Compare commits

...
Sign in to create a new pull request.

487 commits

Author SHA1 Message Date
Viktor Barzin
9c68d147e0 k8s-upgrade: reclaim+auto-prune kubeadm /etc/kubernetes/tmp leak; correct crash root cause to etcd IO (not OIDC)
Some checks failed
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/default Pipeline failed
Digging into "why did the apiserver crash" disproved the earlier OIDC
explanation. An isolated v1.35.6 apiserver repro with authentik reachable
initialises OIDC cleanly (oidc.go:313, no error) and runs fine — so the
--authentication-config -> --oidc-* revert is NOT what crashed it. etcd's
surviving crash-window log is the real cause: 1180 "apply request took too long"
warnings in 16 min, individual applies up to 4.3s (healthy <100ms) right as
kubeadm tried to bring up the new apiserver. That's etcd IO starvation on the
shared sdc HDD (beads code-oflt).

A big contributor + the reason master root fs sat at 73%: kubeadm dumps a full
~400MB etcd DB backup into /etc/kubernetes/tmp/kubeadm-backup-etcd-<ts>/ before
every etcd upgrade and never cleans it up — 145 dirs / 28GB had accumulated,
driving image-GC churn and extra write-IO onto etcd's spindle. Reclaimed live
(73% -> 23%) and added a preflight prune (>3 days) so it can't re-accumulate.

Also corrected the OIDC handling: the kubeadm-config drift is real but only
breaks dashboard/kubectl SSO AFTER a successful upgrade (recoverable via the
chain's restore.sh + the kubeadm-config reconciliation) — it does not crash the
apiserver. So the preflight check is now an ALERT, not a block (was added on the
wrong hypothesis). Post-mortem, runbook, and apiserver-oidc.tf header corrected.

Per Viktor: reclaim the disk and automate so the manual cleanup never recurs;
the durable IO fix remains code-oflt (etcd off the shared HDD).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-25 15:23:15 +00:00
Viktor Barzin
60a1cb9a25 k8s-upgrade: reconcile kubeadm-config OIDC drift that crash-looped the v1.35 apiserver upgrade
All checks were successful
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
Last night's autonomous 1.34->1.35 run reached the master control-plane phase
for the first time (preflight passed, etcd snapshot taken, etcd upgraded), then
the kube-apiserver upgrade to v1.35.6 crash-looped and kubeadm auto-rolled-back
to 1.34.9. The cluster stayed healthy but the master was left cordoned and the
chain wedged on in_flight.

Root cause: kubeadm upgrade regenerates the apiserver static-pod manifest from
the kubeadm-config ConfigMap. apiserver auth was switched on 2026-06-19 to a
structured multi-issuer --authentication-config (kubectl + dashboard SSO), but
kubeadm-config still carried the legacy single-issuer --oidc-* extraArgs, so the
regenerated manifest reverted structured auth and the new apiserver crash-looped.
Proven via `kubeadm upgrade diff`. The existing post-upgrade OIDC restore step
never ran because the upgrade itself never succeeded.

Fix:
- rbac/apiserver-oidc.tf: the remote script now also reconciles kubeadm-config
  (kubeadm init phase upload-config: drop --oidc-*, add --authentication-config)
  so a future kubeadm upgrade regenerates a correct manifest. Delivered to the
  cluster via the apiserver-oidc-restore ConfigMap the chain re-runs (CI needs no
  ssh key); trigger deliberately not script-hashed since CI cannot ssh.
- k8s-version-upgrade/upgrade-step.sh: new preflight gate runs `kubeadm upgrade
  diff` and BLOCKS+alerts (never drains the master) if --authentication-config
  would still be dropped.
- Post-mortem + runbook updated.

The live kubeadm-config was reconciled directly on the master and verified
(`kubeadm upgrade diff` now shows only the control-plane image bump), so tonight's
run can complete the 1.34->1.35 upgrade.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-25 14:16:04 +00:00
Viktor Barzin
c6bba1da6e home-assistant skill: refresh ha-london map (HAOS 2026.5.2, Cowboy revived, Overview redesign)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked to redesign the ha-london dashboards and fix the broken integrations (the Cowboy one). The skill's ha-london knowledge map had drifted badly from reality, so this brings it current: it claimed HA 2025.9.1 on a docker-run container (it's HAOS 2026.5.2, managed); listed the now-dead jdejaegh Cowboy integration with sensor.bike_* entities (revived via elsbrock/cowboy-ha v1.2.0 -> entities are sensor.classic_performance_*); and didn't flag that met/metoffice/roomba/hildebrandglow are user-disabled (not broken) or that Tapo P100 is failing. Also documents the redesigned Overview (Home+More sections, Mushroom+mini-graph-card), the dashboard/view/card glossary, the parked-bike 'unknown battery' gotcha, and that london is API-only from the Sofia devvm (no SSH).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 22:03:15 +00:00
Viktor Barzin
b858561bd0 Merge remote-tracking branch 'origin/master'
Some checks failed
ci/woodpecker/push/default Pipeline failed
2026-06-24 20:59:39 +00:00
Viktor Barzin
a7704f46a6 deploy goldmane-edge-aggregator: durable who-talks-to-whom edge trail (#58, ADR-0014)
Infra side of ADR-0014: an mTLS gRPC consumer of Calico Goldmane's Flows API
that records the namespace-pair edge-set in CNPG and posts a daily new-edge
digest to #security. Adds the goldmane-edge-aggregator stack, the
pg-goldmane-edges Vault rotation role (Tier-0 vault state updated here), and the
namespace in the ghcr-credentials allowlist.

Cert: REUSES the operator-minted, Tigera-CA-signed whisker-backend client cert
(Goldmane verifies only the CA chain, not identity) instead of minting from the
Tigera CA private key. This avoids putting the CA key in TF state AND the
hashicorp/tls provider, which is incompatible with this repo's global
generate-providers/lockfile pattern (it broke every stack's lockfile).

Verified live: aggregator streaming flows, 174 edges in Postgres across 50x54
namespaces, db+slack ExternalSecrets synced, digest dry-run formats correctly,
private image pulls via the Kyverno-synced ghcr-credentials.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 20:59:39 +00:00
Viktor Barzin
aa510e3600 instagram-poster: force_conflicts on ESO manifests (fix apply)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The ESO v1 migration (2026-06-22) made the external-secrets controller own
.spec.refreshInterval via server-side apply, so terraform apply of the two
ExternalSecret manifests fails with a field-manager conflict (Woodpecker #348),
which blocked the replicas=0 scale-down from landing. Add force_conflicts=true
to both, matching the grafana/woodpecker/traefik fix applied to other stacks
the same day.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 20:49:53 +00:00
Viktor Barzin
53834deb24 instagram-poster: scale to 0 (unused, dead ExternalSecret)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Viktor confirmed the Instagram Graph poster isn't used. Its ExternalSecret
has been dead on missing Vault keys (ig_graph_long_lived_token,
ig_business_account_id), so the deployment sat at 0/1 firing
DeploymentReplicasMismatch. Setting replicas=0 stops the alert and makes the
scale-down durable (a bare kubectl scale reverts on the next stack apply).
Re-set to 1 after minting a Meta long-lived token + populating the Vault keys.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 20:45:30 +00:00
Viktor Barzin
8dd9a3978d Merge remote-tracking branch 'forgejo/master' into wizard/homelab-vault
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-24 12:25:52 +00:00
Viktor Barzin
65b2df1222 fix(monitoring): force_conflicts on grafana_db_creds ExternalSecret
The external-secrets controller owns .spec.refreshInterval via SSA, so a plain
terraform apply of the monitoring stack conflicts. Latent until 2026-06-24 (the
homelab-vault loki-rules change was the first monitoring apply in a while and
surfaced it). force_conflicts lets TF win — same pattern as woodpecker/traefik/
k8s-version-upgrade stacks.
2026-06-24 12:25:36 +00:00
Viktor Barzin
1d0388da12 Merge remote-tracking branch 'origin/master'
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-24 12:22:58 +00:00
Viktor Barzin
92361f36db calico: enable Goldmane + Whisker (Calico 3.30 OSS flow observability)
Turns on Calico 3.30's native east-west flow observability so we can see which
Service talks to which (ADR-0014, issue #57). Enabled via the operator CRs
directly (kubectl_manifest Goldmane + Whisker, name=default) rather than the
Helm goldmane/whisker flags, because the goldmanes/whiskers CRDs already exist
and this sidesteps the helm-upgrade CR-before-CRD ordering issue. Whisker
notifications=Disabled so the UI doesn't call the external Tigera endpoint.

Applied supervised: creating the Goldmane CR re-rendered calico-node with the
FELIX_FLOWLOGSGOLDMANESERVER env (operator auto-wires Felix — no manual
FelixConfiguration); calico-node rolled cleanly 7/7, tigerastatus healthy,
goldmane is receiving flows from all nodes, Whisker UI serves.

Durable Loki persistence is NOT included here: the Goldmane emitter is Calico
Cloud/Enterprise-gated with no OSS knob to aim it at Loki (the CR can override
only name+resources, not env), so a durable trail needs a small custom gRPC
consumer of goldmane:7443 — tracked in issue #58.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 12:22:48 +00:00
Viktor Barzin
e711b2f971 feat(monitoring): homelab vault traceability alerts (TOTP-fetch + volume)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Build infra CLI / build (push) Has been cancelled
Adds a Loki ruler group (lane=security -> #security) for the homelab vault
op-log: VaultwardenTOTPFetched (every 2nd-factor fetch is visible) and
VaultwardenFetchVolumeHigh (>100 fetches/10m backstop). The audit spine
(Vault audit device, reads of secret/data/workstation/claude-users/*) is
already captured. True CLI-bypass detection needs cross-stream correlation
(follow-up).
2026-06-24 10:31:32 +00:00
Viktor Barzin
64104e56e9 feat(devvm): install Bitwarden CLI for homelab vault 2026-06-24 10:29:57 +00:00
Viktor Barzin
15643d1f44 feat(cli): bare homelab vault help command 2026-06-24 10:29:32 +00:00
Viktor Barzin
772aed5370 fix(cli): vault security review fixes
C1 (critical): setup wrote the master password + API client_secret as
`vault kv patch key=value` argv, leaking them via /proc/<pid>/cmdline to
same-UID processes. Now written via stdin (key=- form); only email +
client_id (non-credentials) remain in argv.
I1: `get --json` refused on a TTY (was dumping the secret to scrollback).
M1: vaultLock now holds the per-user flock (it mutates bw state).
M4: bw login-detection parses status JSON instead of substring matching.
M5: clipboard path refuses when stderr is not a TTY (was silently failing).
M6: realRunner trims only trailing newline, preserving secret whitespace;
    secret prompts likewise.
Adds security-property tests: no secret in argv across the get flow,
clipboard decision matrix, --json TTY gate, bw status parsing.
2026-06-24 10:28:31 +00:00
Viktor Barzin
5a864cf19c feat(cli): homelab vault setup onboarding (one-time, self-service) 2026-06-24 10:21:57 +00:00
Viktor Barzin
e20033855d feat(cli): vault list/search/code/status/lock 2026-06-24 10:21:07 +00:00
Viktor Barzin
365340b37d feat(cli): homelab vault get with TTY-aware return 2026-06-24 10:20:05 +00:00
Viktor Barzin
2dd12fc6be feat(cli): vault session bootstrap with per-user flock + no-coredump 2026-06-24 10:18:36 +00:00
Viktor Barzin
5bae2a3907 feat(cli): privacy-aware vault op-log (process, never the secret) 2026-06-24 10:17:50 +00:00
Viktor Barzin
81122f8607 feat(cli): TTY-aware return + OSC52 clipboard with terminal gating 2026-06-24 10:17:13 +00:00
Viktor Barzin
06f4b87af1 feat(cli): vault bw engine env/arg builders + unlock 2026-06-24 10:16:19 +00:00
Viktor Barzin
cd44ca5921 feat(cli): vault creds loading from per-user Vault path 2026-06-24 10:15:32 +00:00
Viktor Barzin
6c53ee10b1 feat(cli): register homelab vault command group skeleton 2026-06-24 10:14:24 +00:00
Viktor Barzin
ae0d7984c4 docs: ADR-0014 + glossary — service identity (namespace+label) & Calico Goldmane observability
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Records the design reached in a /grill-with-docs session: how to track which
Service talks to which as more Services are added, using k8s-native options.

Decision: service identity = the workload's namespace (primary) plus a
`service-identity` label only in the few multi-Service namespaces; east-west
observability = Calico 3.30 Goldmane/Whisker (already in our Calico v3.30.7,
currently disabled) emitting to Loki for a durable trail; enforcement reuses the
existing Wave 1 egress track. Dedicated per-Service ServiceAccounts deferred and
a service mesh / mTLS / SPIFFE rejected — the trust model needs attribution-grade
forensics on a trusted, etcd-constrained cluster, not cryptographic
non-repudiation. This is the service-mesh evaluation the 2026-04-20 infra audit
flagged as missing; rejected alternatives (Retina, Hubble, Kiali, a custom Alloy
enricher) are recorded with rationale.

Adds glossary terms (Service identity, Goldmane / Whisker) to CONTEXT.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 10:00:36 +00:00
Viktor Barzin
0293b5c634 android-emulator: fix idle-sleeper dying with SIGPIPE before it could sleep
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Caught live-testing the previous commit: every sleeper run exited 141
(SIGPIPE) in ~1s with no output, never reaching the scale-down. Cause:
`set -o pipefail` + `dumpsys power | awk '...; exit'` — awk closes the pipe
after the first match while `kubectl exec` is still streaming dumpsys, so
the exec gets SIGPIPE, pipefail makes the pipeline 141, and set -e kills the
script before any echo. (My earlier dry-run missed it because it didn't run
under `set -euo pipefail`.)

Fix: drop pipefail; capture each exec to a var (`|| true`) then parse with
awk reading to END (no early `exit`), so nothing can SIGPIPE mid-stream and
a failed/booting exec falls through to the fail-safe "do not sleep" branch.
Also fetch the pod name via jsonpath instead of `-o name | head -1` (no pipe
to SIGPIPE, no `pod/` prefix to strip), and exec `adb` directly without the
`sh -c` wrapper.

Verified live: ran the corrected script as the gate ServiceAccount against
the stuck emulator (idle ~120h) — it logged "idle >= 6h ... scaling to zero"
and patched the deployment to replicas=0. The 6+ day pod is now asleep.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 08:57:36 +00:00
Viktor Barzin
839fdb33c2 android-emulator: sleep after 6h idle (activity-based), fix never-sleeping
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The emulator was meant to scale to zero when idle but had been up 6+ days
straight despite ~5 days with no real use. Two bugs:

1. The idle check counted ESTABLISHED TCP connections to the adb/noVNC
   ports. A forgotten `adb connect` (no disconnect) holds that transport
   open forever, so every 15-min run saw "active" and reset the counter --
   it never reached the sleep branch. (Right now: 4 such stale transports
   from pods on k8s-node3/node4.)
2. Even when it did reach the sleep branch, `kubectl scale --replicas=0`
   failed Forbidden -- the gate ServiceAccount can patch `deployments` but
   not `deployments/scale`.

Switch the sleeper to measure actual use: time since last user activity
(taps/keys/app-launches, incl. noVNC clicks) from `dumpsys power` vs guest
uptime. No interaction for 6h -> sleep. This ignores idle/forgotten
connections entirely. Scale down with a direct replicas patch on the named
deployment (same path the wake gate scales up), so it needs only the
existing `deployments` patch grant -- no `deployments/scale`. Now stateless
(drops the idle-counter annotation; gate.py no longer sets it) and lighter
on etcd. Fail-safe: any read error (e.g. mid-boot) does not sleep.

Requested by Viktor: turn the dev-only emulator off when it hasn't been
used for 6h.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 08:49:23 +00:00
Viktor Barzin
566447a698 k8s-upgrade: preflight kubeadm-plan gate must pass explicit target (minor-upgrade fix)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Last night's 1.34.9->1.35.6 run passed the ESO/kyverno compat gate (the migration
worked!) but ABORTED at the kubeadm-plan-target gate: it ran `kubeadm upgrade plan`
with NO version, so master's old 1.34.9 kubeadm auto-proposed only the current
minor (Loki: "falling back to stable-1.34") and plan_target != 1.35.6 -> abort.
That gate worked for patch upgrades but never for minors. Fix: pass the explicit
`v$TARGET_VERSION` (verified on master: `kubeadm upgrade plan v1.35.6` emits
"kubeadm upgrade apply v1.35.6"). Works for patches too. Applied live to the
ConfigMap before tonight's run; deleted the failed preflight-1-35-6 job.

Also: ESO 2.x took SSA ownership of .spec.refreshInterval, so terraform's apply of
the k8s-upgrade-creds ExternalSecret hit a field-manager conflict. Added
field_manager.force_conflicts=true (benign — interval is semantically identical).
This pattern affects all 104 migrated ESs fleet-wide (follow-up).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 06:06:14 +00:00
Viktor Barzin
98d2b89614 calico: bump tigera-operator mem limit 256Mi -> 512Mi (OOM crashloop fix)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The operator OOM-crashlooped on 2026-06-23: it idles at ~246Mi with a ~266Mi
startup spike (re-listing resources to build informer caches), both at/over the
256Mi limit, so the first time the pod restarted it could never finish startup
(exit 137 OOMKilled, leader-elect, OOM, repeat). A latent landmine — the limit
was always too tight; it only bit once the pod restarted. Data plane was never
affected (calico-node 7/7, tigerastatus green throughout). 512Mi gives headroom
(now ~246Mi steady, verified stable 0 restarts). NOT caused by the ESO migration
(which never touched calico); cluster churn was at most the trigger that exposed
the tight limit.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 12:46:28 +00:00
Viktor Barzin
68c240b8de Merge remote-tracking branch 'origin/master'
Some checks failed
ci/woodpecker/push/default Pipeline failed
2026-06-23 09:56:25 +00:00
Viktor Barzin
7d297dc6b1 eso: complete migration — chart 2.6.0, all CRs on v1, 1.35 gate cleared
Phase 3 of the ESO 0.12->2.6 migration (the last k8s-1.35 compat-gate blocker).
Climbed external-secrets 0.16.2 -> 0.17.0 -> ... -> 2.6.0 one minor at a time,
each hop applied + verified (ES sync held at 109 Ready every hop; atomic=true
rollback safety net). Crossed the 0.17 cutoff (v1beta1 serving removed) only
after Phase 2 put all 104 ExternalSecrets + 2 ClusterSecretStores on
external-secrets.io/v1. Result: compat-gate now returns "OK: cluster is safe to
upgrade to 1.35.6" (EXIT 0) — the autonomous version-check chain will take k8s
1.34 -> 1.35 on its next nightly run.

Also fixes the repo-wide stale-lock issue that broke CI pipeline 332: the
terragrunt-generated providers.tf declares gavinbunney/kubectl + telmate/proxmox,
but ~28-39 stacks' committed .terraform.lock.hcl predated that ("Inconsistent
dependency lock file: no version selected"). Reconciled via `tg init -upgrade`
and committed so `terragrunt apply`/CI work cleanly again.

Docs: .claude/CLAUDE.md ESO line corrected (104 ESs, v1, chart 2.6.0); plan doc
marked COMPLETE.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 09:55:51 +00:00
Viktor Barzin
ff4b01a674 state(external-secrets): update encrypted state 2026-06-23 09:53:36 +00:00
Viktor Barzin
e1a85dd727 state(external-secrets): update encrypted state 2026-06-23 09:52:30 +00:00
Viktor Barzin
af22416d6f state(external-secrets): update encrypted state 2026-06-23 09:51:21 +00:00
Viktor Barzin
c75982f408 state(external-secrets): update encrypted state 2026-06-23 09:50:11 +00:00
Viktor Barzin
0407e3c578 state(external-secrets): update encrypted state 2026-06-23 09:48:33 +00:00
Viktor Barzin
dab8f9446f state(external-secrets): update encrypted state 2026-06-23 09:47:24 +00:00
Viktor Barzin
e815bb0295 state(external-secrets): update encrypted state 2026-06-23 09:46:17 +00:00
Viktor Barzin
8412cd7d54 state(external-secrets): update encrypted state 2026-06-23 09:45:04 +00:00
Viktor Barzin
f2956e1e62 state(external-secrets): update encrypted state 2026-06-23 09:43:57 +00:00
Viktor Barzin
bf2f865eee state(external-secrets): update encrypted state 2026-06-23 09:42:52 +00:00
Viktor Barzin
6f3cfb18c7 state(external-secrets): update encrypted state 2026-06-23 09:41:46 +00:00
Viktor Barzin
6e8e066215 state(external-secrets): update encrypted state 2026-06-23 09:40:14 +00:00
Viktor Barzin
de1fb04d9f state(external-secrets): update encrypted state 2026-06-23 09:39:12 +00:00
Viktor Barzin
606cfdb544 state(external-secrets): update encrypted state 2026-06-23 09:38:12 +00:00
Viktor Barzin
72464e7880 state(external-secrets): update encrypted state 2026-06-23 09:37:11 +00:00
Viktor Barzin
e88ea50304 docs(multi-tenancy): document install_skills (vendored per-user agent skills)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Record the new reconcile step alongside install_memory/install_playwright:
vendored own-copies of the 16-skill set for the SKILL_USERS allowlist (emo),
why it's vendored not npx (upstream drift), and that if-absent keys on the
user's own copy so it heals a stale/cross-user ~/.claude/skills symlink
(emo's grill-me pointed into the admin's home).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 09:30:27 +00:00
Viktor Barzin
1c8dc6bd6c t3-provision-users: install_skills heals stale symlinks + owns ~/.agents
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Follow-up to the vendored-skills change, from verifying the emo rollout:

- The if-absent guard treated ANY pre-existing ~/.claude/skills/<name> entry
  as "installed", so a manual cross-user symlink emo already had (grill-me ->
  /home/wizard/.claude/skills/grill-me) was skipped — leaving the requested
  skill depending on the admin's home instead of emo's own copy. The guard now
  keys on the user's OWN copy (a real dir under ~/.agents/skills) and (re)points
  the ~/.claude/skills symlink at it, healing a stale/cross-user link while
  still never clobbering a real dir.
- install -d left the intermediate ~/.agents owned by root; now owned by the user.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 09:27:31 +00:00
Viktor Barzin
987fdd16db t3-provision-users: vendor agent skills + per-user install_skills (emo)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Make the admin's Claude Code agent skills available to the `emo` devvm user.
Viktor asked to install Matt Pocock's skills for emo, starting with grill-me
but covering the full set the admin already uses.

The `npx skills` upstream has drifted off that set (diagnose -> diagnosing-bugs
and write-a-skill -> writing-great-skills were renamed; caveman + zoom-out are
no longer published), so reproducing it via npx is impossible and would also
spray ~70 agent dirs into the user's home + add a GitHub-clone + unpinned-CLI
dependency to the hourly root reconcile. Instead vendor a point-in-time
snapshot of the 16 skills (scripts/workstation/claude-skills/) and copy them
per-user, mirroring install_memory: install_skills() copies each skill into
~/.agents/skills/<name> (owned by the user) and symlinks
~/.claude/skills/<name> -> ../../.agents/skills/<name>. if-absent, additive,
best-effort, scoped to the SKILL_USERS allowlist (emo).

find-skills is from vercel-labs/skills (not Matt Pocock) but included since it
is part of the admin's current set.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 09:23:37 +00:00
Viktor Barzin
59f2beda21 chrome-service: run real Google Chrome (H.264/AAC codecs) for the browser
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Point the chrome-service container at the new chrome-service-browser image and
launch /opt/google/chrome/chrome instead of the bundled Chromium. Fixes
MEDIA_ERR_SRC_NOT_SUPPORTED on H.264/AAC video (Instagram Reels etc.) in the
noVNC view — bundled Chromium has those codecs compiled out; only real Chrome
carries them. connect_over_cdp callers (tripit fare scrape, homelab browser,
snapshot-harvester) attach over raw CDP (version-tolerant) — validated after
rollout. Image is built off-infra on GHA (prior commit) → public ghcr.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 21:15:36 +00:00
Viktor Barzin
df1ec1879d chrome-service: build a real-Chrome browser image (H.264/AAC codecs)
Some checks failed
ci/woodpecker/push/default Pipeline was successful
Build chrome-service-browser / build (push) Has been cancelled
Add an infra-owned image (Playwright base + google-chrome-stable) + its GHA
build workflow. The bundled Chromium ships proprietary codecs compiled out, so
H.264/AAC video (Instagram Reels, X, most .mp4) fails in the noVNC view with
MEDIA_ERR_SRC_NOT_SUPPORTED; only real Google Chrome carries those codecs
(libffmpeg swap + Chrome-for-Testing both ruled out). This commit only builds
the image (→ ghcr.io/viktorbarzin/chrome-service-browser); a follow-up flips
main.tf's launch to it once the image exists + is public.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 21:01:17 +00:00
Viktor Barzin
7061b1dfc6 state(external-secrets): update encrypted state 2026-06-22 20:55:27 +00:00
Viktor Barzin
e2f328ff4a state(external-secrets): update encrypted state 2026-06-22 20:45:24 +00:00
Viktor Barzin
a735be9ba4 state(external-secrets): update encrypted state 2026-06-22 20:45:08 +00:00
Viktor Barzin
c670cb7118 eso: Phase 2 — migrate all 104 ExternalSecrets + 2 ClusterSecretStores to v1
Some checks failed
ci/woodpecker/push/default Pipeline failed
The API rewrite half of the ESO 0.12->2.6 migration (last k8s-1.35 compat-gate
blocker). Done on chart 0.16.2, which serves BOTH external-secrets.io/v1beta1
and v1, so this is the safe window — MUST land before 0.17 removes v1beta1
(there is no conversion webhook). Pure apiVersion bump, schema is byte-identical:
106 occurrences (104 ExternalSecrets + 2 ClusterSecretStores vault-kv/vault-database)
across 73 .tf files, v1beta1 -> v1, no other field changes.

Validated live first on tandoor (single, non-coupled, synced ES): the
kubernetes_manifest apiVersion bump forces a REPLACE; the target Secret is
cascade-GC'd for ONE ~0.3s poll then ESO recreates it (identical value re-synced
from Vault, new UID) and the ES returns SecretSynced=True on v1. Running pods
keep their mounted copy through the sub-second blip. All 110 target Secrets were
snapshotted to /tmp first as a backstop.

CI applies the changed stacks serially (staged rollout); watching aggregate ES
sync back to 108 synced (2 pre-existing dead: instagram-poster, payslip-ingest).
Next: Phase 3 climb 0.16.2 -> 2.6.0.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 19:13:04 +00:00
Viktor Barzin
98cd535b97 authentik: lock chrome.viktorbarzin.me noVNC to Viktor only
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The chrome-service noVNC exposes Viktor's live logged-in browser sessions
(Instagram etc. — he'll sign in there for homelab browser to reuse). It was
auth="required" = any authenticated user, and "Home Server Admins" includes emo
(emil.barzin@gmail.com), so the admin group is not a sufficient gate. Add a
host-specific case to the domain-wide forward-auth restriction allowing only
Viktor's accounts (vbarzin@gmail.com + akadmin break-glass); everyone else,
incl. emo, is denied at the noVNC. emo's AGENT already can't reach the browser
(read-only RBAC blocks port-forward); this closes the human noVNC path.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 18:09:27 +00:00
Viktor Barzin
a3cdc0d6d0 chrome-service: size headed Chrome window to fill Xvfb (noVNC cut-off)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The noVNC view showed the browser in the top-left with the rest of the
framebuffer black. Cause: Chrome launched with no --window-size, and there's no
window manager, so it opened at its profile-persisted (smaller) size inside the
1280x720 Xvfb. Add --window-size=1280,720 --window-position=0,0 so the window
fills the screen on every launch (fresh pods/profiles too). Live windows were
already resized via CDP as a stopgap.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 18:00:20 +00:00
Viktor Barzin
c7ead032ec chrome-service: fix noVNC stuck-"Connecting" (x11vnc fd-sweep under nofile=2^31)
Some checks failed
ci/woodpecker/push/default Pipeline was successful
Build chrome-service-novnc / build (push) Has been cancelled
The noVNC view hung on "Connecting" forever then timed out. Root cause: x11vnc
sweeps the entire fd table (fcntl per fd) on every client connection, and
containerd grants pods RLIMIT_NOFILE=2^31, so the RFB handshake never completes
(websockify accepts the WS and dials localhost:5900, but x11vnc never sends its
banner — verified: handshake timed out at 8s, x11vnc had burned 1h41m CPU
spinning). Same bug + fix the android-emulator stack already carries.

Cap nofile before x11vnc starts, in two places:
- files/novnc/entrypoint.sh: `ulimit -n 65536` (root fix, makes the image correct)
- main.tf novnc container: `command = ["bash","-c","ulimit -n 65536; exec /entrypoint.sh"]`
  so the cap applies deterministically on rollout even though the image is
  :latest/IfNotPresent (a rebuilt entrypoint isn't guaranteed to be re-pulled).

Also documents the gotcha + diagnosis in docs/architecture/chrome-service.md and
notes the black-when-idle behaviour + the autoconnect URL.

(A live x11vnc relaunch with the cap already unblocked the running pod; this
makes it survive restarts.)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 17:34:03 +00:00
Viktor Barzin
20ca5ee624 tripit: REEL_PROVIDER=anonymous — actually fetch reels (was fake canned caption)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
REEL_PROVIDER was unset, so the reel pipeline used FakeReelExtractor, which returns
a CANNED caption — every pasted (tripit #120) or forwarded reel produced a DUMMY
Saved Place instead of reading the real reel. Set REEL_PROVIDER=anonymous in app_env
(covers the web Deployment + the ingest CronJob) so AnonymousReelExtractor does the
real anonymous read. Verified live from the cluster: yt-dlp fetched a real IG /p/
caption (no IG_GRAPHQL_DOC_ID needed — the internal-API path is an optional
optimisation; yt-dlp fallback works). LLM extraction + Nominatim POI geocoding were
already real (prior commits); this was the last fake link in the chain.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 17:30:47 +00:00
Viktor Barzin
f46b69f372 tripit: enable real LLM + Nominatim on the web Deployment (in-app reel paste #120)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The web Deployment ran LLM_MODE=fake with no reel geocoder — only the ingest-plans
CronJob had real providers. The in-app reel-URL paste feature (tripit #120) runs
ingest_reel IN the web pod (BackgroundTask), so the Deployment now needs real
extraction: LLM_MODE=llamacpp (qwen3vl-8b; qwen3-8b segfaults on the current
llama-swap image) with the ADR-0033 claude-agent-service fallback, plus
REEL_GEOCODER_PROVIDER=nominatim for venue->city/country POI geocoding. Set in
app_env (feeds the Deployment; the CronJobs already had these via extra_env). Bonus:
this also un-fakes the in-app booking *share* import, which used the same fake LLM.
MAIL_INGEST_ENABLED stays false on the Deployment (only the CronJob polls mail).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 16:50:04 +00:00
Viktor Barzin
59f2070e56 tripit: switch mail-ingest LLM_MODEL qwen3-8b -> qwen3vl-8b (qwen3-8b segfaults)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The qwen3-8b GGUF segfaults on load on the current llama-swap :cuda image
("common_init_from_params: failed to create context"; llama-swap returns 502),
which broke ALL tripit mail ingest text extraction — booking emails AND forwarded
reels (status=failed, "no place could be read"). The GGUF isn't corrupt (valid
header, full size, worked for weeks) — it's a llama.cpp/image regression. Rather
than pin the SHARED llama-swap image (cross-user blast radius), repoint the
ingest-plans CronJob at qwen3vl-8b, an already-provisioned 8B model that loads
fine and extracts flight numbers + places reliably. Restores the auto-path
(reels resolve via the Nominatim geocoder; bookings parse again). The broken
qwen3-8b GGUF is a separate, non-urgent llama-cpp cleanup.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 15:52:09 +00:00
Viktor Barzin
7dbbb74163 homelab v0.8.1: frame browser as escalation (default headless), match CLAUDE.md
Some checks failed
ci/woodpecker/push/default Pipeline was successful
Build infra CLI / build (push) Has been cancelled
Make `homelab browser --help` and chrome-service.md state the same tiered rule
now in ~/code/CLAUDE.md: default to the Playwright MCP/headless browser for all
routine automation; reach for `homelab browser` ONLY when headless is blocked
(loads-but-submit-fails / one request errors while siblings 200 / explicit bot
wall). Removes the "co-equal choice" framing so agents have one non-conflicting
instruction. Adds a test asserting the tiered wording so it can't regress.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 15:44:43 +00:00
Viktor Barzin
f96cde35bd tripit: enable Nominatim POI geocoding for reel→Wishlist ingest
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Forwarded reels (tripit ADR-0031) geocode their venue to map a Saved Place to a
country + city, but the reel route was wired to the global geocoder, which here is
GEOCODER_PROVIDER=openmeteo (city-level, name-based). OpenMeteo returns nothing for
a venue query like "Time Out Market, Lisbon" so reels never resolved and no Saved
Place was created. The app fix (tripit 3c62d596) gave the reel route its own
geocoder behind REEL_GEOCODER_PROVIDER; set it to nominatim on the ingest-plans
CronJob (the only one running the reel route) so forwarded reels resolve to real
venue coords + city + country. Isolated from the global geocoder, which stays
openmeteo for weather/tours. Verified Nominatim resolves the venue from the cluster.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 14:59:37 +00:00
Viktor Barzin
a6b52a5839 homelab v0.8.0: browser verbs for headful anti-bot web automation
Some checks are pending
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
Add `homelab browser run|open` so agents can drive the cluster's headful
Chrome (chrome-service) over CDP from the devvm. The headless playwright/mcp
browser can load anti-bot sites and fill their forms, but the gated submit
silently fails — e.g. the Stirling Ackroyd Fixflo tenant portal returned
net::ERR_FILE_NOT_FOUND on its pre-submit check and hung, creating nothing.
Driving the real headful Chrome submits first try. That capability already
existed but was undiscoverable, so it cost ~40 min + redundant form re-runs to
find; now it is one command, versioned, test-covered, and `browser --help`
carries the when-to-use signature + an error-code cheat-sheet so the right tool
is reached at the right moment (the failure was judgment, not setup).

- port-forward svc/chrome-service:9222 (tunnels API-server->pod, so it bypasses
  the :9222 NetworkPolicy), assert non-headless via /json/version,
  connect_over_cdp, inject the same vendored stealth.js the in-cluster callers
  use; the port-forward is always torn down, on success and on error.
- node CDP client pinned to playwright-core@1.48.2 to match the v1.48.0-noble
  image (Chromium 130); self-provisioned lazily into ~/.cache/homelab, no
  per-user setup.
- default is a fresh incognito context (safe for the shared browser + concurrent
  callers); --shared-context reuses the warmed persistent profile.
- TDD: cmd_browser_test.go covers arg parsing, headless detection, the version
  pin, the help cheat-sheet, and a stealth.js drift guard. Verified end-to-end
  against bot.sannysoft.com (real Chrome UA, webdriver hidden, plugins/WebGL
  spoofed) and `browser open`.
- docs: README v0.8 section, ADR-0013, and a chrome-service.md "driving from
  outside the cluster" section.

Closes: code-nepg

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 12:22:22 +00:00
Viktor Barzin
de163aa6af workstation: switch devvm OOM backstop from systemd-oomd to earlyoom
All checks were successful
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
The systemd-oomd backstop added in the previous commit is INERT on this box.
oomd's memory-pressure kill only acts on cgroups doing active reclaim (pgscan
rising); with MemorySwapMax=0 + anonymous agent memory there is nothing to
reclaim, so pgscan stays 0 and oomd never fires. Proven live: a cgroup held at
96-99% memory.pressure for >70s with pgscan=0 was never killed (oomctl + balloon).
The very swap=0 that kills the IO storm also neuters oomd.

Replace it with earlyoom, which watches free RAM (MemAvailable%) and is
swap-independent: SIGTERM the biggest task at 5%, SIGKILL at 3%, swap ignored
(-s 100). It --avoids sshd/systemd/dockerd/containerd/t3-dispatch/tmux (the
admin's way in always survives) and --prefers the agent/browser hogs. Verified
via --dryrun: fires on the RAM threshold and selects a chrome process, not a
protected daemon.

The per-cgroup caps (MemoryHigh=12G/MemoryMax=16G/MemorySwapMax=0 per user,
docker.slice 8G) are unchanged and remain the PRIMARY guard — earlyoom is the
aggregate net for the rare all-users-maxed case. systemd-oomd purged; its config
+ ManagedOOM drop-ins removed. Post-mortem updated with the finding.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 10:39:16 +00:00
Viktor Barzin
3a59f4a8bf workstation: per-user memory caps + systemd-oomd backstop on devvm
All checks were successful
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
The shared devvm keeps overloading and had to be hard-killed again today
(2026-06-22): a runaway in one user's ssh/tmux session (a 10G ugrep, plus
stacked max-effort agents) grew unbounded, spilled into the disk swap, and
swap-thrashed the throttled virtual disk into an IO storm until the box wedged.

Root cause: ssh/tmux work runs under user-<uid>.slice, left memory-uncontained
by the explicit 2026-06-10 "swap-only" decision, while only the t3-serve tree
was capped. So one user could starve everyone.

This bounds every user on BOTH trees (MemoryHigh=12G, MemoryMax=16G,
MemorySwapMax=0 so work OOMs locally at its ceiling instead of thrashing swap),
adds a systemd-oomd PSI backstop that sheds the single worst work cgroup under
box-wide pressure while leaving system.slice (sshd/services/your way in)
protected, gives system.slice a fair-share CPU/IO priority edge, and routes
docker containers into a capped, oomd-policed docker.slice so they can't dodge
the caps or mis-target oomd. All durable in setup-devvm.sh so a VM rebuild
reproduces them; systemd-oomd added to packages.txt.

Applied live and verified: oomctl shows the backstop armed (not dry-run) on the
work slices with system.slice protected; a capped-balloon stress test OOM-killed
locally at the ceiling with swap flat (no thrash).

Post-mortem: docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 10:25:09 +00:00
Viktor Barzin
2169e0de5f workstation: harden memory hooks — prune dead plugin refs + homelab-CLI-only store
All checks were successful
ci/woodpecker/push/default Pipeline was successful
wire-memory-hooks.py now PRUNES any settings.json hook still pointing at the
retired claude-memory plugin (plugins/claude-memory/hooks/) before the additive
pass. install_memory() rm -rf's that dir, so those entries are dangling — and a
missing UserPromptSubmit hook exits 2, a BLOCKING error that erases the prompt
and froze emo's sessions (2026-06-22). The plugin shares basenames with the
homelab hooks, so the old additive-only logic saw the dead plugin path as
"already present" and skipped installing the real ~/.claude/hooks/ copy; pruning
first fixes that. Verified against emo's exact original config: yields the
correct 4-hook set, drops the dead PermissionRequest entry, idempotent on rerun.

auto-learn.py now stores via the `homelab memory` CLI only — dropped the direct
HTTP path and the local-SQLite fallback (memory is homelab-CLI-only per Viktor;
never local files). No-ops silently when no API key is in env (e.g. ancamilea).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 09:24:42 +00:00
Viktor Barzin
aeed461591 Revert "feat(monitoring): Tempo + OTel Collector for tripit tracing, hardened (ADR-0032 Phase 2)"
All checks were successful
ci/woodpecker/push/default Pipeline was successful
This reverts commit 1595bddfc2.
2026-06-22 08:31:17 +00:00
Viktor Barzin
1595bddfc2 feat(monitoring): Tempo + OTel Collector for tripit tracing, hardened (ADR-0032 Phase 2)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Re-land Phase 2 after the first attempt's two failure modes, both fixed:
- tempo.resources set under the correct single-binary chart key (was OOMKilled on
  the namespace LimitRange default when mis-placed at top level).
- atomic=true + cleanup_on_fail=true on BOTH helm releases — a failed install
  auto-rolls-back instead of leaving a stuck/orphaned release (memory #6479).

Tempo (single-binary, proxmox-lvm 20Gi, 30d) + OTel Collector (contrib; otlp ->
redaction -> batch -> tempo) + Tempo datasource + additive trace_id->Tempo
derivedField on Loki + tripit LOG_FORMAT=json/OTEL_EXPORTER_OTLP_ENDPOINT.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 08:17:59 +00:00
Viktor Barzin
a0897de7c3 workstation: document homelab-memory hooks + provisioner self-deploy [ci skip]
multi-tenancy.md never mentioned the homelab-memory hooks rollout and still
listed claude_memory credential injection as purely "future". Document what is
actually true now: install_memory provisions the recall/auto-learn/compaction
hooks per user, the provisioner binary self-deploys from the repo (step 0), the
set -e abort fix, and that the hooks no-op without a MEMORY_API_KEY in env (CLI
defaults the URL) — emo has a key, ancamilea is keyless until one is minted.
Also clarify setup-devvm.sh's binary install is now bootstrap-only (ongoing
edits self-deploy).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 08:04:38 +00:00
Viktor Barzin
92f35550f2 workstation: self-deploy t3-provision-users from the repo each reconcile [ci skip]
Root cause of emo's lost memory: nothing redeployed /usr/local/bin/t3-provision-users
except the manual setup-devvm.sh, so the homelab-memory rollout (44562535/9aa2438e,
Jun 21) sat committed-but-undeployed for a day — the hourly reconcile kept running the
pre-memory binary and never wired the new memory hooks for emo/anca.

Close the gap the same way the script already treats managed-settings.json and
start-claude.sh (sync_managed_config / deploy_user_launcher): the repo is the
authoring surface. At the top of the run, if the repo copy differs from the deployed
binary, install it and re-exec the fresh one. Guards: a re-exec env flag (no loop),
bash -n (never deploy a broken script), DRY_RUN (no mutation), cmp (no churn when
unchanged). Verified across all four paths in isolation.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 08:02:31 +00:00
Viktor Barzin
0b11a28d66 workstation: stop install_memory aborting the reconcile under set -e
install_memory (added in 44562535) ended with `[[ -d <plugin-dir> ]] && rm && log`
and guarded a chmod with a bare `[[ -f settings ]] && chmod`. When the plugin dir
or settings file is absent — the normal case for users who never had the
claude-memory plugin — those return non-zero, and under `set -euo pipefail` the
function returns non-zero and kills the whole hourly reconcile after the FIRST
user, before the rest are processed.

It never fired before because the rollout was committed but the deployed
/usr/local/bin/t3-provision-users was never updated, so install_memory had never
run. On first real run it aborted right after ancamilea, so emo (and wizard)
never got their memory hooks wired — the reason emo's sessions lost memory. Wrap
the cleanup in an if-block, guard the chmod, and end the function with return 0.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 07:59:47 +00:00
Viktor Barzin
464e0bfb97 Revert "feat(monitoring): Tempo + OTel Collector for tripit tracing (ADR-0032 Phase 2)"
All checks were successful
ci/woodpecker/push/default Pipeline was successful
This reverts commit 7513468a2d.
2026-06-22 06:46:56 +00:00
Viktor Barzin
72dcb125d5 Revert "fix(monitoring): tempo OOMKilled — move resources under tempo.resources"
This reverts commit a02782d11f.
2026-06-22 06:46:56 +00:00
Viktor Barzin
a02782d11f fix(monitoring): tempo OOMKilled — move resources under tempo.resources
Some checks failed
ci/woodpecker/push/default Pipeline failed
Pipeline #315 failed: tempo-0 CrashLoopBackOff / OOMKilled (exit 137). The
single-binary grafana/tempo chart (v1.24.4) takes container resources at
tempo.resources, not a top-level resources: — so my block was ignored and the pod
fell to the namespace LimitRange default and OOMed. Set tempo.resources explicitly
(req 256Mi / limit 2Gi). tripit + existing monitoring were unaffected throughout.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 06:44:31 +00:00
Viktor Barzin
7513468a2d feat(monitoring): Tempo + OTel Collector for tripit tracing (ADR-0032 Phase 2)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Stand up the cluster's first trace store + OTLP ingress so tripit's OpenTelemetry
spans (Phase 1, already live in prod) export and correlate with logs:
- Grafana Tempo (single-binary, filesystem on proxmox-lvm 20Gi, 30d)
- OTel Collector (contrib; otlp -> redaction deny-list backstop -> batch -> tempo)
- Grafana: a Tempo datasource + an ADDITIVE trace_id->Tempo derivedField on the
  Loki datasource (no uid change, so existing dashboards are unaffected)
- tripit deployment: LOG_FORMAT=json + OTEL_EXPORTER_OTLP_ENDPOINT -> the Collector

Additive (new helm releases; Loki/Prometheus/Grafana untouched). Offline
'terraform validate' clean; full plan+apply runs in CI (locked git-crypt blocks a
local plan as non-admin).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 06:31:11 +00:00
Viktor Barzin
1a32c07ffe docs(eso): Phase 1 done (0.16.2) + confirmed Phase 2 GC findings
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Execution log added to the ESO migration plan. Phase 1 complete: ESO at 0.16.2
(both v1beta1+v1 served). Phase 2 findings confirmed live: apiVersion bump forces
a kubernetes_manifest REPLACE, and ESO ESs use creationPolicy=Owner (target Secret
ownerRef → cascade-GC risk on the replace's delete). Phase 2 must snapshot Secrets
+ empirically validate GC-survival on the first live ES + per-stack two-phase
-target apply (fallback: state rm + import). Corrected the doc's k8s assumption
(cluster is on 1.34; whole climb stays on 1.34, no interleave).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 20:44:50 +00:00
Viktor Barzin
ac27e41fde Merge remote-tracking branch 'origin/master'
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-21 20:41:35 +00:00
Viktor Barzin
296deda3b4 eso: Phase 1 — climb chart 0.12.1 -> 0.16.2 (transition version) + atomic
First half of the ESO 0.12->2.6 migration (docs/plans/2026-06-21-eso-0.12-to-2.x-migration-design.md),
clearing the LAST k8s-1.35 compat-gate blocker. Stepped one minor at a time on
k8s 1.34 (no k8s interleave — cluster already on 1.34, ESO bands are conservative
tested ranges not hard limits): 0.12.1 -> 0.13.0 -> 0.14.4 -> 0.15.1 -> 0.16.2.
Each hop applied + verified: controller healthy, all 108 live ExternalSecrets
stayed SecretSynced (2 pre-existing dead — instagram-poster, payslip-ingest —
missing Vault data, untouched). Added atomic=true + timeout=600 (ESO had no
rollback safety net). 0.16.2 serves BOTH v1beta1 AND v1 (storedVersions now
["v1beta1","v1"]) — the safe window to rewrite all 104 CRs to v1 (Phase 2) before
0.17 removes v1beta1. State auto-committed per hop by scripts/tg (Tier-0 SOPS).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 20:41:30 +00:00
Viktor Barzin
0cd59d2c55 state(external-secrets): update encrypted state 2026-06-21 20:41:10 +00:00
Viktor Barzin
b8612e788d state(external-secrets): update encrypted state 2026-06-21 20:39:45 +00:00
Viktor Barzin
877e5c73b2 state(external-secrets): update encrypted state 2026-06-21 20:38:34 +00:00
Viktor Barzin
de2250f667 immich-frame: set photo date format to dd/MM/yyyy
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The photo date overlay was showing US-style MM/dd/yyyy — ImmichFrame's built-in default when PhotoDateFormat is unset. Viktor wants UK day/month/year ordering instead. Pin PhotoDateFormat to the date-fns pattern "dd/MM/yyyy" (uppercase MM = month; lowercase mm would render minutes). The config map carries reloader.stakater.com/match, so Reloader restarts the immich-frame pod automatically on apply.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 20:36:43 +00:00
Viktor Barzin
8e6eff03dd state(external-secrets): update encrypted state 2026-06-21 20:36:37 +00:00
Viktor Barzin
0bae025b9b wealth dashboard: spend-down figures in today's money (inflation-adjusted)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked whether the spend-down numbers were inflation-adjusted —
they were not (all nominal). He chose to switch the card to today's
money, so every row now shows constant purchasing power for life.

Each row is a die-with-zero annuity at the REAL rate (1+g)/1.03−1
(3% inflation), spending a constant inflation-adjusted amount (the
actual pounds withdrawn rise with inflation) until net worth hits £0
at age 100:
  • No growth (0%)  → £12/day, £370/mo,   £4,446/yr   (negative real: loses to inflation)
  • Inflation (3%)  → £43/day, £1,315/mo, £15,776/yr  (0% real: holds value)
  • Market (7%)     → £130/day, £3,942/mo, £47,300/yr (~3.9% real)

Title now flags "(today's £)". Same panel/layout; only the SQL, title,
and tooltip changed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 20:13:59 +00:00
Viktor Barzin
3fb6284e2b immich-frame: use 24-hour clock (ClockFormat HH:mm)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked to switch the Immich photo-frame shown on the Portal
kitchen appliance to a 24-hour clock. immichFrame defaults ClockFormat
to 'hh:mm' (12-hour) and we never overrode it, so the frame was showing
12-hour time. Set ClockFormat: "HH:mm" (date-fns 24h token) in the
frame Settings.yml ConfigMap; Reloader restarts the pod on apply.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 20:10:51 +00:00
Viktor Barzin
e89de86af0 wealth dashboard: spend-down table → three growth scenarios
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor wanted the spend-down card to compare three portfolio-growth
scenarios rather than the previous floor-vs-4%-real pair.

The table now has three rows, each a die-with-zero annuity (drain net
worth to £0 by age 100) spending a constant number of ACTUAL (nominal)
pounds, differing only by the assumed nominal growth rate:
  • No growth (0%)      → £43/day,  £1,315/mo, £15,776/yr  (= NW ÷ years)
  • Inflation (3%)      → £106/day, £3,233/mo, £38,792/yr  (NEW)
  • Avg market (7%)     → £220/day, £6,703/mo, £80,435/yr

This keeps the £43 no-growth floor he anchored on. The old third row
was "4% real" (£133) expressed in today's money; it's replaced by the
7%-nominal market row (£220, actual pounds) so all three rows share one
basis (nominal pounds) and are directly comparable. 3%/7% are hardcoded
(one-line SQL edit). Table height 4→5 for the extra row; panels below
shifted down 1.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 20:06:29 +00:00
Viktor Barzin
85d42f2c13 wealth dashboard: merge spend-down tiles into one compact table
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor wanted the six separate spend-down stat tiles consolidated into a
single, more compact card with the figures laid out as rows.

Replaces stat panels 9220-9225 with one table panel (id 9220) in the
Overview row: 2 rows (Floor / 4% real) × 3 columns (per day / month /
year). Same underlying math and live values (£43/£1,315/£15,776 floor;
£133/£4,039/£48,463 at 4% real). w=9 instead of the full-width tile row,
so it takes ~a third of the width.

Note: this intentionally overrides the "table panels live at the bottom"
layout convention — Viktor chose to keep this headline KPI glanceable at
the top of the dashboard rather than scroll for it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 19:55:57 +00:00
Viktor Barzin
63add2a126 feat(tripit): finalize ADR-0028 auth env — AUTH_MODE=normal, trips@ sender, trust XFF
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Now that the native-auth rollout is complete: (1) AUTH_MODE hybrid->normal — the legacy Authentik OIDC-bearer + forward-auth arms were removed in #96, and 'hybrid' already resolved to 'normal' via backward-compat parsing; this makes it explicit and corrects the now-false comment. (2) SMTP_FROM plans@->trips@ — the dedicated native-auth sender; the trips@->spam@ send-as alias is live + verified (RCPT 250). (3) TRUST_FORWARDED_FOR=true — so #95's per-IP signup rate-limit keys on the real client behind Traefik, not the shared ingress pod IP. Env-only; the Deployment image is KEEL_IGNORE_IMAGE (lifecycle-ignored), so this does NOT touch the running image. Reloader restarts the pod to pick up the new env.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 19:50:20 +00:00
Viktor Barzin
166a2bcab4 wealth dashboard: add "spend-down to £0 at 100" stat tiles
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor wanted a glanceable number on the Wealth dashboard for how much
he can spend for the rest of his life — spending the whole net worth
down to zero by age 100.

Adds a third line of six stat tiles to the Overview section, two
equations × three cadences (per day / month / year):

  • FLOOR  — net worth ÷ time remaining to age 100. Treats the money as
    cash (no growth, no inflation): a conservative lower bound.
    ≈ £43/day, £1.3k/mo, £15.8k/yr.
  • 4% REAL — die-with-zero annuity: the constant, inflation-adjusted
    spend that drains the balance to £0 at 100 while it keeps earning
    4% real. PMT = NW·r/(1−(1+r)^−n). ≈ £133/day, £4.0k/mo, £48.5k/yr.

Horizon is today → his 100th birthday (DOB 1998-10-04 → 2098-10-04),
computed live so the figures tick as net worth and the horizon move.
Net worth reuses the existing latest-per-account dav_corrected math, so
the tiles always agree with the "Net worth (current)" stat (pension
included; target £0). The 4% real rate is hard-coded per his "keep it
simple, just a number" steer — a one-line SQL edit to change later.

Layout: tiles inserted at y=9; all sections below shifted down 4 rows.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 19:48:30 +00:00
c830f9f462 Merge pull request 'workstation: wire-memory-hooks as root (fix non-admin wiring)' (#14) from wizard/mem-fix into master
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-21 17:45:39 +00:00
Viktor Barzin
9aa2438e75 workstation: run wire-memory-hooks as root, not runuser (fix non-admin wiring)
install_memory ran the JSON-merge helper via 'runuser -u $user', but the helper
lives under the admin's mode-700 home ($WORKSTATION_DIR) which non-admin users
can't traverse -> wiring silently failed for emo/anca (hooks copied but never
wired into settings.json). Run the helper as root (it reads both the repo helper
and the user's home) and chown the result back to the user. Verified by the live
all-users rollout: emo + anca now wired correctly.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 17:45:36 +00:00
f318773cb0 Merge pull request 'workstation: homelab-memory for all users (retire claude-memory MCP)' (#13) from wizard/memory-allusers into master
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-21 17:42:51 +00:00
Viktor Barzin
44562535a2 workstation: provision homelab-memory hooks for all users (retire claude-memory MCP)
Roll the wizard MCP->homelab-CLI memory migration out to every devvm user. Adds
install_memory() to t3-provision-users.sh (mirrors install_playwright: per-user,
idempotent, if-absent, as-the-user): installs the 4 memory hook scripts into
~/.claude/hooks, wires them into settings.json additively (wire-memory-hooks.py
never touches env / the per-user MEMORY_API_KEY), and removes ONLY the
claude_memory MCP + plugin if present. Reuses each user's existing key (no
minting; per-user isolation stays deferred per the 2026-06-07 design). The
homelab CLI hits the same remote HTTP API the MCP used; recall runs via the
homelab-memory-recall.py UserPromptSubmit hook. Shared instructions (rules/skills
symlinked from base; root+infra CLAUDE.md) already cover all users.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 17:42:42 +00:00
Viktor Barzin
79749d7324 Merge remote-tracking branch 'origin/master'
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-21 17:27:42 +00:00
Viktor Barzin
5e3fe2e8e2 docs(plans): ESO 0.12->2.6 (v1beta1->v1) migration design — the last k8s-1.35 blocker
Design doc for migrating External Secrets Operator off v0.12 (k8s <=1.31), now
the ONLY remaining compat-gate blocker for autonomous k8s 1.35 (kyverno cleared
to 1.18.1 today). Decisive findings: NO v1beta1->v1 conversion webhook, so all
104 ExternalSecrets (across 73 stacks) + 2 ClusterSecretStores must be rewritten
to external-secrets.io/v1 (byte-identical apiVersion bump) while on 0.16.2, BEFORE
crossing 0.17 (which removes v1beta1 — the point of no return). Step one minor at
a time (no skipping); chart==app version; downstream Secrets survive. 5-phase
ordered plan + per-phase rollback + the plan-time data.kubernetes_secret -target
gotcha (15 stacks) + Tier-0/SOPS handling. Plan only — nothing applied.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 17:27:37 +00:00
3f81b20fa6 Merge pull request 'docs: memory via homelab CLI (retire memory-tool/MCP refs)' (#12) from wizard/memory-cli-docs into master
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-21 17:24:10 +00:00
Viktor Barzin
e2018f9b6c docs: memory via homelab CLI, not the retired memory-tool/MCP
The claude-memory MCP/plugin was uninstalled 2026-06-21 (recall now via the
homelab-memory-recall.py UserPromptSubmit hook; store/recall/update via the
`homelab memory` CLI, which hits the same remote HTTP API). Updates the
.claude/CLAUDE.md 'remember X' instruction off the obsolete local memory-tool
CLI + memory_search/memory_get onto the homelab CLI. Matches the root monorepo
CLAUDE.md + ~/.claude/rules/execution.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 17:24:00 +00:00
Viktor Barzin
51838a4ec7 kyverno: 3.6.1 -> 3.8.1 (app 1.16 -> 1.18.1) — clears the k8s-1.35 compat-gate block
All checks were successful
ci/woodpecker/push/default Pipeline was successful
kyverno v1.16 supports k8s <=1.34, so it was one of the two addons blocking the
autonomous 1.35 upgrade (compat gate, nightly). v1.18 supports 1.35.

Stepped one minor at a time per the kyverno upgrade guide (per-minor CRD notes):
3.6.1 (1.16) -> 3.7.2 (1.17.2) -> 3.8.1 (1.18.1), each hop applied + verified
supervised. atomic=true (auto-rollback on a failed rollout) + forceFailurePolicyIgnore
(admissions stay open mid-roll) kept it safe. Values schema confirmed compatible
across 3.6->3.8 (forceFailurePolicyIgnore still under features:).

Verified after each hop: all 17 ClusterPolicies stayed Ready, admission controller
2/2, no destroys/replaces in plan. Final 1.18.1: images v1.18.1, mutating webhook
live (server-side dry-run injects ndots:2 in a non-excluded ns). compat-gate vs
1.35.6 now lists ONLY external-secrets (kyverno cleared). ESO 0.12->2.x
(v1beta1->v1, 73 files) is the last remaining 1.35 blocker — to be planned.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 17:21:38 +00:00
Viktor Barzin
ead876ec65 k8s-upgrade: nightly Slack report monitor + scope chain-failed alert to phases
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Adds a daily visibility layer so every night's autonomous-upgrade outcome is
reviewable at a glance during the upgrade-cleanup window (Viktor: "track every
night's upgrade for the next 7 days; clean up all bugs and blockers").

Last night (2026-06-20) confirmed BOTH prior fixes work in production: the
detector resolved target 1.35.6 (k8s_upgrade_available) and the compat gate
correctly REFUSED it (k8s_upgrade_blocked=1 -> K8sUpgradeBlocked) because ESO
v0.12 (<=1.31) and kyverno v1.16 (<=1.34) don't support 1.35.

What's here:
- CronJob k8s-upgrade-nightly-report (06:07 UTC) -> one Slack summary/morning:
  running version, detector freshness, detected target, outcome (no-op /
  blocked+live reasons / upgraded / in-progress / detector-stale), recent jobs.
  Read-only: reads Pushgateway gauges + live nodes/jobs, re-runs compat-gate.py
  for fresh blockers; reuses the chain SA + slack_webhook + scripts ConfigMap.
  Pure helpers unit-tested (test_nightly_report.py, 8 cases incl. a real
  v-prefix bug TDD caught). Verified end-to-end in-cluster (posted to Slack).
- K8sUpgradeChainJobFailed regex scoped from `k8s-upgrade-.*` to
  `k8s-upgrade-(preflight|master|worker|postflight)-.*` so the new report job
  (or any future helper) can't false-trip the chain-wedged alarm.

Manual state repair (no git artifact): imported the orphaned `alert-digest`
CronJob into the monitoring stack state
(`tg import module.monitoring.kubernetes_cron_job_v1.alert_digest monitoring/alert-digest`).
Root cause: when alert_digest was added (2026-06-12) the apply recorded its
ConfigMap + Secret but not the CronJob, so every full monitoring apply since has
failed with `cronjobs.batch "alert-digest" already exists` (Woodpecker pipeline
298 today) — surviving only via targeted prometheus applies. Now in state, so
monitoring CI applies cleanly again.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 16:57:44 +00:00
Viktor Barzin
7270e2be3b monitoring: K8sUpgradeChainJobFailed must not double-fire on a compat-gate block
Some checks failed
ci/woodpecker/push/default Pipeline failed
Last night (2026-06-20) the detector + compat-gate fixes worked: the chain
resolved target 1.35.6 and the gate correctly REFUSED it (ESO 0.12 + kyverno
1.16 don't support 1.35), pushing k8s_upgrade_blocked=1 -> K8sUpgradeBlocked
fired as designed. But the refusal also made the preflight Job exit 1
(block() exits 1 on purpose so the Failed Job re-spawns nightly), which tripped
K8sUpgradeChainJobFailed too — a duplicate, misleading "pipeline wedged" alarm
for what is the intended halt-and-alert outcome.

Fix: gate the alert with `unless on() k8s_upgrade_blocked == 1`. A deliberate
block sets that gauge (and it stays 1 until the next preflight resets it), so
the chain-job-failed alert is suppressed for the blocked period; a genuine
wedge / crash / halt-on-alert exits 1 WITHOUT setting it, so it still fires
(preserving the alert's original purpose — catching the pre-in_flight preflight
failure that hid the 5-day 1.34.9 wedge). Runbook + automated-upgrades docs
updated to match.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 16:35:35 +00:00
Viktor Barzin
b0ccaf1c65 state(vault): update encrypted state 2026-06-21 15:07:01 +00:00
Viktor Barzin
f84e6818b2 state(vault): update encrypted state 2026-06-21 15:07:01 +00:00
Viktor Barzin
cc4bb8ffe8 wealth dashboard: show price freshness for all 3 holdings, not just worst
Some checks failed
ci/woodpecker/push/default Pipeline failed
Viktor wanted the freshness tile to cover all three main holdings
(META, VUAG, VUSA), not only the single stalest one. Dropped LIMIT 1 so
the stat renders one value per held position (worst-first), switched the
tile to horizontal orientation so the three values sit side-by-side, and
updated the description. Each value is coloured by its own age threshold
(META red ~2mo, the Vanguard ETFs green ~2d). No threshold or datasource
change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 14:49:33 +00:00
6c2c56ab3b Merge pull request 'docs: CrowdSec enforcement = firewall-bouncer + CF WAF (plugin removed)' (#11) from wizard/crowdsec-docs into master
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-21 13:40:41 +00:00
Viktor Barzin
ceae4d5f06 docs: rewrite CrowdSec enforcement architecture (firewall-bouncer + CF WAF; Yaegi plugin removed)
The Traefik Yaegi CrowdSec bouncer plugin was dead on Traefik 3.7.5 (handler
never invoked) and has been removed. Document the replacement: in-kernel
nftables drop via cs-firewall-bouncer on direct hosts, and a Cloudflare IP-List
+ zone WAF block rule (fed by a LAPI->CF-list sync CronJob) on proxied hosts.
Both add zero per-request latency and fail open.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 13:39:26 +00:00
4df741f6de Merge pull request 'traefik/crowdsec: delete dead Yaegi plugin + middleware CRD + captcha (PR2/2)' (#10) from wizard/cs-deplugin-crd into master
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-21 13:36:03 +00:00
Viktor Barzin
c23b03864e traefik/crowdsec: delete dead Yaegi plugin + middleware CRD + captcha (PR2/2)
Zero live ingresses reference traefik-crowdsec@kubernetescrd (PR1 + a
cluster-wide targeted ingress re-apply confirmed 0), so the crowdsec Middleware
CRD and the broken Yaegi bouncer plugin can be removed without orphaning any
router. Removes: the `crowdsec` Middleware, the crowdsec-bouncer plugin (static
config + initContainer download + state.json entry), the captcha template
ConfigMap + volume + captcha.html, the Turnstile widget + data.cloudflare_accounts,
and the 3 now-unused module vars. Also drops the `crowdsec` middleware from the
catch-all error-pages IngressRoute chain (the one remaining CRD-level reference,
which an Ingress-annotation grep does not surface) so that router is not orphaned
when the Middleware is deleted; it keeps rate-limit. Enforcement is fully handled
out-of-band now: cs-firewall-bouncer (in-kernel nftables, direct hosts) +
Cloudflare IP-List/WAF (proxied hosts). The api-token-middleware plugin is
deliberately preserved (still used by paperless-mcp).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 13:35:13 +00:00
df86075c3d Merge pull request 'cleanup: fully remove orphaned council-complaints app' (#9) from wizard/council-cleanup into master
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-21 13:33:23 +00:00
Viktor Barzin
68d9058f85 cleanup: fully remove orphaned council-complaints app
The council-complaints app (Islington civic-reporting pilot) has been
abandoned. It was already dead in the cluster (deployments scaled 0/0,
image only on the decommissioned registry.viktorbarzin.me which 404s),
and it was never in Terraform — only docs + a kyverno comment referenced
it. Its live cluster resources (namespace, both NFS-backed PVs, ingresses)
were torn down out-of-band via kubectl (nothing in TF to drift from); the
DB-dump PVC was backed up to NFS first.

This removes the remaining repo references to the live app:
- service-catalog.md: drop the council-complaints row
- ci-cd.md + .claude/CLAUDE.md: drop it from the GHA->ghcr app list
- kyverno require-trusted-registries: the registry.viktorbarzin.me/*
  allowlist comment claimed council-complaints as the last referencer;
  rewrite it (no live workload pulls from that registry now; only stale
  completed Job records still carry the ref). The allowlist line itself
  is kept (registry-scoped, not app-specific).

Historical point-in-time plan docs (docs/plans/2026-05-16-auto-upgrade-
apps-{design,plan}.md) still mention it inside a frozen "10 GHA-migrated
repos (memory id=388)" snapshot; left as-is so the dated record stays
accurate.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 13:32:10 +00:00
Viktor Barzin
6dc3ce139f wealth dashboard: expand all rows by default + inline the freshness stat
Some checks failed
ci/woodpecker/push/default Pipeline failed
Two follow-ups Viktor asked for on the Price freshness panel:

- Expand every section by default. Grafana's collapsed rows hide their
  child panels; just flipping collapsed=false leaves a non-canonical shape
  (confirmed via the Grafana API that it keeps the panels nested rather
  than hoisting them), so each row is now collapsed=false + panels=[] with
  its children hoisted to top-level -- the exact form Grafana writes when
  you expand-and-save. Row headers revert to their original y (the child
  y-coords were already expanded-layout coordinates).

- Stop the freshness stat from taking its own line. It's now the 6th tile
  in the existing returns row (1d/7d/30d/90d/12mo + freshness), all width 4
  at y=5; the collapsed-row y-shift from the previous commit is undone.

No query or threshold changes. The large diff is mechanical: 12 child
panels re-indent from nested to top-level.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 13:29:25 +00:00
Viktor Barzin
92ff0b92f1 Merge remote-tracking branch 'forgejo/master' into wizard/t3-idle-migrate
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-21 12:41:33 +00:00
Viktor Barzin
5a136c7d53 docs: t3-migrate-idle runbook section + service-catalog + design status
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 12:40:46 +00:00
Viktor Barzin
334d8fee5d setup-devvm: install + enable t3-migrate-idle (lib, script, units, timer)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 12:36:13 +00:00
Viktor Barzin
3cf09a0fe3 t3-migrate-idle: systemd oneshot + overnight timer (01:00-05:40, /20)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 12:35:19 +00:00
Viktor Barzin
af9f7be297 t3-migrate-idle: drain deferral markers when safe
For each /var/lib/t3-autoupdate/deferred/<user> marker: skip+clear if the unit is
gone or was already restarted after the deferral; otherwise, when the idle gate is
satisfied, take a pre-restart backup and restart via the shared safe_restart_unit,
clearing the marker on verified success. DRY_RUN logs decisions without acting.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 12:34:44 +00:00
Viktor Barzin
06e400522f t3-migrate-idle: idle gate (no in-flight turn + quiet buffer), TDD
The gate reads t3's state.sqlite: safe to restart only when zero threads have an
active_turn_id AND the most-recent thread activity is older than the quiet buffer
(default 15m). Fail-closed on any parse/query error. Pure-bash unit tests cover
the boundaries against fixture DBs (no root/bats/Docker).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 12:34:11 +00:00
Viktor Barzin
de97696ff0 t3-autoupdate: source the shared safe-restart lib + record deferrals
Behavior-preserving refactor: the per-unit restart/recover body and small helpers
now come from t3-safe-restart.sh (one audited copy). Additionally, when a unit is
deferred for an active agent, write a marker under /var/lib/t3-autoupdate/deferred/
so the new idle migrator can drain it later; clear the marker on a successful
restart. Install/health-gate/canary logic is unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 12:32:57 +00:00
Viktor Barzin
2ab5b94748 t3-safe-restart: extract shared safe-restart library from t3-autoupdate
Pull the per-unit backup->restart->verify->recover routine (and the small
helpers it needs) out of t3-autoupdate.sh into a sourced library, so a second
job (the upcoming idle migrator) can reuse the exact same audited recovery path
instead of forking safety-critical code. safe_restart_unit returns non-zero on
failure (after recovery+freeze) rather than exiting, so callers control flow.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 12:28:53 +00:00
Viktor Barzin
0cebeeb0ee t3-idle-migrate: implementation plan
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 12:26:05 +00:00
Viktor Barzin
ddbdbca7e9 wealth dashboard: add "Price freshness" stat for stalest held quote
Some checks failed
ci/woodpecker/push/default Pipeline failed
Viktor was worried about stale prices silently distorting net worth.
Confirmed it's real: META's quote has been frozen at 2026-04-17 (65 days
old) while the dashboard keeps valuing the ~55-share position at that
stale close; the Vanguard ETFs are current. Nothing flagged it.

Adds one compact stat to the Overview row showing the most out-of-date
HELD position's quote age (symbol + humanised age), colour-coded: green
<=4d (weekend/bank-holiday tolerant), amber 5-9d, red >=10d. Pure read of
the quote_latest mirror via the wealth-pg datasource, held positions
only, LEFT JOIN so a held symbol with no quote at all sorts as max-stale.
The six collapsed rows below shift down 4 grid units to make room; no
other panel touched.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 12:23:45 +00:00
Viktor Barzin
9503bed589 t3-idle-migrate: design for graceful overnight restart of deferred t3-serve instances
Viktor hit the t3 'Client and server versions differ' warning. Root cause: the daily gated autoupdate defers a user's t3-serve restart whenever that user has an active agent at the 04:00 window, so anyone busy every night (long-lived/AFK sessions) never migrates and the client/server version skew persists for days.

This design adds a small idle-gated overnight job that drains those deferrals -- restarting a deferred instance onto the current binary only when no turn is in flight (state.sqlite active_turn_id) and it's been quiet for a buffer, so the migration lands in a real quiet gap instead of killing in-flight agent turns. Reuses the autoupdate's proven backup->restart->verify->recover path via a shared helper (approach C from the brainstorm). Design doc only; no behavior change yet.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 12:04:22 +00:00
Viktor Barzin
b1bbe42821 homelab ha token: dedicated openclaw/ha-tokens secret + least-priv RBAC for emo
Some checks are pending
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
`ha token` originally read openclaw/openclaw-secrets -> skill_secrets, which only
cluster admins can read — so it hung/failed for the non-admin operator it was
built for (emo = emil.barzin@gmail.com, OIDC group "Home Server Admins", whose
identity is deliberately barred from secrets in the openclaw namespace).

Split the HA tokens into a dedicated secret openclaw/ha-tokens (keys sofia/london)
with a Role + RoleBinding granting `get` on JUST that secret to the Home Server
Admins group (k8s RBAC can't scope to a JSON sub-key, hence a separate object).
emo now resolves the HA token with their own identity, WITHOUT gaining the rest
of skill_secrets (slack_webhook, uptime_kuma_password). openclaw's own deployment
keeps reading openclaw-secrets — purely additive.

- stacks/openclaw/ha_tokens.tf: new secret + least-privilege Role/RoleBinding
- cli/cmd_ha.go: read openclaw/ha-tokens (raw base64 per-instance key); drop JSON parse
- README + ADR-0012 updated; VERSION -> v0.7.1

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 10:45:32 +00:00
a091689603 Merge pull request 'traefik/crowdsec: remove dead plugin middleware reference (PR1/2)' (#8) from wizard/cs-deplugin-refs into master
Some checks failed
ci/woodpecker/push/default Pipeline failed
2026-06-21 00:17:51 +00:00
Viktor Barzin
71d0af084e traefik/crowdsec: remove 6 hard-coded middleware refs the variable sweep missed (PR1/2)
The first PR1 commit only dropped the ingress_factory reference + the 8
exclude_crowdsec call sites. But the crowdsec middleware is ALSO hard-coded
(not via the variable) in 6 more ingresses that build their middleware chain by
hand: owntracks, the monitoring Helm values (grafana + prometheus +
alertmanager), and the reverse-proxy module + its own separate ingress factory.
Remove all 6 so that after the full-cluster apply NO live ingress references
traefik-crowdsec@kubernetescrd — the precondition for PR2 deleting the CRD.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 00:17:40 +00:00
Viktor Barzin
7bd4612edf ci: scripts/tg waits out a contended state lock (-lock-timeout)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The infra CI pipeline was failing often — ~38% of the last 50 runs didn't
succeed. The single biggest cause (8 of 19 non-successes) was Tier-1 stack
applies dying instantly with "Error acquiring the state lock".

Tier-0 stacks already degrade gracefully (Vault advisory lock → the pipeline
skips a locked stack). Tier-1 stacks have no such fallback: they rely on
terraform's pg-backend pg_advisory_lock, and scripts/tg ran terragrunt with
no -lock-timeout, so any concurrent lock holder was fatal — a Woodpecker-killed
run whose PG lock wasn't reaped yet (PL266 killed → PL267 failed the same
second), a human/agent applying locally, or the daily drift `plan`.

Fix: scripts/tg now passes -lock-timeout (default 5m, override TG_LOCK_TIMEOUT)
on every state-locking verb (plan/apply/destroy/refresh), so a contended lock
WAITS for the holder to finish instead of failing. -auto-approve behaviour for
non-interactive applies is unchanged. Central wrapper change → covers CI, plus
local human/agent applies; no CI image rebuild (tg is read from the repo).

Adds a hermetic pytest (stub terragrunt + preset PG_CONN_STR) pinning the
arg-injection. Docs updated in AGENTS.md + .claude/CLAUDE.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 00:15:39 +00:00
Viktor Barzin
84a18a5529 traefik/crowdsec: remove dead Yaegi-plugin middleware reference (PR1/2)
The Traefik CrowdSec (Yaegi) bouncer plugin enforces nothing on Traefik 3.7.5
(handler never invoked) and is fully superseded by the cs-firewall-bouncer
(in-kernel nftables drop on direct hosts) + the Cloudflare IP-List/WAF rule
(proxied hosts). Drop the `traefik-crowdsec@kubernetescrd` middleware from the
ingress_factory chain and the 8 explicit `exclude_crowdsec = true` call sites,
and delete the now-unused `exclude_crowdsec` variable.

This is PR1 of a 2-phase removal: the reference is removed FIRST (a shared-module
change → full-cluster apply re-renders every ingress without the middleware) so
that PR2 can delete the `crowdsec` Middleware CRD + the plugin itself WITHOUT
leaving any ingress pointing at a missing middleware (which would error those
routers). PR2 MUST NOT land until this has fully applied and zero live ingresses
reference traefik-crowdsec@kubernetescrd.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 00:15:12 +00:00
9774ae3d19 Merge pull request 'crowdsec: firewall-bouncer cluster-wide (remove node2 pin)' (#7) from wizard/cs-fw-allnodes into master
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-21 00:08:15 +00:00
Viktor Barzin
c92590ae85 crowdsec: roll firewall-bouncer cluster-wide (remove node2 validation pin)
One-node validation on k8s-node2 passed: kernel nftables sets created in both
input and forward chains (policy accept), ~31k decisions loaded, a known banned
scanner confirmed in the drop set, pod stable 4h+ with no collateral. Remove the
nodeSelector so the DaemonSet runs on every node — direct-host enforcement now
survives a MetalLB VIP failover to any worker.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 00:07:45 +00:00
4f1c998468 Merge pull request 'rybbit sync: exclude CAPI + per_page=500 fix' (#6) from wizard/crowdsec-syncfix into master
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-21 00:05:50 +00:00
Viktor Barzin
f55bb6c422 rybbit: sync excludes CAPI blocklist + fix CF items per_page (500)
The edge CF IP List can't hold the ~31k CAPI community blocklist (already
enforced in-kernel by the firewall-bouncer), so the sync now skips origin=CAPI
and carries only high-signal local/curated decisions (+ a 9000 safety cap).
Also fixes the list-items GET: per_page=1000 returned a misleading CF 400
'invalid or expired cursor' (10027); the endpoint max is 500. Verified live:
crowdsec_ban populates (4 IPs) and the sync exits 0.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 00:05:05 +00:00
Viktor Barzin
6d5d3726d6 Merge remote-tracking branch 'origin/master' into wizard/ha-cli-verbs
Some checks are pending
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
2026-06-20 23:46:29 +00:00
Viktor Barzin
48225f2dea homelab CLI v0.7: add ha token + ha ssh for Home Assistant
Mined another devvm user's Claude sessions for repeated, hand-rolled command
patterns worth absorbing into the shared CLI. The dominant signal was Home
Assistant "Sofia" work: a `kubectl | base64 | jq` token-extraction pipeline
re-derived ~420x, and a bespoke non-interactive `ssh -o …` invocation reinvented
~30x — every session. The existing `home-assistant-sofia.py` already covers the
API but goes unused from an arbitrary cwd (needs an env var set + a cwd-relative
path), so agents bypassed it and hand-rolled everything.

Add two verbs covering exactly the gaps the `ha` MCP can't (entity state/control
stays with the MCP):
- `ha token [--instance sofia|london]` (read): resolves the long-lived API token
  live from k8s secret openclaw/openclaw-secrets via the ambient kubeconfig — no
  pre-set env var. Composes as `curl -H "Authorization: Bearer $(homelab ha token)"`.
- `ha ssh [--instance sofia|london] -- <cmd>` (write): deterministic
  non-interactive ssh to the HA host using the invoking user's key.

Also fix the root cause: `home-assistant-sofia.py` now falls back to
`homelab ha token` when its env var is unset (works from any directory), and the
home-assistant skill points agents at these verbs + `homelab metrics query`
instead of hand-rolled curls. README + ADR-0012 + AGENTS.md updated per the
per-verb-group convention.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 23:46:09 +00:00
Viktor Barzin
46166c63b2 fix(authentik): long-lived social-login sessions + shield auth from CrowdSec lockout
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor's passkeys all vanished and he was suddenly being asked to log in
multiple times a day instead of ~monthly. Root cause: on 2026-06-18 an ad-hoc
tripit passkey E2E test (run from the devvm as akadmin via python-httpx) cleaned
up "the demo user's" passkeys with GET /core/users/?search={demo} then DELETE
each device of users[0] — but the fuzzy search returned the REAL account, so it
wiped all 6 real passkeys. Losing passkeys forced fallback to Google login, and
the social-login stage (default-source-authentication-login) had the provider
default session_duration=seconds=0, which falls back to UNAUTHENTICATED_AGE=2h —
hence the constant re-logins. (Password + passkey logins were already weeks=4.)

Changes:
- authentik: adopt default-source-authentication-login into Terraform (import)
  and pin session_duration=weeks=4, so Google/GitHub/Facebook logins last as long
  as password/passkey. Immediate relief without re-enrolling.
- authentik: document the provider-schema gotcha — authentik_stage_identification
  exposes no webauthn_stage / enable_remember_me attribute, so they must NOT be in
  ignore_changes (commit 4e882989 removed them for this reason; re-adding breaks
  every apply). The passkey break was purely the missing device records, not drift.
- edge (rybbit): shield auth so a CrowdSec hit can never wall a user out of login —
  carve authentik.viktorbarzin.me + public-auth out of the zone WAF block rule,
  make the LAPI->edge sync ban-only (stop downgrading captcha to a hard block),
  and set exclude_crowdsec on the Authentik UI ingress (auth keeps rate-limiting).
- docs: record the session-duration change, the edge enforcement + auth carve-out
  (previously undocumented), and the pre-existing broken crowdsec-cf-sync CronJob
  (CF cursor pagination 400 + ~31k IPs vs list capacity -> edge list inert).

Passkey re-enrollment is a manual user action (devices are gone from the DB);
nothing auto-re-deletes them.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 23:40:22 +00:00
Viktor Barzin
600f1f933c Create Claude auth state directories
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The first live renewal run showed systemd could not create state beneath a read-only home sandbox. Provision each user's writable state directory before enabling the timer so automatic renewal can run.
2026-06-20 20:25:55 +00:00
Viktor Barzin
7f1788a106 Merge remote-tracking branch 'origin/master' into wizard/claude-auth-renew
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-20 20:22:20 +00:00
Viktor Barzin
ff67e9d422 Fix workstation package manifest parsing
The approved Claude token renewal deployment could not run because setup-devvm passed inline package comments to apt as package names. Strip inline comments so the persisted all-user setup remains reproducible.
2026-06-20 20:22:05 +00:00
Viktor Barzin
524b874036 state(vault): update encrypted state
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
2026-06-20 20:14:53 +00:00
Viktor Barzin
7050b0441e Merge remote-tracking branch 'origin/master' into wizard/claude-auth-renew
Some checks failed
ci/woodpecker/push/default Pipeline failed
2026-06-20 20:11:09 +00:00
Viktor Barzin
bc2fbc712c Merge remote-tracking branch 'origin/master' into wizard/claude-auth-renew 2026-06-20 20:10:48 +00:00
Viktor Barzin
02d14796cc feat(mailserver): add trips@ send-as alias for TripIt native auth email (ADR-0028)
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
TripIt's native signup-verification + account-recovery mail (ADR-0028) sends From: trips@viktorbarzin.me while authenticating SMTP as spam@. With SPOOF_PROTECTION on, Postfix smtpd_sender_login_maps requires an EXPLICIT alias (the @domain catch-all doesn't satisfy it) — mirrors the existing plans@->spam@ grant. Must be applied + verified before TripIt flips SMTP_FROM to trips@, else every verification/recovery send is rejected 550.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 20:10:47 +00:00
Viktor Barzin
5549fc3672 Add per-user Claude auth renewal
Each workstation user needs a continuously valid Claude token under their own Enterprise identity. Store only that user's OAuth state in an isolated Vault path, renew and verify it automatically, recover from Vault when possible, and alert when interactive SSO is required.
2026-06-20 20:10:40 +00:00
Viktor Barzin
3278588325 chore(authentik): tear down obsolete tripit-enrollment (ADR-0020 superseded by ADR-0028)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
TripIt external users are now LOCAL TripIt accounts (ADR-0028 native passkey + Authentik OIDC), so the Authentik-side self-enrollment machinery is dead. Removes the tripit-enrollment + tripit-recovery flows and all their stages/prompts/policies/bindings, the tripit-email-stages blueprint (+yaml), and the 'TripIt External' group; reverts the admin-services-restriction fence branch that contained those users (its sole member, the leftover tripit-demo@ test account, was deleted first, so the revert affects zero live principals). Real external collaborators (type=external) are untouched. tg plan: 0 add, 1 change (the policy expression), 20 destroy (all tripit_*). Closes tripit#97; moots the B2 per-app OIDC fences.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 20:04:24 +00:00
834c5e6a2a Merge pull request 'CrowdSec proxied: single CF list (block-only) + firewall-bouncer re-apply' (#5) from wizard/crowdsec-1list into master
Some checks failed
ci/woodpecker/push/default Pipeline failed
2026-06-20 19:31:01 +00:00
Viktor Barzin
7cf93a0587 crowdsec+rybbit: proxied edge to single CF list (block-only) + retrigger firewall-bouncer apply
CF account hard-limits to 1 Rules List, so proxied enforcement uses one crowdsec_ban
list + one WAF block rule; the sync writes both ban and captcha decisions into it
(captcha downgraded to block at the edge). Drops the second list + managed_challenge
rule. Trivial touch to firewall_bouncer.tf to make CI re-apply crowdsec and recreate
the DaemonSet (tar fix already in master; stale orphan was cleared).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 19:29:43 +00:00
1406d8a391 Merge pull request 'Fix CF ruleset import id + depends_on' (#4) from wizard/crowdsec-fix2 into master
Some checks failed
ci/woodpecker/push/default Pipeline failed
2026-06-20 19:13:03 +00:00
Viktor Barzin
f2b089e267 rybbit: fix cloudflare_ruleset import id (zone/ 3-part form) + depends_on lists
v4.52.7 import id must be zone/<zone_id>/<ruleset_id>; add depends_on so the
crowdsec_ban/captcha lists exist before the WAF rules reference them.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 19:12:29 +00:00
58fc6d5061 Merge pull request 'Fix CrowdSec firewall-bouncer tar + CF WAF ruleset import' (#3) from wizard/crowdsec-fixes into master
Some checks failed
ci/woodpecker/push/default Pipeline failed
2026-06-20 19:06:15 +00:00
Viktor Barzin
a351a66843 crowdsec+rybbit: fix firewall-bouncer tar extraction (busybox) + import existing CF WAF ruleset
- initContainer used GNU tar --wildcards which fails on the busybox curl image (pod Init:Error); switch to extract-all + cp via shell glob.
- cloudflare_ruleset hit the per-zone singleton conflict; import the existing 'default' http_request_firewall_custom ruleset and manage all rules — CrowdSec ban/captcha first, the pre-existing disabled skip rule preserved verbatim.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 19:04:30 +00:00
70e8ce1021 Merge pull request 'CrowdSec real enforcement: edge WAF (proxied) + firewall-bouncer (direct)' (#2) from wizard/crowdsec-enforcement into master
Some checks failed
ci/woodpecker/push/default Pipeline failed
2026-06-20 09:42:41 +00:00
Viktor Barzin
ca8d617e72 rybbit: use 'Account Rule Lists' permission group for the CF sync token (v4)
tg plan verified the agent's guess 'Account Filter Lists Edit/Read' is not a key in the v4.52.7 permission-group map; the live CF API lists the correct account-scoped groups as 'Account Rule Lists Read'/'Write'.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 09:41:41 +00:00
Viktor Barzin
0c56290af0 chore(forgejo): re-trigger apply of git.timeout/gc.auto (changed-stack skip)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
910d5892 landed the [git.timeout] + [git.config] env in master, but the CI apply
skipped stacks/forgejo (the changed-stack-diff race after a sync-merge), so the
Forgejo deployment never picked it up. A trivial comment touch to force a clean
apply of the stack so the durable push-mirror fix actually takes effect.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 09:19:53 +00:00
Viktor Barzin
cc4bfb593b rybbit: proxied CrowdSec enforcement via Cloudflare IP Lists + WAF rule
Replaces the Worker+KV approach (which only covered the ~27 routed hosts) with a
zone-wide mechanism that covers ALL proxied hosts: two CF account IP Lists
(crowdsec_ban, crowdsec_captcha) + one zone WAF custom rule that blocks
`(ip.src in $crowdsec_ban)` and managed-challenges `(ip.src in $crowdsec_captcha)`.
No per-request Worker, no cookie machinery — the rybbit Worker stays
analytics-only. lapi_kv_sync.py now full-reconciles the two lists from LAPI
(fail-safe: a LAPI blip skips the run and freezes the last-known-good block set;
serializes CF bulk ops since CF allows one pending op per account). A
least-privilege CF API token (Account Filter Lists Edit) is minted in TF.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 09:18:33 +00:00
Viktor Barzin
7e646e1c7c crowdsec: add cs-firewall-bouncer DaemonSet (direct-host nftables enforcement)
Drops banned source IPs in-kernel via nftables (hooks input+forward, so DNAT'd
LoadBalancer traffic is caught before reaching Traefik) for DIRECT hosts — the
direct-side replacement for the dead Traefik plugin, zero per-request hop.

No published image exists, so an initContainer fetches the pinned official
static binary (v0.0.34) onto a stock debian-slim base (nftables backend uses
netlink directly, no nft CLI needed). hostNetwork + NET_ADMIN/NET_RAW (not
privileged). Config (with api_key) in a Secret, Reloader-annotated. crowdsec ns
is already in the Kyverno wave-1 exclude list, so the privileged/hostNetwork pod
is admitted. Pinned to k8s-node2 (runs a Traefik pod) for one-node validation
before the nodeSelector is removed to roll cluster-wide. Fail-open by element
timeout if the bouncer stops.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 09:11:08 +00:00
Viktor Barzin
53117b193a portal-realtime: deploy the v2 full-duplex voice agent (Pipecat)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
New stack for the realtime voice agent — v2 of the portal-assistant brain
path. One persistent WebSocket per conversation: continuous mic audio ->
Silero VAD turn-taking -> Whisper STT (portal-stt) -> streaming Claude brain
(claude-agent-service) -> edge-tts (portal-tts) -> audio out, with barge-in.
Reuses all three upstream cluster services; nothing new is spun up.

Public Cloudflare ingress (proxied, WebSocket) at portal-realtime.viktorbarzin.me
with the app's own DEVICE_TOKEN as the edge gate (auth="app" — Authentik would
break the native Portal client). No buffering middleware: it would break the
streaming WebSocket. Image ghcr.io/viktorbarzin/portal-assistant-realtime
(private ghcr, pulled with ghcr_pull_token). Sibling to the v1 portal-assistant
gateway, which stays live.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 08:23:17 +00:00
Viktor Barzin
44cac6f4e2 gitignore: ignore Python test artifacts (__pycache__, *.pyc, .pytest_cache)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Introduced the first pytest file in the tree
(stacks/k8s-version-upgrade/scripts/test_compat_gate.py); running it leaves an
untracked __pycache__/ dir. Ignore the standard Python build artifacts so test
runs don't show up as working-tree noise or get committed by accident.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 08:17:03 +00:00
Viktor Barzin
b58fe8cb1a docs(k8s-upgrade): record detector Packages-probe -L fix + compat-gate patch scope
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Two corrections to the runbook matching today's code fixes:
- The next-minor *patch* probe (GET .../Packages) also needs `-L`; it lacked it
  until 2026-06-20 and silently no-op'd the 2026-06-19 nightly run. Both probes
  now follow the 302.
- The compat gate's addon check is scoped to minor jumps — patches within the
  running minor are never addon-blocked (target_minor <= running_minor returns
  early), so a conservative ceiling like ESO 0.12 -> 1.31 no longer false-blocks
  a 1.34.x patch.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 08:16:20 +00:00
Viktor Barzin
e5250f417e k8s-version-upgrade: compat gate must not false-block patch upgrades
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The compat gate compared every addon's matrix ceiling against the target
k8s minor unconditionally. That is correct for a minor JUMP, but it also
blocked patch upgrades within the minor the cluster is ALREADY running:
ESO v0.12's matrix ceiling is 1.31, the cluster runs 1.34.9, so a target of
1.34.10 (a patch) was refused with "external-secrets supports k8s <= 1.31;
target 1.34 exceeds it" — even though the running cluster is itself proof ESO
0.12 works on 1.34. That silently defeats autonomous patching (it would have
bitten the moment a 1.34.10 was published).

Fix: a target at or below the running minor crosses into no new k8s minor, so
every installed addon is already empirically proven on it — check_addons now
returns no reasons when target_minor <= running_minor. Added running_minor()
(oldest kubelet across nodes, mirroring the detector; RUNNING_K8S env override
for tests) and pass it in. Minor jumps are unchanged: 1.34->1.35 still blocks
on ESO 0.12 + kyverno 1.16. removed-API + containerd checks are naturally
inert for patches (no API removal / containerd floor inside a minor) and keep
running as defence. Added test_compat_gate.py (8 cases) covering both paths.

Verified end-to-end against live Prometheus: target 1.34.10 -> EXIT 0 (safe),
target 1.35.6 -> EXIT 2 (blocked on ESO+kyverno).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 08:14:50 +00:00
Viktor Barzin
38675b7922 crowdsec: register kvsync + firewall bouncer keys in LAPI
Seeds two new bouncers at LAPI startup (BOUNCER_KEY_kvsync, BOUNCER_KEY_firewall)
from Vault secret/platform, mirroring the existing BOUNCER_KEY_traefik wiring.
These are the two halves of the real enforcement that replaces the dead Yaegi
plugin: kvsync authenticates the LAPI->Cloudflare-KV sync (proxied edge Worker),
firewall authenticates the cs-firewall-bouncer DaemonSet (direct-host nftables).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 08:12:38 +00:00
Viktor Barzin
a9384a4067 Merge remote-tracking branch 'origin/master'
Some checks failed
ci/woodpecker/push/default Pipeline failed
2026-06-20 08:09:16 +00:00
Viktor Barzin
44a98d408e k8s-version-upgrade: detector next-minor probe must follow 302 (curl -sfL)
The next-minor Packages query used `curl -sf` without -L. pkgs.k8s.io
302-redirects every request to a backing host, so without -L curl returned
an empty body, NEXT_MINOR_PATCH came back empty, and the detector fell
through to "No upgrade needed". That is exactly why last night's 23:00 chain
no-op'd instead of resolving the 1.35 next-minor target (1.35.6) and handing
it to the compat gate. `curl -sfL` follows the redirect and returns the
Packages file (verified: -sf -> empty, -sfL -> 1.35.6). Mirrors the same
-L fix already applied to the Release availability probe (-sILo) above.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 08:09:08 +00:00
Viktor Barzin
910d589205 fix(forgejo): raise git-op timeouts + lower gc.auto to stop push-mirror timeouts
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
The tripit Forgejo->GitHub push-mirror silently stalled: `git cat-file
--batch-all-objects` over the NFS-backed repo exceeded the default git deadline
once ~4500 loose objects accumulated (gc.auto's 6700 threshold hadn't fired), so
pushes stopped reaching GitHub and prod deploys stalled. Raise [git.timeout]
(DEFAULT/MIRROR/GC) so a slow object enumeration can't abort the mirror, and set
[git.config] gc.auto=1000 so post-push autogc + the git_gc_repos cron keep repos
packed (the real fix). A one-off forced gc already unblocked tripit; this prevents
recurrence across all repos.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 08:08:50 +00:00
Viktor Barzin
45bed1c133 Merge remote-tracking branch 'origin/master'
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-20 08:07:23 +00:00
Viktor Barzin
e1736d2e5c calico: hop 3.28.5->3.30.7 (operator v1.38.13) — restores a SUPPORTED Calico/k8s-1.34 pairing. Disabled new-in-3.30 Goldmane/Whisker (their CRs render before crds/ install on helm upgrade; we use Prometheus/Loki). calico-node 7/7 on quay/v3.30.7, tigerastatus green. Applied manually + verified overnight. 2026-06-20 08:07:08 +00:00
Viktor Barzin
4d9fdbc7f7 rybbit: add CrowdSec LAPI -> Cloudflare KV sync script (proxied edge control plane)
Pure-stdlib script (alert_digest pattern, runs on stock python:3.12-alpine) that
projects CrowdSec Ip-scope ban/captcha decisions into the Workers KV namespace
the edge Worker reads on each proxied request. Full-reconcile per run so an
un-ban clears from the edge within one interval; fail-safe (a LAPI read error
skips the run and leaves existing bans to expire by TTL = fail-open, never a
stale all-block). TF wiring (KV namespace + CronJob + key registration) follows.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 08:05:11 +00:00
Viktor Barzin
0ac176da01 crowdsec: whitelist internal/LAN/tailnet CIDRs at the decision layer
Preparing for real CrowdSec enforcement (edge Cloudflare Worker for proxied
hosts + cs-firewall-bouncer for direct hosts). Both enforce by dropping the
real source IP, so if an internal/RFC1918 address ever ended up in a ban
decision it could blackhole legitimate internal traffic. Whitelisting the
cluster/LAN/tailnet ranges (10/8, 172.16/12, 192.168/16, 100.64/10) at the
CrowdSec parser layer makes that structurally impossible — a trusted source
can never produce a decision in the first place. Public IP already whitelisted.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 08:03:46 +00:00
Viktor Barzin
3e3fdb34f0 homelab: v0.6.0 — usage telemetry (usage top), evidence-driven verb prioritization
Some checks are pending
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
Answers the question that drove the whole CLI — which verbs to add next — with
data instead of one maintainer's habits, and resolves the cross-user-usage ask
in-bounds (no reading anyone's home).

- emit on dispatch: every verb fire-and-forgets one Loki line {job,user,verb} +
  "exit=N ver=X". ONLY the verb path + exit code — never args, paths, flags, or
  secrets (the emit never sees arguments). Best-effort: 800ms timeout, errors
  swallowed, never affects the command; opt-out HOMELAB_TELEMETRY=0. Discovery
  verbs (manifest/version/help) and usage itself don't self-record.
- usage top [--since 30d] [--user U] [--json]: ranks verbs via
  sum by (verb)(count_over_time({job="homelab-usage"}[…])) against the shared
  Loki. Cross-user analytics WITHOUT touching ~/.claude — the privacy-preserving
  answer to "what does the team use".
- Loki sink (zero new infra, dogfoods v0.5 logs path); push verified HTTP 204 no
  auth. ADR docs/adr/0011.

Live-verified: ran 4 verbs, usage top ranked them correctly (metrics query=2).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 22:29:01 +00:00
Viktor Barzin
666fefd22b calico: hop 3.26->3.28.5 (operator v1.34.13); calico-node 7/7 healthy, tigerastatus green, kube-controller-manager restarted (3.28 UID change). Applied manually + verified.
Some checks failed
ci/woodpecker/push/default Pipeline failed
2026-06-19 22:09:23 +00:00
Viktor Barzin
8ed5368be9 calico: bring tigera-operator under Terraform via Helm (adopt at 3.26.1)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Base for the stepped 3.26->3.28->3.30->3.32 upgrade (k8s 1.36 prereq; 3.26 is
already unsupported on k8s 1.34). Manage ONLY the operator via the official
tigera-operator Helm chart (chart ver == Calico ver); installation.enabled=false
keeps the live Installation CR operator-managed so Helm never touches calico-node.
Adopted in place: existing operator Deployment/SA/ClusterRole/ClusterRoleBinding
pre-stamped with Helm ownership metadata (transient migration step), then the
release imported via a plan-verified create (1 to add, 0 destroy). Verified clean:
calico-node 7/7 unchanged, tigerastatus green. Closes the year-deferred adoption
(code-3ad) for the operator without adopting the Installation CR.
2026-06-19 21:50:34 +00:00
Viktor Barzin
dd029ca7fb traefik/crowdsec: switch bouncer to live mode (stream cache doesn't enforce under Yaegi)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
After bumping to v1.6.0 (stream goroutine runs) and disabling redis (in-memory
cache), the plugin logs `handleStreamCache:updated` but still does NOT enforce:
a ban present in the LAPI stream AND pulled by the plugin still let the banned IP
through. Stream-mode decision matching is unreliable under Traefik's Yaegi
interpreter here. Switch crowdsecMode stream->live: the plugin queries LAPI
synchronously per request (result cached per-IP for defaultDecisionSeconds), which
enforces reliably and picks up new decisions immediately. LAPI is 3-replica +
in-cluster so per-request latency is small; fail-open preserved (updateMaxFailure=-1).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 17:49:26 +00:00
Viktor Barzin
0cc48d83ac traefik/crowdsec: disable bouncer redis cache (broken under Yaegi → in-memory)
With the plugin on v1.6.0 the stream goroutine finally runs, and its slog output
revealed the real blocker: `handleStreamTicker ... isCrowdsecStreamHealthy:true
cache:unreachable`. The LAPI stream is healthy, but the plugin's redis client
cannot reach the cache under Traefik's Yaegi interpreter — even though
redis-master.redis.svc is reachable AND writable from the traefik namespace
(SET/GET verified via busybox; no NetworkPolicies; no auth). Same interpreter
-incompat class as the stream goroutine itself. With redisCacheUnreachableBlock
=false the bouncer then failed open and enforced nothing.

Disable the redis cache so the plugin uses its in-memory decision store (works
under Yaegi). Removes redisCacheHost/redisCacheUnreachableBlock. Trade-off:
captcha already-solved grace is per-pod across the 3 Traefik replicas (at worst
an occasional re-solve) — acceptable; bans/captcha decisions enforce correctly.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 17:49:26 +00:00
Viktor Barzin
531efb218d traefik: bump crowdsec-bouncer plugin v1.4.2 -> v1.6.0 (fix stream not pulling)
The crowdsec-bouncer Yaegi plugin pinned at v1.4.2 loads on Traefik 3.7.5 but
its decision-stream goroutine never runs — no Traefik pod ever calls the LAPI
stream (verified: no traefik-pod bouncer entry / no @pod-ip auto-registration),
and it logs nothing. All deps are healthy (LAPI 200 + full ban list reachable
from the traefik ns, key valid, redis PONG, config correct, no NetworkPolicies),
so CrowdSec enforced nothing despite the bouncer now being registered. This is
the Traefik-v3 / Yaegi plugin-incompat class that already killed rewrite-body
here. v1.4.2 predates Nov 2025; latest is v1.6.0.

Bump to v1.6.0 (initContainer download URL + state.json + experimental.plugins
version). Config-verified compatible: every key we use survives (crowdsecMode,
crowdsecLapiKey/Host, updateMaxFailure, redisCache*, clientTrustedIPs, all
captcha* incl. turnstile); v1.6.0 also moves logging to slog/trace for future
diagnosis. Pinned, not auto-updated (Keel can't manage a Yaegi plugin, and
plugin bumps must be tested against the running Traefik/Yaegi).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 17:49:26 +00:00
78095aa273 docs(forgejo): runbook reflects Authentik disabled + zero-click GitHub
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Authentik OAuth2 source is now disabled (login_source.is_active=0) and GitHub
auto-registration (zero-click sign-up) is on. Document why (global auto-reg +
Authentik's email-as-username 500; Forgejo/Authentik email mismatch blocks
account-linking) and how to re-enable Authentik later.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 17:37:46 +00:00
7d99203fc6 forgejo: re-enable ENABLE_AUTO_REGISTRATION for zero-click GitHub sign-up
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Per Viktor: GitHub sign-up must work zero-click (account created on first login,
no form). This global [oauth2_client] setting enables it. It conflicts with
Authentik (preferred_username is an email → invalid Forgejo username → 500 on
auto-create), and Viktor's Forgejo email (me@viktorbarzin.me) doesn't match his
Authentik email (vbarzin@gmail.com) so account-linking can't bridge it — so the
Authentik OAuth2 source is DISABLED (login_source.is_active=0; DB-managed,
out-of-band) per his directive. Forgejo sign-in is now GitHub + native login.

Committed via API to land on origin without pushing a concurrent agent's unpushed
local commit.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 17:34:17 +00:00
ef530b7d38 forgejo: drop ENABLE_AUTO_REGISTRATION — it broke Authentik sign-in
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ENABLE_AUTO_REGISTRATION is a global [oauth2_client] setting (all OAuth sources).
On Authentik sign-in, Forgejo auto-created an account and derived the username
from Authentik's preferred_username claim — which is the user's email
(vbarzin@gmail.com), invalid as a Forgejo username (no '@') → CreateUser failed
→ 500 on the OAuth callback. (GitHub's username claim is valid, so only Authentik
broke.) Reverting to the standard link/register flow fixes both; GitHub sign-up
still works via a one-step register form. Committed via API to touch only main.tf
(forgejo-only CI apply) so it doesn't collide with concurrent crowdsec work.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 17:24:29 +00:00
Viktor Barzin
a5bb4db9c5 crowdsec: register the Traefik bouncer with LAPI (fix fail-open)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The Traefik bouncer plugin's API key was never registered with LAPI — the
crowdsec stack reads many keys from Vault but not ingress_crowdsec_api_key, and
the chart registers no bouncer. So LAPI returned 403 to the plugin, which with
updateMaxFailure=-1 failed open and enforced NOTHING: no community-blocklist
bans, and the (now-Turnstile-wired) captcha never fired. cscli bouncers list was
empty; the registration was likely lost in the MySQL->PostgreSQL DB migration
with no IaC to recreate it.

Seed the bouncer at LAPI startup via BOUNCER_KEY_traefik, valued from the same
Vault key the middleware presents — so they match by construction, and the
bouncer re-registers automatically on every LAPI start (survives DB wipes).

- stacks/crowdsec/main.tf: read ingress_crowdsec_api_key, pass to module.
- module main.tf: new sensitive var + thread into the values templatefile.
- values.yaml: BOUNCER_KEY_traefik on lapi.env.
- docs/architecture/security.md: document registration + fail-open history and
  the proxied-app coverage caveat.

Activates enforcement (community blocklist bans + captcha) on non-proxied apps;
internal IPs stay bypassed (clientTrustedIPs), fail-open-on-LAPI-down preserved.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 17:08:28 +00:00
Viktor Barzin
56dadda453 traefik: pin helm chart to 40.2.0 (deployed version)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The traefik helm_release had no chart version pin, so a refreshed helm repo
index resolves `chart = "traefik"` to the latest (41.0.0), whose values schema
rejects this stack's `logs` block ("Additional property logs is not allowed") —
an unpinned apply attempts that upgrade and fails (atomic rollback). Pin to the
deployed 40.2.0 (release rev 57, since 2026-05-30) so applies are deterministic;
chart bumps must be deliberate with a values migration.

Follow-up to fd0c7493 (Turnstile captcha), which was applied with this pin
already in live TF state — this lands the pin in git to remove the drift.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 16:58:33 +00:00
Viktor Barzin
4a66377425 forgejo: add "Sign in with GitHub" (OAuth2 source + auto-registration)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor wanted people to be able to sign up with GitHub, not just the
native form or Authentik SSO.

- Added a GitHub OAuth2 login source via `forgejo admin auth add-oauth
  --provider github` (name "github", matching the callback registered on
  the GitHub OAuth App). Like the existing Authentik source, it lives in
  Forgejo's DB rather than Terraform — there's no clean TF resource for
  login sources. Client id/secret mirrored to Vault secret/viktor
  (forgejo_github_oauth_client_id / _secret) for recovery.
- This commit's TF change: ENABLE_AUTO_REGISTRATION=true in
  [oauth2_client], so a first GitHub sign-in creates the account directly
  ("sign up with GitHub") instead of a link-to-existing detour. The
  GitHub identity is the trust gate for this path; Turnstile + email
  confirmation still gate the native form.

Verified: GitHub recognises the client id, Forgejo's /user/oauth2/github
redirects to GitHub's authorize URL with the correct client id +
callback, and the login page renders the button. Final browser
click-through is the user's to do.

Runbook updated: docs/runbooks/forgejo-open-signups.md (GitHub section +
secret-rotation + DB-loss recreate steps).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 16:41:49 +00:00
Viktor Barzin
fd0c7493c3 traefik/crowdsec: serve Cloudflare Turnstile for captcha remediation
CrowdSec LAPI already issues `captcha`-type decisions for lower-severity abuse
(http-429-abuse, http-403-abuse, http-crawl-non_statics, http-sensitive-files),
but the Traefik bouncer plugin had no captcha provider configured — so those
decisions silently fell through to a 403 ban (traced in the plugin's bouncer.go
@ v1.4.2: captchaClient.Valid==false => handleBanServeHTTP). Flagged users had
no way to self-unblock, contradicting the profile's stated intent.

Wire Cloudflare Turnstile as the bouncer's captcha provider so a captcha
decision now renders a solvable challenge instead of a hard block:

- New cloudflare_turnstile_widget.crowdsec_captcha (managed mode), scoped to
  viktorbarzin.me so one widget covers every subdomain the bouncer fronts.
  Mirrors the existing Forgejo-signup Turnstile pattern; sitekey + secret are
  passed into the traefik module.
- middleware.tf: captchaProvider=turnstile + site/secret keys + grace 1800s +
  captchaHTMLFilePath=/captcha/captcha.html.
- Vendor the plugin's captcha.html and mount it into the Traefik container at
  /captcha via the chart `volumes` value — the pulled Yaegi plugin does not
  expose its bundled template to Traefik.
- docs/architecture/security.md: document the ban-vs-captcha remediation split.
- Remove the dead crowdsec-ingress-bouncer.yaml (unused nginx bouncer with
  placeholder reCAPTCHA keys; referenced by zero .tf).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 16:38:38 +00:00
Viktor Barzin
963e4fcdde forgejo: open native self-signups, gated by Turnstile + email confirmation
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor wants Forgejo open for anyone to sign up, but without bot/spam
account floods. Flip the deployment from OAuth-only registration
(ALLOW_ONLY_EXTERNAL_REGISTRATION=true) to allowing native local
sign-up, and add two bot gates on the registration form:

  - Cloudflare Turnstile captcha (CAPTCHA_TYPE=cfturnstile). The widget
    is managed in Terraform (turnstile.tf) via the CF Global API key, so
    the sitekey/secret are IaC, not a dashboard artifact.
  - Mandatory email confirmation (REGISTER_EMAIL_CONFIRM=true). Wire the
    Forgejo mailer to the cluster mailserver as noreply@viktorbarzin.me
    (mail.viktorbarzin.me:587 STARTTLS), reusing the same Vault-sourced
    credential Authentik uses (email-secret.tf ESO -> secret/authentik
    smtp_password).

Existing Authentik OAuth2 login is unchanged (additive). Deployment env
appended (not inserted) so the diff stays purely additive; a reloader
annotation rolls the pod on secret rotation.

Verified live: signup page renders the Turnstile widget, mailer delivers
a test message end-to-end, Forgejo healthy, plan-to-zero after apply.

Runbook: docs/runbooks/forgejo-open-signups.md

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 16:05:07 +00:00
Viktor Barzin
21dbd79ae4 Merge remote-tracking branch 'origin/master' into wizard/homelab-obs
Some checks are pending
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
2026-06-19 11:27:44 +00:00
Viktor Barzin
e91e1612dd homelab: v0.5.0 — net/dns/metrics/logs probes (endpoint resolution)
The remaining verbs that pass the "saves reasoning, not just typing" test the
user posed mid-session: each encodes the non-obvious which-endpoint-reached-how
resolution otherwise re-derived every time. (Same test deprioritized node-ssh
and secret-get aliasing — thin wrappers over commands already known.)

- net check <host> [path]: two-legged reachability — external (public DNS→CF)
  vs internal (Traefik LB) — so you see WHERE a break is, not just that one path
  works. (live: surfaced the LB at 6ms vs CF 77ms.)
- dns lookup <name> [type]: Technitium (10.0.20.201) vs public (1.1.1.1) diff.
- metrics query "<promql>" / metrics alerts: Prometheus via the LB
  (prometheus-query.viktorbarzin.lan); alerts uses the synthetic ALERTS series
  since the query frontend has no /api/v1/alerts and Alertmanager has no ingress.
- logs query "<logql>" [--since 1h] [--limit N]: Loki range query via the LB.

All reach auth-free internal ingresses through the LB (Go form of
curl --resolve host:443:10.0.20.203) — no port-forward, no kubectl. In-cluster-
only endpoints (Alertmanager v2) deliberately out of scope. Verified live before
building; all five smoke-tested green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 11:27:31 +00:00
Viktor Barzin
6cb823e431 k8s-version-upgrade: complete autonomy P0 — blocked alert + deeper postflight + runbook
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
Builds on the compat gate (prev commit) to finish "auto-upgrade when safe, halt +
alert when not":
- monitoring: K8sUpgradeBlocked alert (k8s_upgrade_blocked==1, for 10m, warning)
  in the Upgrade Gates group — the clean "a k8s auto-upgrade was refused, see
  Slack for why" signal. (Until monitoring is applied, a block still surfaces via
  the already-live K8sUpgradeChainJobFailed.)
- upgrade-step.sh phase_postflight: deeper post-upgrade smoke tests —
  apiserver /readyz + /livez, in-cluster DNS (resolve kubernetes.default), and
  core kube-system pods (apiserver/controller-manager/scheduler/etcd/coredns)
  Running. Any failure halts + alerts (exit 1; no rollback — kubeadm can't
  downgrade). Catches a "pods look Running but cluster is broken" upgrade.
- runbook: documents the compat gate, the blocked alert, how to clear a block,
  matrix maintenance, and the detector minor-probe fix.

After deploy, the nightly chain detects 1.35 (minor detection now works) and
correctly BLOCKS on Calico 3.26 / ESO 0.12 / kyverno 1.16 (all behind), alerting
via K8sUpgradeBlocked — the autonomy working as designed until the catch-up
clears those addons.
2026-06-19 11:27:17 +00:00
Viktor Barzin
cecd9fe247 k8s-version-upgrade: compat gate — auto-upgrade when safe, halt + alert when not
Make k8s upgrades (patch AND minor) autonomous without being reckless: the chain
attempts every upgrade but refuses unless it can prove the target is safe. A
refusal is a BLOCK (not a crash) — it halts the chain and signals for attention.

- compat-gate.py: read-only preflight check. Blocks if (a) a critical addon's
  running version doesn't support the target k8s minor, (b) an in-use deprecated
  API (apiserver_requested_deprecated_apis) is removed at/before the target, or
  (c) a node's containerd is below the target's floor. Validated against the live
  cluster: correctly blocks 1.35/1.36 today on Calico 3.26 / ESO 0.12 / kyverno
  1.16 (all behind), which is exactly the auto-halt we want until they're bumped.
- addon-compat.json: curated addon -> max-supported-k8s matrix (Calico, ESO,
  kyverno, gpu-operator + containerd floor), sourced from each project's compat
  docs (2026-06-19). The keystone data the gate reads; keep current.
- upgrade-step.sh: phase_preflight runs the gate FIRST (before any mutation);
  block() pushes k8s_upgrade_blocked=1 + Slacks the reasons + halts.
- main.tf: detector minor-probe fix (curl -sILo so the 302 from pkgs.k8s.io
  resolves to 200 — minors were never being detected). Gated behind the compat
  gate above, so enabling minor detection can't roll an unsafe minor.

Not pushed yet: deploys with the K8sUpgradeBlocked alert + deeper postflight +
runbook (next commit) so the detector fix only goes live with the full net.
2026-06-19 11:23:30 +00:00
Viktor Barzin
9189560ac3 homelab: v0.4.0 — ci/deploy verbs (watch what you trigger)
Some checks are pending
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
Adds the verb-group that kills the single biggest reasoning sink in agent
sessions — watching a build/deploy to completion (proven the session that built
it: hours hand-rolling Woodpecker polling + DB-schema spelunking for one CI
incident).

- ci status/watch: Woodpecker REST API (version-stable, not its DB schema),
  reached via the internal Traefik LB (dial 10.0.20.203, SNI=ci.viktorbarzin.me
  so the cert verifies — the Go form of the house `curl --resolve` pattern),
  token from WOODPECKER_TOKEN/Vault, repo id resolved from the cwd remote, with
  retries that ride Woodpecker's intermittent empty responses. watch matches the
  HEAD/given commit (avoids the post-push race) and exits non-zero on failure.
- deploy wait: image-sha match THEN rollout status (rollout status alone returns
  success on the old ReplicaSet); kubectl-based.
- work land now auto-watches CI to green on the landed commit (--no-ci-watch to
  skip), closing the v0.1 gap.
- ci logs deferred to v0.4.1 (Woodpecker detail/log endpoints were the least
  reliable; status/watch use the working list endpoint).

Live-verified ci status/watch against the live API.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 10:59:14 +00:00
Viktor Barzin
787ce4edfa homelab: v0.3.1 — fix k8s db PG target (resolve CNPG primary pod, not the Service)
Some checks are pending
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
`k8s db <app>` (Postgres path) execed `pg-cluster-rw`, which is the CNPG
read-write SERVICE, not a pod — so kubectl exec failed with
`pods "pg-cluster-rw" not found`. The unit test only checked the plan; the verb
was never fired at live state (the gap flagged in v0.2), so it shipped broken.

Fix: the PG plan now carries a label selector (cnpg.io/instanceRole=primary)
instead of a pod name, and k8s db resolves the actual primary POD via
`kubectl get pod -l <selector>` before exec. MySQL path (real pod
mysql-standalone-0) unchanged. Live-verified both paths (psql + mysql).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 09:09:34 +00:00
Viktor Barzin
90c944a265 woodpecker: disable partial clone (partial: false) — fix intermittent git exit-128
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Infra pipelines were failing intermittently across all authors (e.g. #241-244,
#247) with the git clone step exiting 128:

  git fetch --depth=1 --filter=tree:0 ...   (partial/treeless clone)
  git reset --hard <sha>
  fatal: could not fetch <tree-sha> from promisor remote
  remote: 404 page not found

The plugin-git clone defaulted to a partial (treeless) clone. The initial ref
fetch carries credentials, but the lazy *promisor* object fetch triggered by
`git reset --hard` hits the PRIVATE Forgejo repo without creds -> 404 -> exit
128. Whether it fired was luck-of-the-draw, hence the ~50% intermittent failures
fleet-wide (not specific to any commit).

Fix: set `partial: false` on every clone block so all objects for the (still
shallow) commit are fetched upfront with creds — no fragile lazy promisor fetch.

Diagnosed against the woodpecker Postgres DB (steps/log_entries) since the
Woodpecker HTTP API was itself flapping. Earlier "permission for ViktorBarzin"
log lines were an unrelated cross-forge red herring.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 09:06:44 +00:00
Viktor Barzin
fd77c0dc4f monitoring: RpiSofiaUndervoltage alerts on new brown-out, not until reboot
Some checks failed
ci/woodpecker/push/default Pipeline failed
The rpi-sofia under-voltage alert keyed off the sticky firmware bit
(rpi_under_voltage_occurred == 1), which latches on the first brown-out and
stays 1 until the Pi reboots. With alert-on-change routing it re-paged on every
boot cycle and sat firing for ~211h of the last 14d — Viktor reported "getting a
few of these lately" — and it disagreed with the HA-sofia dashboard, which shows
the live state and reads OK once voltage recovers.

Can't just switch to the live bit: rpi_under_voltage_now never registered once in
14d (brown-outs are sub-second and fall between the 1-min textfile-collector
samples), so the sticky bit is the only reliable detector.

Fix: edge-trigger on a NEW latch via increase(rpi_under_voltage_occurred[1h]) > 0.
Fires once per brown-out and auto-resolves ~1h later (~2h active over the same
14d instead of ~211h); counter-reset handling makes a clean reboot a no-op. Both
real brown-out events in the window are still caught. Docs updated in the same
commit (monitoring.md).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 08:45:39 +00:00
Viktor Barzin
fbf6f11038 feat(tripit): #96 cutover — /api self-authenticates (remove forward-auth, add strip-auth-headers)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ADR-0028 #96 (website half): /api drops Authentik forward-auth so the browser can carry a TripIt session cookie (the outpost 302'd cookie-only requests). The app self-authenticates (TripIt-session-first in get_current_user); no session -> 401 -> SPA landing. strip-auth-headers is REQUIRED now: with forward-auth gone, the hybrid forward-auth arm would otherwise trust a client-injected X-authentik-email — stripping inbound X-authentik-* closes that. /metrics split into its own still-gated ingress. Shell keeps Authentik bearers on tripit-api.* until #94; full AUTH_MODE collapse follows then. Verified live: no-session->401, valid TripIt cookie->200, injected header->401, Shell->200.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 08:27:39 +00:00
Viktor Barzin
8559c4574a fix(tripit): pin Authentik invalidation_flow literal (data source flakes null in CI under provider skew)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Pipeline 244 failed: data.authentik_flow.default_provider_invalidation resolved null in CI (goauthentik 2024.x provider vs 2026.2 server), silently blocking every tripit-stack apply incl. the ADR-0028 #90 signing-key + redirect-URI delivery. Pin the literal UUID (what the slug resolves to) — matches the data-source-skew workaround used for the Vault binding.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 08:10:25 +00:00
Viktor Barzin
e5bb16e02a feat(tripit): activate TripIt-native session auth — signing key + Authentik web redirect (ADR-0028 #90)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Adds SESSION_SIGNING_KEY (Vault secret/tripit -> tripit-secrets ExternalSecret -> env_from) so TripIt's own session JWTs are signed with a real key (the app fails closed under the dev default until this lands), and adds the website OIDC redirect URI https://tripit.viktorbarzin.me/api/auth/callback/authentik to the public tripit-app provider so 'Log in with Authentik' works. Reuses the Shell's existing public OAuth2 app.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 06:06:43 +00:00
Viktor Barzin
077ac97df5 k8s-version-upgrade: auto-restore apiserver OIDC after control-plane bumps
Some checks failed
ci/woodpecker/push/default Pipeline failed
kubeadm upgrade apply regenerates the apiserver static-pod manifest and drops
the --authentication-config flag, silently breaking SSO (kubectl/kubelogin + the
k8s dashboard) until someone manually re-applied the rbac stack. That manual step
ran after every control-plane upgrade — the one thing keeping autonomous patch
upgrades from being truly hands-off (it bit us this cycle: an earlier master bump
left SSO broken until we noticed).

Automate it: the rbac stack now publishes its existing OIDC restore script (the
same one its null_resource runs) to a kube-system/apiserver-oidc-restore
ConfigMap, and the upgrade chain's phase_master re-runs it on master right after
the kubeadm upgrade — while tigera-operator is still quiesced so the flag-add
apiserver restart can't crashloop it. The script is idempotent and health-gates
/livez with auto-rollback; the step is non-fatal (a failure only lags SSO until
the next rbac apply, it won't abort the upgrade). phase_master already self-skips
when master is at target, so this only fires when master was actually upgraded.

The chain SA gets a name-scoped get on that one ConfigMap. Runbook updated: the
manual restore is now a documented fallback (command corrected — it needs
-replace, since the null_resource trigger hash never changes).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 06:04:30 +00:00
Viktor Barzin
48b63ffa6f homelab: add memory verb-group (v0.3.0) — direct claude-memory HTTP client
Some checks failed
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline failed
Lets agents search/navigate memory via the CLI, as the first step toward
deprecating the memory MCP. claude-memory is a FastAPI service (the MCP is just
one frontend); homelab memory is a thin Bearer-auth HTTP client over the same
API, using the env the hooks already set (CLAUDE_MEMORY_API_URL/KEY). It works
even when the MCP frontend is down — the recurring disconnect that took the MCP
offline for this whole session.

Verbs: recall (server-side semantic search), list, categories, tags, stats,
secret (read); store, update, delete (write). Validated against the live API
including a store→recall→delete round-trip — full data-plane parity with the MCP.

The deprecation itself (rewiring the per-prompt auto-recall + auto-learn hooks to
the CLI, then uninstalling the MCP) is a deliberate follow-up, sequenced after
the CLI is proven in the hooks — see docs/adr/0008.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 05:56:25 +00:00
Viktor Barzin
3594485f77 homelab: v0.2.0 — docs + version for the k8s verb-group
Some checks are pending
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
Bump cli/VERSION to v0.2.0; document the k8s verbs (README table + resolver
note), add docs/adr/0007 (resolver, read/write split, config-mutation stays
raw, db dbaas pattern), and extend the AGENTS.md discovery pointer with the
Kubernetes surface.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 22:30:41 +00:00
Viktor Barzin
1f7438bb18 homelab: add k8s verb-group (v0.2) — the biggest remaining surface
Mining the post-v0.1 corpus showed kubectl is the dominant remaining domain by
far: 11,291 commands across 243 sessions (more than everything else combined).
This adds the full k8s verb-group built on an app→namespace→pod resolver (most
namespaces hold one app, so <app> defaults to the namespace and the target
defaults to deploy/<app>, letting kubectl resolve the pod; -n/--pod/-c/-l/--tty
override).

Read: status (pods + non-Normal events), get, logs, describe, debug (one-shot
triage), pf, rollout-status. Write/operational: db (the dbaas psql/mysql exec
pattern — PG via pg-cluster-rw -c postgres, MySQL via mysql-standalone-0 with the
env-password bash wrapper, never inline), exec, rm-pod (pods/jobs ONLY), restart.
Config-mutation verbs (apply/edit/patch/scale/create) are deliberately NOT
exposed — they stay raw per the Terraform-only policy.

Smoke-verified read verbs against the live cluster (get/logs/rollout-status);
write verbs are unit-tested (resolver, db-plan, shell-quoting) but not fired at
live state.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 22:29:51 +00:00
Viktor Barzin
66caa0bf7f homelab: v0.1 docs, distribution wiring, and version
Some checks are pending
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
Completes v0.1: documentation, build/install path, and version stamping.

- cli/VERSION (v0.1.0) stamped into the binary via ldflags.
- cli/README.md rewritten as the homelab overview (verbs + tiers, manifest,
  build, the preserved legacy webhook use-cases).
- docs/adr/0004-0006: why homelab exists (grown in place from infra/cli, not a
  separate repo), v0.1 scope + everything-allowed/tiers-recorded, and the
  work/tf behaviour (native worktree entry, verification-gated auto-land,
  presence-coupled apply).
- setup-devvm.sh builds cli/ -> /usr/local/bin/homelab each provisioning run
  (t3-dispatch pattern), so every devvm user gets the current binary.
- AGENTS.md: discovery pointer under Common Operations.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 19:25:51 +00:00
Viktor Barzin
087b415f73 homelab: add work verbs (start/land/clean) with a land verification gate
Completes the infra-loop verb surface. work start creates .worktrees/<topic>
on <user>/<topic> off <remote>/master (git-crypt-aware, ensures .worktrees is
ignored) and prints the path for native EnterWorktree entry. work land fetches,
merges master in, verifies, pushes HEAD:master with non-fast-forward retry, and
falls back to pushing the feature branch for a PR when the direct push is
rejected (branch protection). work clean removes the worktree + branch.

Safety: work land REFUSES to push when it cannot verify (no --verify-cmd and no
auto-detected suite) unless --no-verify is passed. This was added after an
accidental smoke-test invocation pushed unverified WIP to master (benign — the
infra CI applied 0 stacks since the diff was cli/-only — but the gate makes an
unverified land a deliberate choice, not the default).

Known v0.1 limitation: land does not yet block on CI to green; that arrives with
the ci/deploy watch verbs. It prints a reminder to follow the pipeline manually.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 19:24:08 +00:00
Viktor Barzin
36d562c15c homelab: add tf verbs + stack/git-crypt substrate
Some checks are pending
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
Adds the tf verb-group and the resolver substrate beneath it, continuing the
v0.1 infra-loop build.

- substrate: findInfraRoot (walk up to terragrunt.hcl + stacks/), stack→dir
  resolver, and repo/remote/git-crypt detection (preferRemote forgejo>origin,
  hasGitCryptAttr, gitCryptFlags) — the last is for `work` next.
- tf plan/validate/fmt/force-unlock/apply, resolving the stack from cwd and
  delegating to scripts/tg (which owns state decrypt/encrypt, the Vault lock,
  and the ingress auth-comment check) rather than calling terragrunt directly.
- tf apply is presence-coupled: claims stack:<name>, ALWAYS releases on exit
  (normal, error, or SIGINT/SIGTERM via sync.Once + signal handler) — fixing
  the documented ~200-claim leak — and prints an out-of-band reminder since CI
  applies canonically on push.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 19:16:33 +00:00
Viktor Barzin
ed6f22fd53 homelab: scaffold unified CLI (registry, manifest, claim/release) in infra/cli
Begin evolving the existing infra/cli into the agent-facing "homelab" CLI
decided in the design/grilling session: one composable, JSON-capable surface
for the operations agents run over and over (mined from 51k commands across
2,225 past sessions; the infra inner-loop is ~29% of them). v0.1 targets that
loop — work/tf/claim — and ships here, in place, in infra/cli.

This first slice:
- command registry + dispatcher (longest-prefix verb matching) and a
  `manifest`/`manifest --json` progressive-discovery entrypoint; every verb
  declares a read|write tier so write-gating can be added later (everything is
  allowed for now).
- claim/release verbs wrapping the existing presence script (not reimplemented),
  with label-taxonomy validation.
- main() front-dispatches the homelab verb surface but falls through to the
  legacy webhook -use-case path verbatim, so the in-cluster infra-cli image is
  unaffected.
- fix a pre-existing vet error (glog.Infof missing format directive) that
  blocked `go test`.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 19:12:57 +00:00
Viktor Barzin
70e217db24 k8s-version-upgrade: preflight skips kubeadm-plan gate when master already at target
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The autonomous 1.34.9 version-upgrade chain has been failing its preflight every
night. A prior run left k8s-master + k8s-node1 on 1.34.9 while node2-6 stayed on
1.34.8, and preflight's gate-4 runs `kubeadm upgrade plan` on master. On an
already-at-target master, kubeadm prints no "kubeadm upgrade apply vX.Y.Z" line,
so the parsed target came back empty and the `!= requested` check aborted the
whole chain before any worker was touched. Deterministic — it self-cleaned and
re-failed identically each night, so it would have failed again tonight, leaving
node2-6 stuck on the old patch.

Fix: skip the kubeadm-plan-target gate when master is already on TARGET_VERSION
— the same at-target self-skip that phase_master and phase_worker already do.
The remaining workers are still validated by their own per-node phases, and the
detector already confirmed the target is installable via apt-cache. This lets
tonight's unattended chain resume and finish node2-6 -> 1.34.9.

Runbook updated: node count 5 -> 7, the gate skip note, and a Past Incidents
writeup (incl. the collateral apiserver OIDC wipe, restored via the rbac stack).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 09:17:46 +00:00
Viktor Barzin
8787d361dc claude-memory: HA (replicas 2 + PDB) to stop recurring MCP disconnects
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The claude-memory MCP backend ran as a single replica with no PDB, so every
voluntary disruption took it to zero for ~30-90s — which surfaced as the
memory MCP "keeps getting disconnected" problem. Disruption sources hitting
the lone pod: the descheduler (every-5-min CronJob, LowNodeUtilization —
caught evicting it live), Keel image bumps, Reloader restarts on the 7-day
DB-password rotation, node drains, and CI deploys.

The local stdio MCP subprocess itself was proven healthy (fast non-blocking
startup, stderr suppressed, graceful degradation), so the fault was purely
backend availability, not the MCP plumbing.

Fix: run 2 replicas (the backend is stateless FastAPI over shared CNPG
Postgres and already has hostname anti-affinity) + restore the PDB at
minAvailable=1 (safe now — the drain deadlock that justified removing it
only existed at 1 replica) + descheduler evict=false to stop the needless
5-min churn. All five disruption sources become zero-downtime rolling events.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 09:13:36 +00:00
Viktor Barzin
48b7be3b14 feat(tripit): live lodging-price scrape — LODGING_PROVIDER=playwright
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked to turn lodging prices on and stop using the fake provider.
Mirrors the existing FARE_PROVIDER wiring: point the Booking.com/Airbnb lodging
scraper at the shared chrome-service browser over CDP (the namespace is already
admitted through chrome-service's NetworkPolicy for the fare scrape). The lodging
code (ADR-0025, tripit #78) is live in tripit 03973b5, so the env lands after
that rollout.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 06:53:19 +00:00
Viktor Barzin
d709d338c6 service-catalog: add paperless-ai (RAG semantic search + auto-tagging)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Document the new paperless-ai service and the two non-obvious operational
facts: runtime config lives in the PVC .env (not TF env, which would shadow
it), and Qwen3 needs /no_think for parseable tagging output.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 06:44:00 +00:00
Viktor Barzin
4977153dfb paperless-ai: make the PVC .env the single source of config truth
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Auto-tagging silently no-op'd: the container env vars set in the deployment
shadowed the app's own /app/data/.env, because paperless-ai's dotenv loader
does not override process.env. A stale PROCESS_PREDEFINED_DOCUMENTS=yes (with
no TAGS) made the scan select zero documents.

Strip the wizard-owned behavioural config (Paperless URL, AI provider, model,
scan interval, tagging flags) from the container env, keeping only
infrastructural env (PUID/PGID/port/RAG/HF cache) and the Vault-sourced
secret refs. The app's setup-written .env on the PVC is now authoritative,
so processing runs and tags all documents. Qwen3 thinking is disabled via
SYSTEM_PROMPT=/no_think in that .env to keep the model's JSON output parseable.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 06:41:29 +00:00
Viktor Barzin
aeee0d02e2 paperless-ai: deploy clusterzx/paperless-ai for semantic doc search + AI tagging
Some checks failed
ci/woodpecker/push/default Pipeline failed
Viktor wanted real semantic search over his ~300 Paperless documents and
preferred a ready-made solution over building one. paperless-ai provides
local-embedding RAG (ChromaDB + sentence-transformers, GPU-free) plus
LLM-driven auto-analysis/tagging.

Wiring:
- LLM (chat answers + tagging) -> in-cluster llama-swap qwen3-8b
  (OpenAI-compatible); embeddings + vector store are local on the PVC.
- Reads Paperless over the internal service via a dedicated `paperless-ai`
  superuser token (Vault secret/paperless-ai); app-admin creds also in Vault.
- Encrypted PVC for /app/data (SQLite + ChromaDB + model cache).
- Ingress paperless-ai.viktorbarzin.me behind Authentik (auth=required).
- Third-party image pinned (docker.io/clusterzx/paperless-ai:3.0.9), no Keel.

Runtime config persists to the PVC .env via the app's one-time setup; the
deployment env vars are pre-fill/documentation only.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 06:23:00 +00:00
Viktor Barzin
605cf99a1b portal-tts: docker.io/ prefix on edge-tts image (Kyverno trusted-registries)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The edge-tts apply was blocked by the require-trusted-registries Kyverno policy —
a bare `travisvn/openai-edge-tts` isn't in the allowlist. The policy blanket-
trusts `docker.io/*`, so prefixing the image with `docker.io/` passes admission
with no policy change. Verified live: bg synth round-trips through Whisper
verbatim and a full gateway /v1/talk bg turn returns a coherent spoken Bulgarian
reply ("Добър ден! Добре съм, благодаря!...").

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 21:24:34 +00:00
Viktor Barzin
ab55cb5dcd portal-stt: drop setup_tls_secret module (ClusterIP-only, no fullchain.pem)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The landed portal-stt source still declared the setup_tls_secret module +
tls_secret_name variable, which file()-reads secrets/fullchain.pem — a file this
stack does not ship. portal-stt is ClusterIP-only (no ingress; the Gateway is the
sole externally-exposed component, ADR-0001), so it needs no TLS secret. The live
deployment never had it (removed during the original apply); this aligns the
source with reality so CI applies cleanly. Fixes the pipeline-229 portal-stt
apply failure.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 20:29:31 +00:00
Viktor Barzin
e7b9a74756 portal-assistant: land voice stacks + switch TTS to edge-tts (intelligible Bulgarian)
Some checks failed
ci/woodpecker/push/default Pipeline failed
The portal-assistant voice-assistant stacks (portal-tts, portal-stt,
portal-assistant) were applied to the live cluster from feature branches but
never landed on master — the GitOps source of truth. This lands all three and,
in portal-tts, fixes Bulgarian speech.

Bulgarian was unintelligible: the local Piper voice (bg_BG-dimitar-medium via
espeak-ng) mangles Bulgarian consonants — a synth->Whisper round-trip turned
"Добър ден" into "Обърден", and a user heard pure gibberish. English was fine.

portal-tts now runs openai-edge-tts (Microsoft edge-tts neural voices) for BOTH
languages instead of Piper — ADR-0003 always named edge-tts as the online
Bulgarian-quality fallback. Validated before landing: edge bg round-trips
through Whisper verbatim ("Добър ден! Как сте днес? ..."). The gateway maps
detected language bg/en to the edge voice names via new TTS_VOICE_BG /
TTS_VOICE_EN env (bg-BG-KalinaNeural / en-US-AvaNeural). No GPU, no NFS model
store, no secrets — edge fetches voices from Microsoft per request (egress
verified). The assistant already needs the internet for the Claude brain, so an
online TTS adds no new failure mode.

The brain stays Sonnet with no extended thinking (already the default — a live
turn answers directly in ~3.4s), per the latency-over-smartness ask.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 20:25:29 +00:00
Viktor Barzin
677a181d49 reverse-proxy: dedicated rate limit for ha-london; bump ha-sofia (cold-client 429s)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
New, empty-cache clients (the repurposed Meta Portal running the HA companion
app) cold-load the whole HA frontend at once - dozens of frontend_latest/*.js +
MDI icon chunks. ha-london had no per-service rate limit, so it fell back to the
global 10/s burst 50 and 429'd those chunks, leaving every dashboard blank
(Settings, which loads less, worked). Give ha-london its own 200/500 middleware
(skip_global_rate_limit, mirroring ha-sofia, with depends_on to avoid the
dangling-middleware 404 window) and bump ha-sofia 100/200 -> 200/500 so a cold
Portal load of Sofia doesn't hit the same wall.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 19:53:47 +00:00
Viktor Barzin
9565ff1ce5 state(infra): update encrypted state
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-17 19:50:30 +00:00
Viktor Barzin
6518e54154 create-template-vm: add k8s-upgrade pipeline SSH key to node cloud-init
Some checks failed
ci/woodpecker/push/default Pipeline failed
New k8s nodes were only getting the personal `wizard` key in authorized_keys —
not the automated k8s-version-upgrade pipeline's key (Vault
secret/k8s-upgrade/ssh_key_pub). So a freshly provisioned node is invisible to
the upgrade chain (it SSHes in as `wizard` to drain+upgrade): node4/5/6 all hit
"Permission denied (publickey)" on 2026-06-17 and had to have the key pushed by
hand. Bake the public key into the cloud-init template so every new node gets it
on first boot.

(unattended-upgrades is already in this template — node4/node5 missed it only
because the LIVE PVE cloud-init snippet lagged this source: it deploys via a
Tier-0 `stacks/infra` apply that hadn't run since before their 2026-05-26
provision. Same lesson applies to THIS change — it reaches new nodes only after
`stacks/infra` is applied to refresh the snippet on the PVE host.)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 18:59:59 +00:00
Viktor Barzin
aac7121ccc t3-afk: scale to 0 — park the in-cluster T3 AFK executor (no current plans)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor has no near-term plans to use the autonomous AFK pipeline's in-cluster T3
cockpit/executor, so stop its pod to free node resources while keeping it
trivially revivable. Only the deployment replica count changes (1 -> 0); the SSD
PVC (state.sqlite + repo checkouts), Service, Ingress, and ExternalSecret are all
left in place — reviving is just setting replicas back to 1 and applying.

Already applied live via scripts/tg (PG state now 0 replicas, pod terminated);
this commit syncs git so drift-detection / the next apply won't re-scale it up.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 18:55:35 +00:00
Viktor Barzin
b931d9fb20 k8s-version-upgrade: make tigera-operator restore crash-safe (EXIT trap)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
phase_master quiesces tigera-operator (Calico's config reconciler) to 0 around
the master upgrade so it can't crashloop during the apiserver blip + I/O-storm
kubeadm's static-pod-hash watch (which would roll the upgrade back). The restore
was a plain line at the end of the phase, so any abort AFTER quiescing left the
operator at 0 — and the idempotent retry then skipped the already-on-target
master phase and never restored it. Observed 2026-06-17: a post-upgrade gate
aborted the master attempt; the operator sat scaled to 0 for ~1.5h (data plane
fine — calico-node keeps running — but no Calico reconciliation).

Fix:
  - Drain first (drain doesn't blip the apiserver), THEN quiesce right before
    `kubeadm upgrade apply`, and install an EXIT trap that restores the operator
    no matter how the phase exits (gate abort, set -e on ssh/kubeadm, success).
    Trap is set AFTER drain_node so its own EXIT trap can't clobber it; cleared
    after the explicit happy-path restore.
  - postflight also force-restores replicas=1 as a final guarantee (covers the
    skip-on-retry path that never quiesces or restores).

Long-term fix remains HA control plane (apiserver never goes down) — bead code-n0ow.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 18:25:54 +00:00
Viktor Barzin
c04efa3d3a k8s-version-upgrade: move detection to nightly 23:00 UTC (overnight upgrades)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Disruptive node drains should run when the cluster is idle. Move the
k8s-version-check detection CronJob from 12:00 UTC (noon) to 23:00 UTC
(00:00 London) — overnight, low usage, and clear of the kured OS-reboot window
(01:00-05:00 UTC) so the two drain pipelines never overlap. (Viktor, 2026-06-17.)

  - stacks/k8s-version-upgrade/main.tf: var.schedule default 0 12 → 0 23 * * *.
  - scripts/upgrade_state.sh: next_scheduled_run_utc now computes the 23:00 slot
    (was next_daily_noon_utc).
  - docs (runbook, architecture) + upgrade-state SKILL: schedule references
    updated to 23:00 UTC nightly.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 18:16:32 +00:00
Viktor Barzin
ed53b34bf4 k8s-version-upgrade: dynamic worker enumeration + IP-based SSH (auto-cover all/new nodes)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The chain hardcoded master→node4→node3→node2→node1→postflight and SSHed by
FQDN. It silently SKIPPED node5/node6 (added 2026-05-26) — postflight would
have failed even if reachable — and node5/node6 had no .viktorbarzin.lan DNS
records, so the chain couldn't SSH to them at all.

Refactor (upgrade-step.sh):
  - Worker set + order derived live from `kubectl get nodes` (worker_nodes /
    next_pending_worker), so EVERY worker still off-target is upgraded and a
    newly-joined node is covered with zero script change.
  - SSH targets are node InternalIPs (ssh_target), removing the dependency on
    node DNS records entirely — a new node is reachable the moment it joins.
  - The two remaining hardcoded loops (containerd skew, apt-repo rewrite) now
    enumerate workers/all-nodes dynamically too.
  - Topology preserved: master-drain Job runs on the first worker; every
    worker-drain Job runs on the already-upgraded k8s-master (self-preemption
    invariant intact).
  - next_pending_worker returns 0 explicitly on the no-match path — the
    `while read … done < <(…)` loop exits 1 at EOF, which under set -e would
    abort the LAST worker's Job before it spawns postflight (cluster upgraded
    but no cleanup / in_flight reset). Caught in review.

Docs (runbook + architecture + headers) updated to the dynamic topology.

NOTE: nodes still need the k8s-upgrade SSH public key in authorized_keys; it was
deployed to node4/5/6 by hand this session. Baking it into node provisioning
(so new nodes get it automatically) is the remaining follow-up.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 16:56:02 +00:00
Viktor Barzin
0c5a9b5f44 k8s-version-upgrade: grant pods/log so preflight can verify the etcd snapshot
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Preflight step 6 confirms the pre-upgrade etcd snapshot is non-empty by parsing
the backup Job's log (`kubectl -n default logs job/pre-upgrade-etcd-...`). The
k8s-upgrade-job ClusterRole granted `pods` get/list/delete but NOT the `pods/log`
subresource, so the read failed with Forbidden in the default ns and aborted
preflight — after step 5 had already set k8s_upgrade_in_flight=1. A stale
out-of-band grant had masked this until a `terragrunt apply` in this session
reconciled the role back to its TF definition. Codify pods/log:get.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 13:52:52 +00:00
Viktor Barzin
bfb86e653f k8s-version-upgrade: ignore CoreDNS preflight on kubeadm upgrade plan too
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The prior commit added the CoreDNS ignore/skip flags only to `kubeadm upgrade
apply`, but `kubeadm upgrade plan` runs the SAME CoreDNS preflight. Once master's
kubeadm binary is on the target version (the first attempt's apt step already
bumps it), both plan calls fail on the Keel-drifted CoreDNS 1.12.4 under
set -euo pipefail and abort:
  - preflight Job step 4 (upgrade-step.sh) — `plan` output is grepped for the
    target version; the failing pipeline killed the whole preflight.
  - update_k8s.sh master path line 85 — bare `plan` before the apply.

Both now pass --ignore-preflight-errors=CoreDNSMigration,CoreDNSUnsupportedPlugins.
Verified read-only on master: plan exits 0 and still emits
"kubeadm upgrade apply v1.34.9".

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 13:49:06 +00:00
Viktor Barzin
037a609f27 k8s-version-upgrade: unblock 1.34.9 — skip kubeadm CoreDNS addon + busybox-date fix
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The 1.34.9 master upgrade hard-failed `kubeadm upgrade apply` preflight: CoreDNS
is at v1.12.4 (Keel auto-bumped it 1.12.1 -> 1.12.4 on 2026-05-26 via a stale
kube-system out-of-band annotation), and 1.12.4 is ahead of kubeadm 1.34.9's
bundled corefile-migration table ("start version not supported").

- scripts/update_k8s.sh: master `kubeadm upgrade apply` now runs with
  `--ignore-preflight-errors=CoreDNSMigration,CoreDNSUnsupportedPlugins
  --skip-phases=addon/coredns`. A dry-run proved --ignore ALONE would overwrite
  our custom split-horizon Corefile with kubeadm's default AND downgrade the
  image; --skip-phases leaves CoreDNS 100% untouched while the control plane
  upgrades. CoreDNS is pinned off Keel (keel.sh/policy=never) to stop the drift.
- stacks/k8s-version-upgrade/scripts/upgrade-step.sh: fix the preflight
  quiet-baseline (settle-window) check, which silently no-op'd on the ghcr
  claude-agent-service image's busybox `date` (can't parse ISO8601). Now tries
  GNU then busybox `-D`, and warns+skips on parse failure (no silent fail-open).
- docs: runbook + architecture document the CoreDNS handling.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 13:45:05 +00:00
Viktor Barzin
042d1ce1ac k8s-version-upgrade: CI-retrigger to apply D1 (missed by two-commit diff-base)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
fb638cd8 landed as two commits; the apply pipeline diffed against HEAD~1 (the
monitoring-only commit) and never applied stacks/k8s-version-upgrade, so the
retry-on-failure logic isn't live yet. This single-commit retrigger forces it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 13:28:58 +00:00
Viktor Barzin
fb638cd8ec k8s-version-upgrade: scope chain-fail alert to terminal reasons + sync docs
Some checks failed
ci/woodpecker/push/default Pipeline failed
Refines the new K8sUpgradeChainJobFailed alert from a bare failed-pod count to
the terminal job-condition reasons (BackoffLimitExceeded|DeadlineExceeded). A
phase whose first pod failed but whose retry SUCCEEDED must NOT fire: every
firing alert also halts kured, so a bare-count false-positive would block all
OS node reboots for the Job's 7-day TTL. Verified against kube-state-metrics:
the stuck preflight reports reason="BackoffLimitExceeded"; a Complete job has 0
for the terminal reasons.

Docs updated to match the behaviour change (per the same-commit docs rule):
  - docs/runbooks/k8s-version-upgrade.md — new alert in the gates list; the
    "kill a stuck Job" recovery now leads with retry-on-failure self-heal.
  - docs/architecture/automated-upgrades.md — fourth Upgrade Gates alert;
    retry-on-failure note on the deterministic-naming paragraph.
  - .claude/skills/upgrade-state/SKILL.md — new "chain failed" status, legend
    entry, and drill-down (also copied to the active ~/.claude copy).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 13:10:18 +00:00
Viktor Barzin
dfa1a12a86 k8s-version-upgrade: retry failed phases + surface wedged chain (fix 5-day silent stall)
The 1.34.9 patch auto-upgrade sat stuck for 5 days without anyone knowing.
On 2026-06-12 a transient critical alert (the ttyd web-terminal probe on the
devvm) was firing when the daily detection ran; the preflight's "halt on any
critical alert" gate aborted it, so the preflight Job Failed (backoffLimit=1).
Two design gaps then turned that blip into a multi-day wedge:

  * the detection guard and spawn_next only checked whether the phase Job
    EXISTED, not whether it succeeded — and the Failed Job lingers 7 days via
    ttlSecondsAfterFinished, so every daily run skipped re-spawning it;
  * the abort happens before the in-flight metric is pushed, so neither
    K8sUpgradeStalled nor upgrade_state.sh could see it — the pipeline reported
    "never ran" while actually being stuck.

Fixes:
  D1 retry-on-failure: detection CronJob (main.tf) and spawn_next
     (upgrade-step.sh) now delete + re-spawn a terminally-Failed phase Job
     instead of skipping it, so a transient gate self-corrects next cycle
     rather than wedging the pipeline for a week.
  D2 WebterminalTtydUnreachable critical -> warning: a devvm developer
     web-terminal is not cluster infrastructure and must not block upgrades.
  D3 observability: new K8sUpgradeChainJobFailed alert
     (kube_job_status_failed in k8s-upgrade ns) and upgrade_state.sh now flags
     a Failed chain Job as "chain failed" — closing the pre-in-flight blind
     spot so a wedge is visible immediately.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 13:07:36 +00:00
Viktor Barzin
7e7e41cbef fix(authentik): derive username from email in tripit-enrollment (user_write needs it)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The passwordless enrollment prompt collects only email+name, so user_write aborted with 'Aborting write to empty username' (ak-stage-access-denied). Add an expression policy on the user_write binding (evaluate_on_plan=false + re_evaluate_policies=true, like guest.tf) that sets prompt_data['username'] = the entered email before the write. Verified the failure live via the flow executor API.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 07:35:23 +00:00
Viktor Barzin
e4512f3566 fix(authentik): deliver tripit email-verify stages via blueprint (provider token_expiry too old)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Pipeline 214 failed: the pinned goauthentik 2024.x provider models EmailStage.token_expiry as an integer, but the live 2026.2.x server requires a duration string ('hours=24') and 400s any number (even the provider default 30). Bumping the provider is a global terragrunt.hcl change re-applying every platform stack + breaking 3 other authentik-using stacks' lockfiles — disproportionate. Instead the two email-verification stages + their flow bindings move into an Authentik blueprint (tripit-email-stages.yaml) applied server-side via authentik_blueprint; the server parses token_expiry natively. Validated on the live server + terraform validate. Restores the ADR-0020 email-verification security gate.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 07:30:05 +00:00
Viktor Barzin
89eb090be3 feat(authentik): tripit-enrollment + tripit-recovery flows (passwordless signup, ADR-0020)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Makes the WebLanding 'Sign up' button work (it was 404ing — the tripit-enrollment flow didn't exist). Open passwordless registration: prompt(email,name) -> user_write(INACTIVE, external, group 'TripIt External') -> email verification (activates) -> passkey -> login. The inactive-until-verified gate is the security boundary: tripit trusts X-authentik-email, so activation must require proving inbox ownership. Passwordless login already works via the built-in webauthn flow. tripit-recovery (email -> new passkey) is built but intentionally NOT wired into the global brand recovery, so admin recovery is unchanged. Schema validated with terraform validate.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 07:20:11 +00:00
Viktor Barzin
4bf3f504ea fix(authentik): SMTP host = mail.viktorbarzin.me (svc name fails wildcard-cert verify)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The in-cluster svc name mailserver.mailserver.svc.cluster.local fails Authentik's strict STARTTLS hostname verification (CERTIFICATE_VERIFY_FAILED): the mailserver serves the *.viktorbarzin.me wildcard cert, which doesn't cover the svc DNS name. Use the public name mail.viktorbarzin.me, which resolves in-cluster (10.0.20.1) and matches the cert. Verified end-to-end from an authentik pod (verified TLS + SASL auth + send) before this change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 07:13:53 +00:00
Viktor Barzin
c3d0c121bb feat(authentik): wire SMTP (noreply@) for TripIt signup verification + recovery email (ADR-0020)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Authentik email was unconfigured (localhost), so the TripIt enrollment flow's email-verification stage couldn't send. Add AUTHENTIK_EMAIL__* to server.env + worker.env pointing at the in-cluster mailserver as noreply@viktorbarzin.me (587/STARTTLS), with the SASL password synced from Vault secret/authentik.smtp_password via a new authentik-email ExternalSecret (reloader-annotated). Image pin unchanged (2026.2.4 == live). Prereq for the tripit-enrollment flow.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 07:04:52 +00:00
Viktor Barzin
8a2a3d9eca Merge remote-tracking branch 'origin/master' into wizard/reconcile-mirror
All checks were successful
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
# Conflicts:
#	scripts/t3-provision-users.sh
2026-06-16 22:32:43 +00:00
Viktor Barzin
63e714782c immich: remove one-shot anca-elements-import Job + its PVC
All of Anca's photos are imported. The Job was declared as
kubernetes_job_v1.anca_elements_import — meaning every `terragrunt apply` of
the immich stack re-created it, despite the 2026-05-25 in-code comment saying
"After successful completion: REMOVE this resource block + apply again."
Nobody noticed for 22 days; the re-trigger today (2026-06-16) was the
6th IO-pressure incident — it scanned all 21,643 assets in pure read-scan
mode for 51 min, saturated sdc, starved etcd, crash-looped kube-apiserver.

Recovery actions taken before this commit:
  - Throttled nfsd 64→8 on PVE host to give apiserver headroom
  - `kubectl delete job -n immich anca-elements-import` + force-delete pod
  - Restored nfsd to 64; cluster healthy

Code change here:
  - Remove `kubernetes_job_v1.anca_elements_import` block
  - Remove `module.nfs_anca_elements_host` (PVC `immich-anca-elements-host` —
    no live consumer; videos batch deferred per user, source dump remains on
    PVE at /srv/nfs/anca-elements, browseable via Nextcloud admin)
  - Update 2026-05-25 post-mortem: 6th-incident section + new lesson that
    one-shot Jobs do NOT belong in kubernetes_job_v1 (use a suspended CronJob
    or a runbook-captured `kubectl create job` ad-hoc invocation instead).
2026-06-16 22:11:27 +00:00
Viktor Barzin
88717c61fd immich-frame: whole library (last 2y), Ken Burns, weather, 30s interval
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Per Viktor: show the whole Immich library from the last 2 years instead of the
single 'china' album, enable Ken Burns pan/zoom, slow the interval to 30s, and
add the weather overlay (London, metric). OpenWeatherMap key is read from Vault
(secret/immich -> frame_weather_api_key), not hardcoded.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 21:07:39 +00:00
Viktor Barzin
cffa32fae3 Merge remote-tracking branch 'forgejo/master' into wizard/tripit-ingest-model
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-16 20:39:30 +00:00
Viktor Barzin
14476bfbd7 tripit: mail-ingest extracts with the qwen3-8b text model, not the vision model
Forwarded schedule-change emails were being parsed by qwen3vl-4b (a 4B *vision*
model) for text extraction, which reliably dropped the flight number — so the
matcher had no key to link on and a forwarded flight update created a duplicate
instead of amending the existing segment.

Point the ingest-plans CronJob's text extraction at qwen3-8b (verified live: it
emits flight_number + a clean PNR, 3/3 on the failing email) and keep qwen3vl-4b
for boarding-pass image attachments (LLM_VISION_MODEL). llama-swap loads each on
demand; the GPU swap cost is accepted.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 20:39:29 +00:00
Viktor Barzin
0a6ed4b2fe workstation: per-user playwright browser MCP for all users, reproducible from git
Viktor asked that the playwright browser MCP be available for every devvm user
in every directory, with each user running their own server and multiple
concurrent sessions per user.

Before this, playwright was hand-set-up per user (~/.config/systemd/user/
playwright-mcp.service on 8931/8932/8933) and only wizard was actually wired —
emo's and anca's servers ran but their ~/.claude.json had no playwright entry,
so their Claude never connected. None of it was reproducible from git (units,
refresh script, and the Vault snapshot token lived only in user homes), so a
devvm rebuild would silently lose it.

This makes it reproducible and fixes the unwired users:

- roster_engine.py: sticky per-user PLAYWRIGHT_PORT (PLAYWRIGHT_BASE_PORT=8931,
  allocated for every roster user incl. the admin), emitted in the derive JSON.
- scripts/workstation/playwright/: system-level TEMPLATE units
  (playwright-mcp@.service + playwright-snapshot-refresh@.{service,timer},
  User=%i — system manager, so no systemd --user / linger) + the refresh script.
  @playwright/mcp pinned to 0.0.76 (avoids the @latest silent-fleet-roll
  footgun, same rationale as T3_PIN).
- setup-devvm.sh: install the templates + script (9e); stage the chrome-service
  snapshot bearer token from Vault to a root file (8c) — the hourly root
  reconcile has no Vault token, mirrors the Claude OAuth staging in 8a.
- t3-provision-users.sh: install_playwright() (ALL tiers incl. admin) writes
  PLAYWRIGHT_PORT, seeds the token if-absent, wires the user-scope ~/.claude.json
  by running `claude mcp add` AS the user (clobber-proof + if-absent, so it fixes
  existing/new/admin without rewriting a populated config), and enable --now's the
  instances (idempotent, never restarts a running server). Also hardened the
  section-1 *.env scan to skip the new playwright-*.env files (no T3_PORT -> grep
  no-match would abort under set -e -o pipefail).
- Docs: chrome-service-snapshot runbook (new Provisioning section + system-unit
  commands), multi-tenancy.md, and the 2026-06-07 plan Task 2.3.

Supersedes the hand-made per-user --user units (one-time idle-gated migration to
follow on the live host).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 20:33:47 +00:00
Viktor Barzin
c6a5cbe227 feat(tripit): serve the SPA publicly, keep /api + /metrics forward-auth-gated (ADR-0020 landing)
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
The website 302'd unauthenticated visitors straight to Authentik. Split the tripit.viktorbarzin.me ingress: the SPA shell (everything else) becomes auth=none so the app shows its own Log in / Sign up landing page, while a new tripit-app-api ingress keeps /api + /metrics behind forward-auth — the security boundary, since /api trusts the outpost-injected X-authentik-email. The public SPA gets strip-auth-headers (no spoofed headers can reach the backend) and anti_ai_scraping=false (it's an installable PWA). The existing auth=none carve-outs (calendar, emails/confirm, planner/slack) are longer prefixes and keep winning. Pairs with the tripit landing-page deploy (commit 3fe4da1).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 19:30:58 +00:00
github-actions[bot]
eb47eb1d10 priority-pass: bump image_tag to 63e118c3 [ci skip]
Auto-committed by ViktorBarzin/priority-pass GHA on push to main.
Source: 63e118c334
2026-06-16 17:45:33 +00:00
github-actions[bot]
d1f2e50736 priority-pass: bump image_tag to 4ce9e8e8 [ci skip]
Auto-committed by ViktorBarzin/priority-pass GHA on push to main.
Source: 4ce9e8e894
2026-06-16 17:44:40 +00:00
github-actions[bot]
46b5f04f67 priority-pass: bump image_tag to 63e118c3 [ci skip]
Auto-committed by ViktorBarzin/priority-pass GHA on push to main.
Source: 63e118c334
2026-06-16 17:20:08 +00:00
github-actions[bot]
29ad200026 priority-pass: bump image_tag to 4ce9e8e8 [ci skip]
Auto-committed by ViktorBarzin/priority-pass GHA on push to main.
Source: 4ce9e8e894
2026-06-16 17:19:55 +00:00
Viktor Barzin
044444d328 cluster-health: helm check #18 catches pending/failed releases (helm list -a)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
check_helm_releases used `helm list` without -a, which HIDES pending-upgrade and
failed releases — so on 2026-06-16 check #18 reported "All deployed" while the
prometheus release sat in pending-upgrade for ~4 days, silently blocking every
monitoring terragrunt apply (frozen alert/rule config). Add -a to surface them
and flag pending-* (FAIL, blocks applies) + failed (WARN); deployed/uninstalled/
superseded stay green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 15:39:06 +00:00
Viktor Barzin
e74f4208f5 t3-backup-state: retention 14 -> 6 (bound devvm root fs)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
wizard's state.sqlite grew to ~1.1GB and the new gated nightly tracker adds a
pre-bump snapshot per bump on top of this daily one; 14 x ~1.1GB would fill the
devvm root fs (was trending to ~16GB of wizard backups on a disk with ~9GB
free). 6 is ample — rollback only ever needs the most recent pre-bump backup.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 14:26:03 +00:00
Viktor Barzin
cdd9ecd199 t3: docs for the gated nightly tracker (runbook, post-mortem, service-catalog)
All checks were successful
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
Phase 4 docs for the enforcer -> gated-tracker change:
- runbook t3-version-bump.md: rewritten around the tracker — how each bump is
  gated, plus freeze/revert/pin/dry-run/manual-rollback ops.
- post-mortem 2026-06-09: append the deliberate 2026-06-16 reversal and how the
  gates close each named root-cause/lesson (historical sections left intact).
- service-catalog t3 row: "PINNED 0.0.24 enforcer" -> gated nightly tracker;
  replace the stale "auto-pair 401-broken on 0.0.26" note (re-verified healthy
  2026-06-16, cookieless -> 302 + t3_session).
- t3-provision-users.sh step 5b comment: enforcer -> tracker; note Persistent dropped.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 11:33:49 +00:00
Viktor Barzin
f4f7705127 monitoring: adopt orphaned alert-digest resources into TF state (unblocks apply)
The monitoring stack apply was create-failing on every push with `configmaps
"alert-digest-script" already exists` + `secrets "alert-digest" already exists`
(modules/monitoring/alert_digest.tf) — both resources exist in-cluster but fell
out of Terraform state, so apply tried to CREATE them and errored. Pre-existing
(failed on pipelines 203 AND 204, NOT caused by the t3 alert-rules change). Add
import {} blocks (TF 1.5+ adoption per AGENTS.md) so apply imports + reconciles
instead of failing. Idempotent once imported; safe to remove after a green apply.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 11:31:17 +00:00
Viktor Barzin
36521839fc t3: gated nightly tracker (replaces pinned enforcer) + drop timer Persistent
Phase 2 of "track t3 nightly, accept the risk, but make sure session auth works
and revert if it breaks". Rewrites the daily t3-autoupdate from a pinned-version
enforcer into a NIGHTLY TRACKER that gates every bump so a bad build self-heals
instead of repeating 2026-06-09:

- follows the t3@nightly npm dist-tag (T3_TRACK; T3_PIN still works as a hard
  freeze; /etc/t3-autoupdate.freeze is the manual revert switch);
- downgrade-guard (the nightly tag is mutable — never move backward) + channel
  sanity (target must be a -nightly. build);
- pre-bump per-user state.sqlite backup (online VACUUM INTO) BEFORE install, so
  rollback is a restore not sqlite surgery;
- health-check now SEEDS a throwaway instance with a COPY of a real POPULATED
  state.sqlite, exercising the forward MIGRATION (the actual 2026-06-09 failure
  class) + the real mint->exchange->t3_session pairing handshake before trusting
  a build. Scratch dir is on /var/tmp (disk), not the 2G tmpfs /tmp;
- canary rollout: restart idle instances ONE AT A TIME, verify pairing through
  the real dispatch after each, and on the first failure roll back (binary +
  that user's DB from the pre-bump backup) AND self-freeze so it can't re-flap
  onto bad builds. Active-agent instances are deferred, never killed. Rollback
  target is the recorded LAST-GOOD, not "whatever was installed";
- DRY_RUN mode (T3_DRY_RUN=1) previews the gate against a temp-prefix install —
  validated: 0.0.28-nightly.20260616.571 PASSES the populated-DB migration gate.

timer: drop Persistent=true (a missed 04:00 must not fire a real bump on boot
mid-day with users active — a 2026-06-09 contributing factor).
setup-devvm.sh: install t3@nightly on fresh boxes (no state to break), in sync.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 10:08:12 +00:00
Viktor Barzin
994d305d04 t3: session-auth detection for the gated nightly tracker (dispatch fallback logging + Loki alerts)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Before auto-tracking t3 nightly builds (Viktor's call, risk accepted), stand up
the detection that was missing on 2026-06-09 — when an auto-pulled nightly broke
pairing for ALL users and nothing alerted. Viktor's explicit requirement: make
sure session auth keeps working and revert if the pairing fallback/failure rate
climbs. This is phase 0 (detection) of that work.

- t3-dispatch: exchangeCredential now reports WHICH pairing endpoint answered,
  and autoPair logs every outcome (paired user=.. endpoint=.. fallback=..) — so
  the real-user browser-session->bootstrap fallback rate is observable. A
  non-zero rate flags that a build moved the pairing API (the 2026-06-09 class).
- Loki ruler alerts (devvm journal -> Alertmanager -> Slack): T3PairingBroken
  (real users failing to pair), T3PairFallbackHigh (build moved the pairing API),
  T3AutoUpdateRolledBack / RollbackFailed / Frozen (enforcer outcomes). Closes
  the post-mortem's open "nothing monitors end-to-end pairing" detection gap.

The existing t3-probe only checks GET /api/auth/session==200, which stays 200
even when pairing is dead, so it never caught the outage class.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 09:56:55 +00:00
Viktor Barzin
e783cae2cb chrome-service + mam-farming: doc clarifications (+ re-trigger CI apply missed earlier)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Two small doc additions that also re-include these stacks in Woodpecker's
changed-stack detection. The earlier 2-commit push left chrome-service out of the
HEAD~1..HEAD diff so its ignore_changes fix never applied; the monitoring apply was
separately blocked by a stuck prometheus pending-upgrade (now cleared).

- chrome-service: note the live pod's container order had drifted from this file's
  order, so a TF apply reorders them (containers[0] differs live-vs-TF until the
  apply lands) -- documents the confusion this caused during diagnosis.
- mam-farming: cross-ref the grabber script that emits mam_grabber_last_run_timestamp.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 09:34:23 +00:00
Viktor Barzin
b0e8e3599f nfs-mirror: exclude SQLite WAL/SHM sidecars + treat rsync exit 24 as success
NfsMirrorFailing fired ~13% of nights (3/23 runs, all rsync exit 24). Root cause:
calibre-web-automated keeps a WAL-mode SQLite queue.db on /srv/nfs, whose -wal/-shm
sidecars are created/checkpointed/deleted constantly and vanish between rsync's
file-list scan and the transfer ("file has vanished" -> exit 24). The mirror
actually completes every run; only transient files disappear.

Two fixes: (1) exclude *-wal/*-shm/*-journal -- these must never be in a raw mirror
anyway (a WAL without an atomic .db snapshot is useless to restore; daily-backup
makes the consistent SQLite copies). (2) Treat rsync exit 24 as success-with-warning
so the run still appends to the offsite manifest (a code-24 night previously skipped
that, delaying those changes to the monthly full sync) and the alert stops
false-firing.

Deployed to the PVE host via scp to /usr/local/bin/nfs-mirror (host script, not TF).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 09:34:22 +00:00
Viktor Barzin
2479560fa2 mam-farming: make MAMFarmingStuck a grabber heartbeat, not a grab-count check
Some checks failed
ci/woodpecker/push/default Pipeline failed
MAMFarmingStuck fired whenever the freeleech grabber added 0 torrents in 4h, but
grabbing 0 is normal: the grabber searches a random catalogue offset each run and
legitimately finds nothing when freeleech is dry (account ratio was a healthy
37.5; the alert even misreported it as "0.00" because $value was the grabbed
count, not the ratio). The alert's real intent was to catch the grabber not
running at all (CronJob Forbid-blocked / wedged), but increase(grabbed[4h])==0
cannot distinguish "didn't run" from "ran, nothing to grab" since Pushgateway
serves the last pushed value forever.

The grabber now heartbeats mam_grabber_last_run_timestamp on every completed run
(main success, ratio/mouse skip, and qBittorrent-unreachable paths). The alert
fires only when that heartbeat is >4h stale — the true stuck condition. Cookie
expiry and qBittorrent-down keep their own dedicated alerts.

Surfaced by /cluster-health as a false-firing alert.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 08:18:33 +00:00
Viktor Barzin
a0725ede57 chrome-service: stop ignoring container[0].image so TF re-asserts the pinned browser image
The chrome-service container (container[0]) runs the pinned Microsoft Playwright
image, which ships chromium under /ms-playwright. Its image was still listed in
the deployment's lifecycle ignore_changes — a leftover KEEL_IGNORE from before
ADR-0002 #29 moved the novnc container to TF management. With that field ignored,
a stray clobber of container[0] to ghcr chrome-service-novnc:latest (which has no
chromium there) stuck permanently: the container crash-looped ~12h on "chromium
binary not found under /ms-playwright" (273 restarts) and TF could not revert it.

Remove container[0].image from ignore_changes so Terraform pins it to local.image
and re-asserts it on every apply. Both containers are TF-managed now (novnc since
ADR-0002 #29); Keel is inert (policy=never), so nothing should fight TF here.

Surfaced by /cluster-health. Live state was already restored transiently via
kubectl set image; this commit makes the fix durable.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 08:18:32 +00:00
1ba453c65d fan-control docs: sync runbook/env/service/design to the HA-actuator + anti-flap model
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The committed docs still described the 2026-06-04 presence-aware daemon. Bring
them in line with what is actually deployed: HA computes the setpoint, the host
is a thin actuator (COMMAND_ENTITY/STALE_SECS/HA_GRACE_SECS), additive bias,
anti-flap hold-last, and the new HA readout sensors (command/equilibrium/
cpu_load/fan_speed_avg/fan_power_avg). Earlier doc edits were made in a clone
lost in the workstation reshuffle; re-created here.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 08:11:48 +00:00
5bc3d27d1b Merge remote-tracking branch 'forgejo/master' into emo/fan-control-ha-actuator
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-16 08:08:27 +00:00
2cfe338419 fan-control: hold last command through transient HA losses (stop fan flapping)
The actuator dumped the fans to Dell auto on every brief loss of the HA command
(~14% of the time, every few minutes) — crashing them to the ~7100 rpm floor and
bouncing back: the "fans surge then crash then surge" the owner reported. Causes:
the command sensors last_updated going >120s old whenever CPU temp sat flat
(mis-read as stale), plus occasional unavailable blips. Fix: on a missing/stale
command, HOLD the last applied % for up to HA_GRACE_SECS (300s) instead of
falling back, and loosen STALE_SECS 120->1800 (staleness only happens at flat
temp, where the held value is still valid). The 83C CPU CEILING on our own IPMI
read stays the real overheat safety. Verified live: fallback 14% -> 0% over 8h,
command std 16 -> 3, no more rpm floor crashes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 08:07:52 +00:00
Viktor Barzin
57d45d8d8f fix(authentik): pin Vault binding UUIDs as literals (provider has no authentik_application data source)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
CI pipeline 198 failed: the pinned goauthentik/authentik provider has no data "authentik_application" source, so terraform failed the whole authentik plan and applied NOTHING (state unchanged). Replace the data-source lookups with the live pbm_uuid (Vault app) and group_uuid (Allow Login Users) as literals; authentik_policy_binding is supported (used in guest.tf).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 22:01:29 +00:00
Viktor Barzin
aa461b95bc feat(authentik): bind Vault OIDC app to Allow Login Users (close ADR-0020 OIDC gap)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Audit found the Vault Authentik application had no authorization binding, so any authenticated identity (incl. a future self-enrolled TripIt External user) could complete Vault OIDC login and get a built-in default-policy token. Bind it to 'Allow Login Users' — existing homelab users inherit that group via its children (verified User.all_groups() includes the parent), parentless TripIt External users are excluded. Closes the only OIDC app the forward-auth fence does not cover.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 21:48:04 +00:00
Viktor Barzin
cbca281aaa feat(authentik): TripIt external self-signup group + forward-auth fence (ADR-0020)
Viktor wants people outside the homelab to self-register to TripIt with email + a passkey (no password), kept separate from the rest of the homelab. Adds the empty, parentless 'TripIt External' Authentik group and a first-position branch in the catch-all policy that admits those users to tripit.viktorbarzin.me only and denies every other forward-auth host. Inert on apply (group empty => matches no existing user => no lockout). An adversarial review found the fence is forward-auth-only, so the runbook records the OIDC-app containment audit (every sensitive app already requires a trusted group External users won't hold), the Vault->Allow Login Users binding that closes the one open OIDC app, the SMTP prerequisite for email verification, and the before/after access-matrix verification. Flows/SMTP/Vault binding are UI steps per the runbook; the push that applies the catch-all edit must be human-watched (CI auto-applies the authentik stack).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 21:48:04 +00:00
Viktor Barzin
cf51cb45de docs(adr-0003): keep Forgejo canonical, complete the GitHub mirror (reject swap)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Grilled the 'swap Forgejo for GitHub' idea. Root cause of the divergence pain
is an incomplete push-mirror rollout (14 repos dual-pushed, push_mirrors=0),
not Forgejo itself — and CONTEXT.md already documents Forgejo-canonical +
one-way GitHub mirror. Decision: don't swap; finish the mirror, name the
GitHub-first exceptions, reconcile infra, enforce one-remote-per-clone. Adds
ADR-0003 + the GitHub-first repo glossary term + dual-push/force-overwrite
warnings on Canonical repo / GitHub mirror.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-15 21:32:28 +00:00
Viktor Barzin
5d3a166b94 t3-afk: fix agent Bash — stop mounting into ~/.claude
Some checks failed
ci/woodpecker/push/default Pipeline failed
Root cause of "the agent never commits": the issue-implementer CLAUDE.md was
subPath-mounted at /home/node/.claude/CLAUDE.md, which made /home/node/.claude
root-owned. The agent (uid 1000) then couldn't create its Bash session-env
there, so EVERY Bash/git call failed (Write/Edit worked, so it silently edited
but never committed). Found by reading the agent transcripts from
state.sqlite -> projection_thread_messages.

Fix: don't mount anything into ~/.claude (it's not honored by T3's SDK anyway).
Behaviour is injected via the dispatch message preamble by the control plane;
files/issue-implementer-CLAUDE.md kept as the canonical source text.

Verified post-fix: a preamble-dispatched task edited README and COMMITTED
(073ab28) unattended.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 20:49:34 +00:00
Viktor Barzin
34c30ac2bf t3-afk: auto-pair dispatcher sidecar — no manual pairing
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The bare `t3 serve` behind Authentik showed the manual /pair#token screen, which
didn't connect. Mirror the devvm t3-dispatch: a small stdlib-Node sidecar fronts
t3 serve, and on a cookieless (already Authentik-gated) document load it mints a
pairing credential (`t3 auth pairing create`) and exchanges it at
/api/auth/browser-session for the t3_session cookie, then 302s back. Everything
else — including WebSocket upgrades for the live cockpit — reverse-proxies to
:3773. The Service now targets the sidecar (:8080).

Verified: cookieless GET -> 302 + Set-Cookie t3_session; cookied GET -> 200 SPA.
Matches the t3.viktorbarzin.me experience (Authentik login -> straight into the
cockpit).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 20:19:39 +00:00
Viktor Barzin
92c5b24975 docs: ghcr_pull_token is now a scoped read:packages PAT, not the admin alias
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
Minted a dedicated classic GitHub PAT scoped to read:packages and stored it in
Vault secret/viktor/ghcr_pull_token (2026-06-15), replacing the previous alias
of the broad admin github_pat. Propagated via targeted apply of
module.kyverno.kubernetes_secret.ghcr_credentials (Kyverno re-syncs the
allowlisted namespaces). Document the new cred + the manual rotation recipe.

Closes: code-h2il

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-15 20:19:17 +00:00
Viktor Barzin
ef555c7e02 workstation: put ~/.local/bin on PATH so the launcher finds native claude
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor hit "~/.local/bin is not part of the PATH". Root cause: the native claude
binary lives in ~/.local/bin, but the terminal launcher (start-claude.sh) runs in
tmux's NON-login bash env, which doesn't source the user's shell rc where the native
installer put ~/.local/bin on PATH. So `command -v claude` failed there → the
launcher's bootstrap re-ran the native installer → the installer printed the PATH
warning. (Interactive zsh already had ~/.local/bin via the per-user installer rc edit,
and t3-serve sets PATH in its unit — so only the terminal launcher was affected.)

- skel/start-claude.sh: prepend ~/.local/bin to PATH near the top (guarded/idempotent),
  before the launch logic — so `claude` is found, no reinstall, no warning.
- setup-devvm.sh: install /etc/profile.d/10-local-bin.sh — adds ~/.local/bin to PATH for
  all LOGIN shells machine-wide (SSH etc.), independent of the per-user installer rc edit
  (fresh-user-safe). zsh login picks it up via /etc/zsh/zprofile -> /etc/profile.
- docs/architecture/multi-tenancy.md: documented the three PATH-injection points.

Verified: guard adds-when-missing / no-dup-when-present; all scripts pass bash -n.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 17:20:03 +00:00
Viktor Barzin
eecd78233b workstation: standardize on the native claude install (drop npm-global + npx)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Question from Viktor: should claude run via the binary or npx? Answer: the native
install is the recommended runtime (self-contained, self-updating ~/.local/bin/claude;
installMethod=native) — and every existing user had already auto-migrated to it, leaving
the npm-global copy empty and the npx fallback dead. "Leave only the recommended setup":

- setup-devvm.sh: node is now installed ONLY for the t3 CLI; dropped the machine-wide
  `npm install -g @anthropic-ai/claude-code` (npm/npx is not the recommended runtime and
  just shadowed the per-user native installs).
- t3-provision-users.sh: new per-user `install_user_claude_native` (runs the official
  https://claude.ai/install.sh AS the user, idempotent/skip-if-present) — provisions native
  claude for BOTH the terminal launcher and each t3-serve instance, replacing the npm bootstrap.
- skel/start-claude.sh: launcher runs the native `claude` only; if missing it bootstraps via
  the native installer (was an `npx @anthropic-ai/claude-code` fallback).
- docs/architecture/multi-tenancy.md: documented the native-only runtime model.

node stays (the pinned t3 CLI is npm-global). Verified: native installer reachable +
produces ~/.local/bin/claude 2.1.177; all three scripts pass bash -n + shellcheck.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 17:12:05 +00:00
Viktor Barzin
4a48f065e9 mcp: drop project-scoped paperless from .mcp.json (paperless is now wizard-only)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Paperless is a personal tool for wizard, not shared. It was project-scoped in the
infra repo's .mcp.json (the in-cluster paperless-mcp proxy), so every user whose
~/code IS an infra clone (emo, ancamilea) auto-loaded it. Per request, paperless
should be wizard-only: wizard now runs his own direct, token-based paperless MCP in
his user-scope config (a local barryw/paperlessmcp container -> paperless-ngx).
Removing the shared entry so emo and other infra-clone users no longer get it; the
`ha` MCP stays project-scoped. emo's clone drops it on next freshen.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 17:03:37 +00:00
Viktor Barzin
bb3f5f2329 workstation: stop the Claude Code onboarding wizard reappearing for terminal users
All checks were successful
ci/woodpecker/push/default Pipeline was successful
emo reported being "logged out" on terminal.viktorbarzin.me: every new shell
dropped him at the first-run "Choose the text style" wizard, even though he'd
used many sessions and is in fact fully authenticated. Root cause is NOT a
logout — ~/.claude.json is a single file that all of a user's concurrent claude
processes (the ttyd terminal + their t3-serve instance + agent sessions)
read-modify-write, and a stale writer periodically drops top-level keys,
including hasCompletedOnboarding. That bounces the next interactive session back
to onboarding; credentials are safe in the separate ~/.claude/.credentials.json
(which is why T3 kept working). wizard's own ~/.claude.json showed the same key
loss, so this hits any heavy multi-session user.

Fix:
- skel/start-claude.sh: ensure_onboarding() idempotently re-asserts
  hasCompletedOnboarding (+ lastOnboardingVersion) in ~/.claude.json right before
  launching claude. Merge-only (never clobbers other keys), runs as the user, and
  no-ops if jq is missing or the file is empty/corrupt. So even if the race drops
  the flag, the next launch restores it before claude reads it.
- t3-provision-users.sh: deploy_user_launcher() re-copies skel/start-claude.sh
  into every non-admin home (copy-if-changed) on the hourly reconcile. /etc/skel
  only seeds the launcher at account creation, so without this the fix (and any
  future launcher edit) would never reach existing users. .tmux.conf is
  deliberately not re-copied — terminal-lobby appends a managed section to it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 14:37:59 +00:00
Viktor Barzin
82a0c5aedf t3-afk: fix crashloop — exclude from Keel at the deployment level
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Keel "patch"-downgraded the image docker.io/library/node:24 -> library/node:24.0.2,
which is below t3@0.0.27's required node >=24.10, so `t3 serve` exited silently and
the pod crash-looped (~160 restarts / 13h).

Root cause: keel.sh/policy=never was on the POD-TEMPLATE labels, but Keel reads the
policy at the DEPLOYMENT level. The cluster's Kyverno inject-keel-annotations is
opt-out, so it stamped policy=patch and Keel acted on it.

Fix: set keel.sh/policy=never as a deployment-level annotation; ignore_changes the
Kyverno-injected keel.sh/pollSchedule + keel.sh/trigger annotations; the image stays
TF-owned (apply reverted Keel's downgrade). Pod now 1/1, t3 serve 200.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 10:32:38 +00:00
Viktor Barzin
214638216b fix(anisette): wait_for_rollout=false so a slow first start can't strand the deploy out of state
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The docker.io fix created the deployment, but wait_for_rollout (default true)
then hung on the OOMing pod and the apply failed — leaving the deployment in
the cluster but NOT in terraform state, so every later apply hit
'deployments.apps "anisette" already exists'. Deleted that orphan and set
wait_for_rollout=false (mirrors tts/llama-cpp slow-start services); readiness
probe still gates Service traffic.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 20:56:30 +00:00
Viktor Barzin
d8c60d7ab8 t3-afk: dedicated in-cluster T3 Code instance (AFK executor + cockpit)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Slice #2 of claude-agent-service PRD #1 (AFK implementation pipeline). Dedicated
in-cluster T3 Code instance the control plane dispatches issues into; runs the
issue-implementer agent in a git worktree with a live cockpit. Applied + live
2026-06-14 (9 resources).

Pilot-fast: stock docker.io/library/node:24 + install pinned t3@0.0.27 + Claude
CLI at startup onto an SSD-NFS PVC. Authentik-gated ingress. issue-implementer
behaviour ships as a user-level ~/.claude/CLAUDE.md (T3 hardcodes the system
prompt; settingSources loads it) and forbids plan-mode/clarifying-questions so
unattended threads don't stall. Keel-excluded (ADR 0003). wait_for_rollout=false
(slow first start). Image fully-qualified for the Kyverno trusted-registries
allowlist; container mem limit 4Gi (tier-aux LimitRange cap).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 20:06:33 +00:00
Viktor Barzin
bc7b28244f fix(anisette): raise memory limit to 512Mi — 128Mi OOMKilled at startup
Some checks failed
ci/woodpecker/push/default Pipeline failed
The pod CrashLooped with OOMKilled (exit 137): anisette downloads and
initializes Apple's CoreADI provisioning library on startup, spiking past the
128Mi limit before it can bind :6969 (empty logs, liveness 'connection
refused'). Bump request 256Mi / limit 512Mi; steady state is much lower.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 19:54:13 +00:00
Viktor Barzin
96addf65b4 fix(anisette): docker.io/ image prefix to pass Kyverno require-trusted-registries
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
First apply was denied at admission — a bare dadoum/anisette-v3-server@sha256
ref isn't in the trusted-registries allowlist (only enumerated DockerHub
user-repo prefixes are). docker.io/* IS allowlisted, so use the explicit
registry prefix; still pulls via the 10.0.20.10 pull-through cache.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 19:47:05 +00:00
Viktor Barzin
0bfa6f0774 feat(anisette): self-hosted Apple anisette server for SideStore (infra #40)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Deploy a small stateless anisette-data server so the TripIt iOS Shell can be
sideloaded with SideStore using a free Apple ID, without brokering the
Apple-ID auth dance through a public third-party anisette server (which would
see every login). SideStore points at a stable internal endpoint we control.

- Image: Dadoum/anisette-v3-server, the de-facto standard anisette-v3 server
  for SideStore/AltStore. Upstream ships only a mutable :latest (no GitHub
  releases / semver / sha tags), so pinned by manifest digest instead of a tag
  per the "never :latest" rule. Pulled from DockerHub via the registry-VM
  pull-through cache like echo/cyberchef. Diun watches :latest (notify-only) so
  a new upstream build prompts a digest re-pin.
- Stateless: emptyDir backs the provisioning-library cache dir (regenerable
  download; upstream issue #23 means it doesn't preserve client auth across
  restarts anyway) — no PVC, no Vault secret.
- Internal-only endpoint http://anisette.viktorbarzin.lan (auth=none,
  allow_local_access_only, ssl_redirect off) — SideStore is a native client
  that can't do the Authentik cookie dance, same reasoning as android-emulator's
  adb. The .lan CNAME is auto-created by technitium-ingress-dns-sync; never
  publicly exposed.

Mirrors the echo/networking-toolbox/android-emulator stack pattern. Service
catalog updated.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 19:35:57 +00:00
Viktor Barzin
fe1f8d62e7 tripit: re-apply tripit stack to land CITY_IMAGE_PROVIDER=wikipedia
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The commit that enabled real city cover photos (a69847a0,
CITY_IMAGE_PROVIDER=wikipedia, #47) was committed to master but its CI run
skipped the tripit stack apply (changed-stack diff race — same class as the
prior "re-apply after pipeline race" fixes). The env never landed in-cluster,
so the provider stayed on its fake 1x1-PNG default and every trip/stay cover
rendered blank/placeholder in prod. This comment touch forces CI to re-apply
the tripit stack; terraform then reconciles the drift (desired HCL already
has the env) so the deployment picks up CITY_IMAGE_PROVIDER=wikipedia.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 17:45:07 +00:00
Viktor Barzin
2df6ebf305 health: fix middleware ref namespace prefix (restore site from 404)
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
My previous commit referenced the new limiter as `health-rate-limit@kubernetescrd`,
omitting the namespace prefix. Traefik CRD middleware refs are
`<namespace>-<name>@kubernetescrd`, and the Middleware lives in the `traefik` ns,
so the router couldn't resolve it — Traefik failed the whole
health.viktorbarzin.me router and returned 404 on every path (the app + pod were
healthy throughout; verified via port-forward).

Correct it to `traefik-health-rate-limit@kubernetescrd`, matching the working
traefik-tripit-rate-limit / traefik-actualbudget-rate-limit references.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 17:43:08 +00:00
Viktor Barzin
086ff85911 health: dedicated 100/1000 rate limit for the redesigned SPA
Some checks failed
ci/woodpecker/push/default Pipeline failed
Viktor hit 429s browsing the redesigned health app. The default shared limiter
is 10 req/s / burst 50, but each page load is the shell (JS chunks + two
self-hosted Geist woff2) plus a 5-8 call API burst, so fast tab-to-tab
navigation from one client IP overruns burst 50 — Traefik 429s the tail and the
affected cards/pages render empty.

Give health its own limiter (average 100, burst 1000) and skip the default,
exactly as tripit/immich/actualbudget/ha-sofia already do for the same
parallel-burst pattern. Attached via the ingress_factory escape hatch
(skip_default_rate_limit + extra_middlewares).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 13:03:51 +00:00
Viktor Barzin
6dc77f4612 uptime-kuma: add CONTEXT.md + ADR-0001 (intentionally lean; sizing/placement review)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Documents the 2026-06-13 right-sizing review: Kuma is already lean (~1 check/s, 227 monitors mostly at 300s, 77MB on shared MySQL, 30d retention); the 'scraping too much' concern traced to a fixed socket.io login-timeout incident, not load. Records the deliberate decisions (keep per-service [External] monitors over canaries; keep datastore on shared mysql.dbaas) with rejected alternatives + rationale, plus the known internal-sync no-prune gap (stale Goldilocks monitor cleaned up by hand).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 09:11:22 +00:00
Viktor Barzin
05bec26d09 health: internal test-access ingress + DEV_AUTH_EMAIL (ADR-0008)
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
Add health-test.viktorbarzin.lan (auth=none, allow_local_access_only,
anti-AI off) pointing at the same health deployment, plus a
DEV_AUTH_EMAIL=vbarzin@gmail.com env on the container. Lets automated
E2E / Playwright / manual screenshots reach the live app without the
Authentik SSO redirect, for testing — while the public
health.viktorbarzin.me ingress stays auth=required (forward-auth fails
closed, so the public path always carries the real X-authentik-email
header and never hits the DEV_AUTH_EMAIL fallback). LAN-only, no public
exposure. Decision recorded in health repo ADR-0008.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 04:02:34 +00:00
Viktor Barzin
e6699ed20b uptime-kuma: retry Kuma login in monitor-sync jobs (intermittent socket.io timeout)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The internal + external monitor-sync CronJobs intermittently failed with socketio.exceptions.TimeoutError on api.login(), firing JobFailed -> Slack noise (and leaving monitor sync stale). Kuma 2.3.2 itself is healthy (1/1, 30m CPU); its single Node event loop just briefly stalls under ~300 monitors so the socket.io login handshake occasionally exceeds the client timeout. Wrap connect+login in a 5-attempt / 15s-backoff retry (disconnecting the half-open client between tries) so a transient stall no longer fails the whole job. Applied to both sync scripts.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-13 20:54:14 +00:00
Viktor Barzin
a6381b8cf8 forgejo: custom 8Gi ResourceQuota (was pegged at the 4Gi tier cap)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Yesterday's Forgejo 3Gi->4Gi OOM fix pushed its tier-3-edge namespace quota (requests.memory=4Gi) to 100%, firing KubeQuotaAlmostFull + the healthcheck resourcequota check. Forgejo is the git + OCI-registry backbone and legitimately needs ~4Gi, so the edge tier's 4Gi ceiling is too tight. Opt the namespace out of the auto tier quota (resource-governance/custom-quota=true) and define a forgejo-specific ResourceQuota at requests.memory=8Gi, so the 4Gi pod sits at ~50% with headroom. Same opt-out pattern dbaas uses. Re-tiering was rejected: tier 1-cluster is also 4Gi, and 0-core (8Gi) would over-classify Forgejo's priority/eviction.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-13 17:16:47 +00:00
Viktor Barzin
72982683bc docs(CLAUDE.md): k8s-portal now GHA->ghcr, not a Woodpecker build
All checks were successful
ci/woodpecker/push/default Pipeline was successful
k8s-portal was the last in-cluster image builder. Its .woodpecker/k8s-portal.yml
was deleted; it now builds on GHA (build-k8s-portal.yml) -> PRIVATE ghcr, pulled
via the Kyverno ghcr-credentials allowlist and deployed by Keel. Fix the CI/CD
section: drop k8s-portal from the Woodpecker-pipelines list (stale), move it from
'already on GHA' to the infra-owned private-ghcr images, and add it to the
PRIVATE ghcr allowlist roster. Completes the no-local-builds migration.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 16:10:56 +00:00
Viktor Barzin
25a39fd54e k8s-portal: wire private-ghcr pull (allowlist + imagePullSecrets)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
k8s-portal was the last in-cluster image build; it now builds on GHA and
pushes ghcr.io/viktorbarzin/k8s-portal:latest, which is PRIVATE (infra repo
default). To pull it: add k8s-portal to the sync-ghcr-credentials Kyverno
allowlist (clones the ghcr-credentials Secret into the namespace) and
reference that secret via imagePullSecrets on the deployment — same wiring
as tripit/recruiter-responder. Completes the no-local-builds migration so
nothing builds container images on the cluster anymore (ADR-0002).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 15:38:42 +00:00
Viktor Barzin
a7d33abec9 k8s-portal: commit package.json + lock (force; was gitignored) — unblocks GHA build
Some checks failed
ci/woodpecker/push/default Pipeline was successful
Build k8s-portal / build (push) Has been cancelled
Recovered the real manifest + resolved lockfile (lockfileVersion 3, 71 pkgs)
from the running pod. A parent .gitignore force-ignored package.json, so the
git source tree was incomplete and the image only ever built manually. Now
reproducible on GHA (ADR-0002 no-local-builds).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 15:29:27 +00:00
Viktor Barzin
a9b08c03cf fix(k8s-portal): npm install (no committed lockfile) so GHA can build
Some checks are pending
Build k8s-portal / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
package-lock.json was never committed to either lineage — npm ci needs it,
so the build only ever worked from a manual devvm build with a local lock.
npm install resolves from package.json, unblocking the GHA build (ADR-0002).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 15:26:42 +00:00
Viktor Barzin
bdfdf8db72 fix(ci): k8s-portal build context is stacks/k8s-portal/modules/k8s-portal/files (was stale platform/ path)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 15:23:46 +00:00
Viktor Barzin
b906f61ac3 k8s-portal: build off-infra GHA -> ghcr + Keel; remove Woodpecker build (no-local-builds)
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
The last in-cluster image build. GHA build-k8s-portal.yml builds
ghcr.io/viktorbarzin/k8s-portal:latest+sha (path-filtered on the Dockerfile
dir); Keel (force/poll/match-tag) rolls the deployment. Stack image repointed
to ghcr (ignore_changed); .woodpecker/k8s-portal.yml deleted.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 15:21:35 +00:00
Viktor Barzin
9501da81a0 dbaas: document postgresql-backup startingDeadlineSeconds rationale
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Inline note on why the four backup CronJobs moved 10s->600s (bda1bdcb): a 10s deadline silently dropped the 2026-06-13 midnight full-backup run, firing PostgreSQLBackupStale. bda1bdcb rode in the same push as a forgejo change that failed CI on a namespace-quota error, so that pipeline failed before the dbaas apply took effect (live deadline was still 10s). This dbaas-only commit re-triggers the dbaas apply at a clean master so the 600s deadline actually goes live.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-13 14:22:24 +00:00
Viktor Barzin
ba72621e52 forgejo: 6Gi exceeded namespace quota, set to 4Gi (quota ceiling)
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
The 3Gi->6Gi bump in ff3cc44a was rejected by the forgejo namespace tier-quota (requests.memory capped at 4Gi). With Guaranteed QoS the 6Gi request exceeded quota; FailedCreate left forgejo with 0 pods for ~6 min (git remote + OCI registry outage) until I patched the live Deployment back to a schedulable 4Gi. 4Gi is the most the quota allows and is still a headroom bump over the OOM-prone 3Gi. To go higher the tier-quota must be raised in the same change. This reconciles TF to the live 4Gi so the pending/next apply is a no-op rather than reverting to the quota-busting 6Gi.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-13 14:13:36 +00:00
Viktor Barzin
ff3cc44a29 forgejo: raise memory limit from 3Gi to 6Gi (OOMKilled at 3Gi)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Forgejo OOMKilled twice on 2026-06-13 at the 3Gi cap (exit 137), briefly taking the git remote and OCI registry down and spiking ingress TTFB to 4.7s and the 4xx rate to 51%. Steady-state is ~2.2Gi but it spiked into the cap (true demand above 3.2Gi). The 2026-06-09 bump to 3Gi was sized for tripit buildkit registry pushes, but that driver is gone now that the Forgejo registry was frozen and emptied today (ADR-0002, images on ghcr), so the spike is git ops / the integrity-probe catalog walk / a possible leak. 6Gi gives headroom on the critical git backbone while we watch whether working-set keeps climbing (which would indicate a leak).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-13 14:02:55 +00:00
Viktor Barzin
bda1bdcbf3 dbaas: widen backup CronJob startingDeadlineSeconds from 10s to 600s
The daily full PostgreSQL backup silently skipped its 2026-06-13 00:00 run, leaving the last full dump 37h old and firing the critical PostgreSQLBackupStale alert. Root cause: startingDeadlineSeconds was 10s on all four dbaas backup CronJobs, so when the CronJob controller was more than 10s late to the midnight tick (many IO-heavy backups all fire at 00:00, the known etcd-starvation window) the run was dropped entirely instead of starting late. 600s lets a brief controller lag still launch the job. Applied to all four (mysql + pg, full + per-db) since they share the footgun and the midnight contention.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-13 14:02:54 +00:00
082bdfcc77 fan-control: thin actuator — HA computes the setpoint, host only applies it
The R730 fan-control logic now lives entirely in Home Assistant: the curve
thresholds, duty %, bias and asymmetric deadband, plus manual/lock, are set on
the dashboard and published as sensor.r730_fan_command_pct. The host daemon is
reduced to a thin actuator — it reads that one number each loop, validates it
(numeric + not older than STALE_SECS) and applies it over IPMI. Removed the
presence-aware two-curve logic and the garage-door coupling.

Safety stays independent on the host: CPU>=CEILING, repeated IPMI failures, or
HA unreachable/stale all hand the fans back to Dell auto. RPM telemetry now
averages all 6 chassis fans. Deployed and verified live on pve (applies the HA
command; fans follow).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-13 12:59:57 +00:00
Viktor Barzin
3e82c64a76 docs: sync CI/CD docs to ADR-0002 final state (ghcr + Woodpecker deploy-only) [ci skip]
ADR-0002 is fully landed (issues #11-#32 closed): every owned image now
builds on GitHub Actions and pushes to ghcr.io/viktorbarzin/<name>, with
Woodpecker reduced to deploy-only. The Forgejo container registry is frozen
and emptied; there are no in-cluster image builds or CI test runs anywhere.
The docs still described the old hybrid topology (DockerHub builds,
Woodpecker-native owned-app builds, the per-pattern migration lists, the
tripit-only pilot framing), which would mislead future sessions and
incident response.

This brings the docs to the completed reality (closes #33):

- docs/architecture/ci-cd.md: full rewrite as the canonical CI/CD reference —
  the fleet GHA->ghcr->Woodpecker-deploy pattern, public/private ghcr package
  split, infra-owned image workflows (incl. infra-ci on ghcr), the frozen
  Forgejo registry, what Woodpecker still runs, and the #31 decommissions.
- .claude/CLAUDE.md: rewrite the "CI/CD Architecture" section to the
  fleet-wide final state; FIX the stale claim that claude-memory-mcp builds
  to DockerHub (it is GHA->ghcr); note owned images now live on ghcr and the
  Forgejo registry is frozen/break-glass near the image-registry bullet.
- .claude/reference/service-catalog.md: f1-stream is GHA->ghcr + Woodpecker
  deploy-only (was "Woodpecker-native build->deploy").
- stacks/{tuya-bridge,android-emulator}/variables.tf + stacks/terminal/main.tf:
  cosmetic description/comment updates (forgejo -> ghcr; terminal-lobby has no
  CI pipeline). Description/comment text only — no stack logic changed.

Historical records (docs/post-mortems/*, docs/plans/*) and ADR-0002 itself
are left untouched as point-in-time records.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 12:55:49 +00:00
Viktor Barzin
6e4db0ddc6 openclaw + f1-stream: last forgejo image refs -> ghcr (ADR-0002 #32 prep)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
openclaw's install-nextcloud-todos-plugin init still pulled forgejo
nextcloud-todos (would ImagePullBackOff on restart once the forgejo
registry is wiped) -> ghcr:latest. f1-stream stack base (KEEL_IGNORE'd,
live already ghcr via set-image) repointed for fresh-create correctness.
Clears the last LIVE forgejo viktor/* refs before the registry reclaim.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 12:36:10 +00:00
Viktor Barzin
3c3e6bfc95 ci: retire in-cluster infra-ci build; breakglass becomes manual ghcr pull-and-save (ADR-0002 #30)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
infra-ci now builds on GHA → ghcr and the ghcr-based apply is PROVEN
(pipeline 165 ran terragrunt apply in the ghcr image). Removing the
Woodpecker build-ci-image.yml (clean cut). The breakglass tarball is
preserved as a MANUAL Woodpecker job pulling ghcr (public) → registry VM;
infra-ci on ghcr is external + node-cached, so the Forgejo-down rationale
for the old auto-tarball is moot — this is belt-and-braces DR.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 10:07:58 +00:00
Viktor Barzin
ee25a41c74 ci: apply + drift steps run on ghcr infra-ci (ADR-0002 #30)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The terragrunt apply step (default.yml) and drift-detection now pull
ghcr.io/viktorbarzin/infra-ci:latest (GHA-built, verified toolchain:
tf 1.5.7 / tg 0.99.4 / sops / kubectl 1.34 / vault / git-crypt). ghcr is
public + proven pullable in-cluster. build-ci-image.yml (forgejo build)
KEPT as the fallback copy until this ghcr-based apply is proven, so a
revert restores the working forgejo image if needed.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 10:05:34 +00:00
Viktor Barzin
23fc2bf2ec ci: GHA→ghcr build for infra-ci (ADR-0002 #30, bootstrap-safe — woodpecker build kept until proven)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 09:53:43 +00:00
Viktor Barzin
eb8b550521 chrome-service: TF-manage novnc image (ghcr:latest), drop its KEEL_IGNORE (ADR-0002 #29)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
novnc's image was ignore_changed (KEEL_IGNORE) but nothing manages its
tag (keel.sh/policy=never), so the earlier forgejo->ghcr repoint never
took. Removing container[1].image from ignore_changes lets terragrunt
own novnc=ghcr:latest and roll it. container[0]/[2] (pinned playwright)
stay ignored.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 09:49:58 +00:00
Viktor Barzin
94a3d1b870 chrome-service-novnc + android-emulator images -> ghcr (ADR-0002 #29/#30)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Both now built by GHA → public ghcr. Repoint stack image bases
forgejo→ghcr:latest (terragrunt-managed, imagePullPolicy Always picks up
rebuilds). android var default api36-v8 -> latest.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 09:43:40 +00:00
Viktor Barzin
a69847a0f3 tripit: enable Wikipedia city cover photos (CITY_IMAGE_PROVIDER=wikipedia, #47)
Flips the planning workspace's Stay cover photos from the fake provider to live Wikipedia lead-image fetches (downloaded into STORAGE_DIR, served by the backend, editable per Stay). Part of the new-trip flow feature: every picked destination city gets a banner-ready cover. HOLD-ORDER: pushed only after the tripit image containing CityImageMode.wikipedia rolled out.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-13 09:43:40 +00:00
Viktor Barzin
1621f0b204 ci: GHA→ghcr builds for chrome-service-novnc, android-emulator, infra CLI (ADR-0002 #29/#30)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Infra-owned rare-build images move off Woodpecker/manual to GHA (build
from the github checkout — Dockerfiles verified identical on both
remotes). chrome-service-novnc + android-emulator → public ghcr
(dispatch+path). CLI → DockerHub (kept) + ghcr; Woodpecker build-cli.yml
removed. infra-ci handled separately (bootstrap-critical).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 09:38:36 +00:00
Viktor Barzin
f61d707d75 travel_blog: remove decommissioned stack (ADR-0002 infra#31)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
Service was already scaled 0/0 and unused (Viktor: 'not used anymore').
Live resources destroyed via scripts/tg destroy (10 resources: deployment,
namespace, service, anubis-travel + PDB/cm/svc/secret, ingress, TLS).
Removing the stack dir; old Woodpecker build (repo 5) deactivated
separately. The harmless legacy 'travel' CNAME->apex in config.tfvars is
left (now 404s; removing it would trigger a full-platform apply).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 09:32:39 +00:00
Viktor Barzin
90fb0685ae traefik: x402-gateway image forgejo -> ghcr + KEEL_IGNORE_IMAGE (ADR-0002 infra#28)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
Formalizing x402-gateway CI (was a manual no-CI image). The deployment
lives in the traefik module; its image was NOT in ignore_changes, so a
set-image deploy would be reverted on the next traefik apply — added it
(KEEL_IGNORE_IMAGE). Base repointed to ghcr:latest; the GHA deploy
set-images the :sha8. Public ghcr package = no pull secret. Inert on the
live pod (image now ignored); rolling cutover keeps forwardAuth up.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 02:42:45 +00:00
Viktor Barzin
bdea34b992 offinfra-onboard: --dockerfile flag for non-root Dockerfiles
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
claude-memory-mcp's Dockerfile is at docker/Dockerfile, not repo root
(infra#20 build failed: 'open Dockerfile: no such file or directory').
build.yml template gains file: {{DOCKERFILE}} (default ./Dockerfile).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 02:37:25 +00:00
Viktor Barzin
3960eac716 claude-memory: image base forgejo -> ghcr (ADR-0002 infra#20)
Some checks failed
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/default Pipeline was canceled
GHA now builds+pushes ghcr.io/viktorbarzin/claude-memory-mcp (public).
Image is KEEL_IGNORE_IMAGE (set-image managed), so this apply is inert
on the live pod; the stale :17 default literal is corrected to :latest.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 02:34:20 +00:00
Viktor Barzin
2f3c58dff1 claude-agent-service image -> ghcr across all five consumer stacks (infra#19)
All checks were successful
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
GHA now builds+pushes ghcr.io/viktorbarzin/claude-agent-service (public
package, anonymous pulls). Repointed: claude-agent-service (deployment +
git-init/seed-beads-agent inits), claude-breakglass, ci-pipeline-health,
beads-server CronJobs, k8s-version-upgrade (tag var 2fd7670d -> latest —
the Forgejo registry lost that sha; node caches were the only thing
keeping those CronJobs alive). publish-gate: vendor-contact emails
(licensing@/legal@/security@/sales@) ruled license-boilerplate, not PII.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 01:47:54 +00:00
Viktor Barzin
8aba3a0179 offinfra-onboard --no-deploy; wealthfolio-sync image -> ghcr (ADR-0002 infra#25)
All checks were successful
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
broker-sync is a CronJob-only consumer (no deployment): new --no-deploy
mode skips Woodpecker registration and renders build.yml without the
deploy job — :latest+Always CronJobs pick up builds on the next run.
wealthfolio stack: ghcr-credentials pull secret + image base repoint.
The wealthfolio-sync image regains a reproducible rebuild path.

Closes: code-62tm

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 01:39:35 +00:00
Viktor Barzin
2dde480795 openclaw: install-recruiter-plugin init image forgejo -> ghcr :latest (infra#27)
All checks were successful
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
Second half of the recruiter-responder off-infra migration: the first GHA
build has published ghcr.io/viktorbarzin/recruiter-responder:{1d99a8d5,latest},
so the openclaw plugin-install init container can now follow the ghcr
:latest. The forgejo-side build pipeline was removed by the onboarding
commit, so the old forgejo :latest tag is frozen and would silently serve
stale plugin code. Deferred from the first commit on purpose - flipping it
before the package existed would have wedged the openclaw rollout on
ImagePullBackOff.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 00:57:30 +00:00
Viktor Barzin
57ff41e47e recruiter-responder: pull image from ghcr + ghcr-credentials on all consumers (ADR-0002, infra#27)
All checks were successful
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
Migrating recruiter-responder off in-cluster Woodpecker builds: GHA will
build and push ghcr.io/viktorbarzin/recruiter-responder (PRIVATE package).
This commit lands the pull-side prerequisites BEFORE the first off-infra
build fires:

- stacks/recruiter-responder: image base forgejo -> ghcr (inert on the live
  Deployment - both containers are ignore_changes'd; the Woodpecker deploy
  moves the tag) + ghcr-credentials imagePullSecrets on the Deployment
  (covers the recruiter-responder container AND the alembic-migrate init
  container, which share the image).
- stacks/openclaw: ghcr-credentials imagePullSecrets on the openclaw
  Deployment - its install-recruiter-plugin init container consumes the
  :latest tag of this image. The image ref itself flips to ghcr in a
  follow-up once the first GHA build has created the package (flipping now
  would ImagePullBackOff on a not-yet-existing package and wedge the apply).
- stacks/kyverno: allowlist openclaw in sync-ghcr-credentials so the pull
  secret is cloned into that namespace too.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 00:43:35 +00:00
Viktor Barzin
c594274c83 ci: re-apply fire-planner stack after pipeline race
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
Comment-only touch so the changed-stack detection applies
stacks/fire-planner from the current master tree. Pipeline 150 (commit
f18dfa4c — the ghcr image base + ghcr-credentials migration for issue
#26) was auto-killed when the concurrent nextcloud-todos push superseded
it, and pipeline 151 diffed from f18dfa4c onward so the fire-planner
stack changes were never applied (cronjobs still point at the forgejo
image, pod specs lack ghcr-credentials).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 00:41:20 +00:00
Viktor Barzin
a264a19629 Merge remote-tracking branch 'forgejo/master' into wizard/nextcloud-todos-ghcr
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
2026-06-13 00:38:27 +00:00
Viktor Barzin
d5c328d23c nextcloud-todos: image base forgejo -> ghcr (ADR-0002, infra#18)
The nextcloud-todos build moved off-infra: GHA builds on the public
GitHub mirror and pushes ghcr.io/viktorbarzin/nextcloud-todos (public
package, anonymous pulls); Woodpecker repo 207 is deploy-only. First
ghcr image (:19c22d8c) is already built, deployed and rolled out, so
this repoint lands after the image exists. Both deployment image refs
(main + alembic-migrate init) are ignore_changes'd — no live churn,
the base matters only on resource (re)create. Old image was pulled
from a Forgejo registry package that no longer exists (pods survived
on node image cache only).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 00:38:25 +00:00
Viktor Barzin
f18dfa4c8b fire-planner: pull image from ghcr + add ghcr-credentials to all pod specs
Some checks failed
ci/woodpecker/push/build-cli Pipeline was canceled
ci/woodpecker/push/default Pipeline was canceled
Migrating fire-planner off in-cluster Woodpecker builds to GitHub
Actions -> ghcr.io (ADR-0002, issue #26). The image base moves
forgejo.viktorbarzin.me/viktor/fire-planner ->
ghcr.io/viktorbarzin/fire-planner (a PRIVATE ghcr package), so the
deployment, all three cronjobs (recompute, col-refresh,
examples-weekly) and the examples bulk job gain the ghcr-credentials
imagePullSecret (the kyverno sync-ghcr-credentials allowlist already
covers the fire-planner namespace). registry-credentials stays
alongside so the currently-running sha-pinned forgejo image can still
be pulled until the first ghcr deploy lands; the cronjob images are TF
literals and flip to ghcr :latest on this apply.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 00:38:09 +00:00
Viktor Barzin
e696957ebf ci: ancestor guard on DIFF_BASE; gate allowlists the owner's work email [ci skip]
Restarted infra pipelines after master moved diffed in REVERSE and
re-applied stale trees (pipeline 148 reverted payslip-ingest's fresh
ghcr config — repaired by the wave-2 agent). Only trust
CI_PREV_COMMIT_SHA when it is an ancestor of HEAD. publish-gate:
viktorbarzin@meta.com is the owner's own work email (same class as the
allowlisted personal domain), not blockable PII — unblocks infra#18.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 00:31:33 +00:00
Viktor Barzin
cdd60d9078 ci: re-apply instagram-poster + payslip-ingest stacks after pipeline race
All checks were successful
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
Comment-only touch of both stacks so the changed-stack detection applies
them from the current master tree. Two pipelines went wrong in sequence
during the parallel ADR-0002 wave-2 migrations (issues #23/#24):

- pipeline 146 (instagram-poster stack prep, commit 29c69250) was
  auto-killed when the concurrent payslip-ingest push superseded it, so
  its apply never ran;
- restarting it as pipeline 148 inherited CI_PREV_COMMIT_SHA = the NEW
  branch head (6928ce0b) with the OLD checkout (29c69250) — a reverse
  diff that re-applied stacks/payslip-ingest from the pre-migration
  tree, stripping the ghcr image base + ghcr-credentials pull secrets
  that pipeline 147 had just applied (2 resources reverted).

This commit restores the committed payslip-ingest config exactly as
issue #24 landed it and finally applies the instagram-poster ghcr prep
from issue #23. Lesson encoded in the comments: do not restart killed
infra pipelines after master has moved — re-trigger with a touch commit
instead.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 00:11:17 +00:00
Viktor Barzin
6928ce0be5 Merge remote-tracking branch 'forgejo/master' into wizard/payslip-ingest-ghcr
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
2026-06-13 00:03:29 +00:00
Viktor Barzin
5d236c2352 payslip-ingest: image base forgejo -> ghcr, ghcr-credentials pull secret, cron to :latest+Always
Prep for moving payslip-ingest's image build off-infra to GitHub Actions ->
ghcr.io (ADR-0002 wave 2, issue #24). One stack commit before onboarding:

- image base repointed forgejo.viktorbarzin.me/viktor/payslip-ingest ->
  ghcr.io/viktorbarzin/payslip-ingest (private ghcr package)
- ghcr-credentials imagePullSecrets added on the Deployment AND the
  actualbudget-payroll-sync CronJob pod specs (namespace is already in the
  kyverno sync-ghcr-credentials allowlist; secret verified present)
- the CronJob's SHA pin is retired: terragrunt image_tag 4f70681d -> latest
  plus explicit imagePullPolicy Always on the cron container, per the fleet
  convention for owned-app CronJobs — one less set-image target, and the
  cron can never go back to pulling the dead Forgejo tag

The Deployment keeps KEEL_IGNORE_IMAGE; its concrete :sha8 tag is set by
the Woodpecker deploy pipeline after each GHA build.

Closes: nothing yet — the repo-side onboarding (offinfra-onboard) follows.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 00:03:11 +00:00
Viktor Barzin
29c6925031 instagram-poster: image base forgejo->ghcr + ghcr-credentials pull secret
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
Prep for migrating instagram-poster off in-cluster Woodpecker builds to
GitHub Actions -> ghcr.io (ADR-0002, issue #23, PRIVATE-repo path).
Viktor asked for the wave-2 migration of instagram-poster per the wave-1
retro recipe: before onboarding, the stack must (a) carry the
ghcr-credentials imagePullSecret on the Deployment so the cluster can
pull the private ghcr image, and (b) repoint the image base from
forgejo.viktorbarzin.me/viktor to ghcr.io/viktorbarzin.

The Deployment image is KEEL_IGNORE_IMAGE (ignore_changes), so this
apply does NOT roll the pod to a not-yet-existing ghcr image — the live
forgejo-built :da5b4191 keeps running until the first GHA build POSTs
the Woodpecker deploy. The three CronJobs run curlimages/curl (public
DockerHub), not the app image, so they need neither the pull secret nor
a repoint. registry-credentials stays for the transition window.

Closes: nothing (stack prep only; repo onboarding follows)

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 00:02:04 +00:00
Viktor Barzin
72b5843e4b publish-gate: exclude package-lock + beads tracker from email heuristic; beadboard image base -> ghcr
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
infra#17: the gate flagged npm deprecation boilerplate (package-lock.json
escapes the *.lock filter) and the upstream fork author's email in tracked
.beads data — both already-public upstream content, ruled false positives.
Lock files excluded properly; .beads moved to the eyeball inventory.
beads-server stack: beadboard image base repointed (deployment image is
KEEL-ignored; no CronJobs use it).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 23:52:07 +00:00
Viktor Barzin
57ffd0ed8d Merge remote-tracking branch 'forgejo/master' into wizard/freedify-mig
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
2026-06-12 23:37:19 +00:00
Viktor Barzin
c16fe56180 freedify: image base forgejo registry -> ghcr (ADR-0002)
Freedify builds moved off-infra per issue #22: GitHub Actions on the
ViktorBarzin/freedify mirror now builds and pushes the public image
ghcr.io/viktorbarzin/freedify, and the Woodpecker deploy pipeline
(repo 202) rolls :sha8 via kubectl set image. Both factory deployments
(music-viktor, music-emo) now seed from ghcr instead of the retired
in-cluster Forgejo build, and the container image joins lifecycle
ignore_changes (KEEL_IGNORE_IMAGE) so terraform applies do not revert
the deployed :sha8. Landed after the first GHA push so ghcr :latest
already existed when this repoint applied. Public package - no pull
secret needed.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 23:37:10 +00:00
Viktor Barzin
9f742b544c kms: image base forgejo registry -> ghcr (ADR-0002 infra#21)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
kms-website moves off in-cluster Woodpecker builds to GHA -> ghcr.
The kms-web-page deployment image is ignore_changes'd (CI sets the live
tag), so this repoint only governs future creates; package is PUBLIC so
no pull secret is wired. No CronJobs in this stack.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 23:30:07 +00:00
Viktor Barzin
fb88440ec4 ci-pipeline-health: billing moved to the enhanced usage endpoint
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
The legacy /settings/billing/actions endpoint now returns 410; sum
Minutes usageItems from /settings/billing/usage instead (found during
the infra#16 retro: June-to-date = 420/2000).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 23:24:18 +00:00
Viktor Barzin
12bdd06f74 kyverno: force_new on sync-ghcr-credentials — generate rules are immutable
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
Pipeline 138: the validate-policy webhook denies in-place edits of a
generate rule (allowlist additions). force_new = delete+recreate;
generated secrets survive and generateExisting re-adopts.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 23:18:15 +00:00
Viktor Barzin
6b0d42c7bc publish-gate + tuya-bridge ghcr cutover prep (ADR-0002 infra#15)
Some checks failed
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/default Pipeline failed
publish-gate: gitleaks + trufflehog (full history) + PII heuristics;
CLEAN verdict gates any public flip, DIRTY = stays private. tuya-bridge:
ghcr-credentials pull secret + image base -> ghcr; namespace added to
the ghcr-credentials allowlist as a safety net (new ghcr packages
default PRIVATE even from public repos — prune after visibility flip).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 23:12:02 +00:00
Viktor Barzin
54dfaf6edc job-hunter: image base forgejo registry -> ghcr (ADR-0002)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
CronJobs track :latest via the TF literal (unlike the ignore_changes'd
deployment), so they kept pulling the dead Forgejo image after the
GHA/ghcr cutover — repoint the stack's image base.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 23:06:54 +00:00
Viktor Barzin
51682ee939 offinfra-onboard: require clean clone + ff to forgejo master first [ci skip]
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 23:00:55 +00:00
Viktor Barzin
09bb0b50a1 offinfra-onboard: forgejo token fallback to ~/.git-credentials [ci skip]
job-hunter's clone uses the credential-store helper (no token embedded
in the remote URL, unlike f1-stream).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 22:59:32 +00:00
Viktor Barzin
1c41781996 job-hunter: ghcr-credentials pull secret on deployment + CronJobs
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
ADR-0002 wave 1 (infra#14): job-hunter's image moves to private ghcr;
the deployment AND both :latest CronJobs need the Kyverno-cloned pull
secret.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 22:56:48 +00:00
Viktor Barzin
6f41de71fa offinfra-onboard: normalize Woodpecker repo to untrusted [ci skip]
Trusted repos get netrc injected into every step container; the
non-root bitnami/kubectl deploy step dies with '//.netrc: Permission
denied' (hit live on f1-stream's reactivated old-era repo 10, which
carried trusted=true; tripit 167 is untrusted and works).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 22:32:08 +00:00
Viktor Barzin
beac1b57a3 offinfra-onboard: re-activate inactive Woodpecker registrations [ci skip]
Hit live on f1-stream: the old GHA-era ViktorBarzin/f1-stream
registration (repo 10) existed but was deactivated; the lookup matched
it and skipped registration, leaving the deploy POST pointed at an
inactive repo. Now checks .active and re-activates in place via
forge_remote_id.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 22:28:03 +00:00
Viktor Barzin
baff3d7477 offinfra-onboard: per-repo GHA->ghcr migration tool + f1-stream ghcr pull secret
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
ADR-0002 tracer bullet (infra#13), per Viktor's go-ahead. Idempotent
script: GitHub mirror repo (create/unarchive/visibility), GHA secrets
via gh, Forgejo push-mirror (sync_on_commit) + initial sync, Woodpecker
mirror registration, renders build.yml/deploy.yml from templates
(single-manifest provenance:false, svu semver to Forgejo, ghcr keep-10
retention, Slack notify-failure, manual-event deploy), removes the old
in-cluster build pipeline, commits on the Canonical side. f1-stream
stack gains the ghcr-credentials imagePullSecret (first consumer).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 22:21:22 +00:00
Viktor Barzin
3138a0a040 Merge remote-tracking branch 'forgejo/master' into wizard/breakglass
Some checks failed
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/default Pipeline failed
2026-06-12 21:41:58 +00:00
Viktor Barzin
32cf75635f claude-breakglass: in-cluster warm break-glass UI for the devvm
Stand up the infra for Viktor's break-glass: when the devvm is wedged (cluster
healthy), open breakglass.viktorbarzin.me, have Claude SSH in to diagnose/fix,
and power-cycle VM 102 via the Proxmox host if needed. App half landed in the
claude-agent-service repo.

New stack stacks/claude-breakglass/ — own namespace + SA, NO Vault role (ESO
syncs only its key, so the pod has zero direct Vault access). Hardened to
survive the pressure it exists to fix: priorityClassName tier-0-core, broad
node-pressure tolerations, anti-affinity off node1, imagePullPolicy Always.
auth="required" ingress so it rides the Authentik resilience proxy and stays
reachable via the basic-auth fallback during an auth-stack outage. Runs the
shared claude-agent-service image with the breakglass entrypoint.
files/breakglass-pve is the PVE forced-command (status|forensics|reset|stop|
start|cycle on VM 102, forensics-first).

Isolation: the shared claude-agent pod's terraform-state Vault policy is
explicitly DENIED secret/claude-breakglass/* (stacks/vault/main.tf) so a
prompt-injected agent on that pod can't read the root-on-devvm key.

traefik: add a checksum/auth-proxy-htpasswd annotation so the auth-proxy rolls
when the emergency basic-auth password rotates (it's a subPath mount that
doesn't auto-update) — regenerated this session so Viktor has a known
emergency credential, which the auth-stack-outage failure domain requires.

Docs: docs/runbooks/breakglass-ui.md (full incident + bootstrap procedure,
incl. the per-host from= NAT quirks) and a security.md note recording the two
new privileged footholds.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 21:40:17 +00:00
Viktor Barzin
1eee2d6eb6 Merge remote-tracking branch 'forgejo/master' into wizard/tripit-sub-mode
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
2026-06-12 21:17:09 +00:00
Viktor Barzin
42cd7d8272 tripit: flip AUTH_MODE to hybrid + OTA bundle env (Android Shell live)
The 81a816f7 image (hybrid auth + OTA endpoints) is rolled out, so the
env can flip: AUTH_MODE=hybrid with the tripit-app OIDC knobs makes the
bearer-only tripit-api host actually authenticate Shell logins (browser
cookie path unchanged); BUNDLE_PUBLIC_BASE pins the signed OTA zip URLs
to that host; BUNDLE_TOKEN_SECRET joins the tripit-secrets ES (value
already written to Vault secret/tripit). Part of the Android APK work
(tripit #50/#51).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 21:16:14 +00:00
Viktor Barzin
02785987dd ci-pipeline-health: image :latest+Always — registry lost the 2fd7670d tag
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
The sha tag other claude-agent-service CronJobs pin no longer exists in
the Forgejo registry (node caches mask it); fresh pulls 404. Follow the
owned-app CronJob convention until infra#19 moves this image to ghcr.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 21:06:20 +00:00
Viktor Barzin
765cfe803f tripit: tripit-app provider issues sub = user email (hybrid-auth identity fix)
Review of tripit slice #50 caught that the provider's default
sub_mode (hashed_user_id) would make Shell JWTs carry a sub that
never matches the email-keyed prod user rows - first app login
would either 500 in placeholder reconciliation or split the user's
identity. sub_mode = user_email makes bearer and forward-auth
resolve the same row. Part of the Android APK work (tripit #50).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 21:00:33 +00:00
Viktor Barzin
624747cc46 workstation: default Claude model fable-5 → opus-4-8 for all devvm users
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
Viktor asked to make Opus the default for new Claude sessions — his own,
Emo's, and Anca's — because Fable 5 is overkill for most daily tasks.

The org-wide default lives in the managed-settings `model` key, which
overrides each user's personal ~/.claude/settings.json model (and no
per-user launcher passes --model anymore). So flipping this one value
makes every user's NEXT session default to Opus 4.8; current sessions
keep their model, and a per-session /model still overrides as before.
The hourly t3-provision-users reconcile deploys it to
/etc/claude-code/managed-settings.json within the cycle.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-12 20:59:03 +00:00
Viktor Barzin
bd0cb71f17 tts: TCP probes — http liveness killed the server mid-synthesis
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
The devnen server runs chunked synthesis as a blocking call inside its
async handler, so the event loop (and every HTTP probe) hangs for the
whole multi-minute story. Kubelet's http liveness probe (1s timeout)
then killed the container mid-story (exit 137, twice within 10 min of
the first real drain), which reset the engine, so every following pass
started cold and tripit's 120s synthesis budget could never be met —
the queue would never drain.

TCP probes keep the meaning that matters: uvicorn binds 8004 only
after the model finishes loading in the lifespan hook, so readiness
still gates 'model loaded', while a GPU-busy server is left alive.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 20:57:28 +00:00
Viktor Barzin
30ff8f2db3 ci: diff changed stacks against CI_PREV_COMMIT_SHA, not HEAD~1
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
HEAD~1 on a merge commit is the feature-branch parent, so the
changed-stack detection diffed the WRONG side and silently skipped the
stacks the push actually changed — pipeline 128 'succeeded' without
applying the new ci-pipeline-health stack. Use the push's true
before-state (CI_PREV_COMMIT_SHA) when it resolves, HEAD~1 as fallback
(first build / shallow edge cases). Also touches the ci-pipeline-health
stack so THIS push applies it.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 20:50:43 +00:00
Viktor Barzin
fb8b6aa2f3 Merge remote-tracking branch 'forgejo/master' into wizard/ci-pipeline-health
All checks were successful
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
2026-06-12 20:45:30 +00:00
Viktor Barzin
d02ca4f2db ci-pipeline-health: daily sweep of the off-infra CI chain (ADR-0002)
Viktor asked to monitor the pipelines closely as builds move off-infra
(PRD infra#10). New aux stack: daily 07:30 UTC CronJob on the
claude-agent-service image running a deterministic shell sweep —
GitHub Actions failures/stuck runs across owned repos, Woodpecker
pipeline failures, GHA free-tier minutes burn. Healthy = one quiet
Slack line; issues = Slack alert + comment on infra#10. In-cluster
(not a cloud routine) because Vault + the Woodpecker token are
LAN-only. Secrets via ExternalSecret (github_pat deliberately, not the
ghcr_pull_token alias — a scoped packages-only rotation couldn't read
Actions runs).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 20:45:28 +00:00
Viktor Barzin
5bcad2bf34 Merge forgejo/master into wizard/emu-window
Some checks failed
ci/woodpecker/push/build-cli Pipeline was canceled
ci/woodpecker/push/default Pipeline was canceled
2026-06-12 20:44:40 +00:00
Viktor Barzin
e5291f97c8 android-emulator: api36-v8 — auto-fit emulator window to the display
noVNC scaled correctly but the emulator's Qt window opened small (~411x914)
and floated inside the 1080x2280 Xvfb, so the user saw a tiny phone in a sea
of black. v8 bakes a background fitter (wmctrl+xdotool) that, after boot,
auto-OKs the one-shot nested-virtualization warning dialog, fills the phone
window to the display, and parks the control strip off the right edge —
re-running to catch window/dialog timing then maintaining every 30s. Applied
live to the running pod already; this makes it survive the next wake.
2026-06-12 20:44:29 +00:00
Viktor Barzin
98f1f7fc24 tts: seed extension-less voice copies so tripit's bare stems resolve
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
First live drain failed all 27 queued narrations with 404 'Voice file
'Emily' not found': tripit's catalog sends bare stems (Emily) but the
devnen server resolves the voice as a literal filename (Emily.wav) in
predefined_voices_path then reference_audio — no stem fallback exists
upstream (HEAD == our pinned sha), and symlinks can't bridge it because
safe_resolve_within() resolves them out of the containment check.

New initContainer on the chatterbox deployment copies the 28 bundled
voices to /data/reference_audio/<stem> on the PVC (second lookup path).
Same image as the main container so no extra pull; idempotent; ~15 MB.
Verified live before committing: an extension-less copy synthesizes
200 audio/mp3 (5.3s warm) where voice=Emily 404'd.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 20:41:51 +00:00
Viktor Barzin
bb0f9f59ef docs: CI-compute doctrine — leverage external infra for builds AND tests [ci skip]
Viktor's standing instruction (2026-06-12): lean on external infra as
much as possible for CI — builds, running tests, lint, releases all on
GitHub Actions hosted runners, never on cluster nodes; in-cluster
pipelines only for cluster-touching steps (deploys, terragrunt,
certbot). Also: watch any triggered pipeline chain to completion and
fix failures immediately. Added to AGENTS.md + .claude/CLAUDE.md
CI sections (ADR-0002 companions).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 20:39:27 +00:00
Viktor Barzin
97dcf49b8e monitoring: reduce Slack alert noise (alert-on-change + daily digest)
Some checks failed
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/default Pipeline was canceled
Reviewed the last 24h of Slack alerts after the midday node-pressure blip:
the volume came far less from the outage than from (a) alerts re-pinging
every few hours while nothing changed and (b) a pod cascade that fired
uninhibited. This hardens the alerting *system* so recurrences are quiet,
rather than just clearing today's broken services.

Changes (all in the monitoring module):

* Alert-on-change routing. warning/info repeat_interval -> 8760h (notify
  once, then only on a membership change or resolve); critical 1h -> 6h
  (a slow nag, not an hourly drip). send_resolved stays on. The bulk of
  the 24h volume was these re-pings (RpiSofiaUndervoltage alone fired
  continuously for ~24h, re-notifying every 4h).

* Daily digest CronJob (alert_digest.tf + alert_digest.py) -> #alerts at
  08:00 Europe/London: the full current board grouped by severity + what
  resolved in the last 24h. This is the standing-state safety net for the
  alert-on-change model. Stock python:3.12-alpine, pure-stdlib script
  (no pip/apk at runtime -> none of the per-run disk-write footprint that
  disabled status-page-pusher). Reuses the existing Alertmanager Slack
  webhook via a namespaced Secret; reads Alertmanager v2 + Prometheus.

* Cascade inhibition. NodeConditionBad/NodeDiskPressure now suppress the
  downstream pod-churn alerts (PodCrashLooping, PodImagePullBackOff,
  PodsStuckContainerCreating, ScrapeTargetDown, *ReplicasMismatch, ...).
  The midday DiskPressure event on 4 nodes fired 25 PodCrashLooping + 14
  PodImagePullBackOff uninhibited because only NodeDown was a source.

* T3 probe de-duplication. T3ProbeLegDown now inhibits T3ProbeDropBurst
  for the same leg — two alerts described one condition and were the #1
  noise source (~3,400 alert-minutes over 24h).

* ScrapeTargetDown false positives. Scrape only Ready endpoints, so
  completed CronJob pods that linger in EndpointSlices as NotReady
  addresses stop firing phantom "down" alerts (tts/tripit/beads). A Ready
  pod with a genuinely broken metrics endpoint still fires.

* for: 0m -> 5m on the flappy backup-status flags (LVM/Weekly/Offsite/
  NfsMirror/Vzdump *Failing) and DNS spike detectors, so a single
  transient Pushgateway/scrape blip no longer fires-and-resolves.

* Added an Alertmanager scrape target: it carried no prometheus.io/scrape
  annotation, so notification volume was unmeasurable — now we can verify
  this change worked (alertmanager_notifications_total et al.).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-12 20:35:56 +00:00
Viktor Barzin
87a8a393fe tts: demand gate treats a failed queue probe as no-action, not queue-empty
Some checks failed
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was canceled
The demand-gate script defaulted an unreadable/unparseable tts-queue
response to QUEUED=0, which the scale-down arm reads as 'queue empty'.
One transient curl failure at 20:30 UTC today idled chatterbox-tts to 0
the very minute the pod first went Ready, with 27 narrations still
queued (tripit kept logging tts_unreachable). Probe failure now exits
without touching replicas: scale-up still needs a real count > 0, and
scale-down now needs an explicitly parsed 0. Worst case after this
change is a stale-up deployment idling until the 06:00 window-down.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 20:35:02 +00:00
Viktor Barzin
18f524c265 docs: ghcr-credentials is now Kyverno-synced to allowlisted namespaces [ci skip]
Same-change doc sync for infra#12: the tripit-ns-scoped interim secret
paragraph described the pre-ClusterPolicy state.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 20:31:55 +00:00
Viktor Barzin
68c7be8653 traefik: non-merge apply trigger (error-pages buffer fix)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
2026-06-12 20:31:24 +00:00
Viktor Barzin
f3cb5661a6 Merge forgejo/master into wizard/errorpages-buffer
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
ci/woodpecker/push/build-cli Pipeline was canceled
2026-06-12 20:31:22 +00:00
Viktor Barzin
aa1fccb883 traefik/error-pages: READ_BUFFER_SIZE 5KB -> 128KB — 431s for cookie-heavy users
Viktor hit 'Too big request header' (fasthttp 431 from error-pages) on a
routed host during a brief 503 window, and sees it periodically across
ingresses: Authentik forward-auth accumulates one authentik_proxy_*
cookie per protected service on .viktorbarzin.me, so established
browsers carry multi-10KB Cookie headers — over error-pages' 5120-byte
default read buffer, which doubles as its max header size. Any error-
middleware dispatch then 431'd instead of rendering the styled page.
Same root cause class as the 2026-06-01 large_client_header_buffers
fixes on bot-block-proxy and auth-proxy-config; error-pages was the
remaining small-buffer backend on the shared chain.
2026-06-12 20:31:01 +00:00
Viktor Barzin
523e18c127 kyverno: sync-ghcr-credentials to private-ghcr namespaces; tripit consumes the clone
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
Viktor asked to unblock the ADR-0002 ghcr pull-secret work (infra#12)
without waiting on a UI-minted token: GitHub has no token-mint API, so
the admin PAT (aliased in Vault as secret/viktor/ghcr_pull_token —
swap the alias value when a scoped token is ever minted) becomes the
platform credential. Because the PAT is broad, the new ClusterPolicy
clones ghcr-credentials ONLY to an explicit allowlist of namespaces
running private ghcr images (tripit, f1-stream, job-hunter,
instagram-poster, payslip-ingest, wealthfolio, fire-planner,
recruiter-responder) — NOT cluster-wide like registry-credentials.
generateExisting+synchronize so existing namespaces get the clone.
tripit's hand-declared ns-scoped secret is removed in favour of the
clone (imagePullSecrets now reference the name literally).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 20:28:11 +00:00
Viktor Barzin
12fd1fcbc9 android-emulator: api36-v7 — noVNC defaults: scaled view, autoconnect, reconnect
Some checks failed
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/default Pipeline was canceled
Viktor's screen rendered unscaled on a bare /vnc.html. The entrypoint
now writes /usr/share/novnc/defaults.json (resize=scale, autoconnect,
reconnect with 2s delay, shared) so every load behaves right without URL
params, and viewers self-heal across pod restarts/wakes. Already applied
live to the running pod; this makes it survive the next wake.
2026-06-12 20:18:26 +00:00
Viktor Barzin
ff08c685cd tts: image is TF-owned — drop the copied KEEL ignore so the GHCR switch applies
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
The deployment's lifecycle.ignore_changes still ignored the container
image (copied from the keel-managed tripit pattern), which would have
made the previous commit's GHCR switch a silent no-op on apply. Keel
cannot poll the private GHCR repo anyway; the pinned sha tag is
terraform's to manage.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 20:13:50 +00:00
Viktor Barzin
dbb4572112 tts: pull Chatterbox from GHCR — the Forgejo-registry copy is unpullable
Some checks are pending
ci/woodpecker/push/build-cli Pipeline is pending
ci/woodpecker/push/default Pipeline is pending
Viktor reports the voice still isn't from the TTS service — correct:
zero story_audio rows exist; the pod has sat in ImagePullBackOff since
the first window because the 2026-06-09 Forgejo-registry push has a
corrupt layer blob (HEAD 500s; pushed from a 94%-full disk) and identical
digests can't heal corrupt registry storage. The off-infra GHA rebuild
(tripit build-chatterbox.yml, devnen 915ae289, succeeded 03:23 UTC) now
lives in private GHCR: switch the image there, pin the upstream-sha tag,
and add the vault-backed ghcr-credentials pull secret (mirrors
stacks/tripit). tripit's drain loop has 27 narrations queued and picks
them up the moment the pod goes Ready.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 20:13:19 +00:00
Viktor Barzin
8919835c5d beads-server: track claude-agent-service :latest (was pruned SHA → ImagePullBackOff)
Some checks failed
ci/woodpecker/push/build-cli Pipeline was canceled
ci/woodpecker/push/default Pipeline was canceled
cluster-health found beads-dispatcher + beads-reaper CronJobs in ImagePullBackOff
for ~7h: they pinned claude-agent-service:2fd7670d, a SHA tag that Forgejo
retention (keeps newest 10) pruned. claude-agent-service itself runs :latest
(KEEL_IGNORE_IMAGE). Point the beads tag at :latest so it tracks the live image
and can't go stale again — the dispatcher/reaper only need bd+curl+jq, which the
image ships.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-12 20:12:24 +00:00
Viktor Barzin
0491fc43f2 android-emulator: README — final measured profile; honest GL story
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
ci/woodpecker/push/build-cli Pipeline was canceled
Trues the runbook up to reality: guest GL stays software (llvmpipe)
under Xvfb by deliberate choice (NVIDIA headless GL would need a
different streaming architecture), the GPU slice costs ~100MiB VRAM only
while awake, and the awake steady-state is ~0.5-1.3 cores / ~5Gi with
scale-to-zero covering idle.
2026-06-12 20:11:55 +00:00
Viktor Barzin
10a52a2683 gitignore: timestamped terraform.tfstate.*.backup (plaintext Tier-0 secrets) [ci skip]
Viktor's off-infra-builds wave 0 (infra#11): two untracked
terraform.tfstate.<ts>.backup files with live plaintext Tier-0 secrets
were sitting in stacks/infra/ unmatched by the existing *.tfstate.backup
patterns — one stray git add from the public repo. Pattern added;
the on-disk files are deleted separately.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 20:11:41 +00:00
Viktor Barzin
3802967290 android-emulator: api36-v6 — cap RLIMIT_NOFILE; x11vnc -nolookup
All checks were successful
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
Viktor's noVNC sat at 'Connecting…' forever: the WebSocket traversed
Cloudflare/Authentik/websockify fine, but x11vnc never sent the RFB
banner — strace showed it sweeping the container's fd table with one
fcntl per fd, and containerd grants RLIMIT_NOFILE=2147483584 here, so
each connection effectively never completed. The entrypoint now sets
ulimit -n 65536 for everything it launches (verified live: banner
answers instantly under the capped limit); x11vnc also gets -nolookup
so client reverse-DNS can never stall handshakes.
2026-06-12 20:04:42 +00:00
Viktor Barzin
623d34628a docs: ADR-0002 — all owned image builds move off-infra to GHA + ghcr [ci skip]
Viktor asked to evaluate fully external image builders because in-cluster
CI builds keep destabilising the homelab (Forgejo OOM under registry-push
load, hairpin push timeouts, build IO on the shared sdc HDD, registry PVC
at its 50Gi ceiling). The evaluation was grilled to a decision set:

- every owned image builds on GitHub Actions and lives on ghcr.io
  (extends the 2026-06-09 tripit pilot to the whole fleet)
- per-repo visibility: 9 public mirrors + images (gated on a clean
  gitleaks/PII history scan), the personal/finance/gray ones stay private
- clean cut: no in-cluster fallback build pipelines; existing
  build-fallback.yml files are deleted
- Woodpecker becomes deploy-only; Forgejo registry freezes to one
  last-known-good tag per Service after a manual cleanup pass
- dead builders (terminal-lobby, webhook-handler, hmrc-sync, trading-bot,
  travel-agent, trip-planner) are decommissioned, not migrated;
  travel_blog is decommissioned outright; manual images (x402-gateway,
  chrome-service-novnc, chatterbox-tts, android-emulator) get formalized
  GHA builds; infra-ci + CLI builds move to GHA on the public infra repo

CONTEXT.md: updated 'GHA build + Woodpecker deploy', added 'Canonical
repo', 'GitHub mirror', 'Forgejo registry' terms, image-path relationship,
and a 'registry' ambiguity entry.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 19:55:47 +00:00
Viktor Barzin
3978eec53a Merge forgejo/master into wizard/emu-gpu
Some checks failed
ci/woodpecker/push/default Pipeline failed
ci/woodpecker/push/build-cli Pipeline was successful
2026-06-12 19:45:06 +00:00
Viktor Barzin
b2bd859a8e android-emulator: NVIDIA_DRIVER_CAPABILITIES=all — graphics libs for -gpu host
First GPU boot verified qemu attached to the T4, but the guest GL
translator reported llvmpipe: the GPU operator injects only
compute,utility by default, so the NVIDIA EGL/GL vendor libraries were
absent and gfxstream silently fell back to software GL. The graphics
capability completes the hardware rendering path.
2026-06-12 19:43:25 +00:00
Viktor Barzin
0216e993dc etcd-load-reduction: remove VPA/Goldilocks, disable kyverno reporting, descheduler hourly
Some checks failed
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/default Pipeline failed
The control-plane flap (etcd lease-renewal timeouts) recurred. Rather than move
etcd to SSD (code-oflt, deferred again), the chosen direction is to REDUCE etcd
load enough that the leader-election-timeout band-aid (renew 10s->30s) becomes
removable. These are the big, clean cuts:

1. Remove VPA/Goldilocks (stacks/vpa emptied). All 349 VPAs ran updateMode=Off
   (no auto-right-sizing) yet cost ~800 etcd objects + continuous recommender
   writes + a pod-creation admission webhook, purely to feed a dashboard. krr
   (Dockerized, on-demand) replaces it. Reverses the re-add after memory 2431.

2. Disable kyverno reporting (admission/aggregate/background). policyReports were
   already off, so the pipeline generated ephemeralreports + an hourly
   all-resource etcd re-scan for NO user-facing output. Admission enforcement
   (deny-* policies) and Keel mutation are unaffected; violations surface via
   Loki->Slack.

3. descheduler */5 -> hourly (fewer list/evict cycles; rebalancing isn't urgent).

Deferred (poor ROI / unsafe as planned): ESO refreshInterval 15m->1h is a
~20-stack sprawl for ~0.1 writes/s; keel background=false is invalid for a
mutate-existing policy and its churn is apply-time not steady-state. Both filed
as follow-up beads.

Post-apply: delete the chart-orphaned VPA CRDs to cascade-clean leftover CRs.
Then measure etcd apply-latency and revert the timeouts. Docs updated
(VPA/Goldilocks -> krr). See memory 5402-5407.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-12 19:41:22 +00:00
Viktor Barzin
16adda2c48 android-emulator: gate reaches the kube API via env vars, not DNS
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
First real wake attempt 500'd: kubernetes.default.svc does not resolve
from the gate's alpine pod (musl + injected dns_config ndots quirk), so
every kube call failed with 'Name does not resolve'. Use the injected
KUBERNETES_SERVICE_HOST/PORT env vars — the canonical in-cluster
endpoint, no DNS dependency. ConfigMap checksum annotation rolls the
gate automatically.
2026-06-12 19:32:34 +00:00
Viktor Barzin
b1b9de90e4 tripit: tripit-api ingress joins the dedicated 100/1000 rate-limit
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
Follow-up to eef4dc7f: the Android Shell's dedicated bearer-auth host
(tripit-api, ADR-0017) serves the same thumbnail-proxy traffic and was
still on the default 10/50 limiter — the shell's photo grid would have
hit the identical 429 wall Viktor just reported on the PWA host.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 19:18:40 +00:00
Viktor Barzin
eef4dc7f63 tripit: dedicated 100/1000 rate-limit — photo grid 429s on the default 10/50
Some checks failed
ci/woodpecker/push/build-cli Pipeline was canceled
ci/woodpecker/push/default Pipeline was canceled
Viktor hit a wall of 429s scrolling the new trip Photos tab: every Immich
thumbnail proxies through tripit's /api, so a few-hundred-photo trip is
that many parallel GETs from one IP — far past the shared Traefik
limiter's average 10 / burst 50. Fourth instance of the parallel-asset
pattern (ha-sofia, ActualBudget, noVNC); same cure: dedicated
tripit-rate-limit middleware (average 100, burst 1000) +
skip_default_rate_limit on the main tripit ingress only. The token-gated
calendar/email/slack carve-outs keep the strict default.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 19:15:56 +00:00
Viktor Barzin
e8a4eb0f05 tripit: satisfy the auth-comment lint on the tripit-api ingress
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
The previous commit (c5631cff) failed CI's ingress_factory guard: the
'# auth = "none": <why>' justification must sit directly above the auth
line inside the module, not above the module block. Same content, moved
to where the lint looks; no functional change.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 08:53:02 +00:00
Viktor Barzin
c5631cff74 tripit: Shell auth surface — tripit-app OAuth2 provider + bearer-only tripit-api host
Some checks failed
ci/woodpecker/push/default Pipeline failed
ci/woodpecker/push/build-cli Pipeline was successful
Viktor is adding the Android APK (Capacitor Shell) for TripIt. The Shell
cannot use the browser's forward-auth cookie dance, so per tripit ADR-0017
it logs in with OIDC Code+PKCE and calls the API with bearer JWTs:

- authentik.tf: tripit-app OAuth2 provider (public client + PKCE — an APK
  holds no secret), custom-scheme redirect me.viktorbarzin.tripit://callback,
  RS256, 1h access / 90d refresh (offline_access mapping attached so refresh
  tokens are issued), plus the TripIt App application.
- main.tf: new ingress host tripit-api.viktorbarzin.me -> same tripit
  Service, no forward-auth (backend validates the JWTs itself once tripit
  AUTH_MODE=hybrid lands — slice 2), inbound X-authentik-* deleted via the
  existing traefik strip-auth-headers middleware so the header fallback can
  never be spoofed through this host.

Closes nothing here; tracked as viktor/tripit#49.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 08:47:46 +00:00
Viktor Barzin
b985686661 android-emulator: non-merge apply trigger (GPU + wake gate)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
2026-06-12 07:53:38 +00:00
Viktor Barzin
18ccd57b63 Merge forgejo/master into wizard/emu-gpu
Some checks failed
ci/woodpecker/push/build-cli Pipeline was canceled
ci/woodpecker/push/default Pipeline was canceled
2026-06-12 07:53:12 +00:00
Viktor Barzin
f4dd515fd7 android-emulator: GPU rendering on node1 + scale-to-zero wake gate
Viktor's direction (2026-06-12): the emulator is dev-only, so it should
be on-demand, and it should use the T4 where applicable. (1) api36-v5
runs '-gpu host' on the GPU node (nodeSelector + time-slice + EGL libs;
automatic swiftshader fallback if GPU init dies) — screen-on rendering
moves off the CPU (~5 cores → expected 1-2). (2) The wake gate (stdlib
python, owns / on both hostnames) scales the deployment 0→1 on visit and
hands the browser to noVNC when ready; agents GET /wake + /status. The
idle-sleeper CronJob counts established adb/noVNC connections via
/proc/net/tcp (excluding the in-container loopback adb client) and scales
to zero after 4 idle checks (~1h). TF ignores replicas drift. VRAM cost
(~0.5-1GiB) is held only while awake, protecting llama-swap headroom.
2026-06-12 07:52:50 +00:00
Viktor Barzin
b598c61c61 android-emulator: scale to 0 — its CPU burn was starving etcd
All checks were successful
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
The cluster-health check found the control plane flapping: kube-scheduler
and kube-controller-manager were crashlooping (220+ restarts) on lost
leader-election leases, with "etcdserver: request timed out" in the logs.

Root cause: the android-emulator pod's ~4.7-core swiftshader (software-GPU)
CPU burn on node3, together with frigate on node1, saturated the single
Proxmox host (load ~64) and starved etcd's disk/CPU on the k8s-master VM —
so etcd timed out and the leader-election controllers died and restarted in
a loop.

The emulator is a shared *test* instance, not a 24/7 service, so scaling it
to 0 is the right relief: spin it back to replicas=1 on-demand for a testing
session. Confirmed recovery after scaling down: node3 CPU 83%->28%, PVE load
64->51, control-plane restarts frozen. Durable structural fix (etcd/critical
VM disks off the shared sdc HDD; PVE CPU weighting) is tracked as code-oflt.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-12 07:31:46 +00:00
Viktor Barzin
39a22b352e tts: bootstrap the chatterbox NFS subdir — first-window mount failed forever
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
First real window (2026-06-12 02:00): the chatterbox pod sat in
ContainerCreating with MountVolume exit 32 x19 — /srv/nfs-ssd is exported
whole-tree but the chatterbox SUBDIR never existed on the host (the
go-live runbook step needed NFS-host shell nobody doing the apply had).
One-shot busybox Job mounts the export root and mkdir -p's the subtree;
kubelet's mount retry then self-heals the pod. Audio queue (27 items)
drains as soon as the model loads.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 02:51:14 +00:00
Viktor Barzin
db63cd7501 android-emulator+traefik: non-merge apply trigger for the rate-limit fix
Some checks failed
ci/woodpecker/push/build-cli Pipeline failed
ci/woodpecker/push/default Pipeline failed
Pipeline 102 applied nothing — the rate-limit commit entered master under
a merge head and the changed-stack detector is blind to merge diffs.
Plain commit touching both stacks so they apply.
2026-06-12 00:33:10 +00:00
Viktor Barzin
4d844d6fd4 Merge forgejo/master into wizard/emu-ratelimit
Some checks failed
ci/woodpecker/push/default Pipeline failed
ci/woodpecker/push/build-cli Pipeline failed
2026-06-12 00:26:05 +00:00
Viktor Barzin
152dad0a40 android-emulator: dedicated rate-limit — noVNC's module storm tripped the shared 10/50 limiter
Viktor's 'VNC stuck loading forever' (remote network): noVNC 1.3 is
unbundled and fetches ~60 ES modules in parallel on page open; the shared
Traefik rate-limit (average 10, burst 50) 429s the tail and noVNC's
loader waits on the missing modules indefinitely (reproduced: 38x429 in
a 90-request burst through the ingress). Adds a dedicated 50/300
android-emulator-rate-limit middleware (actualbudget/immich pattern) and
opts both emulator ingresses out of the shared limiter.
2026-06-12 00:25:44 +00:00
Viktor Barzin
d3d37a15ec tts: GPU-gated live narration — demand-gate CronJob + all-day VRAM guard
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
ci/woodpecker/push/build-cli Pipeline was canceled
Viktor asked 'can't we make it live? why the cronjob?' — the overnight
window guaranteed VRAM room on the shared T4, but immich/frigate models
idle-unload during the day so the card often has room (measured 10.3 GiB
free at 01:20). New 'demand' action every 3 min: scale Chatterbox up when
tripit's audio queue is non-empty AND free VRAM >= floor; idle it back to
0 when the queue empties (also frees the card early inside the nightly
window). Failed metrics scrape fail-safes to no-scale-up, same as the
window preflight. The guard moves to all-day */5 — live synthesis can
hold the card at any hour, so the yield-on-pressure watchdog must watch
at any hour. tripit exposes the unauthenticated in-cluster queue count;
a 404 from an older image reads as queued=0 (no-op). The 02:00 window-up
stays as the guaranteed nightly catch-up.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 00:25:35 +00:00
Viktor Barzin
d818f7ed3b android-emulator: README — measured resource profile + remote access + screen-off etiquette
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
2026-06-12 00:10:03 +00:00
Viktor Barzin
9af3e8860e Merge origin/master (CI state-sync commits) into wizard/android-emulator-public
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
ci/woodpecker/push/build-cli Pipeline was canceled
2026-06-12 00:08:14 +00:00
Viktor Barzin
43d2107760 android-emulator: public Authentik-gated ingress for the noVNC screen
Some checks failed
ci/woodpecker/push/build-cli Pipeline was canceled
ci/woodpecker/push/default Pipeline was canceled
Viktor wants the emulator screen reachable over the web: adds
android-emulator.viktorbarzin.me (Cloudflare-proxied) behind Authentik
forward-auth — same-origin WebSockets through forward-auth are proven by
the terminal/ttyd stack. The LAN .lan view stays, and adb:5555 remains
LAN-only since it is unauthenticated.
2026-06-12 00:07:49 +00:00
Viktor Barzin
9a2124f105 tripit: flip Research agent live (RESEARCH_PROVIDER=claude_agent, #23)
Switches the planning workspace's 'Research this' from the deterministic Fake to the live claude-agent-service Researcher. Behaviour-reviewed via a prod-pod country_when call (proposed Morocco/Georgia/Peru/Iceland with real 2026 UK bank-holiday leave windows + rough fares). Opt-in, budget-capped ~$2/run, wall-clock-bounded → degrades to 'found nothing' on slow/failed/quota-exhausted runs. Reuses CLAUDE_AGENT_TOKEN already in tripit-secrets. Completes the 12-slice Trip-Planning-Decisions feature.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-11 23:53:49 +00:00
Viktor Barzin
02ed3062f6 android-emulator: non-merge apply trigger for v4 image rollout
All checks were successful
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
Pipeline 96 applied only tripit: the v4 bump (577267cd) entered master
inside a merge whose first-parent diff hid stacks/android-emulator from
the stack detector — same failure mode as the tts 798b0255 trigger. This
plain commit touches the stack so the detector picks it up.
2026-06-11 23:48:16 +00:00
Viktor Barzin
2f8addc63b Merge forgejo/master into wizard/android-emulator
Some checks failed
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline failed
2026-06-11 22:53:11 +00:00
Viktor Barzin
577267cd97 android-emulator: api36-v4 — pin emulator 36.1.9; bind socat to pod IP
Two final fixes from the live debugging session: (1) sdkmanager-latest
emulator 36.6.11 hangs before executing a single guest instruction in
this pod (KVM and TCG alike, every gpu mode, crash-reporting on or off)
while 36.1.9 boots Android in ~107s — the entrypoint now pins build
13823996 on the PVC; (2) the emulator already listens on 127.0.0.1:5555,
so socat's wildcard bind died with EADDRINUSE and its exit restarted the
pod right after a successful boot — socat now binds the pod IP only.
2026-06-11 22:52:54 +00:00
Viktor Barzin
fba1659611 tripit: enable LLM sight discovery + real place resolver (image 2a965ca0 is live)
All checks were successful
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
Viktor's tour-redo (tripit#29): the new image is rolled out, so the two
new provider knobs can flip — discovery becomes wikipedia+llm (GeoSearch
merged with claude-agent-service proposals, Focus-steered) and the
Wikipedia place resolver (manual sight search + LLM-proposal resolution)
leaves its fake default. Env-after-image hold order, same as FARE_PROVIDER.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 22:30:24 +00:00
Viktor Barzin
f74e421283 tripit: overnight tour-audio fill CronJobs (02:20 + 04:30 retry, Europe/London)
All checks were successful
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
Viktor's tour-guide redo (tripit#30/#31): narration audio is baked-audio-only
now — the fill-tour-audio worker synthesizes the queued (story, telling,
voice) audio while the tts stack's off-peak window (02:00-06:00) has
Chatterbox scaled up. Two idempotent passes: 02:20 after scale-up + model
load, 04:30 insurance against a skipped window or guard yield. Daytime runs
record tts_unreachable and exit quietly by design.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 22:24:29 +00:00
Viktor Barzin
85dbec6108 android-emulator: api36-v3 — avdmanager must run from inside the SDK root
Some checks failed
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/default Pipeline failed
v2's marker fix proved the install completes, but avdmanager still saw
no system images: it IGNORES ANDROID_SDK_ROOT (and has no --sdk_root),
deriving the SDK root from its own toolsdir — /opt/android in our image,
while packages live on the PVC at /sdk. v3 seeds cmdline-tools into
/sdk/cmdline-tools/latest once and runs avdmanager from there, so it
resolves the PVC as the SDK root.
2026-06-11 21:15:50 +00:00
Viktor Barzin
5e8a988858 android-emulator: api36-v2 — marker-file install idempotency + retries
Some checks failed
ci/woodpecker/push/k8s-portal Pipeline failed
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/pve-nfs-exports-sync Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/registry-config-sync Pipeline was successful
ci/woodpecker/push/build-ci-image Pipeline was successful
ci/woodpecker/push/default Pipeline failed
First boot crashed mid-SDK-install, and the dir-existence check then
skipped reinstall forever: avdmanager saw the partial tree and died with
'Valid system image paths are: null' (CrashLoopBackOff). v2 tracks
install completion with a marker file written only after sdkmanager
succeeds + package.xml exists, wipes partial system-image trees before
reinstalling, and retries sdkmanager 3x.
2026-06-11 20:59:08 +00:00
Viktor Barzin
3fac45febc android-emulator: drop applied import stanzas; deployment recreates fresh
Some checks failed
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/default Pipeline was canceled
The five imports from the last recovery commit are in state now (verified
serial 4: everything except the deployment). The deployment kept falling
out of state between runs, so instead of a third import round the broken
0-replica deployment object was deleted live (transient recovery step,
presence-claimed) and this apply recreates it Terraform-owned with the
quota-fitting 3Gi requests. Import stanzas must go because TF 1.5 errors
on importing already-managed addresses.
2026-06-11 20:49:37 +00:00
Viktor Barzin
6b7efcd2d6 android-emulator: import the five resources still missing from state
Some checks failed
ci/woodpecker/push/default Pipeline failed
ci/woodpecker/push/build-cli Pipeline was successful
Pipeline 88 imported the namespace but its refresh dropped the PVC, both
services, the ingress and the tls secret from state (PG-backend state
races on this new stack's first applies), so the apply again died on
'already exists' conflicts. State now holds namespace+deployment; adopt
the missing five with import blocks (TF 1.5 errors on importing
already-managed addresses, so only the missing set is listed). Stanzas
come out once applied.
2026-06-11 20:44:09 +00:00
Viktor Barzin
b948224008 android-emulator: import orphaned namespace into state (lock-race recovery)
Some checks failed
ci/woodpecker/push/default Pipeline failed
ci/woodpecker/push/build-cli Pipeline was successful
Pipeline 85 created the namespace but a Terraform pg-backend
workspace-creation lock race (new stack schema initializing while other
stacks applied concurrently) left it out of the recorded state — every
later apply then died with 'namespaces android-emulator already exists'.
Adopt it with an import block per the house recovery pattern; stanza
gets removed once it has applied.
2026-06-11 20:38:46 +00:00
Viktor Barzin
99c19584f7 android-emulator: fit pod inside the tier-1 ResourceQuota (Burstable memory)
Some checks failed
ci/woodpecker/push/k8s-portal Pipeline failed
ci/woodpecker/push/default Pipeline failed
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/pve-nfs-exports-sync Pipeline was successful
ci/woodpecker/push/registry-config-sync Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/build-ci-image Pipeline was successful
First deploy hit 'exceeded quota: tier-quota, requested requests.memory=8Gi,
limited 4Gi' — the generated tier-1 quota caps memory REQUESTS at 4Gi but
allows 32Gi of limits, so go Burstable (requests 3Gi, limits 8Gi) like
tiers 3/4 do, instead of opting the namespace out via custom-quota.
2026-06-11 19:56:09 +00:00
Viktor Barzin
6bf216751b Merge forgejo/master (tts stack) into wizard/android-emulator
Some checks failed
ci/woodpecker/push/default Pipeline failed
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
# Conflicts:
#	stacks/tripit/main.tf
2026-06-11 19:53:07 +00:00
Viktor Barzin
8b7c77c794 android-emulator: new stack — shared in-cluster Android 16 testing instance
Viktor is setting up an Android app development pipeline (tripit is the
first app) and wants agents to natively test changes on Android before
shipping. This adds the testing environment: an API-36 Google emulator
under KVM as a privileged pod (namespace joins the Kyverno exclude list),
SDK/system-image/AVD on a proxmox-lvm PVC, adb on the shared MetalLB IP
10.0.20.200:5555 (LAN only), noVNC screen view at
android-emulator.viktorbarzin.lan. Image is built manually from the
stack's docker/ dir (rare rebuilds; off-infra-CI rule targets repeated
builds). First infra ADR records the trade-offs (devvm/VM/redroid/budtmo
rejected).
2026-06-11 19:51:57 +00:00
Viktor Barzin
798b025580 tts+kyverno: non-merge apply trigger (merge-commit diff hid stacks/tts from the stack detector)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
The Woodpecker default pipeline selects stacks via git diff HEAD~1 HEAD;
on a merge commit that is the first-parent diff, which contained only the
concurrently-landed files — stacks/tts never got applied (namespace still
absent) and the kyverno re-trigger push got no pipeline at all. Single
non-merge commit touching both stacks so the detector sees them; the
sorted loop applies kyverno before tts, the order tripit#26 requires.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 19:08:23 +00:00
Viktor Barzin
a66aeac3b8 Merge remote-tracking branch 'forgejo/master' into wizard/tour-redo-env
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
2026-06-11 18:27:53 +00:00
Viktor Barzin
4a8c4f9a14 tts: first apply of Chatterbox stack; predefined voices from the image, not the unseeded PVC
Viktor's tour-guide redo (tripit#26): 87702bdc committed this stack with
[ci skip] so it was never applied — prod tripit has been pointing at a
nonexistent chatterbox-tts service since. This commit triggers the apply
and fixes the voices path: config pointed predefined_voices_path at the
NFS PVC (/data/voices), which nobody can seed without NFS-host shell
access and which would leave /v1/audio/voices empty (it gates readiness).
Use the 28 voices bundled in the image at /app/voices instead; /data
keeps reference audio (future cloning) and the HF model cache.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 18:27:44 +00:00
Viktor Barzin
318ce9b909 Merge remote-tracking branch 'forgejo/master' into wizard/breakglass-redesign
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
2026-06-11 18:23:40 +00:00
Viktor Barzin
df332b59e6 break-glass SSH: drop port-knock for exposed key-only :52222; version host config
Viktor got locked out of the break-glass path (forgot the port-knock setup) and
deleted the edge-router forwards, then asked to review and redesign it from
scratch.

Root cause of the lockout: the knock added no real security (key-only SSH is
already brute-force-proof) and its only benefit — hiding the port — came at the
cost of a circular dependency. The knock sequence lived only in in-cluster
Vault, which is unreachable in the exact away/cold scenario break-glass exists
for. So the unlock secret was unavailable precisely when needed.

New model (self-contained, nothing to remember): plain key-only SSH on the
Proxmox host's :52222, openly reachable. The edge router forwards WAN tcp/52222
-> 192.168.1.127:52222 (external port MUST equal internal on the TP-Link AX6000
- it rejects remaps; port 22 itself is reserved). The exposed port trusts only a
dedicated break-glass key via `Match LocalPort` (a leak of any other root key
does not grant internet access), rate-limited (iptables hashlimit) + fail2ban.

- Removed knockd (package + config) and the legacy Synology SSH forward
  (ext 3333 -> .13:22, a needless WAN exposure the original plan wanted gone).
- Fixed the fail2ban jail for Debian 13 (auth logs under sshd-session, not sshd
  - the stock journalmatch silently never banned).
- Versioned the host config in scripts/ (it was applied ad-hoc, never committed)
  and recorded the deliberate Wave-1 "no public-IP" exception in security.md +
  .claude/CLAUDE.md. Superseded the 2026-05-30 port-knock design docs.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 18:23:39 +00:00
Viktor Barzin
7a1cc64898 kyverno: re-trigger apply of tts GPU-priority exclusion (87702bdc was [ci skip]'d)
Viktor's tour-guide redo (tripit#26): the Chatterbox TTS go-live commit
87702bdc carried [ci skip], so CI never applied the kyverno change that
keeps the tts namespace out of low-GPU-priority injection. This comment-only
commit makes CI apply the already-committed change — step 1 of the
kyverno -> tts -> tripit apply order.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 18:23:29 +00:00
Viktor Barzin
50eff3ca39 tripit: enable real tour-guide content providers (wikipedia discovery, web sources, chat writer)
Some checks failed
ci/woodpecker/push/build-cli Pipeline was canceled
ci/woodpecker/push/default Pipeline was canceled
Viktor's tour-guide redo (tripit#24, slice tripit#25): the feature shipped
dark on 2026-06-08 because these three env vars were never set, so prod ran
the fake test-fixture providers — the only sight users ever saw was the
placeholder 'Sight 1' narrated by browser TTS. Flips discovery to Wikipedia
GeoSearch, story material to the five real web sources, and script-writing
to claude-agent-service (token already present in tripit-secrets).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 18:22:10 +00:00
Viktor Barzin
5486b9d438 tripit: wire calendar-conflict column to Nextcloud CalDAV (#19)
CALENDAR_CONFLICT_PROVIDER=nextcloud + CalDAV base/user on the deployment, and the read-only app-password via tripit-secrets (seeded in Vault secret/tripit). Lets the planning workspace's calendar_check column flag date clashes against the owner's Nextcloud calendar. Same image-first hold-order as the fare scrape — pushed only after the #19 image is live.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-11 18:13:01 +00:00
Viktor Barzin
e2788d1b2d workstation: lean managed-settings claudeMd — org red-lines + pointers [ci skip]
Viktor's agent-rules cleanup: the org claudeMd now carries only
governance red-lines (RBAC tiers, per-user secrets, Terraform-only,
git audit-trail rules, code-layout detection) and points to
~/.claude/rules/execution.md for the worktree lifecycle, which was
previously duplicated here in full. Settings precedence and the
model key are unchanged. Also refreshes a .gitignore comment that
cited the old execution.md section numbering.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 18:02:43 +00:00
Viktor Barzin
c3a63fcd38 apply-mbps-caps: compare normalized option sets (true idempotency) + devvm I/O-stall post-mortem [ci skip]
The raw string compare never matched qm config's canonical key order, so
the hourly timer re-issued 'qm set' against every running capped VM,
live-rewriting QEMU throttle state via QMP 24x/day. Implicated in today's
devvm freeze (15:21-16:48 UTC): the guest's disk I/O stalled inside QEMU
(blockstats frozen at 0 while QMP stayed responsive) on the legacy lsi
controller path with no iothread.

Viktor asked to root-cause the freeze before choosing fixes, then approved
mitigating via VM settings: this commit fixes the hourly trigger and
documents the incident; the controller swap (virtio-scsi-single +
iothread=1 + aio=threads) is staged on VM 102 separately, pending his
cold stop/start.

Adds docs/post-mortems/2026-06-11-devvm-qemu-io-stall.md (evidence chain,
ruled-out causes, capture-before-kill autopsy steps) and syncs compute.md
+ proxmox-inventory.md.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 18:00:08 +00:00
Viktor Barzin
2e0cebff87 docs: sync compute/storage/proxmox-inventory with live state (memory audit) [ci skip]
Viktor asked to go through the agent's stored infra facts and straighten out anything wrong about what-is-where. Cross-checking docs against the live cluster surfaced doc drift alongside the stale memories:

- compute.md: add k8s-node5/6 (joined 2026-05-26) to diagram + node table; totals 48 vCPU / ~176GB -> 64 vCPU / ~240GB; cluster version v1.34.2 -> v1.34.8 (live-verified)
- storage.md: the nfs-proxmox StorageClass no longer exists (removed 2026-04-25, commit 484b4c71) — nfs-truenas is the only NFS SC; fixed three spots that told readers to use nfs-proxmox
- proxmox-inventory.md: k8s VM RAM rows live-verified via kubectl (master 32G, node1 48G, node2-4 32G — the old 16/32/24G figures predated the 2026-04-02 resize), added node5/6 rows, devvm swap 8G -> 14G (grown 2026-06-10), recomputed total (~288GB nominal of 272GB physical, overcommitted)

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 17:50:43 +00:00
Viktor Barzin
81e01ec1c4 tripit: label namespace as chrome-service CDP client
The fare scrape's first E2E test was blocked by chrome-service-ws-ingress (9222 admits only namespaces labeled chrome-service.viktorbarzin.me/client=true). Label the tripit namespace per that policy's opt-in design so the planning workspace's live fare fetches reach the shared browser.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-11 14:42:53 +00:00
Viktor Barzin
980ec55418 tripit: enable live flight-fare scrape via shared chrome-service CDP
Sets FARE_PROVIDER=playwright + FARE_CDP_URL on the tripit deployment so the planning workspace's flight_fare cells auto-fetch live Google Flights quotes through the existing in-cluster headed browser (tripit issue #18, ADR-0007 — rate-limited, cached, degrades to manual entry). Viktor asked to complete the trip-planning tickets; this is the infra leg of the fare-scrape slice. Docs: chrome-service architecture + service catalog updated (tripit is now the second active CDP caller; catalog's legacy :3000 WS pool line corrected to CDP :9222). HOLD-ORDER NOTE: pushed only after the tripit image containing FareMode.playwright rolled out (older images crash-loop on the unknown enum).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-11 14:23:53 +00:00
Viktor Barzin
9b19caff47 t3: connection logging across the path for drop attribution
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
Viktor asked to add connection logs (Traefik/Cloudflare) to catch the
real-path t3 WS drops: a direct-to-t3-serve browser ran 40 min clean
while real tunnel sessions cycle every 15-35s, so the drop originates
above t3-serve and we need to see which layer cuts the socket.

Traefik (/ws duration) and cloudflared (WS close events) already ship to
Loki; the gap was the devvm side. This adds:

- t3-dispatch logs every /ws open/close with dur_ms + cause:
  downstream_closed (client/CF/Traefik hung up = last-mile/network),
  upstream_closed (t3-serve closed/reset), or graceful. Graceful closes
  previously left no trace (default ReverseProxy only logs on error), so a
  watchdog-driven reconnect was invisible. Helpers unit-tested.
- devvm-promtail.{yaml,service}: ships devvm journald (t3-dispatch +
  t3-serve@<user>) to cluster Loki as job=devvm-journal, mirroring the
  pve/rpi-sofia shippers. devvm was never in Loki (standalone VM).

Joined in Loki the three layers attribute any future drop to a segment
with no repro needed. Runbook + service-catalog updated.
2026-06-11 13:48:10 +00:00
Viktor Barzin
933e4649fb Merge remote-tracking branch 'forgejo/master' into wizard/authentik-signin-speed
Some checks failed
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/k8s-portal Pipeline failed
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/pve-nfs-exports-sync Pipeline was successful
ci/woodpecker/push/registry-config-sync Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/build-ci-image Pipeline was successful
2026-06-11 00:35:56 +00:00
Viktor Barzin
b3ef0dba76 authentik: ignore Keel-managed image_pull_policy on pgbouncer
Keel flip-flops the pgbouncer container's imagePullPolicy, so the
declared Always kept re-diffing on every plan. Ignore it like the
image tag (KEEL_IGNORE pattern) — plan-to-zero restored.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 00:34:44 +00:00
Viktor Barzin
4e88298976 authentik: incident hardening after the signin-speedup rollout storm
The first apply of the signin-speedup change triggered a ~50min authentik
outage (and a shared CNPG primary failover): the helm chart pin (2026.2.2)
silently DOWNGRADED the Keel-managed live image (2026.2.4) against an
already-migrated DB, default liveness probes kill-looped pods queuing on
authentik's migration advisory lock, and kills mid-migration left ghost
idle-in-transaction sessions holding that lock. Full analysis in
docs/post-mortems/2026-06-10-authentik-downgrade-boot-storm.md.

Hardening (all root causes):
- values.yaml: pin global.image.tag to the Keel-managed live tag (2026.2.4)
  so helm applies can never downgrade under Keel again
- values.yaml: server livenessProbe 6x10s/5s (was chart-default 3x10s/3s)
- values.yaml: REMOVE AUTHENTIK_POSTGRESQL__CONN_MAX_AGE (session-mode
  pgbouncer pins persistent conns 1:1 -> pool saturation, 58s/s waits)
- pgbouncer.ini: idle_transaction_timeout=300 reaps ghost lock holders;
  pgbouncer.tf gets a config-checksum annotation so ini changes roll pods
- authentik_provider.tf: drop the completed import stanza (adoption rule)
- traefik: suppress pre-existing keel.sh annotation/tier-label drift on
  auth-proxy/bot-block/x402/error-pages deployments (KEEL_LIFECYCLE_V1
  pattern) so applies stop stripping live Keel state

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 00:26:52 +00:00
Viktor Barzin
bd60c3d5e0 pve-host/dns: register loki.viktorbarzin.lan CNAME, drop the /etc/hosts pin
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
Follow-up to the pve-host Loki shipper (aac807fb). The host reached Loki via an
/etc/hosts pin of the Traefik LB IP — Viktor flagged that as the wrong solution
(no hardcoding; the DNS infra should handle it). Registered loki.viktorbarzin.lan
in Technitium as a CNAME -> ingress.viktorbarzin.lan (the anchor whose A record
auto-tracks the live Traefik LB IP, so it's renumber-proof), via the Technitium
API + zone-sync to all 3 instances. Removed the /etc/hosts pin from the PVE host;
promtail now resolves the name purely via DNS (verified still shipping to Loki).
insecure_skip_verify stays — the internal .lan cert isn't publicly trusted.

Docs (monitoring.md) + the pve-promtail.yaml header updated to drop the pin
references. The DNS record is API-managed (the viktorbarzin.lan zone convention),
not in this repo; auto-managing .lan CNAMEs in technitium-ingress-dns-sync
remains a noted follow-up.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-10 22:55:20 +00:00
Viktor Barzin
97ccdbecb8 authentik: speed up first-time signin (single-screen login, live env tuning, asset caching, outpost+nginx hot path)
Viktor asked to review Authentik and the web tier and make first-time
signin to apps faster. Review found the slowness is screens and round
trips, not server time. Changes:

- values.yaml: the authentik.* Helm values (gunicorn workers, cache
  timeouts, conn_max_age) were silently INERT because existingSecret
  skips chart env rendering — pods ran defaults (2 workers, 300s
  caches, no persistent DB conns). Moved all tuning into
  server.env/worker.env, which actually reaches the pods.
- authentik_provider.tf: adopt the identification stage and pin
  password_stage so username+password render on ONE screen (the
  separate order-20 password binding is deleted via API — authentik
  requires that when embedding). Outpost log_level trace->info and
  1->2 replicas (it is on the hot path of every forward-auth request;
  PG-backed sessions make 2 replicas safe).
- authentik module: /static ingress carve-out with immutable
  Cache-Control (assets are version-fingerprinted but served with no
  max-age — internal split-horizon users got zero caching).
- traefik auth-proxy nginx: upstream keepalive 32 + HTTP/1.1 (was
  opening a fresh TCP connection to the outpost per subrequest) +
  config-checksum annotation so config changes roll the pods.
- docs: authentication.md + authentik-state.md updated; fixed stale
  'postgresql.dbaas has no endpoints' claim in CLAUDE.md/CONTEXT.md
  (it is a live CNPG primary-selector compatibility service).

Done via API in the same change (UI-managed objects): 6 OIDC providers
(Vault, Forgejo, Immich, Headscale, linkwarden, Cloudflare Access)
switched from explicit to implicit consent — all first-party, the
4-weekly consent screen only slowed first-time signin.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 21:58:10 +00:00
Viktor Barzin
93ba67c84a devvm: install prometheus-node-exporter (was never installed)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
The monitoring stack now scrapes devvm (job 'devvm') for the t3 drop
attribution work, but the box had no node_exporter at all — installed
via apt and persisted here so reprovisioning keeps it.
2026-06-10 21:29:17 +00:00
Viktor Barzin
046a4a32f3 Merge remote-tracking branch 'forgejo/master' into wizard/t3-disconnect-fixes
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
2026-06-10 21:26:10 +00:00
Viktor Barzin
70442ccdc6 t3-probe: fix aiohttp 3.9 compat (ClientWSTimeout is 3.10+)
Bound connection establishment via session ClientTimeout(total=None,
connect=15) instead — works on 3.9 through current; total must stay None
or the session timeout would kill the long-lived probe WS. Verified by a
local 14s smoke run: cloudflare + internal legs both connect.
2026-06-10 21:26:09 +00:00
Viktor Barzin
4af5eff043 docs(multi-tenancy): note the on-demand web restore button
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
The tmux-persist paragraph only described the boot-time restore. Document the
new manual path — the web terminal's "Restore sessions" button (tmux-api
POST /restore -> tmux-restore-user wrapper -> `tmux-persist restore <user>`) —
and why it exists: an OOM that kills a user's tmux server WITHOUT a reboot
never triggers the boot-only restore service, which is the common case under
multi-user memory pressure.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-10 21:22:41 +00:00
Viktor Barzin
a734155fb5 Merge remote-tracking branch 'forgejo/master' into wizard/t3-disconnect-fixes
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
2026-06-10 21:11:30 +00:00
Viktor Barzin
9b55d53be0 t3: differential drop-attribution probe + devvm metrics
Closes the loop on Viktor's ask to find the t3 disconnect root cause and
definitively rule infra in or out. Server logs alone cannot separate
'client network broke' from 'Cloudflare/tunnel broke' from 't3-serve
stalled' — every cause collapses into the same 20s-watchdog reconnect.

The t3-probe (stacks/t3code) holds three permanent legs that differ only
in path segment: 'cloudflare' (WS via DoH-resolved public DNS -> WAN ->
CF edge -> tunnel -> Traefik -> dispatch), 'internal' (same WS pinned to
the Traefik LB, no Cloudflare), 't3serve' (HTTP straight to the serve
process). Whichever leg drops convicts its segment; all legs clean while
a user drops exonerates infra with data. Dispatch gains an
unauthenticated /probe/ws echo + /probe/healthz (gorilla/websocket,
test-first) behind an auth=none path carve-out, guarded by the
authentik-walloff probe.

Also starts scraping devvm's node_exporter (job 'devvm') — it ran
unscraped, so the box whose memory/IO stalls cause the drops had zero
pressure history. Alerts T3ProbeLegDown + T3ProbeDropBurst; runbook
docs/runbooks/t3-drop-attribution.md.
2026-06-10 21:11:29 +00:00
Viktor Barzin
ecef09ab87 tmux-persist: add single-user restore mode (restore [user])
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
The web-terminal will get a "Restore sessions" button (common ask after an
OOM kills a user's tmux server without a reboot, which the boot-only restore
service doesn't catch). The button needs to restore ONE user's saved sessions
on demand, so teach `restore` an optional <user> argument: with no arg it
restores every terminal user (unchanged — the boot service path), with a
<user> arg it validates the name against /etc/ttyd-user-map and restores only
that user. Reuses the existing restore loop (single source of restore truth).

The terminal-lobby tmux-api will invoke this as root via a validated
tmux-restore-user sudo wrapper. Verified: bad user exits 2 (won't fall back to
restoring everyone), no-arg path unchanged, shellcheck clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-10 21:08:57 +00:00
Viktor Barzin
b5c6639272 t3-serve@: contain agent memory storms; survive child OOM kills
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
Same t3-disconnect root-cause work: a runaway claude agent child grew to
10.8G anon RSS inside t3-serve@wizard's cgroup, swap-thrashed devvm off
its spinning disk (system-wide multi-10s freezes = every t3 client's 20s
watchdog firing = the 'frequent disconnects that self-recover'), then
the global OOM at 2026-06-10 19:56 took the whole unit down for 8.5min
because the default OOMPolicy=stop fails the unit when ANY cgroup child
is OOM-killed. Cap the cgroup (MemoryHigh=12G, MemoryMax=16G), forbid
swap so stalls can't smear into minute-long freezes, and OOMPolicy=continue
so a runaway agent dies alone while the WS server keeps serving.
2026-06-10 21:00:06 +00:00
Viktor Barzin
d5fdc7ffe9 cloudflared: disable in-place autoupdate (--no-autoupdate)
Viktor asked to root-cause the frequent t3 code disconnects and rule
infra in or out. The tunnel pods ran bare 'cloudflared tunnel run':
every Cloudflare release made the binary self-update and exit (code 11),
restarting all 3 pods and severing every WebSocket riding the tunnel —
one of the confirmed infra-side drop causes (pods cycled 2026-06-09
20:55/21:00 and 2026-06-10 02:31). Updates belong to pod image rollouts,
not in-place binary swaps.
2026-06-10 21:00:05 +00:00
Viktor Barzin
ac6f19dd3b tmux-persist: never let an empty snapshot clobber a saved manifest
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
emo's 5 web-terminal tmux sessions were OOM-killed (the server died, no
reboot), and the 5-minute save tick then overwrote his session manifest with
0 bytes — wiping the record that restore needs. Root cause: the save guard
only checked that the tmux socket *file* existed, but an OOM-killed server
leaves a stale /tmp/tmux-<uid>/default behind; list-panes then returns
nothing and that empty capture was installed over the good manifest. Because
the restore service only runs at boot, an OOM (not a reboot) skips restore
entirely, so the clobbered manifest was the only record left — and it was
already gone.

Fix: only overwrite <user>.tsv when the snapshot captured >=1 live session;
otherwise keep the last good manifest (now covers no-server AND
stale-socket/dead-server). Verified by reproducing the 0-byte clobber on the
old script and confirming the new one preserves the manifest, plus a live
save that still captures every active session.

emo's 5 sessions were recovered from their transcripts and are back; this
keeps the next OOM from destroying the manifest again.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-10 20:38:59 +00:00
Viktor Barzin
9fff77cbea Merge branch 'wizard/budget-rate-limit'
Some checks failed
ci/woodpecker/push/default Pipeline failed
ci/woodpecker/push/build-cli Pipeline was successful
2026-06-10 19:42:19 +00:00
Viktor Barzin
acb847b858 actualbudget: dedicated traefik rate-limit (50/300) for budget ingresses
The Actual web app boots with ~70 near-parallel requests (55
/data/migrations/*.sql + statics, all served cache-control max-age=0 so
every page load re-validates them). The shared rate-limit middleware
(average 10, burst 50) 429s the tail of that storm, so every cold boot
shows 'Server returned an error while checking its status' and every
load stalls in retry backoff — measured up to 5min stalls when two
loads from one IP overlap. Viktor asked to relax the limit after the
anca slow-load investigation (beads code-7zv).

Same pattern as immich: dedicated actualbudget-rate-limit middleware in
the traefik stack, budget-* ingresses opt out of the default via
skip_default_rate_limit + extra_middlewares.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 19:36:42 +00:00
Viktor Barzin
8304ef0f70 Merge origin/master (pfsense SNI-routed internal 443) into forgejo/master
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
Reconciles the two live infra remotes after the pve-host logging change landed
on forgejo (which was a commit behind origin). Non-destructive merge — keeps both
eae35c51 (pfsense webmail SNI routing) and aac807fb (pve-host Loki shipping).
2026-06-10 19:35:55 +00:00
Viktor Barzin
aac807fb3a pve-host: ship journal to Loki (snoopy command audit + sshd-pve) for emo's root SSH
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
Emo's Claude agent was given root SSH to the Proxmox host (`ssh pve`, dedicated
shared-root key emo-pve-agent@devvm) so he can manage the host — e.g. the R730
fan daemon — through his agent. To keep an audit trail of what that agent does,
and to feed the long-pending Wave-1 S1 security rule, the PVE host now ships its
systemd journal to cluster Loki:

- snoopy logs every execve() to journald (identifier=snoopy), enabled via
  /etc/ld.so.preload; config scripts/pve-snoopy.ini.
- promtail v3.5.1 (amd64) ships /var/log/journal to Loki as {job="pve-journal"}
  (full host journal; filter identifier="snoopy" for the command audit), and
  relabels sshd auth to {job="sshd-pve"} — which ACTIVATES S1 (it was PENDING
  only for lack of this shipper). Config/unit: scripts/pve-promtail.{yaml,service}.

S1 won't false-fire on legitimate access: the devvm SNATs through pfSense to
192.168.1.2, which is already in the S1 source-IP allowlist.

Loki is reached via an /etc/hosts pin (10.0.20.203 loki.viktorbarzin.lan);
follow-up noted to register a Technitium CNAME so it auto-tracks LB renumbers.

Host pieces are hand-managed (not Terraform), like fan-control and the rpi-sofia
promtail — these files are the source of truth. Docs updated: security.md
(S1 LIVE) and monitoring.md ("External host: pve").

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-10 19:31:45 +00:00
Viktor Barzin
eae35c511a pfsense: SNI-routed internal 443 — mail.viktorbarzin.me serves webmail everywhere
Completes the internal port table of the mail front door (10.0.20.1):
443 was squatted by the pfSense webGUI (self-signed cert expired 2022),
so internal webmail and the kuma [External] mail probe hit the firewall
login instead of Roundcube — the last leg of the mail split-brain name.

Design (Viktor): route by what the client asked for. New HAProxy
frontend internal_https_443 (binds 10.0.20.1+10.0.10.1 :443, mode tcp):
SNI present -> Traefik .203 with send-proxy-v2 (trusted, IPv6-bridge
pattern, no health check per the PROXY-probe gotcha); SNI of
pfsense.viktorbarzin.{lan,me} or NO SNI (bare-IP admin access) -> webGUI,
which moved to :8443 (invisible to habits — https://10.0.20.1 still
lands on the login page; :8443 doubles as direct fallback). The
reverse-proxy pfsense ingress now targets :8443 directly.

Declared idempotently in pfsense-haproxy-bootstrap.php; config.xml
backed up on-box (config.xml.bak-2026-06-10-pre-sni443). Verified:
bare IP -> GUI login; pfsense.viktorbarzin.lan -> GUI;
pfsense.viktorbarzin.me -> 302 via ingress; mail.viktorbarzin.me ->
Roundcube with STRICT cert validation; :993 IMAPS untouched.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 18:41:07 +00:00
Viktor Barzin
176a65d3d2 plotting-book: TF baseline image follows what CI actually builds
Viktor asked to verify the book-plotting push->build->deploy chain.
The chain itself is healthy, but the Terraform baseline image said
ancamilea/book-plotter:latest while CI (GHA on
PassionProjectsAnca/Plotting-Your-Dream-Book) builds and deploys
viktorbarzin/book-plotter:<sha8> + :latest — a from-scratch apply
would have resurrected a stale March image. Baseline now
viktorbarzin/book-plotter:latest. No live change: the running tag is
CI-owned via ignore_changes, plan confirms the image attr is ignored.

[ci skip] deliberately: plan shows UNRELATED pre-existing drift on
this stack (live ns labels managed-by=vault-user-onboarding +
resource-governance/custom-quota=true would be stripped; deployment
keel.sh/policy=patch annotations removed) — auto-applying that needs
its own reviewed pass.
2026-06-10 18:37:14 +00:00
Viktor Barzin
5f7c2964ac workstation: session-launch freshen follows the checked-out branch (not just master)
Viktor asked to log Anca into her GitHub account so she can develop on
the devvm and deploy her apps through the existing CI/CD. Her GitHub
repos (Plotting-Your-Dream-Book, travel, My-Wardrobe — now cloned into
her ~/code workspace) default to main, and the launcher freshen only
fast-forwarded master, silently skipping them. ff the current branch's
upstream instead — same safety gates (on a branch, clean tree, upstream
configured, ff-only). Single-layout infra clones behave identically.
[ci skip]
2026-06-10 18:20:59 +00:00
Viktor Barzin
de1d8b7bf3 technitium: add Brevo DKIM selector CNAMEs to internal zone [ci skip]
The roundtrip probe kept failing after the SPF/MX fix: rspamd's actual
junk-score driver was R_DKIM_PERMFAIL(+4.5) on selector brevo2 — Brevo
signs with brevo1/brevo2._domainkey, which are CNAMEs to
b{1,2}.viktorbarzin-me.dkim.brevo.com in public DNS and were absent
from the internal zone (the earlier existence check used ANY queries,
which Cloudflare refuses per RFC 8482 — false negative). The DKIM
permfail also cascaded into DMARC_POLICY_SOFTFAIL(+1.5), totalling the
6.09/6.0 junk threshold; sieve filed probes into \Junk where the INBOX
poll never finds them.

ingress-dns-sync now maintains both selector CNAMEs. Ops notes: rspamd
caches DNS (restart to flush after zone fixes); CoreDNS denial cache
holds NXDOMAINs up to 300s. Verified: roundtrip SUCCESS in 20.5s.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 18:07:38 +00:00
Viktor Barzin
2825cb1703 workstation: per-user code_layout — workspace puts project repos under ~/code (ancamilea + tripit)
Viktor asked to restructure Anca's setup: her ~/code WAS the infra clone
itself; he wants ~/code to be the directory where all her project repos
(tripit etc.) live side by side, with infra moved to a subdirectory.

- roster.yaml gains per-user 'code_layout: single|workspace' + 'repos',
  validated + derived by roster_engine.py (12 new tests, 40 total).
- t3-provision-users reconcile: auto-migrates a single-layout ~/code to
  ~/code/infra (running processes follow the moved inode), hoists nested
  project clones to the workspace root, clones roster repos from Forgejo
  AS the user (their PAT makes private repos work), and wires the
  documented forgejo remote + forgejo/master upstream into clones that
  predate that contract.
- Fixed a latent TSV bug: empty jq @tsv fields collapse under tab-IFS
  read, shifting later fields left (groups was only safe by being the
  last field) — emit '-' sentinels instead.
- start-claude.sh session freshen is layout-aware (freshens each repo
  under ~/code for workspace users).
- managed claudeMd + AGENTS.md non-admin recipe + multi-tenancy.md
  updated in the same change.

Applied live: ancamilea = workspace (infra at ~/code/infra, her existing
tripit clone hoisted to ~/code/tripit, master upstream switched to
forgejo/master); emo stays single layout, untouched. [ci skip]
2026-06-10 18:05:31 +00:00
Viktor Barzin
3b6a5c6737 workstation: worktree-first feature work for all agents [ci skip]
Viktor asked that every feature task be developed in its own git worktree
and merged into master when done, enabling multiple agents to work the
same project concurrently. Encode the org rule in the managed claudeMd
(self-deploys to /etc via the hourly reconcile), add the worktree-first
paragraph to the AGENTS.md non-admin landing recipe, and gitignore
.worktrees/ so per-feature worktrees can live at the repo root. Full
lifecycle: ~/.claude/rules/execution.md §3.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 17:49:43 +00:00
Viktor Barzin
daddafd279 docs: superset rule for the internal viktorbarzin.me zone (mail-auth records) [ci skip]
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 17:47:31 +00:00
Viktor Barzin
00bc1e052d technitium: mirror mail-auth records into internal zone; fix redfish check [ci skip]
Two fixes from the post-DNS-internalization health sweep:

1. The internal viktorbarzin.me zone served only ingress A/CNAME records.
   Since the mailserver pods now resolve the domain through it (CoreDNS
   viktorbarzin.me:53 -> Technitium, 59a531b8), rspamd's SPF checks on
   inbound @viktorbarzin.me mail saw SPF=none and quarantined it — the
   Brevo email-roundtrip probe failed from the 16:20 run onward
   (EmailRoundtripFailing/Stale). The ingress-dns-sync CronJob now also
   maintains the static mail-auth records (SPF, brevo-code TXT, MX;
   DMARC + DKIM were already present), idempotently. Principle: the
   internal zone must be a SUPERSET of the public zone for every record
   type internal clients consume. Verified in-pod: all four types
   resolve; roundtrip re-probe green.

2. cluster_healthcheck #30 queried instant `up`, which goes stale for
   ~5 of every 10 minutes on the deliberate 10m redfish-idrac remnant
   job -> intermittent false "redfish-idrac=missing". Now uses
   last_over_time(up[15m]) — same answers for fast jobs.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 17:46:37 +00:00
Viktor Barzin
e7fbf986fb workstation: rename tmux persistence out of the t3 namespace [ci skip]
Viktor's correction: this feature is about the tmux web-terminal
sessions, not t3 — t3 auto-saves its own threads (~/.t3 state +
daily t3-backup-state). Renamed t3-tmux-sessions -> tmux-persist
(units tmux-persist-save.timer / tmux-persist-restore.service, state
/var/lib/tmux-persist), header rescoped to say exactly that. Same
mechanism, correct taxonomy. Old units removed, state migrated,
re-verified live (5 emo + 3 wizard sessions snapshotted).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 17:42:52 +00:00
Viktor Barzin
2e4f48f3fc workstation: tmux sessions survive devvm reboots (save timer + boot restore)
All checks were successful
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
Viktor: emo's open web-terminal sessions must persist across reboots.
Claude conversations were already durable on disk; the volatile part
was the tmux wiring (which named session runs which conversation).

t3-tmux-sessions save (5-min timer) snapshots every roster user's
sessions to /var/lib/t3-tmux-state/<user>.tsv — conversation uuid
taken from argv --resume (self-sustaining once restored) or the
newest transcript in the cwd-slug project dir created after process
start (fresh launcher sessions; claude does NOT hold its transcript
fd open, so fd-sniffing was a dead end). t3-tmux-sessions restore
(boot oneshot, also safe after partial loss) recreates missing
sessions with claude --resume <uuid>. Reconciler self-heals both
units' enablement.

Verified live: emo's 5 sessions snapshotted with correct uuids;
killed R730-cooling -> restore brought it back resuming the same
conversation (context meter identical); other sessions untouched.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 17:39:32 +00:00
Viktor Barzin
59a531b8e0 coredns: pods get internal split-horizon answers for viktorbarzin.me [ci skip]
Forward the viktorbarzin.me:53 pod block to the Technitium ClusterIP
(10.96.0.53, same as the .lan block) instead of 8.8.8.8/1.1.1.1. Pods
become ordinary internal clients (CNAME -> apex -> live Traefik LB;
mail -> 10.0.20.1), fixing the 27 non-proxied [External] uptime-kuma
monitors that rode the TP-Link NAT loopback (hard-down since 06-09;
loopback refuses flows whose source equals the reflection target, which
all pfSense-SNAT'd cluster traffic does).

Enabled by re-testing a stale premise: on k8s 1.34 pods DO reach the
ETP=Local Traefik LB IP (kube-proxy short-circuits in-cluster traffic
to LB IPs; verified from pods on three non-Traefik nodes) — re-verify
after major k8s upgrades; canary = [External] fleet going red. The
NAT-layer alternatives (pfSense rdr, SNAT-drop) were rejected: both
fight return-path asymmetry and deepen TP-Link dependency.

Verified in-pod: immich -> .203 + HTTPS 200, mail -> 10.0.20.1,
forgejo -> Traefik ClusterIP (pin kept for Technitium-outage
resilience). Proxied [External] monitors now test the internal path —
true edge fidelity moves to the external vantage (ha-london, next fix).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 16:21:34 +00:00
Viktor Barzin
35c89fa90c workstation: managed Claude config self-deploys from the repo [ci skip]
Viktor's claudeMd edits must keep reaching every user now that emo is
out of the shared tree. Two reconciler additions:
- sync_managed_config: installs scripts/workstation/managed-settings.json
  to /etc/claude-code whenever the repo copy changes — editing the
  org claudeMd is now edit + commit, no manual install step
- refresh_codex_mirror: regenerates each user's ~/.codex/AGENTS.md
  (static mirror of the claudeMd; header-guarded so user-customized
  files are never clobbered)

Verified live: corrupted emo's mirror -> reconcile restored it;
wizard's stale mirror refreshed; in-sync managed config no-ops.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 16:03:24 +00:00
Viktor Barzin
8cfd0e5e5c Merge forgejo/master: reconcile diverged lineages [ci skip]
Local checkout carried the 2026-06-10 DNS/registry architecture series
(pfSense forward-zone, CoreDNS viktorbarzin.me:53 carve-out, nodes
stock) + vzdump/nfs-mirror/workstation-rebuild commits that never
reached the canonical remote, while forgejo master received the
emo-access series via isolated worktrees. Viktor asked to merge.

Conflict resolutions (newest iteration wins in each file):
- stacks/forgejo/cleanup.tf: LOCAL — dry_run=true (2026-06-10 revert
  after live retention orphaned OCI indexes; remote had 06-09 enable)
- .claude/CLAUDE.md, docs/architecture/backup-dr.md: LOCAL — final
  registry/DNS architecture + implemented vzdump alerts
- scripts/workstation/setup-devvm.sh: LOCAL — pinned-version,
  reproducible-rebuild refactor (kubelogin pin, restructured staging)
- scripts/workstation/managed-settings.json: FORGEJO — the
  allow-then-audit claudeMd (matches /etc deployment byte-for-byte)
- scripts/t3-provision-users.sh: FORGEJO comment; refresh_locked_clone
  intact

[ci skip]: all stack changes in the local lineage were applied live
this morning — CI would re-walk 100+ stacks via the modules/ fallback
for zero state change.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 15:21:50 +00:00
Viktor Barzin
a34f9ff3b8 docs: infra Woodpecker repo-82 ops — in-cluster webhook, secret parity, empty-commit gotcha [ci skip]
Emo's first direct pushes surfaced three latent CI issues, all fixed
out-of-band today and recorded here: webhook deliveries to
ci.viktorbarzin.me timing out on the public-IP hairpin (hook now
targets the in-cluster woodpecker-server service), repo 82 registered
without the repo-scoped secret set (cloned from repo 1 in the DB), and
empty commits compiling every workflow so missing secrets hard-error.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 15:09:17 +00:00
63161ef3a5 test: final audit-pipeline verification
Some checks failed
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/k8s-portal Pipeline failed
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/pve-nfs-exports-sync Pipeline was successful
ci/woodpecker/push/registry-config-sync Pipeline was successful
ci/woodpecker/push/build-ci-image Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
Repo-82 Woodpecker secrets were missing (repo-1 set cloned over) and
the webhook now targets the in-cluster service. This push should run
the full pipeline: Slack audit ping + no-op apply.
2026-06-10 15:07:15 +00:00
619b7608fa test: verify audit pipeline fires on emo push
Second verification: the Forgejo->Woodpecker webhook was timing out on
the public-IP hairpin (first test push fired no pipeline), so it now
targets the in-cluster Woodpecker service. This push should produce a
pipeline with the notify-nonadmin-push Slack step.
2026-06-10 15:03:48 +00:00
0f45585b53 test: verify emo direct master push (allow-then-audit)
Viktor granted emo direct push to master on 2026-06-10 — any change
allowed, tracked via commit messages + the Slack audit feed. This
empty commit verifies the whitelist and exercises the new
notify-nonadmin-push CI step end-to-end.
2026-06-10 14:54:04 +00:00
Viktor Barzin
a49d1eadf6 workstation: emo direct master push — allow-then-audit [ci skip]
Viktor: emo may make any change; what matters is tracking what changed
and why. ebarzin added to master push+merge whitelists (force-push
stays disabled — append-only history). Tracking enforced three ways:
- agent instructions (managed claudeMd + AGENTS.md): commit body MUST
  carry the user's plain-language intent; commits land on master
  directly; [ci skip] forbidden for non-admins
- new notify-nonadmin-push step in .woodpecker/default.yml: Slack
  message for every non-admin master push (admin pushes silent)
- PR flow remains the fallback for non-whitelisted users

Accepted consequence (informed): emo's pushes auto-apply changed
stacks via CI. Offboard runbook gains whitelist-removal step.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 14:53:43 +00:00
Viktor Barzin
6d8773573c workstation: agent-driven contribute flow for non-technical users [ci skip]
emo can't use git — his agent must do all VCS mechanics invisibly.
Managed claudeMd (every session, top precedence) now instructs agents:
commit -> push <os-user>/<topic> branch -> open PR via Forgejo API
(user's PAT from ~/.git-credentials) -> back to clean master -> tell
the user in plain words it's submitted for review. AGENTS.md carries
the full recipe with the curl call.

Verified live as emo: PR #1 opened (HTTP 201, write:repository scope
suffices) and closed via his PAT. Deployed to
/etc/claude-code/managed-settings.json; codex AGENTS.md mirrors for
emo + ancamilea regenerated from the new claudeMd.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 10:12:26 +00:00
Viktor Barzin
2e5af5dc0e workstation: keep non-admin infra clones fresh (hourly + at launch) [ci skip]
Non-admins (emo) need current master without manual pulls. Two layers:
- t3-provision-users reconcile gains refresh_locked_clone: fetch all
  remotes + ff-only master, guarded (on master, clean tree, upstream
  set); dirty/diverged clones are left alone with a WARN.
- start-claude.sh freshens ~/code at session launch, 15s-capped so an
  offline remote never delays the session.

Verified live on emo's clone: stale clone ff'd to tip by the
reconciler; launcher snippet ff's when clean and refuses while a
dirty file exists. Deployed to /usr/local/bin/t3-provision-users,
/etc/skel/start-claude.sh, and emo's launcher.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 09:41:38 +00:00
Viktor Barzin
5d9417fbaa workstation: emo contribute access + Phase-5 cutover done; gate master (push=apply) [ci skip]
ADR-0004's premise was wrong: pushing master fires the Woodpecker apply
pipeline (require_approval=forks only), so master pushes ARE deploys.
Added Forgejo branch protection on master (push/merge whitelist=viktor,
deploy keys allowed); non-admins contribute via branches + PRs.

emo (ebarzin): write collaborator on viktor/infra, PAT in
~/.git-credentials, forgejo remote + upstream in his locked clone.
Phase-5 finished: code-shared removed; ~/.claude symlinks kept (they
ARE the skel shared-base mechanism — plan step 4c obsolete).
Offboard runbook: revoke PAT + collaborator + group steps added.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 09:30:41 +00:00
Viktor Barzin
a1b7b0ca53 forgejo retention: revert to DRY_RUN — first live run orphaned OCI indexes [ci skip]
The keep-set (newest 10 versions + latest + *cache* tags) treats
multi-arch/attestation index CHILDREN — separate untagged sha256
versions — as deletable: for images not rebuilt recently they sort
outside the newest-10 window and were pruned while their kept parent
index survived. kms-website :latest and :dfc83fb children 404'd
(RegistryManifestIntegrityFailure, caught by forgejo-integrity-probe
within hours; deployed tag a794d1a unaffected).

Healed: :latest re-pointed at the intact a794d1a index (also the
newest commit), corrupt :dfc83fb version deleted, probe re-run clean
(0 failures / 22 repos / 63 tags / 59 indexes). DRY_RUN=true applied
live. Re-enable only with a container-aware keep-set — options in the
post-mortem.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 09:22:47 +00:00
Viktor Barzin
e49c91e60c monitoring: VzdumpBackup{Stale,NeverRun,Failing} alerts for the new VM-image backup
vzdump-vms pushes vzdump_last_{run,success}_timestamp + vzdump_last_status to
Pushgateway job vzdump-backup, but nothing alerted on them — a stopped/failing VM
backup would be silent (exactly how the nfs-mirror reaping went unnoticed until I
re-verified). Add the trio to the 3-2-1 group in prometheus_chart_values.tpl,
mirroring the LVM/pfSense/nfs-mirror alerts. Stale = >~50h since last success.

NOT [ci]-applied: this is a Terraform stack change — arms on the next
`scripts/tg apply` of the monitoring stack (metrics already flow, so it arms
immediately once applied). Admin-gated apply per org policy.

[ci skip]

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-10 09:10:46 +00:00
Viktor Barzin
05f928931f workstation: packages.txt — add provisioner build deps + uncaptured core tools
setup-devvm.sh now needs golang-go (builds t3-dispatch in section 9) and uses unzip
(kubelogin extraction); neither was in the manifest, so a fresh box would skip the
t3-dispatch build. Also add build-essential (cgo / npm native modules) + core tools
that were manually-installed but uncaptured (rsync, wget, tree, shellcheck). Noted
gh as non-apt (GitHub's own repo). All verified to resolve in apt.

[ci skip]

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-10 09:08:53 +00:00
Viktor Barzin
312c418a9a workstation: setup-devvm.sh installs the systemd service layer (reproducible rebuild)
The t3 system units (t3-serve@, t3-autoupdate, t3-backup-state, t3-provision-users,
t3-dispatch) + the t3-dispatch Go binary + t3-mint + the sudoers grant were all
hand-scp'd and would NOT survive a fresh devvm. setup-devvm.sh now installs + enables
them: build-if-absent for the Go binary, visudo-validated sudoers (a malformed
/etc/sudoers.d file breaks all sudo), timers self-heal, t3-dispatch system account
created if absent. t3-serve@ stays a per-user template enabled by the provisioner;
the ttyd terminal-lobby chain ships from its own repo (viktor/terminal-lobby).

Verified: shellcheck clean, go build compiles, visudo parses the sudoers, units parse.
NOT run live (would re-assert apt/npm on the shared host) — exercised on next rebuild.

[ci skip]

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-10 09:07:20 +00:00
Viktor Barzin
d9ea7812f5 nfs-mirror: exclude /vzdump/ — it was reaping the new VM-image backups nightly
nfs-mirror does `rsync -rlt --delete /srv/nfs/ -> /mnt/backup/`; any /mnt/backup
dir with no /srv/nfs counterpart is an orphan and gets --delete'd. vzdump-vms
(added yesterday) writes /mnt/backup/vzdump/, which wasn't excluded — so the
02:00 nfs-mirror run silently deleted both successful 40G devvm images
(verified: dir gone, 40G freed, despite status=0 success logs). Add
--exclude='/vzdump/' alongside the existing pvc-data/pfsense/pve-config/
sqlite-backup excludes that exist for exactly this reason. TDD-proven with an
isolated rsync --delete -n -v. backup-dr.md notes the dependency.

[ci skip]

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-10 09:04:57 +00:00
Viktor Barzin
2b8c0def30 dns: pfSense forward-zone for viktorbarzin.me, nodes fully stock [ci skip]
Round 3 of the forgejo-pull hairpin fix (per Viktor: no per-node
customization — split-brain lives in the DNS infra):

- pfSense Unbound domain override viktorbarzin.me -> Technitium
  10.0.20.201 (applied via php write_config, backup on-box). Every
  Unbound client on every VLAN now gets the internal split-horizon
  answers (live Traefik IP via apex CNAME) with zero per-host config.
- CoreDNS carve-out (TF, applied): dedicated viktorbarzin.me:53 block —
  forgejo pinned to Traefik ClusterIP via data source (pods cannot reach
  the ETP=Local LB IP pfSense now returns), all other .me names kept on
  public resolvers (pods' pre-existing behavior). Replaces the .:53
  forgejo rewrite.
- Removed the same-day resolved routing-domain drop-ins from all 7 nodes;
  node5/6 link DNS repointed Technitium -> pfSense (netplan + qm 205/206)
  for fleet parity; cloud-init no longer writes any DNS drop-ins.
- Docs: dns.md, pfsense-unbound runbook (override + rollback), registry
  bullet, post-mortem final-architecture addendum.

Verified: nodes resolve forgejo -> .203 via pfSense, crictl pull OK,
pods resolve forgejo -> ClusterIP / others -> public, mail record works,
.lan zone unaffected.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 08:32:34 +00:00
Viktor Barzin
1ee1bf0817 forgejo pulls: route *.viktorbarzin.me to Technitium, drop /etc/hosts pins [ci skip]
Supersedes this morning's per-node /etc/hosts pin (no hardcoded service
IPs on nodes, per Viktor). Technitium's split-horizon zone already
resolves forgejo.viktorbarzin.me -> CNAME apex -> live Traefik LB IP
(ingress-dns-sync auto-CNAMEs every ingress host; apex drift probe
alerts) -- the nodes just never queried it. Rolled the devvm's
systemd-resolved routing-domain pattern (~viktorbarzin.me ->
10.0.20.201) to all 7 nodes, removed the pins, verified getent +
crictl pull via pure DNS.

Also demoted node5/6's cloud-init global-dns.conf (DNS=8.8.8.8 1.1.1.1)
to FallbackDNS-only: public servers in the global set race the routing
domain. Its justification ("Technitium NXDOMAINs forgejo") was obsolete
-- exactly the stale comment that pointed new nodes at the hairpin.

hosts.toml mirror kept but documented as vestigial (Traefik 404s
bare-IP requests; registry auth realm is an absolute URL).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 07:56:31 +00:00
Viktor Barzin
b6976ce014 forgejo pulls: pin registry name to internal Traefik in node /etc/hosts [ci skip]
tuya-bridge was down 7.5h (ImagePullBackOff on k8s-node3): fresh kubelet
pulls of forgejo.viktorbarzin.me images depended on the intermittently
broken public-IP hairpin. The containerd hosts.toml mirror cannot keep
pulls internal on its own — Traefik 404s its bare-IP requests (no
Host/SNI match) and the registry Bearer realm is an absolute public URL
fetched outside the mirror. Third incident of this class (buildkit
06-04, tripit/devvm 06-09).

Fix: /etc/hosts pin 10.0.20.203 forgejo.viktorbarzin.me on every node —
covers resolve + token + blob legs with correct SNI and valid cert.
Applied live to all 7 nodes; persisted in the cloud-init bootstrap and
the existing-node rollout script. Docs updated (registry bullet, dns.md
hairpin scope + stale .200 literals, runbook) + post-mortem.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 07:15:24 +00:00
Viktor Barzin
eb8695743b workstation: fix setup-devvm.sh provisioner correctness (claude detect, kubelogin pin, codex auth, t3-serve dir)
- claude-code: detect via `npm ls -g` not `command -v claude` — the admin's
  personal ~/.local/bin/claude shadowed the PATH check, so the system-wide
  install never ran (/usr/lib/node_modules/@anthropic-ai empty, no /usr/bin/claude;
  fresh non-admins had no claude). Found during the devvm reproducibility audit.
- kubelogin: pin v1.36.2 instead of releases/latest/download, so two fresh boxes
  built weeks apart are byte-identical.
- /etc/t3-serve: mkdir before the token writes (install -m doesn't create the
  parent — section 8 would fail on a fresh box).
- codex shared auth: stage /opt/codex-shared/auth.json from Vault
  secret/workstation.codex_shared_auth_json (key already existed but nothing
  consumed it — was a manual step lost on rebuild), mirroring the Claude token.

[ci skip]

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:41:54 +00:00
Viktor Barzin
8886ac7763 backup: fix vzdump-vms exit code — EXIT-trap && short-circuit falsely failed OK runs
First live run produced a valid 40G dump and logged status=0, but the service
exited 1/FAILURE: cleanup() used `[ -n "$KILLED" ] && push_metrics 2 0`, and a
bash EXIT trap whose LAST command returns non-zero overrides the script's
`exit 0`. With KILLED empty the && short-circuits -> returns 1 -> a successful
backup is marked failed (would trip a vzdump staleness/failure alert). Switch to
daily-backup's `if…fi` idiom (returns 0 when not killed). Bug reproduced + fix
verified locally; redeployed to PVE + reset-failed.

[ci skip]

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:41:54 +00:00
Viktor Barzin
7330cb6a0b backup: image-level vzdump of hand-managed VMs (devvm) — close no-VM-backup DR gap
The hand-managed Linux VMs (not in Terraform) were never imaged: the
PVC/NFS/pfSense/PVE-config scripts cover cluster data but no VM disk. A lost
devvm disk = unrecoverable home dirs + local-only git repos (monorepo root has
no remote).

vzdump-vms.{sh,service,timer}: daily 01:00 live `vzdump --mode snapshot` of
VZDUMP_VMIDS (default 102=devvm) -> /mnt/backup/vzdump (Copy 2), keep 3; the
monthly offsite-sync full pass mirrors it to Synology (Copy 3). Guest agent
enabled -> fs-consistent. Nice/idle-ionice so it never starves etcd.
Pushgateway job vzdump-backup.

Deployed live to PVE + timer enabled. Docs updated: backup-dr.md (new VM-image
layer + protection matrix), infra CLAUDE.md, AGENTS.md.

[ci skip]

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:41:54 +00:00
Viktor Barzin
3e7093947d t3: bump pin 0.0.24 -> 0.0.26 (fable-5) [ci skip]
Completes the 0.0.26 adoption prepared in fcb84ce0 (version-agnostic
dispatch browser-session/bootstrap fallback + Gate-2 real pairing
health-check + per-user state.sqlite backup). 0.0.26 verified
end-to-end on the devvm: emo + ancamilea auto-pair via t3-dispatch
(302 + Set-Cookie t3_session) after migrating state.sqlite 30->32;
pre-cutover backups in /var/backups/t3-state. Brings claude-fable-5
into the t3 model picker.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:41:53 +00:00
Viktor Barzin
dacd9d2d8a t3: prepare to adopt 0.0.25 — version-agnostic dispatch + real pairing health-check + state backup [ci skip]
Investigated the 0.0.25 break: it is ONLY an endpoint rename
(/api/auth/bootstrap -> /api/auth/browser-session). The rest of the pairing
contract (credential payload, t3_session cookie, /api/auth/session) is
byte-identical, verified in isolated 0.0.24-vs-0.0.25 sandbox serves. So a
future pin bump is now safe + reversible (pin STAYS 0.0.24 — this is prep):

- t3-dispatch: autoPair tries /api/auth/browser-session, falls back to
  /api/auth/bootstrap on 404 — one binary pairs across both versions and any
  rolling-restart skew. TDD via TestAutoPairAcrossVersions (red on 0.0.25
  before, green after). Built, deployed, verified live on 0.0.24 (all three
  users still 302 + t3_session via the fallback).
- t3-autoupdate.sh: health-check now exercises the REAL mint->credential->cookie
  handshake (was GET / -> 200, which passed the pairing-broken nightly). A bad
  build now auto-rolls-back. Validated against both versions.
- t3-backup-state.{sh,service,timer}: daily online VACUUM INTO of each ~/.t3
  state.sqlite (was the only copy, unbacked) -> the one-way forward schema
  migration becomes a restore, not sqlite surgery. timeout-guarded.
- runbooks/t3-version-bump.md: the reversible cutover checklist.
- post-mortem #5 (health-check) DONE + #6 added; service-catalog updated.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:41:53 +00:00
Viktor Barzin
baac46415f t3: pin t3@0.0.24 + stop nightly auto-update (auth-outage fix) [ci skip]
The t3-autoupdate timer (re-enabled by the provisioner's step 5b with
`--now`, which fires the missed daily job immediately on a Persistent
timer) pulled t3@nightly 0.0.25 mid-day. That build ran forward schema
migrations on every ~/.t3 state.sqlite (auth_pairing_links/auth_sessions
role->scopes, +proof_key_thumbprint) AND changed the bootstrap API,
breaking t3-mint/pairing for ALL devvm users (pair prompt, no session).

- t3-autoupdate.sh: now a pinned-version ENFORCER (T3_PIN=0.0.24), not a
  nightly tracker -- re-asserts the pin (a no-op when correct).
- t3-provision-users.sh step 5b: drop `--now` (it triggered the
  immediate missed-job run that pulled the bad build).
- setup-devvm.sh: install pinned t3@0.0.24 at machine setup.
- unit Descriptions + service-catalog reflect the pin.
- post-mortem: 2026-06-09-t3-nightly-autoupdate-auth-outage.md.

Host already reconciled out-of-band: rolled back to 0.0.24, re-enabled
the (now-pinned) enforcer, reset the 2 new users' disposable DBs,
surgically reverted wizard's auth tables to level-30 (96 threads + live
session preserved). All users verified 302 + t3_session.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:41:53 +00:00
Viktor Barzin
41c11216da t3-dispatch: re-pair on present-but-invalid t3_session cookie
The dispatcher only re-paired on an ABSENT cookie. After the 2026-06-09
auth-schema rollback wiped all server-side sessions, browsers kept dead
30-day t3_session cookies; the dispatcher proxied them straight through
and t3 rendered its pair page ("all users must pair again").

Now a present cookie on a top-level document navigation is validated via
the instance's /api/auth/session and re-paired on authenticated:false.
Gated to document navs (Sec-Fetch-Dest: document, else Accept: text/html)
so XHR/asset/WebSocket sub-requests are never answered with a 302; fails
open (proxy through) on any validation error. Unit + handler tests added.

[ci skip]

Co-Authored-By: Claude <noreply@anthropic.com>
2026-06-09 21:41:53 +00:00
Viktor Barzin
e0ab621cb2 workstation: fix new-user .env clobber — env_set preserves CLAUDE_CODE_OAUTH_TOKEN
The port-write used '>' (overwrite), wiping the token injected earlier in the same run for a NEW user (existing users like anca survived only because their .env already had the T3_PORT line). New env_set() does update-or-append per key, preserving others. Verified end-to-end: throwaway t3probe provisioned from scratch -> .env has both T3_PORT + CLAUDE_CODE_OAUTH_TOKEN -> claude -p AUTHOK. So all new non-admins now authenticate automatically. NOT pushed (shared-tree divergence hold).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:41:53 +00:00
Viktor Barzin
39e35ca8c9 workstation: share admin Claude subscription with non-admins (CLAUDE_CODE_OAUTH_TOKEN)
Non-admins without their own ~/.claude login get the shared long-lived sk-ant-oat01 token injected into their t3-serve env, so their agent authenticates against the admin's subscription. setup-devvm.sh stages it from Vault secret/workstation.claude_oauth_token (root-readable); the provisioner's install_user_claude_token injects per-user, if-absent (never clobbers emo's own login). Live-fixed anca (verified AUTHOK); this codifies it for reproducibility + future users. NOT pushed (shared-tree divergence hold).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:41:53 +00:00
Viktor Barzin
1edccedb1f workstation: v2 membership implementation plan [ci skip]
8 tasks: engine derive_os_user + roster_from_members (TDD); read-only Authentik token (TF); setup-devvm.sh stages it; provisioner sources T3 Users members from the Authentik API (replaces roster.yaml); Authentik-managed membership + legacy os_user attributes; retire roster.yaml; e2e add/remove smoke. Pairs with the 2026-06-09 design doc.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:41:53 +00:00
Viktor Barzin
87702bdce8 feat(tts): Chatterbox TTS stack + off-peak T4 gate, wire tripit narration [ci skip]
New `infra/stacks/tts/` deploys devnen/Chatterbox-TTS-Server (OpenAI-compatible
/v1/audio/speech) as ClusterIP `chatterbox-tts.tts.svc:8000` (server listens on
8004; Service remaps), requesting ONE T4 time-slice. Mirrors stacks/llama-cpp/.

Option A off-peak control (no VRAM isolation on the time-sliced T4 — see
post-mortem 2026-06-02): Deployment sits at replicas=0; three Europe/London
CronJobs own the replica count — `chatterbox-window-up` scales to 1 at 02:00
ONLY IF a free-VRAM preflight passes (sum gpu_pod_memory_used_bytes from
gpu-pod-exporter; free = 16GiB - used >= floor), `chatterbox-vram-guard` yields
the card mid-window if a resident wakes, `chatterbox-window-down` scales to 0 at
06:00. tripit's bake is best-effort + cached-forever (ADR-0002/0004) so a
skipped/aborted window backfills next time. SA+Role+RoleBinding grant the
CronJobs deployments/scale (nextcloud-watchdog pattern).

Polite-tenant hardening: kyverno `inject-gpu-workload-priority` now excludes the
`tts` namespace (new `gpu_priority_excluded_namespaces` local) so Chatterbox
keeps tier-2-gpu priority (600k) and is always evicted first under GPU pressure
— never immich-ml/frigate/llama-swap. The LimitRange-fallback policy still uses
the base exclude list (tts untouched there).

tripit: add TTS_MODE=openai_compatible, TTS_BASE_URL, TTS_MODEL=chatterbox to
local.app_env (no token — ClusterIP only). No tripit code change.

Image build is documented in stacks/tts/README.md (devnen cu128 target ->
forgejo.viktorbarzin.me/viktor/chatterbox-tts) — build is impractical inline
(large CUDA image + needs the upstream repo). NOT APPLIED — review branch only.
Free-VRAM floor (var.vram_free_floor_bytes, default 6GiB) must be set from the
measured chatterbox-multilingual T4 peak during the first bake.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:41:53 +00:00
edaee13be3 docs(ci-cd): tripit auto-deploy (GHA->Woodpecker 167) + svu semver in GHA [ci skip] 2026-06-09 21:41:53 +00:00
Viktor Barzin
4b44db36da workstation: skel start-claude.sh inherits managed default model (drop hardcoded --model)
The per-user launcher hardcoded --model claude-opus-4-8; an explicit --model flag overrides the managed default in /etc/claude-code/managed-settings.json (claude-fable-5). Dropping it lets emo and all new accounts inherit the org default (per-session /model still works). Deployed to /etc/skel and emo live copy in the same change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:41:53 +00:00
Viktor Barzin
64413c76ce workstation: default Claude model = claude-fable-5 for all devvm users
Adds a model key (claude-fable-5) to the machine-wide managed-settings.json (installed to /etc/claude-code/ by setup-devvm.sh). Sets the default model for every Claude Code session on the devvm (CLI + t3 web) at top settings precedence; per-session /model and explicit --model flags still override. The org claudeMd block is unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:41:53 +00:00
93ec0c66fd docs(ci-cd): add off-infra GHA->GHCR build pattern for private Forgejo repos (tripit pilot) [ci skip] 2026-06-09 21:41:53 +00:00
90b8312a29 tripit: build off-infra via GHA -> GHCR (private), pull via scoped ghcr-credentials
Switches tripit image from forgejo.viktorbarzin.me (in-cluster buildkit, sdc load) to ghcr.io/viktorbarzin/tripit built by GitHub Actions (Forgejo push-mirror -> private GitHub -> GHA -> GHCR). Adds a tripit-ns-scoped GHCR pull secret (github_pat, interim). Verified: deploy on :c8dfb5cb ready, ingest-plans CronJob pulled :latest + Succeeded. [ci skip]
2026-06-09 21:41:53 +00:00
Viktor Barzin
e0452611b5 forgejo: survive CI-build registry-push storms (mem 3Gi + working retention)
Heavy in-cluster builds (e.g. tripit buildkit) were taking Forgejo down via
two vectors. Fixes both, without moving Forgejo off the sdc HDD (code-oflt
deferred):

- Memory 1Gi -> 3Gi (requests=limits). Forgejo was OOMKilled (exit 137) under
  registry-push load; VPA upperBound ~1.5Gi was suppressed by the 1Gi cap it
  kept OOMing against. Size for the push spike.

- Activate registry retention (DRY_RUN false). Verified the delete list
  against all running viktor/* images first: 0 running images affected.
  Pruned 478 -> 161 package versions; PVC was at its 50Gi autoresize ceiling.

- FIX broken retention auth: the cleanup PAT was ci-pusher's, but Forgejo
  scopes container packages per-user, so DELETE on viktor/* returned 403 (the
  dry-run only did GETs, hiding it). Repointed forgejo_cleanup_token to
  viktor's write:package PAT. Retention had never actually worked.

- Protect buildkit *cache* tags from retention (cleanup.sh keep-set) so the
  gentler-builds layer cache survives daily pruning.

[ci skip] — already applied via scripts/tg.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:41:53 +00:00
Viktor Barzin
bc37b16815 backup: fix vzdump-vms exit code — EXIT-trap && short-circuit falsely failed OK runs
First live run produced a valid 40G dump and logged status=0, but the service
exited 1/FAILURE: cleanup() used `[ -n "$KILLED" ] && push_metrics 2 0`, and a
bash EXIT trap whose LAST command returns non-zero overrides the script's
`exit 0`. With KILLED empty the && short-circuits -> returns 1 -> a successful
backup is marked failed (would trip a vzdump staleness/failure alert). Switch to
daily-backup's `if…fi` idiom (returns 0 when not killed). Bug reproduced + fix
verified locally; redeployed to PVE + reset-failed.

[ci skip]

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:30:19 +00:00
Viktor Barzin
83f418159a backup: image-level vzdump of hand-managed VMs (devvm) — close no-VM-backup DR gap
The hand-managed Linux VMs (not in Terraform) were never imaged: the
PVC/NFS/pfSense/PVE-config scripts cover cluster data but no VM disk. A lost
devvm disk = unrecoverable home dirs + local-only git repos (monorepo root has
no remote).

vzdump-vms.{sh,service,timer}: daily 01:00 live `vzdump --mode snapshot` of
VZDUMP_VMIDS (default 102=devvm) -> /mnt/backup/vzdump (Copy 2), keep 3; the
monthly offsite-sync full pass mirrors it to Synology (Copy 3). Guest agent
enabled -> fs-consistent. Nice/idle-ionice so it never starves etcd.
Pushgateway job vzdump-backup.

Deployed live to PVE + timer enabled. Docs updated: backup-dr.md (new VM-image
layer + protection matrix), infra CLAUDE.md, AGENTS.md.

[ci skip]

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:30:19 +00:00
Viktor Barzin
7fc4caefe3 t3: bump pin 0.0.24 -> 0.0.26 (fable-5) [ci skip]
Completes the 0.0.26 adoption prepared in fcb84ce0 (version-agnostic
dispatch browser-session/bootstrap fallback + Gate-2 real pairing
health-check + per-user state.sqlite backup). 0.0.26 verified
end-to-end on the devvm: emo + ancamilea auto-pair via t3-dispatch
(302 + Set-Cookie t3_session) after migrating state.sqlite 30->32;
pre-cutover backups in /var/backups/t3-state. Brings claude-fable-5
into the t3 model picker.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:21:39 +00:00
Viktor Barzin
bccaa08d8e t3: prepare to adopt 0.0.25 — version-agnostic dispatch + real pairing health-check + state backup [ci skip]
Investigated the 0.0.25 break: it is ONLY an endpoint rename
(/api/auth/bootstrap -> /api/auth/browser-session). The rest of the pairing
contract (credential payload, t3_session cookie, /api/auth/session) is
byte-identical, verified in isolated 0.0.24-vs-0.0.25 sandbox serves. So a
future pin bump is now safe + reversible (pin STAYS 0.0.24 — this is prep):

- t3-dispatch: autoPair tries /api/auth/browser-session, falls back to
  /api/auth/bootstrap on 404 — one binary pairs across both versions and any
  rolling-restart skew. TDD via TestAutoPairAcrossVersions (red on 0.0.25
  before, green after). Built, deployed, verified live on 0.0.24 (all three
  users still 302 + t3_session via the fallback).
- t3-autoupdate.sh: health-check now exercises the REAL mint->credential->cookie
  handshake (was GET / -> 200, which passed the pairing-broken nightly). A bad
  build now auto-rolls-back. Validated against both versions.
- t3-backup-state.{sh,service,timer}: daily online VACUUM INTO of each ~/.t3
  state.sqlite (was the only copy, unbacked) -> the one-way forward schema
  migration becomes a restore, not sqlite surgery. timeout-guarded.
- runbooks/t3-version-bump.md: the reversible cutover checklist.
- post-mortem #5 (health-check) DONE + #6 added; service-catalog updated.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:21:39 +00:00
Viktor Barzin
5ea238c707 t3: pin t3@0.0.24 + stop nightly auto-update (auth-outage fix) [ci skip]
The t3-autoupdate timer (re-enabled by the provisioner's step 5b with
`--now`, which fires the missed daily job immediately on a Persistent
timer) pulled t3@nightly 0.0.25 mid-day. That build ran forward schema
migrations on every ~/.t3 state.sqlite (auth_pairing_links/auth_sessions
role->scopes, +proof_key_thumbprint) AND changed the bootstrap API,
breaking t3-mint/pairing for ALL devvm users (pair prompt, no session).

- t3-autoupdate.sh: now a pinned-version ENFORCER (T3_PIN=0.0.24), not a
  nightly tracker -- re-asserts the pin (a no-op when correct).
- t3-provision-users.sh step 5b: drop `--now` (it triggered the
  immediate missed-job run that pulled the bad build).
- setup-devvm.sh: install pinned t3@0.0.24 at machine setup.
- unit Descriptions + service-catalog reflect the pin.
- post-mortem: 2026-06-09-t3-nightly-autoupdate-auth-outage.md.

Host already reconciled out-of-band: rolled back to 0.0.24, re-enabled
the (now-pinned) enforcer, reset the 2 new users' disposable DBs,
surgically reverted wizard's auth tables to level-30 (96 threads + live
session preserved). All users verified 302 + t3_session.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:21:39 +00:00
Viktor Barzin
2125651aaa t3-dispatch: re-pair on present-but-invalid t3_session cookie
The dispatcher only re-paired on an ABSENT cookie. After the 2026-06-09
auth-schema rollback wiped all server-side sessions, browsers kept dead
30-day t3_session cookies; the dispatcher proxied them straight through
and t3 rendered its pair page ("all users must pair again").

Now a present cookie on a top-level document navigation is validated via
the instance's /api/auth/session and re-paired on authenticated:false.
Gated to document navs (Sec-Fetch-Dest: document, else Accept: text/html)
so XHR/asset/WebSocket sub-requests are never answered with a 302; fails
open (proxy through) on any validation error. Unit + handler tests added.

[ci skip]

Co-Authored-By: Claude <noreply@anthropic.com>
2026-06-09 21:21:39 +00:00
Viktor Barzin
fad10a8707 workstation: fix new-user .env clobber — env_set preserves CLAUDE_CODE_OAUTH_TOKEN
The port-write used '>' (overwrite), wiping the token injected earlier in the same run for a NEW user (existing users like anca survived only because their .env already had the T3_PORT line). New env_set() does update-or-append per key, preserving others. Verified end-to-end: throwaway t3probe provisioned from scratch -> .env has both T3_PORT + CLAUDE_CODE_OAUTH_TOKEN -> claude -p AUTHOK. So all new non-admins now authenticate automatically. NOT pushed (shared-tree divergence hold).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:21:39 +00:00
Viktor Barzin
eeadf0f85d workstation: share admin Claude subscription with non-admins (CLAUDE_CODE_OAUTH_TOKEN)
Non-admins without their own ~/.claude login get the shared long-lived sk-ant-oat01 token injected into their t3-serve env, so their agent authenticates against the admin's subscription. setup-devvm.sh stages it from Vault secret/workstation.claude_oauth_token (root-readable); the provisioner's install_user_claude_token injects per-user, if-absent (never clobbers emo's own login). Live-fixed anca (verified AUTHOK); this codifies it for reproducibility + future users. NOT pushed (shared-tree divergence hold).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:21:39 +00:00
Viktor Barzin
fbcc330214 workstation: v2 membership implementation plan [ci skip]
8 tasks: engine derive_os_user + roster_from_members (TDD); read-only Authentik token (TF); setup-devvm.sh stages it; provisioner sources T3 Users members from the Authentik API (replaces roster.yaml); Authentik-managed membership + legacy os_user attributes; retire roster.yaml; e2e add/remove smoke. Pairs with the 2026-06-09 design doc.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:21:39 +00:00
Viktor Barzin
48013a4a92 feat(tts): Chatterbox TTS stack + off-peak T4 gate, wire tripit narration [ci skip]
New `infra/stacks/tts/` deploys devnen/Chatterbox-TTS-Server (OpenAI-compatible
/v1/audio/speech) as ClusterIP `chatterbox-tts.tts.svc:8000` (server listens on
8004; Service remaps), requesting ONE T4 time-slice. Mirrors stacks/llama-cpp/.

Option A off-peak control (no VRAM isolation on the time-sliced T4 — see
post-mortem 2026-06-02): Deployment sits at replicas=0; three Europe/London
CronJobs own the replica count — `chatterbox-window-up` scales to 1 at 02:00
ONLY IF a free-VRAM preflight passes (sum gpu_pod_memory_used_bytes from
gpu-pod-exporter; free = 16GiB - used >= floor), `chatterbox-vram-guard` yields
the card mid-window if a resident wakes, `chatterbox-window-down` scales to 0 at
06:00. tripit's bake is best-effort + cached-forever (ADR-0002/0004) so a
skipped/aborted window backfills next time. SA+Role+RoleBinding grant the
CronJobs deployments/scale (nextcloud-watchdog pattern).

Polite-tenant hardening: kyverno `inject-gpu-workload-priority` now excludes the
`tts` namespace (new `gpu_priority_excluded_namespaces` local) so Chatterbox
keeps tier-2-gpu priority (600k) and is always evicted first under GPU pressure
— never immich-ml/frigate/llama-swap. The LimitRange-fallback policy still uses
the base exclude list (tts untouched there).

tripit: add TTS_MODE=openai_compatible, TTS_BASE_URL, TTS_MODEL=chatterbox to
local.app_env (no token — ClusterIP only). No tripit code change.

Image build is documented in stacks/tts/README.md (devnen cu128 target ->
forgejo.viktorbarzin.me/viktor/chatterbox-tts) — build is impractical inline
(large CUDA image + needs the upstream repo). NOT APPLIED — review branch only.
Free-VRAM floor (var.vram_free_floor_bytes, default 6GiB) must be set from the
measured chatterbox-multilingual T4 peak during the first bake.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:21:39 +00:00
b1a6391a4d docs(ci-cd): tripit auto-deploy (GHA->Woodpecker 167) + svu semver in GHA [ci skip] 2026-06-09 19:41:08 +00:00
Viktor Barzin
68a237faf7 workstation: skel start-claude.sh inherits managed default model (drop hardcoded --model)
The per-user launcher hardcoded --model claude-opus-4-8; an explicit --model flag overrides the managed default in /etc/claude-code/managed-settings.json (claude-fable-5). Dropping it lets emo and all new accounts inherit the org default (per-session /model still works). Deployed to /etc/skel and emo live copy in the same change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 19:35:29 +00:00
Viktor Barzin
64f405db36 workstation: default Claude model = claude-fable-5 for all devvm users
Adds a model key (claude-fable-5) to the machine-wide managed-settings.json (installed to /etc/claude-code/ by setup-devvm.sh). Sets the default model for every Claude Code session on the devvm (CLI + t3 web) at top settings precedence; per-session /model and explicit --model flags still override. The org claudeMd block is unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 18:31:27 +00:00
8eb0bb244f docs(ci-cd): add off-infra GHA->GHCR build pattern for private Forgejo repos (tripit pilot) [ci skip] 2026-06-09 18:20:54 +00:00
1f23ba6929 tripit: build off-infra via GHA -> GHCR (private), pull via scoped ghcr-credentials
Switches tripit image from forgejo.viktorbarzin.me (in-cluster buildkit, sdc load) to ghcr.io/viktorbarzin/tripit built by GitHub Actions (Forgejo push-mirror -> private GitHub -> GHA -> GHCR). Adds a tripit-ns-scoped GHCR pull secret (github_pat, interim). Verified: deploy on :c8dfb5cb ready, ingest-plans CronJob pulled :latest + Succeeded. [ci skip]
2026-06-09 18:18:13 +00:00
Viktor Barzin
c5bda77731 forgejo: survive CI-build registry-push storms (mem 3Gi + working retention)
Heavy in-cluster builds (e.g. tripit buildkit) were taking Forgejo down via
two vectors. Fixes both, without moving Forgejo off the sdc HDD (code-oflt
deferred):

- Memory 1Gi -> 3Gi (requests=limits). Forgejo was OOMKilled (exit 137) under
  registry-push load; VPA upperBound ~1.5Gi was suppressed by the 1Gi cap it
  kept OOMing against. Size for the push spike.

- Activate registry retention (DRY_RUN false). Verified the delete list
  against all running viktor/* images first: 0 running images affected.
  Pruned 478 -> 161 package versions; PVC was at its 50Gi autoresize ceiling.

- FIX broken retention auth: the cleanup PAT was ci-pusher's, but Forgejo
  scopes container packages per-user, so DELETE on viktor/* returned 403 (the
  dry-run only did GETs, hiding it). Repointed forgejo_cleanup_token to
  viktor's write:package PAT. Retention had never actually worked.

- Protect buildkit *cache* tags from retention (cleanup.sh keep-set) so the
  gentler-builds layer cache survives daily pruning.

[ci skip] — already applied via scripts/tg.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 14:36:17 +00:00
Viktor Barzin
1e6e5c4ee9 t3code: enable t3-autoupdate.timer from the hourly provisioner
The unit files (t3-autoupdate.{timer,service,sh}) were committed but nothing
ever enabled the timer, so it sat `disabled` and every t3-serve@ instance
silently froze on an old t3 build (all users were on v0.0.24 while nightly was
0.0.25-nightly.20260608). Enable it from the hourly reconciler (not the
once-at-provision setup-devvm.sh) so it self-heals if ever disabled again.

[ci skip]

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 14:09:55 +00:00
Viktor Barzin
fd0f4a0365 fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip]
6d224861 came from a --no-checkout worktree whose empty index made the
commit drop every file except two. This restores 05b50d2b's full tree and
correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su
entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the
live infra was never applied from the broken commit.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 08:45:33 +00:00
Viktor Barzin
6d224861c4 stem95su: scheduled Drive->site sync CronJob (every 10m)
CronJob stem95su-gdrive-sync (*/10) mounts the content PVC RW and
rclone-syncs the read-only Drive folder "claude" (stem claude/files) onto
it (rclone/rclone:1.74.3, scope=drive.readonly, empty-source guard +
--max-delete 25). ESO ExternalSecret stem95su-rclone <- Vault
secret/stem95su. Requires the GCP OAuth app published to Production or the
refresh token expires ~weekly.

Lands the gdrive-sync stack on master (it had landed on a feature branch
by accident on the shared devvm checkout).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 08:42:26 +00:00
475 changed files with 40244 additions and 9211 deletions

File diff suppressed because one or more lines are too long

View file

@ -7,6 +7,7 @@ Control and query Home Assistant entities on ha-sofia.viktorbarzin.me.
import argparse
import json
import os
import subprocess
import sys
from urllib.parse import urljoin
@ -17,13 +18,29 @@ except ImportError:
print(" pip install requests")
sys.exit(1)
# Configuration from environment variables (ha-sofia specific)
HA_URL = os.environ.get("HOME_ASSISTANT_SOFIA_URL", "").rstrip("/")
HA_TOKEN = os.environ.get("HOME_ASSISTANT_SOFIA_TOKEN")
if not HA_URL or not HA_TOKEN:
print("ERROR: HOME_ASSISTANT_SOFIA_URL and HOME_ASSISTANT_SOFIA_TOKEN environment variables must be set.")
print("These should be set when activating the Claude venv (~/.venvs/claude)")
def _token_from_homelab():
"""Resolve the token via the homelab CLI when the env var isn't set, so the
script works from any directory / unprovisioned session (see ADR-0012)."""
try:
out = subprocess.run(
["homelab", "ha", "token", "--instance", "sofia"],
capture_output=True, text=True, timeout=30)
if out.returncode == 0 and out.stdout.strip():
return out.stdout.strip()
except Exception:
pass
return None
# Configuration: prefer env vars (set by the Claude venv); otherwise fall back to
# defaults + the homelab CLI so the script is not cwd/env dependent (ADR-0012).
HA_URL = os.environ.get("HOME_ASSISTANT_SOFIA_URL", "").rstrip("/") or "https://ha-sofia.viktorbarzin.me"
HA_TOKEN = os.environ.get("HOME_ASSISTANT_SOFIA_TOKEN") or _token_from_homelab()
if not HA_TOKEN:
print("ERROR: no ha-sofia API token available.")
print("Set HOME_ASSISTANT_SOFIA_TOKEN, or ensure `homelab ha token` works (kubeconfig reachable).")
sys.exit(1)
HEADERS = {

View file

@ -5,17 +5,26 @@
## Applications (11)
| Application | Provider Type | Auth Flow |
|-------------|--------------|-----------|
| Cloudflare Access | OAuth2/OIDC | explicit consent |
| Cloudflare Access | OAuth2/OIDC | implicit consent |
| Domain wide catch all | Proxy (forward auth) | implicit consent |
| Forgejo | OAuth2/OIDC | explicit consent |
| Forgejo | OAuth2/OIDC | implicit consent |
| Grafana | OAuth2/OIDC | implicit consent |
| Headscale | OAuth2/OIDC | explicit consent |
| Immich | OAuth2/OIDC | explicit consent |
| Headscale | OAuth2/OIDC | implicit consent |
| Immich | OAuth2/OIDC | implicit consent |
| Kubernetes | OAuth2/OIDC (public) | implicit consent |
| Kubernetes Dashboard | OAuth2/OIDC (confidential) | implicit consent |
| linkwarden | OAuth2/OIDC | explicit consent |
| linkwarden | OAuth2/OIDC | implicit consent |
| Vault | OAuth2/OIDC | implicit consent |
| wrongmove | OAuth2/OIDC | implicit consent |
> **2026-06-10 — every provider now uses implicit consent.** Cloudflare
> Access (pk 9), Forgejo (20), Immich (1), Headscale (13), linkwarden (8)
> and Vault (53) were switched from
> `default-provider-authorization-explicit-consent` via the API (these
> providers are UI-managed, not in TF). All are first-party apps; the
> expiring consent screen (re-shown every 4 weeks per app) only slowed
> first-time signin.
> **Kubernetes Dashboard** (TF-managed in `stacks/k8s-dashboard/authentik.tf`):
> confidential client `k8s-dashboard`, built for seamless dashboard SSO via
> oauth2-proxy. **Currently IDLE** — the apiserver rejects all OIDC tokens (see
@ -60,8 +69,27 @@
- All sources use `invitation-enrollment` as enrollment flow (new users require invitation)
## Authorization Flows
- **Explicit consent** (`default-provider-authorization-explicit-consent`): Shows consent screen
- **Implicit consent** (`default-provider-authorization-implicit-consent`): Auto-redirects
- **Explicit consent** (`default-provider-authorization-explicit-consent`): Shows consent screen — no provider uses it since 2026-06-10
- **Implicit consent** (`default-provider-authorization-implicit-consent`): Auto-redirects — used by ALL providers
## Authentication Flow (single-screen login, 2026-06-10)
`default-authentication-flow` bindings: identification (order 10) →
mfa-validation (order 30) → user-login (order 100). The identification
stage (`default-authentication-identification`, pk
`32aca5ab-106e-43f4-a4cc-4513d80e57f3`) has `password_stage` set to
`default-authentication-password`, so username + password render on ONE
screen (one round trip instead of two). The previously separate
password-stage binding at order 20 (pk `0fc677db-a23f-4ee7-8648-da342e14573b`)
was DELETED via the API — authentik requires removing it when the
identification stage embeds the password field. `password_stage` is pinned in
Terraform (`authentik_stage_identification.default_identification` in
`stacks/authentik/authentik_provider.tf`); all other stage fields stay
UI-managed via `ignore_changes`. Social-login buttons remain on the same
screen and bypass the password field, so Google/GitHub/Facebook users are
unaffected. If a future authentik upgrade/blueprint re-adds the order-20
binding, users would briefly see a second password prompt — delete the
binding again.
## Invitation Enrollment Flow
Slug: `invitation-enrollment` | PK: `7d667321-2b02-4e16-8161-148078a8dac1`
@ -138,7 +166,8 @@ Pinned via Terraform in `stacks/authentik/`:
| Knob | Value | Surface | Effect |
|------|-------|---------|--------|
| `UserLoginStage.session_duration` on `default-authentication-login` | `weeks=4` | `authentik_stage_user_login.default_login` in `authentik_provider.tf` | Authenticated users stay logged in 4 weeks across browser restarts. No sliding refresh — resets on each login. |
| `UserLoginStage.session_duration` on `default-authentication-login` | `weeks=4` | `authentik_stage_user_login.default_login` in `authentik_provider.tf` | Authenticated users stay logged in 4 weeks across browser restarts. No sliding refresh — resets on each login. Used by password login (`default-authentication-flow`) AND passkey login (`webauthn` flow — both terminate on this stage). |
| `UserLoginStage.session_duration` on `default-source-authentication-login` | `weeks=4` | `authentik_stage_user_login.default_source_login` in `authentik_provider.tf` (imported 2026-06-20, id `4c6977d2-…`) | **Social logins** (Google/GitHub/Facebook, via `default-source-authentication-flow`). Was the provider default `seconds=0`, which fell back to `UNAUTHENTICATED_AGE=hours=2` — so social logins expired every **2h** while password/passkey lasted 4 weeks. Pinned `weeks=4` on 2026-06-20 to make all login paths consistent. (Surfaced when the 2026-06-18 passkey wipe forced fallback to Google login → "re-login multiple times daily".) |
| `ProxyProvider.access_token_validity` on `Provider for Domain wide catch all` | `weeks=4` | `authentik_provider_proxy.catchall.access_token_validity` in `authentik_provider.tf` | Cookie `Max-Age` on `authentik_proxy_*` and `expires` on rows in `authentik_providers_proxy_proxysession`. Bumped 2026-05-10 from `hours=168`. **Bumping requires `kubectl rollout restart deploy/ak-outpost-authentik-embedded-outpost`** — the gorilla session store binds the value once at outpost startup; the 5-min provider refresh logs `"reusing existing session store"` and skips rebuild. |
| `AUTHENTIK_SESSIONS__UNAUTHENTICATED_AGE` (server + worker) | `hours=2` | `server.env` + `worker.env` in `modules/authentik/values.yaml` | Anonymous Django sessions (bots, healthcheckers, partial flows) are reaped within 2h instead of the 1d default. |
@ -149,7 +178,19 @@ Notes:
- The standalone embedded-outpost deployment needs `AUTHENTIK_POSTGRESQL__{HOST,PORT,USER,PASSWORD,NAME}` env vars to reach the dbaas cluster — codified via `kubernetes_json_patches.deployment` envFrom the shared `goauthentik` Secret. The `app.kubernetes.io/component=server` pod label is also injected via JSON patch (matches the `component:server` half of the Service selector that the controller adds for embedded outposts).
- `ProxyProvider.remember_me_offset` stays UI-managed via `ignore_changes`.
- The Authentik provider's resource schema does **not** expose the `Outpost.managed` field. We rely on TF's "write only fields it knows about" semantic: the server-set `goauthentik.io/outposts/embedded` value is preserved across applies because Terraform never writes `managed`. Don't change the resource provider schema expectations without verifying this assumption holds.
- The `unauthenticated_age` env var is injected via `server.env` / `worker.env` (not `authentik.sessions.unauthenticated_age`) because we set `authentik.existingSecret.secretName: goauthentik`, which makes the chart skip rendering its own `AUTHENTIK_*` Secret. The `authentik.*` value block is therefore inert in this stack — anything new under `authentik.*` must use the `*.env` arrays instead. The same applies to the existing `authentik.cache.*`, `authentik.web.*`, `authentik.worker.*` blocks (currently inert; live values come from the orphaned, helm-keep-policy `goauthentik` Secret created by chart 2025.10.3 before `existingSecret` was introduced).
## WebAuthn / Passkeys (2026-06-20)
- **Passkey devices live in the DB, NOT Terraform** (`WebAuthnDevice` model). They are user-owned; no TF resource or blueprint manages them. Re-enroll via the user settings UI (Authentik → Settings → MFA Devices → register a security key / passkey).
- **2026-06-18 wipe (root cause of the "WebAuthn broke" incident):** all 6 of Viktor's passkeys were deleted (`WebAuthnDevice.objects.count()` → 0) at 19:27 by an **ad-hoc tripit passkey E2E test** run from the devvm (`python-httpx/0.28.1`, as `akadmin`). The test cleanup did `GET /core/users/?search={demo}` (a **fuzzy** search) then `DELETE /api/v3/authenticators/admin/webauthn/{pk}/` for each device of `users[0]` — but `users[0]` resolved to the **real** account, not the intended demo user. **Lesson:** any future passkey-test cleanup MUST exact-match the demo user (`username == demo`), never `users[0]` of a fuzzy `?search=`. It was a one-off ad-hoc script (no committed/scheduled copy), so nothing auto-re-deletes — re-enrollment is safe.
- **Passkey login path itself is intact:** the identification stage's `passwordless_flow``webauthn` flow (UI-managed, in `ignore_changes`); the break was purely the missing device records.
- **Provider-schema gotcha:** the pinned authentik TF provider's `authentik_stage_identification` resource exposes **no** `webauthn_stage` or `enable_remember_me` attribute (they exist on the app *model*, not in the provider schema). Do NOT add them to `ignore_changes``tg plan` errors `Unsupported attribute`. They are purely UI/app-managed. (Commit `4e882989` removed them for exactly this reason; re-adding breaks every apply.)
- ALL tuned env vars are injected via `server.env` / `worker.env` (not the `authentik.*` values block) because we set `authentik.existingSecret.secretName: goauthentik`, which makes the chart skip rendering its own `AUTHENTIK_*` Secret. The `authentik.*` value block is therefore inert in this stack — anything new under `authentik.*` must use the `*.env` arrays instead. Live base values come from the orphaned, helm-keep-policy `goauthentik` Secret created by chart 2025.10.3 before `existingSecret` was introduced. **2026-06-10:** the previously-inert tuning (`AUTHENTIK_WEB__WORKERS=3`, `AUTHENTIK_WEB__THREADS=4`, `AUTHENTIK_CACHE__TIMEOUT_FLOWS=1800`, `AUTHENTIK_CACHE__TIMEOUT_POLICIES=900`, `AUTHENTIK_POSTGRESQL__CONN_MAX_AGE=60`, `AUTHENTIK_POSTGRESQL__CONN_HEALTH_CHECKS=true`, worker `AUTHENTIK_WORKER__THREADS=4`) was moved into the env arrays and is now actually live — before that, pods silently ran defaults (2 gunicorn workers, 300s caches, no persistent DB conns).
- **Outpost (2026-06-10):** `log_level=info` (was `trace` — per-request overhead on the forward-auth hot path) and `kubernetes_replicas=2` (was 1 — single-pod hot path; safe since proxy sessions live in Postgres). Both in `authentik_outpost.embedded` config.
- **Image tag is PINNED in values (`global.image.tag`), 2026-06-10:** Keel moves the authentik image between chart releases, while helm derives the tag from the chart appVersion — an unpinned helm apply silently DOWNGRADES live pods (caused the 2026-06-10 boot storm + shared-PG failover; see `docs/post-mortems/2026-06-10-authentik-downgrade-boot-storm.md`). Before touching this chart, check the live image tag and refresh the pin.
- **Liveness budget (2026-06-10):** `server.livenessProbe` = 6×10s, 5s timeout (chart default 3×10s/3s kill-loops pods that queue on the DB migration advisory lock during rolling restarts).
- **PgBouncer (2026-06-10):** `idle_transaction_timeout=300` reaps ghost `idle in transaction` sessions (a killed pod mid-migration otherwise holds the migration advisory lock forever, serializing all boots); the deployment carries a config-checksum annotation so ini changes roll the pods. Do NOT set `AUTHENTIK_POSTGRESQL__CONN_MAX_AGE` — session-mode PgBouncer pins persistent conns 1:1 (pool saturation).
- **Static assets (2026-06-10):** a second `ingress_factory` (`module.ingress-static`, path `/static` on the authentik host) attaches the `authentik-static-cache-headers` middleware → `Cache-Control: public, max-age=31536000, immutable`. Authentik itself serves no max-age; assets are version-fingerprinted so immutable is safe. Mainly helps split-horizon internal users (no Cloudflare edge cache on the direct path).
## Upgrade Validation Checklist
@ -161,8 +202,9 @@ Run after **any** of these:
The fragile surfaces are the `kubernetes_json_patches` and the `Outpost.managed` field — both rely on assumptions that can silently break across upgrades. The checklist exercises the same path the alerts watch, so it doubles as a smoke test for the alerts.
```bash
# 1. Service routes to the outpost pod (NOT the server pods).
# Empty endpoints => auth-proxy fallback fires; expected: ONE pod IP, ports 9000/9300/9443.
# 1. Service routes to the outpost pods (NOT the server pods).
# Empty endpoints => auth-proxy fallback fires; expected: TWO pod IPs
# (kubernetes_replicas=2 since 2026-06-10), ports 9000/9300/9443.
kubectl -n authentik get endpoints ak-outpost-authentik-embedded-outpost
# 2. Service selector still excludes the server pods. Expected: includes

View file

@ -92,19 +92,21 @@ Channel 3: A4 [32G] ──── A8 [32G] ──── A12[ 8G ] = 72 GB
| VMID | Name | Status | CPUs | RAM | Network | Disk | Notes |
|------|------|--------|------|-----|---------|------|-------|
| 101 | pfsense | running | 8 | 4GB | vmbr0, vmbr1:vlan10, vmbr1:vlan20 | 32G | Gateway/firewall |
| 102 | devvm | running | 16 | 24GB | vmbr1:vlan10 | 100G | Development VM + t3code Workstation host. 8G swapfile (swappiness=10). Capacity budget: ~4-5G RAM/active user, max ~3-4 concurrent active Claude sessions. NOT Terraform-managed. |
| 102 | devvm | running | 16 | 24GB | vmbr1:vlan10 | 100G | Development VM + t3code Workstation host. 14G swap (8G /swapfile + 6G /swapfile2, grown 2026-06-10; swappiness=10). Capacity budget: ~4-5G RAM/active user, max ~3-4 concurrent active Claude sessions. NOT Terraform-managed. Disk controller: `virtio-scsi-single` + `scsi0 iothread=1,aio=threads` staged 2026-06-11 after the QEMU I/O stall (was `scsihw: lsi`, the only VM on the legacy path — see `docs/post-mortems/2026-06-11-devvm-qemu-io-stall.md`); applies at next cold stop→start. |
| 103 | home-assistant | running | 8 | 8GB | vmbr0 | 64G | HA Sofia, net0(vlan10) disabled, SSH: vbarzin@192.168.1.8 |
| 105 | pbs | stopped | 16 | 8GB | vmbr1:vlan10 | 32G | Proxmox Backup (unused) |
| 200 | k8s-master | running | 8 | 16GB | vmbr1:vlan20 | 64G | Control plane (10.0.20.100) |
| 201 | k8s-node1 | running | 16 | 32GB | vmbr1:vlan20 | 256G | GPU node, Tesla T4 |
| 202 | k8s-node2 | running | 8 | 24GB | vmbr1:vlan20 | 256G | Worker |
| 203 | k8s-node3 | running | 8 | 24GB | vmbr1:vlan20 | 256G | Worker |
| 204 | k8s-node4 | running | 8 | 24GB | vmbr1:vlan20 | 256G | Worker |
| 200 | k8s-master | running | 8 | 32GB | vmbr1:vlan20 | 64G | Control plane (10.0.20.100) |
| 201 | k8s-node1 | running | 16 | 48GB | vmbr1:vlan20 | 256G | GPU node, Tesla T4 |
| 202 | k8s-node2 | running | 8 | 32GB | vmbr1:vlan20 | 256G | Worker |
| 203 | k8s-node3 | running | 8 | 32GB | vmbr1:vlan20 | 256G | Worker |
| 204 | k8s-node4 | running | 8 | 32GB | vmbr1:vlan20 | 256G | Worker |
| 205 | k8s-node5 | running | 8 | 32GB | vmbr1:vlan20 | 256G | Worker (10.0.20.105, joined 2026-05-26) |
| 206 | k8s-node6 | running | 8 | 32GB | vmbr1:vlan20 | 256G | Worker (10.0.20.106, joined 2026-05-26) |
| 220 | docker-registry | running | 4 | 4GB | vmbr1:vlan20 | 64G | MAC DE:AD:BE:EF:22:22 (10.0.20.10) |
| 300 | Windows10 | running | 16 | 8GB | vmbr0 | 100G | Windows VM |
| ~~9000~~ | ~~truenas~~ | **stopped/decommissioned** | — | — | — | — | NFS migrated to Proxmox host (192.168.1.127) at `/srv/nfs` and `/srv/nfs-ssd` |
**Total VM RAM allocated**: 196 GB of 272 GB (72%) — 76 GB free for future VMs (devvm corrected 8GB→24GB 2026-06-08)
**Total VM RAM allocated**: ~288 GB nominal across running VMs vs 272 GB physical — OVERCOMMITTED (ballooning enabled on K8s workers, host swap in use; see memory id=535/2543). K8s rows live-verified via `kubectl get nodes` capacity 2026-06-11 (master 32G, node1 48G, node2-6 32G; the old 16/32/24GB figures predated the 2026-04-02 resize and node5/6).
## VM Templates
| VMID | Name | Purpose |

File diff suppressed because one or more lines are too long

View file

@ -11,8 +11,8 @@ description: |
There are TWO Home Assistant deployments: ha-london (default) and ha-sofia.
Always use Home Assistant for smart home control.
author: Claude Code
version: 2.0.0
date: 2026-02-07
version: 2.1.0
date: 2026-06-24
---
# Home Assistant Control
@ -44,6 +44,12 @@ There are **two** Home Assistant instances:
- Environment variables for each instance:
- **ha-london**: `HOME_ASSISTANT_URL` and `HOME_ASSISTANT_TOKEN`
- **ha-sofia**: `HOME_ASSISTANT_SOFIA_URL` and `HOME_ASSISTANT_SOFIA_TOKEN`
- If those env vars aren't set (e.g. you're not in the infra repo / Claude venv), don't hand-roll a `kubectl | base64 | jq` token pipeline — use the global **`homelab` CLI** instead (on `$PATH` in any directory):
## homelab CLI (preferred — works from any directory)
- **Token**: `homelab ha token [--instance sofia|london]` resolves the long-lived API token live from the cluster. Use it directly in curl: `curl -H "Authorization: Bearer $(homelab ha token)" https://ha-sofia.viktorbarzin.me/api/states`. (The `home-assistant-sofia.py` script also auto-falls-back to this when its env var is unset.)
- **Host shell** (ha-sofia): `homelab ha ssh -- <cmd>` runs a command on the HA host with deterministic non-interactive ssh (no host-key prompt) — e.g. `homelab ha ssh -- "sudo docker ps"`, `homelab ha ssh -- "cat /config/configuration.yaml"`. Replaces bespoke `ssh -o StrictHostKeyChecking=no …` invocations.
- **Cluster metrics/logs** (not HA-specific): prefer `homelab metrics query "<promql>"` / `homelab logs query "<logql>"` over hand-rolled `curl …/api/v1/query`, and `homelab claim`/`release` over calling `scripts/presence` directly.
## API Control
@ -389,14 +395,27 @@ Advanced SSH, File Editor, Studio Code Server, InfluxDB, Mosquitto, Node-RED, Fr
## ha-london Knowledge Map
### Overview
- **HA Version**: 2025.9.1 (Docker container on Raspberry Pi)
- **HA Version**: 2026.5.2 on **Home Assistant OS** (HAOS — managed appliance, NOT a `docker run` container). Latest is 2026.6.4 (update available, deliberately not applied).
- **Location**: London, UK
- **Platform**: Raspberry Pi 4, HA OS (not Docker standalone)
- **SSH**: `ssh hassio@192.168.8.103` (requires `sudo` for file access)
- **Config path**: `/config/` (requires `sudo` for file access)
- **Platform**: Raspberry Pi 4, HA OS
- **Access from the Sofia devvm**: london is **remote**`homelab ha ssh --instance london` generally WON'T connect (ADR-0012). Drive it via the API: `homelab ha token --instance london` + `https://ha-london.viktorbarzin.me/api/...`, and the WebSocket API `wss://ha-london.viktorbarzin.me/api/websocket` for dashboards / config-entries / HACS installs.
- **SSH (only from the London LAN)**: `ssh hassio@192.168.8.103` (requires `sudo` for file access)
- **Config path**: `/config/`
- **3 tracked people**: Viktor Barzin, Anca Milea, Gheorghe Milea
- **Zone**: London (home)
### Dashboards (redesigned 2026-06-24)
**Glossary** (HA terms — keep distinct):
- **Dashboard** = a sidebar entry (Overview, Air Quality, Map). Sidebar *order* is a per-USER frontend preference, not in any dashboard config.
- **View** = a tab inside a dashboard. View order is global (stored in the dashboard config).
- **Card** = a widget inside a view.
- **Overview** (`lovelace`, the default): responsive **sections** views, styled with Mushroom + mini-graph-card.
- **Home** tab: *Who's home* · *Comfort & Air* (CO₂/temp/humidity/PM2.5/VOC chips + CO₂ and temp/humidity trend graphs + link to Air Quality) · *Cowboy* (battery/range/last-ride) · *Energy* (5 Kasa plugs + power trend) · *Quick actions* (Netflix/Stremio/Night).
- **More** tab: *Network* (GL-MT6000 router) · *System* (HA version/update, last backup, RPi power) · *Phones*.
- **Air Quality** (`air-quality`): deep-dive (views: Home, Detailed). (`detialed``detailed` path typo fixed 2026-06-24.)
- Built via the WS `lovelace/config/save` API (london is remote — no SSH path).
### Key Systems
#### 1. Smart Plugs (TP-Link Kasa) — Energy Monitoring
@ -418,10 +437,15 @@ Named plugs with power/energy tracking:
- PM1.0/2.5/4.0/10 particulate sensors
- VOC, NOx, ammonia, CO, ethanol, hydrogen, methane, NO2 gas sensors
#### 3. Cowboy E-Bike
- `sensor.bike_state_of_charge`: Battery %
- `sensor.bike_total_distance`: Total km
- `sensor.bike_total_co2_saved`: CO2 saved (grams)
#### 3. Cowboy E-Bike (`elsbrock/cowboy-ha`)
Bike named **"Classic Performance"** → entities are `sensor.classic_performance_*` (26 total). The old `sensor.bike_*` names are GONE (they were the dead `jdejaegh` integration).
- `sensor.classic_performance_remaining_battery`: Battery % (was `sensor.bike_state_of_charge`)
- `sensor.classic_performance_remaining_range`: Range km
- `sensor.classic_performance_mileage`: Total km (was `sensor.bike_total_distance`)
- `sensor.classic_performance_saved_co2`: Lifetime CO2 saved (was `sensor.bike_total_co2_saved`)
- Plus `_distance_today`, `_last_trip_*`, `_battery_health`, `device_tracker.classic_performance`, etc.
- **GOTCHA**: live battery/range/mileage read `unknown` while the bike is parked/asleep — Cowboy only reports live SoC when awake (ridden/charging); trip-history + `distance_today` stay live regardless.
- Auth: account **email+password** (no AWS Cognito — that was the dead `jdejaegh`/`cowboybike` lineage). Setup via UI config flow / REST `config_entries/flow`. Creds in Vaultwarden item **"cowboy bike"** (`homelab vault get "cowboy bike"`).
#### 4. Uptime Monitoring (UptimeRobot)
- `sensor.blog`: blog uptime
@ -440,12 +464,17 @@ Named plugs with power/energy tracking:
- Scripts: `script.start_netflix`, `script.start_stremio`
- Scene: `scene.night` (turns off Livia + Michelle plugs)
### Custom Components
- **cowboy**: Cowboy e-bike integration (HACS)
- **hildebrandglow_dcc**: UK smart meter DCC energy data (HACS)
### Custom Components (HACS integrations)
- **cowboy** (`elsbrock/cowboy-ha` v1.2.0): Cowboy e-bike — revived 2026-06-24. The old `jdejaegh/home-assistant-cowboy` repo is **dead (404)**; don't chase it.
- **hildebrandglow_dcc**: UK smart meter DCC energy — **DISABLED by user** (config entry `disabled_by: user`), not broken.
### HACS frontend cards (plugins)
- **Mushroom** (`piitaya/lovelace-mushroom`), **mini-graph-card** (`kalkih/mini-graph-card`), **plotly-graph-card** (`dbuezas/lovelace-plotly-graph-card`) — used by the redesigned Overview. Install over WS `hacs/repository/download`; resources auto-register in storage mode.
### Integrations
ESPHome, TP-Link Kasa, Tapo, UptimeRobot, Cowboy, Hildebrand Glow DCC, Oral-B BLE, Ookla Speedtest, HACS, OpenRouter (multiple free LLMs), Piper (local TTS), Whisper (local STT), Android TV/ADB
ESPHome, TP-Link Kasa, Tapo, UptimeRobot, **Cowboy** (elsbrock), Oral-B BLE, Ookla Speedtest (exposes only an `update` entity, no live speed sensors), HACS, OpenRouter (free LLMs), Piper (TTS), Whisper (STT), Android TV/ADB.
- **Disabled by user (NOT broken)**: `met` + `metoffice` (weather — so `weather.*` entities are ABSENT), `roomba` (Rumi vacuum), `hildebrandglow_dcc` (energy).
- **Failing**: `tplink` **Tapo P100** projector plug — `setup_retry`, 403 KLAP handshake from 192.168.8.108 (plug off / firmware). Left as-is.
### AI / Voice Assistants
- 5 free LLM conversation agents: Google Gemma 3 27B, Meta Llama 3.2 3B, Mistral Devstral 2, OpenAI GPT-OSS-20B, Z.AI GLM 4.5 Air
@ -460,15 +489,8 @@ ESPHome, TP-Link Kasa, Tapo, UptimeRobot, Cowboy, Hildebrand Glow DCC, Oral-B BL
- Anca arrival/departure notifications
- Night scene: turns off Livia + Michelle
### Docker Setup
```bash
docker run -d --name homeassistant --privileged \
-e TZ=Europe/London \
-v /home/pi/docker/homeAssistant:/config \
-v /run/dbus:/run/dbus:ro \
--network=host --restart=unless-stopped \
homeassistant/home-assistant:2025.9
```
### Platform (HAOS — ignore any legacy `docker run` snippet)
ha-london runs **Home Assistant OS** (managed appliance), NOT a hand-run Docker container. There is no `docker run homeassistant/home-assistant` to manage. Install HACS components over the WebSocket API (`hacs/repository/download` with the repo's HACS id), then restart via `POST /api/services/homeassistant/restart` — a HAOS restart drops automations for ~12 min and resets `sensor.uptime` (use that as the "back up" marker).
### SSH Access
```bash

View file

@ -51,7 +51,7 @@ Exit codes: `0` healthy, `1` attention warranted, `2` stalled / broken.
|---|---|---|---|
| **Apps** | Keel polls every watched Deployment's container registry; rolls on new digest | hourly | Prom (`pending_approvals`, `registries_scanned_total`), Keel pod logs |
| **OS** | `unattended-upgrades` in-release patching; `kured` reboots when `/var/run/reboot-required` is set | daily 02:00-06:00 London | SSH fan-out to all 5 nodes |
| **K8s** | `k8s-version-check` CronJob detects new kubeadm patch/minor; spawns the Job-chain that drains+upgrades node-by-node | daily 12:00 UTC | Pushgateway (`k8s_upgrade_*`), `kubectl get nodes` |
| **K8s** | `k8s-version-check` CronJob detects new kubeadm patch/minor; spawns the Job-chain that drains+upgrades node-by-node | nightly 23:00 UTC | Pushgateway (`k8s_upgrade_*`), `kubectl get nodes` |
The K8s pipeline pushes a small set of gauges to the Prometheus
Pushgateway (`prometheus-prometheus-pushgateway.monitoring:9091`):
@ -61,8 +61,11 @@ Pushgateway (`prometheus-prometheus-pushgateway.monitoring:9091`):
- `k8s_upgrade_in_flight` — 0/1
- `k8s_upgrade_started_timestamp` — when the current chain started (0 when idle)
`K8sUpgradeStalled` alert fires when `in_flight=1` and the chain has
been running >90 minutes. The script raises `✗` in the same window.
`K8sUpgradeStalled` fires when `in_flight=1` and the chain has been running
>90 minutes. `K8sUpgradeChainJobFailed` fires when a phase Job terminally
failed — including a **preflight that aborted before `in_flight` was set**
(the gates exit pre-metric). The script raises `✗` for either, and reads the
Jobs directly, so it also catches a Failed preflight that left no metric.
## Status-icon legend
@ -72,7 +75,7 @@ been running >90 minutes. The script raises `✗` in the same window.
| `→` | Update available, not yet applied (K8s patch/minor) |
| `…` | In flight — chain currently running |
| `⚠` | Attention: held-with-bumps, recent errors, pending approvals |
| `✗` | Broken: pod down, alert firing, chain stalled |
| `✗` | Broken: pod down, alert firing, chain stalled, or a chain Job failed |
## Drill-down — when a row trips, what to do
@ -177,6 +180,31 @@ kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- sh -
--header='Content-Type: text/plain'"
```
### K8s `✗ chain failed` — a phase Job terminally failed
`K8sUpgradeChainJobFailed` would fire. Most often a **preflight** that aborted
on a gate (a critical alert firing, a node not Ready, a kubeadm-plan mismatch) —
these exit before `in_flight` is set, so `K8sUpgradeStalled` never sees them, and
the deterministic name + 7d TTL blocked re-spawn (the 2026-06-12 5-day wedge).
```bash
kubectl -n k8s-upgrade get jobs
kubectl -n k8s-upgrade describe job <failed-job> # check the Failed reason
# Preflight abort reasons post to Slack ONLY (not stdout), so Loki won't have
# them. Replay the gate instead — which critical alerts were firing at the
# failure time? (ALERTS{severity="critical"} in Prometheus, query at that ts.)
```
Recovery is now mostly automatic: the detection CronJob and `spawn_next`
re-spawn a terminally-Failed Job on the next cycle (retry-on-failure), so a
transient gate clears within ~24h. To expedite, delete the Failed Job and
trigger detection:
```bash
kubectl -n k8s-upgrade delete job <failed-job>
kubectl -n k8s-upgrade create job --from=cronjob/k8s-version-check manual-detect-$(date +%s)
```
### K8s `✗ detection stale` — last detection >9 days
```bash

View file

@ -0,0 +1,36 @@
name: Build android-emulator
# ADR-0002: infra-owned image built off-infra on GHA → ghcr (public).
# Large image (Android SDK + emulator); on-demand workload (scaled 0). Rebuilds
# rare → dispatch + path trigger.
on:
push:
branches: [master]
paths:
- 'stacks/android-emulator/docker/**'
workflow_dispatch: {}
permissions:
contents: read
packages: write
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: docker/setup-buildx-action@v3
- uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- uses: docker/build-push-action@v6
with:
context: stacks/android-emulator/docker
platforms: linux/amd64
provenance: false
push: true
tags: |
ghcr.io/viktorbarzin/android-emulator:latest
ghcr.io/viktorbarzin/android-emulator:${{ github.sha }}

View file

@ -0,0 +1,39 @@
name: Build chrome-service-browser
# ADR-0002: infra-owned image built off-infra on GHA → ghcr. Playwright base +
# real Google Chrome (proprietary H.264/AAC codecs) for the chrome-service
# browser container, so the noVNC view can play H.264 video (Reels). Rebuilds
# are rare → dispatch + path trigger. NOTE: after the first push, set the ghcr
# package `chrome-service-browser` to PUBLIC (same as chrome-service-novnc) so
# the pod pulls it without credentials.
on:
push:
branches: [master]
paths:
- 'stacks/chrome-service/files/chrome/**'
workflow_dispatch: {}
permissions:
contents: read
packages: write
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: docker/setup-buildx-action@v3
- uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- uses: docker/build-push-action@v6
with:
context: stacks/chrome-service/files/chrome
platforms: linux/amd64
provenance: false
push: true
tags: |
ghcr.io/viktorbarzin/chrome-service-browser:latest
ghcr.io/viktorbarzin/chrome-service-browser:${{ github.sha }}

View file

@ -0,0 +1,36 @@
name: Build chrome-service-novnc
# ADR-0002: infra-owned image built off-infra on GHA → ghcr (public).
# Source Dockerfile identical on both git remotes, so the github checkout builds
# the current image. Rebuilds are rare (stable noVNC proxy) → dispatch + path.
on:
push:
branches: [master]
paths:
- 'stacks/chrome-service/files/novnc/**'
workflow_dispatch: {}
permissions:
contents: read
packages: write
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: docker/setup-buildx-action@v3
- uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- uses: docker/build-push-action@v6
with:
context: stacks/chrome-service/files/novnc
platforms: linux/amd64
provenance: false
push: true
tags: |
ghcr.io/viktorbarzin/chrome-service-novnc:latest
ghcr.io/viktorbarzin/chrome-service-novnc:${{ github.sha }}

41
.github/workflows/build-cli.yml vendored Normal file
View file

@ -0,0 +1,41 @@
name: Build infra CLI
# ADR-0002: infra CLI built off-infra on GHA. Replaces the Woodpecker
# build-cli.yml. Pushes to DockerHub (public distribution, kept) + ghcr.
# Not a cluster workload — a distributed tool image.
on:
push:
branches: [master]
paths:
- 'cli/**'
workflow_dispatch: {}
permissions:
contents: read
packages: write
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: docker/setup-buildx-action@v3
- uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- uses: docker/login-action@v3
with:
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_TOKEN }}
- uses: docker/build-push-action@v6
with:
context: cli
platforms: linux/amd64
provenance: false
push: true
tags: |
viktorbarzin/infra:latest
ghcr.io/viktorbarzin/infra-cli:latest
ghcr.io/viktorbarzin/infra-cli:${{ github.sha }}

37
.github/workflows/build-infra-ci.yml vendored Normal file
View file

@ -0,0 +1,37 @@
name: Build infra-ci
# ADR-0002: the infra CI toolbox image (terraform/terragrunt/sops/kubectl/vault)
# built off-infra on GHA → ghcr (public). BOOTSTRAP-CRITICAL: .woodpecker/default.yml's
# apply step runs in this image. The Woodpecker build-ci-image.yml is kept until a
# ghcr-based apply is proven, then removed.
on:
push:
branches: [master]
paths:
- 'ci/Dockerfile'
workflow_dispatch: {}
permissions:
contents: read
packages: write
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: docker/setup-buildx-action@v3
- uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- uses: docker/build-push-action@v6
with:
context: ci
platforms: linux/amd64
provenance: false
push: true
tags: |
ghcr.io/viktorbarzin/infra-ci:latest
ghcr.io/viktorbarzin/infra-ci:${{ github.sha }}

36
.github/workflows/build-k8s-portal.yml vendored Normal file
View file

@ -0,0 +1,36 @@
name: Build k8s-portal
# ADR-0002 / no-local-builds: k8s-portal (infra-owned Go portal) builds off-infra
# on GHA → public ghcr; Keel polls ghcr:latest and rolls the deployment. Replaces
# the in-cluster .woodpecker/k8s-portal.yml build.
on:
push:
branches: [master]
paths:
- 'stacks/k8s-portal/modules/k8s-portal/files/**'
workflow_dispatch: {}
permissions:
contents: read
packages: write
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: docker/setup-buildx-action@v3
- uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- uses: docker/build-push-action@v6
with:
context: stacks/k8s-portal/modules/k8s-portal/files
platforms: linux/amd64
provenance: false
push: true
tags: |
ghcr.io/viktorbarzin/k8s-portal:latest
ghcr.io/viktorbarzin/k8s-portal:${{ github.sha }}

13
.gitignore vendored
View file

@ -103,3 +103,16 @@ stacks/terminal/clipboard-upload/clipboard-upload
# Plaintext terraform state — NEVER commit (use SOPS-encrypted .tfstate.enc only)
terraform.tfstate
terraform.tfstate.backup
# Per-feature git worktrees (worktree-first workflow — execution.md)
.worktrees/
# Timestamped terraform state backups (terraform.tfstate.<ts>.backup) — plaintext Tier-0
# secrets; created by terraform state ops. The patterns above miss the timestamped form.
terraform.tfstate.*.backup
# Python test artifacts (pytest bytecode cache) — e.g. from
# stacks/k8s-version-upgrade/scripts/test_compat_gate.py
__pycache__/
*.pyc
.pytest_cache/

View file

@ -3,10 +3,6 @@
"ha": {
"type": "http",
"url": "${HA_MCP_URL}"
},
"paperless": {
"type": "http",
"url": "http://paperless-mcp.paperless-mcp.svc.cluster.local/mcp"
}
}
}

View file

@ -0,0 +1,31 @@
# Break-glass: save the ghcr infra-ci image to a tarball on the registry VM
# (10.0.20.10) so it can be `docker load`-ed onto a node if ghcr is ever
# unreachable during a recovery. infra-ci now builds on GHA → ghcr (ADR-0002),
# which is external + node-cached, so this is a belt-and-braces DR artifact —
# run MANUALLY after an infra-ci rebuild (or periodically). Pulls from ghcr
# (public, no login). Recovery: docs/runbooks/forgejo-registry-breakglass.md.
when:
- event: manual
steps:
- name: breakglass-tarball
image: alpine:3.20
failure: ignore
environment:
REGISTRY_SSH_KEY:
from_secret: registry_ssh_key
commands:
- apk add --no-cache openssh-client
- mkdir -p ~/.ssh && chmod 700 ~/.ssh
- printf '%s\n' "$REGISTRY_SSH_KEY" > ~/.ssh/id_ed25519
- chmod 600 ~/.ssh/id_ed25519
- ssh-keyscan -t ed25519 10.0.20.10 >> ~/.ssh/known_hosts 2>/dev/null
- |
ssh -n -o BatchMode=yes root@10.0.20.10 "
set -e
mkdir -p /opt/registry/data/private/_breakglass
IMAGE=ghcr.io/viktorbarzin/infra-ci:latest
docker pull \$IMAGE
docker save \$IMAGE | gzip > /opt/registry/data/private/_breakglass/infra-ci-latest.tar.gz
ls -lh /opt/registry/data/private/_breakglass/infra-ci-latest.tar.gz
"

View file

@ -1,88 +0,0 @@
# Build the CI tools Docker image used by all infra pipelines.
# Triggers on push that touches ci/Dockerfile, or manual (API/UI) so
# rebuilds after a registry incident don't need a cosmetic Dockerfile edit.
when:
- event: push
branch: master
path:
include:
- 'ci/Dockerfile'
- event: manual
steps:
- name: build-and-push
image: woodpeckerci/plugin-docker-buildx
settings:
# Phase 4 of forgejo-registry-consolidation 2026-05-07 —
# registry.viktorbarzin.me dropped, Forgejo is the only target.
repo:
- forgejo.viktorbarzin.me/viktor/infra-ci
dockerfile: ci/Dockerfile
context: ci/
tags:
- latest
- "${CI_COMMIT_SHA:0:8}"
platforms: linux/amd64
logins:
- registry: forgejo.viktorbarzin.me
username:
from_secret: forgejo_user
password:
from_secret: forgejo_push_token
# Post-push integrity check is now redundant with the every-15min
# forgejo-integrity-probe in stacks/monitoring/, which walks
# /v2/_catalog + HEADs every blob across the entire Forgejo registry.
# If a corruption pattern emerges that the periodic probe misses,
# restore a verify step similar to the pre-Phase-4 version (see
# commit 49f4956f) but pointed at forgejo.viktorbarzin.me.
# Break-glass tarball: save the just-pushed infra-ci image to disk on the
# registry VM (10.0.20.10) so we can `docker load` it back into a node
# when Forgejo is unreachable. Pulls from Forgejo (the only registry now).
# Best-effort — failure here doesn't fail the pipeline.
# Recovery procedure: docs/runbooks/forgejo-registry-breakglass.md.
- name: breakglass-tarball
image: alpine:3.20
failure: ignore
environment:
REGISTRY_SSH_KEY:
from_secret: registry_ssh_key
FORGEJO_USER:
from_secret: forgejo_user
FORGEJO_PASS:
from_secret: forgejo_push_token
commands:
- apk add --no-cache openssh-client
- mkdir -p ~/.ssh && chmod 700 ~/.ssh
- printf '%s\n' "$REGISTRY_SSH_KEY" > ~/.ssh/id_ed25519
- chmod 600 ~/.ssh/id_ed25519
- ssh-keyscan -t ed25519 10.0.20.10 >> ~/.ssh/known_hosts 2>/dev/null
- SHA=${CI_COMMIT_SHA:0:8}
- |
ssh -n -o BatchMode=yes root@10.0.20.10 "
set -e
mkdir -p /opt/registry/data/private/_breakglass
IMAGE=forgejo.viktorbarzin.me/viktor/infra-ci:$SHA
echo \$FORGEJO_PASS | docker login forgejo.viktorbarzin.me -u \$FORGEJO_USER --password-stdin
docker pull \$IMAGE
docker save \$IMAGE | gzip > /opt/registry/data/private/_breakglass/infra-ci-$SHA.tar.gz
ln -sfn infra-ci-$SHA.tar.gz /opt/registry/data/private/_breakglass/infra-ci-latest.tar.gz
ls -t /opt/registry/data/private/_breakglass/infra-ci-*.tar.gz \
| grep -v 'latest' | tail -n +6 | xargs -r rm -v
ls -lh /opt/registry/data/private/_breakglass/
"
- name: slack
image: curlimages/curl
commands:
- |
curl -s -X POST -H 'Content-type: application/json' \
--data "{\"text\":\"CI image built: forgejo.viktorbarzin.me/viktor/infra-ci:${CI_COMMIT_SHA:0:8} (and registry-private mirror)\"}" \
"$SLACK_WEBHOOK" || true
environment:
SLACK_WEBHOOK:
from_secret: slack_webhook
when:
status: [success]

View file

@ -1,42 +0,0 @@
when:
event: push
clone:
git:
image: woodpeckerci/plugin-git
settings:
attempts: 5
backoff: 10s
steps:
- name: build-image
image: woodpeckerci/plugin-docker-buildx
settings:
username: "viktorbarzin"
password:
from_secret: dockerhub-pat
# Phase 4 of forgejo-registry-consolidation 2026-05-07 —
# registry.viktorbarzin.me:5050 decommissioned. Push to DockerHub
# (the public-facing infra image) AND Forgejo (the cluster pull
# source). Same image, two locations.
repo:
- viktorbarzin/infra
- forgejo.viktorbarzin.me/viktor/infra
logins:
- registry: https://index.docker.io/v1/
username: viktorbarzin
password:
from_secret: dockerhub-pat
- registry: forgejo.viktorbarzin.me
username:
from_secret: forgejo_user
password:
from_secret: forgejo_push_token
dockerfile: cli/Dockerfile
context: cli
auto_tag: true
# cache_from/cache_to removed: registry cache corruption causes
# "short read: expected 32 bytes" BuildKit errors. Inline cache
# will be re-populated once a clean image is pushed.
# cache_from: "registry.viktorbarzin.me:5050/infra:latest"
# cache_to: "type=inline"

View file

@ -19,13 +19,34 @@ clone:
git:
image: woodpeckerci/plugin-git
settings:
partial: false
depth: 2
attempts: 5
backoff: 10s
steps:
# Audit feed for the allow-then-audit contribution model: any master push by
# a NON-admin author is surfaced in Slack (Viktor's own pushes are not).
# Runs before apply and never blocks it. Note: [ci skip] commits never reach
# this step (Woodpecker skips the whole pipeline) — hence the rule that
# non-admins must not use [ci skip].
- name: notify-nonadmin-push
image: curlimages/curl
environment:
SLACK_WEBHOOK:
from_secret: slack_webhook
commands:
- |
case "$CI_COMMIT_AUTHOR" in
viktor|ViktorBarzin|wizard) echo "admin push — no notify"; exit 0 ;;
esac
SUBJECT=$(echo "$CI_COMMIT_MESSAGE" | head -1 | tr -d '"\\')
curl -s -X POST -H 'Content-type: application/json' \
--data "{\"text\":\"📝 infra master push by *$CI_COMMIT_AUTHOR*: $SUBJECT\n$CI_REPO_URL/commit/$CI_COMMIT_SHA\"}" \
"$SLACK_WEBHOOK" || true
- name: apply
image: forgejo.viktorbarzin.me/viktor/infra-ci:latest
image: ghcr.io/viktorbarzin/infra-ci:latest
pull: true
backend_options:
kubernetes:
@ -115,6 +136,25 @@ steps:
git fetch --deepen=1 origin master 2>/dev/null || true
fi
# Diff base: prefer the push's true before-state (CI_PREV_COMMIT_SHA).
# HEAD~1 is WRONG for merge commits — it is the first parent (the
# feature-branch side), so the diff shows the OTHER lineage's files
# and silently skips the stacks this push actually changed
# (bit ci-pipeline-health on 2026-06-12, pipeline 128).
DIFF_BASE="HEAD~1"
if [ -n "${CI_PREV_COMMIT_SHA:-}" ] && [ "$CI_PREV_COMMIT_SHA" != "$CI_COMMIT_SHA" ]; then
git cat-file -e "$CI_PREV_COMMIT_SHA^{commit}" 2>/dev/null || git fetch --depth=50 origin master 2>/dev/null || true
# Restarted pipelines after master moved produce REVERSE diffs
# (CI_PREV ahead of the checked-out HEAD re-applied stale trees and
# reverted a sibling apply on 2026-06-12, pipeline 148). Only use
# CI_PREV when it is an ancestor of HEAD.
if git cat-file -e "$CI_PREV_COMMIT_SHA^{commit}" 2>/dev/null \
&& git merge-base --is-ancestor "$CI_PREV_COMMIT_SHA" HEAD 2>/dev/null; then
DIFF_BASE="$CI_PREV_COMMIT_SHA"
fi
fi
echo "Diff base: $DIFF_BASE"
# If still no parent, apply all platform stacks as a safe fallback
if ! git rev-parse HEAD~1 >/dev/null 2>&1; then
echo "Cannot determine changed files — applying ALL platform stacks"
@ -122,14 +162,14 @@ steps:
> .app_apply
else
# Check if global files changed (triggers full platform apply)
GLOBAL_CHANGED=$(git diff --name-only HEAD~1 HEAD | grep -E '^(modules/|config\.tfvars|terragrunt\.hcl)' || true)
GLOBAL_CHANGED=$(git diff --name-only "$DIFF_BASE" HEAD | grep -E '^(modules/|config\.tfvars|terragrunt\.hcl)' || true)
if [ -n "$GLOBAL_CHANGED" ]; then
echo "Global files changed — applying ALL platform stacks"
echo "$PLATFORM_STACKS" | tr ' ' '\n' > .platform_apply
else
# Detect platform stacks that changed
git diff --name-only HEAD~1 HEAD | grep '^stacks/' | cut -d/ -f2 | sort -u > .all_changed
git diff --name-only "$DIFF_BASE" HEAD | grep '^stacks/' | cut -d/ -f2 | sort -u > .all_changed
> .platform_apply
while read -r stack; do
if echo "$PLATFORM_STACKS" | grep -qw "$stack"; then
@ -140,7 +180,7 @@ steps:
# Detect app stacks that changed
> .app_apply
git diff --name-only HEAD~1 HEAD | grep '^stacks/' | cut -d/ -f2 | sort -u | while read -r stack; do
git diff --name-only "$DIFF_BASE" HEAD | grep '^stacks/' | cut -d/ -f2 | sort -u | while read -r stack; do
if echo "$PLATFORM_STACKS" | grep -qw "$stack"; then
continue # Skip platform stacks
fi

View file

@ -9,12 +9,13 @@ clone:
git:
image: woodpeckerci/plugin-git
settings:
partial: false
depth: 1
attempts: 3
steps:
- name: detect-drift
image: forgejo.viktorbarzin.me/viktor/infra-ci:latest
image: ghcr.io/viktorbarzin/infra-ci:latest
pull: true
backend_options:
kubernetes:

View file

@ -5,6 +5,7 @@ clone:
git:
image: woodpeckerci/plugin-git
settings:
partial: false
depth: 2
steps:

View file

@ -1,49 +0,0 @@
when:
event: push
branch: master
path:
include:
- "stacks/platform/modules/k8s-portal/files/**"
clone:
git:
image: woodpeckerci/plugin-git
settings:
attempts: 5
backoff: 10s
steps:
- name: build-and-push
image: woodpeckerci/plugin-docker-buildx
settings:
username: "viktorbarzin"
password:
from_secret: dockerhub-pat
repo: viktorbarzin/k8s-portal
dockerfile: stacks/platform/modules/k8s-portal/files/Dockerfile
context: stacks/platform/modules/k8s-portal/files
platforms:
- linux/amd64
tag: ["${CI_PIPELINE_NUMBER}", "latest"]
cache_from: "viktorbarzin/k8s-portal:latest"
cache_to: "type=inline"
- name: deploy
image: bitnami/kubectl:latest
commands:
- "kubectl set image deployment/k8s-portal portal=viktorbarzin/k8s-portal:${CI_PIPELINE_NUMBER} -n k8s-portal"
- "kubectl rollout status deployment/k8s-portal -n k8s-portal --timeout=120s"
- "echo 'k8s-portal deployed successfully (build ${CI_PIPELINE_NUMBER})'"
- name: slack
image: curlimages/curl
commands:
- |
curl -s -X POST -H 'Content-type: application/json' \
--data "{\"text\":\"K8s Portal: build #${CI_PIPELINE_NUMBER} ${CI_PIPELINE_STATUS}\"}" \
"$SLACK_WEBHOOK" || true
environment:
SLACK_WEBHOOK:
from_secret: slack_webhook
when:
status: [success, failure]

View file

@ -11,6 +11,7 @@ clone:
git:
image: woodpeckerci/plugin-git
settings:
partial: false
depth: 5
steps:

View file

@ -5,6 +5,7 @@ clone:
git:
image: woodpeckerci/plugin-git
settings:
partial: false
attempts: 5
backoff: 10s

View file

@ -23,6 +23,7 @@ clone:
git:
image: woodpeckerci/plugin-git
settings:
partial: false
depth: 1
attempts: 3

View file

@ -38,6 +38,7 @@ clone:
git:
image: woodpeckerci/plugin-git
settings:
partial: false
depth: 1
attempts: 3

View file

@ -6,6 +6,7 @@ clone:
git:
image: woodpeckerci/plugin-git
settings:
partial: false
attempts: 5
backoff: 10s

View file

@ -9,7 +9,7 @@
- **Ask before `git push`** — always confirm with the user first
## Execution
- **Apply a service**: `scripts/tg apply --non-interactive` (auto-decrypts SOPS secrets)
- **Apply a service**: `scripts/tg apply --non-interactive` (auto-decrypts SOPS secrets; passes `-lock-timeout`, default `5m` / `TG_LOCK_TIMEOUT`, so a contended state lock waits instead of failing with `Error acquiring the state lock`)
- **Legacy apply**: `cd stacks/<service> && terragrunt apply --non-interactive` (uses terraform.tfvars)
- **kubectl**: `kubectl --kubeconfig $(pwd)/config`
- **Health check**: `bash scripts/cluster_healthcheck.sh --quiet`
@ -90,6 +90,7 @@ Terragrunt-based homelab managing a Kubernetes cluster (5 nodes, v1.34.2) on Pro
- **Public domain**: `viktorbarzin.me` (Cloudflare) | **Internal**: `viktorbarzin.lan` (Technitium DNS)
- **Onboarding portal**: `https://k8s-portal.viktorbarzin.me` — self-service kubectl setup + docs
- **CI/CD**: Woodpecker CI — PRs run plan, merges to master auto-apply all stacks
- **CI compute is external (ADR-0002, 2026-06-12)**: builds, tests, lint, and release jobs run on GitHub Actions hosted runners via each repo's GitHub mirror — never on cluster nodes. In-cluster pipelines exist only for steps that need cluster access (Woodpecker `kubectl set image` deploys, terragrunt applies, certbot). Never add an in-cluster build or test pipeline to any repo; the fallback-build pattern was deliberately removed. After pushing anything that fires a build chain, watch it end-to-end (GHA run → Woodpecker deploy → rollout) before calling the change done — verify live state, not the checkmark.
## Key Paths
- `stacks/<service>/main.tf` — service definition
@ -109,7 +110,8 @@ Terragrunt-based homelab managing a Kubernetes cluster (5 nodes, v1.34.2) on Pro
- **SQLite on NFS is unreliable** (fsync issues) — always use proxmox-lvm or local disk for databases.
- **NFS mount options**: Always `soft,timeo=30,retrans=3` to prevent uninterruptible sleep (D state).
- **NFS export directory must exist** on the Proxmox host before Terraform can create the PV.
- **Backup (3-2-1)**: Copy 1 = live PVCs on sdc. Copy 2 = sda `/mnt/backup` (PVC file backups, auto SQLite backups, pfSense, PVE config). Copy 3 = Synology offsite (two-tier: sda→`pve-backup/`, NFS→`nfs/`+`nfs-ssd/` via inotify change tracking).
- **Backup (3-2-1)**: Copy 1 = live PVCs on sdc. Copy 2 = sda `/mnt/backup` (PVC file backups, auto SQLite backups, pfSense, PVE config, **VM images via `vzdump-vms`**). Copy 3 = Synology offsite (two-tier: sda→`pve-backup/`, NFS→`nfs/`+`nfs-ssd/` via inotify change tracking).
- **vzdump-vms** (Daily 01:00): live `vzdump --mode snapshot` of hand-managed VMs (NOT in TF) → `/mnt/backup/vzdump/`, keep 3/VMID. `VZDUMP_VMIDS` default `102` (devvm) — the only VM imaged today; before this (2026-06-09) no VM was ever imaged. NOT in the incremental offsite manifest; monthly full pass mirrors it. See `docs/architecture/backup-dr.md`.
- **daily-backup** (Daily 05:00): Auto-discovered BACKUP_DIRS (glob), auto SQLite backup (magic number + `?mode=ro`), pfSense, PVE config. No NFS mirror step (NFS syncs directly to Synology via inotify).
- **offsite-sync-backup** (Daily 06:00): Step 1: sda→Synology `pve-backup/`. Step 2: NFS→Synology `nfs/`+`nfs-ssd/` via `rsync --files-from` (inotify change log). Monthly full `--delete`.
- **nfs-change-tracker.service**: inotifywait on `/srv/nfs` + `/srv/nfs-ssd`, logs to `/mnt/backup/.nfs-changes.log`. Incremental syncs complete in seconds.
@ -225,7 +227,69 @@ Per-workload opt-out: add the label `keel.sh/policy: never` on the Deployment me
4. Viktor reviews → CI applies → Slack notification
5. Portal: `https://k8s-portal.viktorbarzin.me/onboarding` for full guide
### Non-admin workstation users — the AGENT does the git work
Non-admin devvm users (power-user / namespace-owner tiers) may not know git at
all. Their agent handles every version-control step silently — never ask them
to commit, push, pull, or open a PR, and never surface git jargon at them.
Their infra clone arrives preconfigured: git identity, a `forgejo` remote
authenticated via `~/.git-credentials`, and `master` tracking `forgejo/master`
(auto-freshened hourly and at session launch, fast-forward only).
Two per-user layouts exist (`code_layout` in
`scripts/workstation/roster.yaml`): `single` (the default) — `~/code` IS the
locked infra clone — and `workspace``~/code` is a plain directory of
per-project clones: the infra clone at `~/code/infra`, plus each roster
`repos` entry (e.g. `~/code/tripit`) cloned from Forgejo `viktor/<name>` with
the user's own PAT. The reconcile auto-migrates a single-layout `~/code` when
a user is flipped to `workspace`, and keeps every clone fresh either way.
The model is **allow-then-audit** (Viktor, 2026-06-10): whitelisted users (emo)
push straight to `master` — no PR gate — and the record of *what changed and
why* is what matters. Force-push is disabled for everyone, so master history
is append-only.
**Feature-sized work is worktree-first** (org rule, 2026-06-10): develop in an
isolated worktree (`.worktrees/<topic>`, branch `<os-user>/<topic>` off
`forgejo/master`) so concurrent agent sessions never collide in the clone, then
land by merging latest master into the branch and pushing it
(`git push forgejo HEAD:master`, or the PR fallback below if not whitelisted) —
the audit-trail rules below apply to the branch's commit messages all the same.
Locked (git-crypt) clones can use plain `git worktree add`. Trivial
single-commit fixes may be committed directly on a clean `master`. Full
lifecycle: `~/.claude/rules/execution.md` §3.
To land a finished change from such a clone:
1. Commit on `master`. **The commit message is the audit trail** — this matters
more than the change itself:
- subject: what changed, specific ("ha-sofia: lower fan curve bias to -5")
- body: WHY, in plain words — paraphrase the user's actual request and any
reasoning ("Emil asked for quieter fans in the evening; curve was
overshooting after the 2026-06-08 redesign")
2. `git push forgejo master`. If rejected non-fast-forward: `git pull --rebase
forgejo master` and push again.
3. **Never use `[ci skip]`** as a non-admin — it hides the change from the
Slack audit feed; a no-op CI apply on a docs-only commit is harmless.
4. Leave the clone on clean `master` so auto-refresh keeps working.
5. Tell the user in plain language what happened. Stack changes are
auto-applied by CI — verify the live result with the user's read-only
kubectl before saying "it's live".
If a push to `master` is rejected by branch protection (user not on the
whitelist — e.g. new users before Viktor grants it), fall back to a
`<os-user>/<short-topic>` branch + PR with the user's own PAT
(`write:repository` suffices — verified 2026-06-10):
```bash
TOK=$(sed -E 's#https://[^:]+:([^@]+)@.*#\1#' ~/.git-credentials)
curl -X POST -H "Authorization: token $TOK" -H 'Content-Type: application/json' \
https://forgejo.viktorbarzin.me/api/v1/repos/viktor/infra/pulls \
-d '{"title":"<title>","head":"<os-user>/<short-topic>","base":"master","body":"<what + why>"}'
```
## Common Operations
- **`homelab` CLI** (`/usr/local/bin/homelab`, source `cli/`): unified infra-ops verbs — run `homelab manifest` to discover the surface (each verb tagged read/write). Infra loop: `homelab tf plan|fmt|apply <stack>` (wraps `scripts/tg`; `apply` auto-claims presence + releases on exit, warns out-of-band), `homelab claim|release <kind>:<name>`, `homelab work start|land|clean <topic>` (worktree lifecycle; `land` gates on verification, `--verify-cmd`/`--no-verify`). Kubernetes (v0.2): `homelab k8s status|get|logs|describe|debug|pf|rollout-status <app>` (read; `<app>` defaults to the namespace, target to `deploy/<app>`), `homelab k8s db <app> [--mysql] -- "<SQL>"`, `k8s exec`, `k8s restart`, `k8s rm-pod` (pods/jobs only) — config-mutation kubectl verbs are intentionally absent (Terraform-only). Memory (v0.3): `homelab memory recall "<context>"` (semantic search), `memory list|categories|tags|stats|secret`, `memory store|update|delete` — a direct HTTP client to claude-memory that works even when the memory MCP is down. CI/deploy (v0.4): `homelab ci status|watch [commit]` (Woodpecker, repo resolved from cwd), `homelab deploy wait <ns>/<deploy> [--sha]` (image-sha + rollout) — `work land` now auto-watches CI to green. Net/obs (v0.5): `homelab net check <host> [path]` (external-CF vs internal-LB reachability), `dns lookup <name>` (Technitium vs public diff), `metrics query "<promql>"` / `metrics alerts` (Prometheus via LB), `logs query "<logql>" [--since]` (Loki via LB) — endpoint resolution baked in, no port-forward. Usage telemetry (v0.6): every dispatched verb fire-and-forgets a Loki line (`{user,verb}` + exit only, NO args/secrets; opt-out `HOMELAB_TELEMETRY=0`); `homelab usage top [--since][--user]` ranks verb usage across all users — evidence for what to build next, queryable without reading anyone's home. Home Assistant (v0.7): `homelab ha token [--instance sofia|london]` (prints the long-lived API token, resolved live from k8s Secret `openclaw/openclaw-secrets` — use as `curl -H "Authorization: Bearer $(homelab ha token)"`), `homelab ha ssh [--instance sofia|london] -- <cmd>` (run a command on the HA host; deterministic non-interactive ssh, the invoking user's `~/.ssh/id_ed25519`, sofia=`vbarzin@192.168.1.8` default) — entity state/control stays with the `ha` MCP, these cover only what an API-only MCP can't (token + host shell). Full docs: `cli/README.md`.
- **Deploy new service**: Use `stacks/<existing-service>/` as template. Create stack, add DNS in tfvars, apply platform then service.
- **Fix crashed pods**: Run healthcheck first. Safe to delete evicted/failed pods and CrashLoopBackOff pods with >10 restarts.
- **OOMKilled**: Check `kubectl describe limitrange tier-defaults -n <ns>`. Increase `resources.limits.memory` in the stack's main.tf.

View file

@ -117,9 +117,17 @@ The bare-metal load-balancer that assigns external IPs to `type=LoadBalancer` Se
_Avoid_: calling `.200` "the cluster IP" or assuming all ingress shares one LB IP.
**Calico**:
The cluster CNI and **NetworkPolicy** engine (also GlobalNetworkPolicy + flow logs). Egress lockdown follows an **observe-then-enforce** rollout — flow logs build an empirical allowlist, then default-deny egress is enforced per-namespace, tier by tier (wave 1 began at `recruiter-responder`; Tier 0/1/2 deferred).
The cluster CNI and **NetworkPolicy** engine (also GlobalNetworkPolicy + flow logs; live flow observability via **Goldmane / Whisker**). Egress lockdown follows an **observe-then-enforce** rollout — flow logs build an empirical allowlist, then default-deny egress is enforced per-namespace, tier by tier (wave 1 began at `recruiter-responder`; Tier 0/1/2 deferred).
_Avoid_: "firewall" (it's pod-level policy, not a perimeter); conflating a Calico **NetworkPolicy** (enforced in the data path) with a **Kyverno policy** (enforced at admission) — different layers.
**Service identity**:
How a **Service** is named in flow/audit data — its **namespace** is the primary identity (Goldmane stamps it natively, and "one Service ≈ one namespace" holds for ~87 namespaces), refined by an explicit identity label (e.g. `service-identity`) only in the handful of genuinely multi-Service namespaces (`monitoring`, `kube-system`, `dbaas`). Deliberately NOT a per-Service **ServiceAccount** (deferred — 56% of pods share `default`; revisit only if principal-based enforcement or mTLS is adopted) and NOT a SPIFFE/mesh identity (rejected — attribution-grade audit on a trusted single-tenant cluster doesn't justify a mesh).
_Avoid_: equating "service identity" with a workload's **ServiceAccount** (that's the deferred enforcement principal, not the attribution key) or with cryptographic/SPIFFE identity; "Service" here is the domain **Service**, not the K8s `Service` object.
**Goldmane / Whisker**:
Calico 3.30's OSS flow-observability pair — **Goldmane** aggregates identity-stamped flows (namespace/pod/workload/labels + allow-deny + policy trace) streamed from Felix over gRPC into an in-memory ~60-min ring buffer (no etcd/API writes); **Whisker** is its live web UI. The east-west "who-talks-to-whom" data plane, succeeding raw iptables-`LOG`→journald lines (which carry no identity). Durable history requires emitting Goldmane flows to **Loki**; the in-memory buffer alone is not an audit trail.
_Avoid_: assuming Goldmane persists (it's a ring buffer — lost on restart); expecting a ServiceAccount field in its schema (it carries labels, not SA); confusing it with Cilium **Hubble** (needs the Cilium datapath, unusable on Calico) or **Kiali** (needs an Istio mesh).
### Storage
**proxmox-lvm-encrypted**:
@ -149,7 +157,7 @@ _Avoid_: bare "backup" without saying which copy you mean (a service is "backed
**CNPG** / **pg-cluster**:
**CNPG** is the CloudNativePG operator; **`pg-cluster`** is the Postgres cluster it manages — the shared Postgres substrate. Backs Tier-1 Terraform state (`pg-cluster-rw.dbaas.svc.cluster.local:5432/terraform_state`) and ~12 application databases, reached through **PgBouncer** (a **critical-path Service**) for connection pooling; app credentials rotate via the `vault-database` ClusterSecretStore.
_Avoid_: "the database" (many DBs share one cluster); the legacy `postgresql.dbaas` Service (no endpoints — dead); conflating the CNPG operator with the `pg-cluster` it manages.
_Avoid_: "the database" (many DBs share one cluster); the legacy `postgresql.dbaas` Service for NEW work (it is a live compatibility alias selecting the CNPG primary — authentik's PgBouncer still uses it — but `pg-cluster-rw` is the canonical name); conflating the CNPG operator with the `pg-cluster` it manages.
### Secrets
@ -169,8 +177,24 @@ A user-managed secret committed to a Stack directory as `sealed-*.yaml`. Distinc
### CI/CD
**GHA build + Woodpecker deploy**:
The split where Docker images are built+pushed by GitHub Actions and Woodpecker only runs `kubectl set image` on a deploy-only pipeline. Repos that can't fit GHA limits stay on Woodpecker for build too.
_Avoid_: bare "Woodpecker pipeline" — say "build" or "deploy".
The split where every owned image is built+pushed by GitHub Actions and Woodpecker only runs `kubectl set image` on a deploy-only pipeline (ADR-0002). Woodpecker never builds images.
_Avoid_: bare "Woodpecker pipeline" — say "build" or "deploy"; "fallback build" (the in-cluster fallback path was removed by ADR-0002).
**Canonical repo**:
The Forgejo `viktor/<name>` repo — the only place commits land, workflow files included. Every first-party repo is Forgejo-canonical *except* an explicit set of **GitHub-first repos**. A clone keeps **only** the canonical remote (ADR-0003): the **GitHub mirror** is not a second push target.
_Avoid_: "upstream" (ambiguous); committing anywhere else; keeping both remotes on a clone and hand-pushing to each (the dual-push habit that caused the 2026-06 divergence — ADR-0003).
**GitHub mirror**:
The GitHub repo a **Canonical repo** push-mirrors to, one-way (Forgejo's `push_mirrors`, `sync_on_commit`), so GitHub Actions can build from it; anything committed on the mirror is silently overwritten by the next sync — and enabling the mirror **force-overwrites** the GitHub side, so a diverged GitHub-only commit must be merged back into Forgejo *before* the mirror is turned on or it is lost.
_Avoid_: treating it as a second writable remote; bare "the GitHub repo" without saying mirror.
**GitHub-first repo**:
The deliberate exception to the **Canonical repo** rule — a repo whose canonical home is GitHub, so it sits outside the mirror policy. Two kinds: third-party clones/forks where GitHub is genuinely upstream (`jsoncrack.com`, `snmp_exporter`, `SparkyFitness`, `agent-rules-books`, `Plotting-Your-Dream-Book`), and a first-party repo intentionally kept public on GitHub (`health`). Single GitHub remote, never dual-pushed.
_Avoid_: adding a Forgejo remote "for consistency"; treating one as a **Canonical repo**.
**Forgejo registry**:
Forgejo's built-in container registry — since ADR-0002 a frozen archive holding one last-known-good tag per **Service**, not a build target; owned images live on ghcr.io.
_Avoid_: "private registry" (collides with the registry VM's pull-through caches); pushing new images to it.
**Keel**:
The **poll-driven** rollout orchestrator — watches registries for new image tags and rolls the matching Deployments automatically. The actor behind "auto-upgrade" for upstream images, and a redundant net for owned apps (already rolled on push by **Woodpecker deploy**).
@ -192,6 +216,7 @@ A PoW reverse-proxy issuing a 30-day JWT cookie, used in front of public content
- A **proxmox-lvm-encrypted** PVC binds to one Node at a time (RWO) and requires a Service-level backup CronJob; an **NFS volume** is RWX and is backed up at the host level via rsync.
- **State tier** and **Namespace tier** are orthogonal — a Tier 0 Stack can deploy a Service into any Namespace tier and vice versa.
- A **Service**'s image reaches the cluster via **Woodpecker deploy** (push-driven, on commit) or **Keel** (poll-driven, on a new registry tag); **Diun** only notifies. Operator-managed StatefulSets are rolled by neither.
- An owned **Service**'s image is built by GitHub Actions from the **Canonical repo**'s **GitHub mirror** and hosted on ghcr.io (ADR-0002); the **Forgejo registry** keeps only a frozen last-known-good tag per **Service**.
- Tier-1 **State tier** state and ~12 app databases share one **CNPG** `pg-cluster`, reached through **PgBouncer**; their credentials rotate via the `vault-database` store.
## Example dialogue
@ -211,3 +236,4 @@ A PoW reverse-proxy issuing a 30-day JWT cookie, used in front of public content
- **"secret"** spans Vault entries, K8s Secret objects, **ExternalSecrets**, and **Sealed Secrets**. Always specify which.
- **"proxied"** / **"non-proxied"** refer to Cloudflare's CDN posture for a DNS record, _not_ Anubis or forward-auth layering.
- **"policy"** spans **Kyverno policy** (admission-time mutate/generate/validate), **Calico NetworkPolicy** (data-path ingress/egress), Vault policy (KV access), and K8s RBAC. Always qualify which engine.
- **"registry"** spans three things: ghcr.io (where owned images live, ADR-0002), the **Forgejo registry** (frozen last-known-good archive), and the registry VM's pull-through caches (read-only proxies of upstream registries). Name which one.

View file

@ -1,2 +1,224 @@
# What is this?
This is a CLI to manipulate files in the terraform repo and commit and push them
# homelab
`homelab` is the unified, agent-facing CLI for operating this homelab — one
composable, JSON-capable surface for the operations agents run over and over,
discovered progressively at runtime. It is grown **in place** from this
directory (the former `infra-cli`), and the legacy webhook use-cases still work
(see below).
It encodes *actions*, never *judgment*: methodology (debugging, TDD, review) and
third-party/owned MCP servers (e.g. phpIPAM) are deliberately out of scope.
## Usage
```
homelab <command> [args]
homelab manifest [--json] # list every verb + its read/write tier (discovery entrypoint)
homelab version
```
### v0.1 verbs — the infra inner-loop
| Command | Tier | What it does |
|---|---|---|
| `claim <kind>:<name> --purpose "…"` | write | claim a shared resource on the presence board (wraps `scripts/presence`) |
| `release <kind>:<name>` | write | release a presence claim |
| `tf plan <stack>` | read | `scripts/tg plan` for a stack (resolved from cwd) |
| `tf validate <stack>` | read | `scripts/tg validate` |
| `tf fmt <stack>` | read | `terraform fmt -recursive` on the stack |
| `tf force-unlock <stack> <lock-id>` | write | release a stuck state lock |
| `tf apply <stack>` | write | `scripts/tg apply` — auto-claims `stack:<name>`, always releases, warns it's out-of-band |
| `work start <topic>` | write | create `.worktrees/<topic>` on `<user>/<topic>` off `<remote>/master`; enter with native `EnterWorktree` |
| `work land [--verify-cmd "…"] [--no-verify]` | write | merge master in → verify → push `HEAD:master` (non-ff retry; PR fallback) |
| `work clean <topic>` | write | remove a task's worktree + branch (run from the main checkout) |
### v0.2 verbs — Kubernetes
Built on an **app→namespace→pod resolver**: `<app>` defaults to the namespace
(most namespaces hold one app); the target defaults to `deploy/<app>` and lets
kubectl resolve the pod. Override with `-n`/`--pod`/`-c`/`-l`/`--tty`. Uses the
ambient kubeconfig.
| Command | Tier | What it does |
|---|---|---|
| `k8s status [ns]` | read | pods (wide) + recent non-Normal events (`-A` if no ns) |
| `k8s get <ns> <resource> […]` | read | `kubectl -n <ns> get …` passthrough |
| `k8s logs <app>` | read | logs for `deploy/<app>` (`--tail` default 200; `-c`/`--previous`/`--since`/`-l`) |
| `k8s describe <app> [resource]` | read | describe the deployment (or an explicit resource) |
| `k8s debug <app>` | read | one-shot triage: pods + workloads + describe + recent logs + events |
| `k8s pf <app> <local:remote> [target]` | read | port-forward to `svc/<app>` (or an explicit target) |
| `k8s rollout-status <app>` | read | `rollout status deploy/<app>` |
| `k8s db <app> [--mysql] [--db N] -- "<SQL>"` | write | exec into the dbaas DB (PG `pg-cluster-rw`, or MySQL with env-password wrapper) |
| `k8s exec <app> [--tty] -- <cmd>` | write | exec in the app's pod |
| `k8s restart <app>` | write | `rollout restart deploy/<app>` then wait for status |
| `k8s rm-pod <name> -n <ns> [--job] [--force]` | write | delete a stuck **pod/job only** |
Config-mutation verbs (`apply`/`edit`/`patch`/`scale`/`create`) are intentionally
**not** exposed — they stay raw `kubectl`, per the Terraform-only policy.
`tf` resolves the stack dir by walking up from cwd to the infra root and
delegates to `scripts/tg` (which owns state decrypt/encrypt, the Vault lock, and
the ingress auth-comment check). git-crypt filter flags are auto-injected on git
operations in the encrypted infra repo.
**`work land` refuses to push when it cannot verify** (no `--verify-cmd` and no
auto-detected suite) unless you pass `--no-verify` — landing to master unverified
must be deliberate. After pushing it **watches CI to green** (`ci watch` on the
landed commit) and fails if the pipeline does; pass `--no-ci-watch` to skip.
Tiers are recorded per verb so a future PreToolUse classifier can auto-allow
reads / prompt writes; v0.1 allows everything and relies on existing gates
(permission mode, presence claims, plan approval).
### v0.3 verbs — memory
A thin HTTP client over the **claude-memory** service (the same backend the
memory MCP wraps), authed with `CLAUDE_MEMORY_API_KEY` against
`CLAUDE_MEMORY_API_URL` (the env the hooks already set; defaults to the
ingress). Because it hits the HTTP API directly, it **works even when the MCP
frontend is down**.
| Command | Tier | What it does |
|---|---|---|
| `memory recall "<context>" [--query --category --sort --limit]` | read | semantic search (server-side ranking) — the navigate workhorse |
| `memory list [--category --tag --limit]` | read | recent memories |
| `memory categories` / `memory tags` / `memory stats` | read | enumerate the store |
| `memory secret <id>` | read | reveal a sensitive memory's content |
| `memory store "<content>" [--category --tags --keywords --importance --sensitive]` | write | store a memory |
| `memory update <id> [--content --tags --importance]` | write | edit a memory |
| `memory delete <id>` | write | delete a memory |
All read/write paths are validated against the live API (incl. a
store→recall→delete round-trip). This gives full data-plane parity with the MCP;
the eventual deprecation (rewiring the per-prompt auto-recall + auto-learn hooks
to the CLI, then uninstalling the MCP) is a **separate, deliberate follow-up**
see `docs/adr/0008`.
### v0.4 verbs — ci / deploy
Watch what you trigger, without hand-rolling Woodpecker/kubectl polling. `ci`
talks to the Woodpecker API (token from `WOODPECKER_TOKEN` or Vault
`secret/ci/global`) via the internal Traefik LB, resolving the repo from the cwd
remote, with retries that ride Woodpecker's intermittent empty responses.
| Command | Tier | What it does |
|---|---|---|
| `ci status [commit]` | read | pipeline status for HEAD (or a commit) |
| `ci watch [commit]` | read | poll the pipeline to terminal; exit non-zero on failure |
| `deploy wait <ns>/<deploy> [--sha SHA]` | read | wait for the deployment image to match the sha, *then* rollout status (rollout status alone lies on the old ReplicaSet) |
`work land` now calls `ci watch` on the landed commit automatically (skip with
`--no-ci-watch`), closing the v0.1 "doesn't wait for CI" gap. `ci logs` (failing
step) is deferred to v0.4.1 — Woodpecker's per-pipeline detail/log endpoints were
the least reliable; `status`/`watch` use the list endpoint that works.
### v0.5 verbs — net / dns / metrics / logs
Reachability + observability probes. Their value is *endpoint resolution* — the
non-obvious "which host, public or LB, what auth, what URL shape" reasoning you'd
otherwise re-derive every time — not the HTTP call itself. All reach internal
ingresses through the Traefik LB (the Go form of `curl --resolve host:443:10.0.20.203`).
| Command | Tier | What it does |
|---|---|---|
| `net check <host> [path]` | read | probes the host two ways — external (public DNS → Cloudflare) vs internal (Traefik LB) — with status + latency, so you can tell *where* a break is (CF? app? the LB path?) |
| `dns lookup <name> [type]` | read | resolves via Technitium (`10.0.20.201`) and public (`1.1.1.1`), diffed — surfaces split-horizon vs propagation gaps |
| `metrics query "<promql>"` | read | Prometheus instant query (`prometheus-query.viktorbarzin.lan`); prints `value {labels}` or `--json` |
| `metrics alerts` | read | currently-firing alerts (via the synthetic `ALERTS` series — the query frontend has no `/api/v1/alerts`) |
| `logs query "<logql>" [--since 1h] [--limit N]` | read | Loki range query (`loki.viktorbarzin.lan`); prints log lines or `--json` |
Quote the PromQL/LogQL. These hit auth-free internal ingresses — no port-forward,
no kubectl. (In-cluster-only endpoints like Alertmanager stay out of scope; the
firing set is reachable via `ALERTS` instead.)
### v0.6 — usage telemetry (`usage top`)
Makes "which verbs are actually used, by everyone" a query instead of a guess —
so adding the *next* verb is evidence-driven, not shaped by one person's habits.
Every dispatched verb emits one fire-and-forget Loki line: `{job, user, verb}`
labels + `exit=N ver=X` — **only the verb path and exit code, never args, paths,
flags, or secrets.** It's best-effort (tight timeout, errors swallowed, never
affects the command) and opt-out via `HOMELAB_TELEMETRY=0`. Because the sink is
the shared Loki, aggregate usage is queryable **without reading anyone's home**
the privacy-preserving answer to "what does the team use."
| Command | Tier | What it does |
|---|---|---|
| `usage top [--since 30d] [--user U] [--json]` | read | rank verbs by invocation count across all users (or one), via `sum by (verb) (count_over_time({job="homelab-usage"}[…]))` |
### v0.7 verbs — Home Assistant
Cover exactly the two things the `ha` **MCP server can't**: resolving the
long-lived API token out of the cluster, and SSH to the HA host for host-level
work (config files, docker, add-ons). Entity state and control (`turn_on`,
`get_state`, services) stay with the MCP — *actions an MCP already encodes are
out of scope* (see top of this doc). The value here is the same as `net`/`dns`:
the non-obvious *which secret, which host, which key, which flags* you'd
otherwise re-derive every session — agents were hand-rolling a
`kubectl | base64 | jq` token pipeline and a bespoke `ssh -o …` invocation on
every run because the existing `home-assistant-sofia.py` needs an env var set
and a cwd-relative path, neither of which holds in an arbitrary session.
| Command | Tier | What it does |
|---|---|---|
| `ha token [--instance sofia\|london]` | read | print the long-lived HA API token, resolved live from the dedicated k8s Secret `openclaw/ha-tokens` (key per instance) via the ambient kubeconfig — no pre-set env var. Use as `curl -H "Authorization: Bearer $(homelab ha token)" …`. The secret is a least-privilege carve-out (`stacks/openclaw/ha_tokens.tf`): the `Home Server Admins` group can read *just* it, so non-admin operators get the HA token without the rest of `skill_secrets` (slack webhook, uptime-kuma password) |
| `ha ssh [--instance sofia\|london] [-i KEY] -- <cmd>` | write | run `<cmd>` on the HA host over ssh with deterministic non-interactive flags (explicit key = the invoking user's `~/.ssh/id_ed25519`, no user ssh-config, no known_hosts prompt). sofia (`vbarzin@192.168.1.8`) is reachable from the devvm LAN; london is documented but generally remote |
`--instance` defaults to **sofia** (the devvm shares the Sofia LAN). `ha token`
prints the bare token to stdout so it composes in `$(…)`; it's read-tier like
`memory secret`. `ha ssh` resolves the *invoking user's* key, so it's per-user,
not tied to whoever first wrote the workflow (the user's key must be enrolled on
the HA host).
### v0.8 verbs — browser (headful anti-bot automation)
Drive the cluster's **headful** Chrome (`chrome-service`, real Chrome under Xvfb)
from the devvm over CDP, for sites that detect and block headless automation. The
headless `@playwright/mcp` browser can *load* such a site and fill its forms, but
the gated action (submit/login) silently fails — the motivating case was the
Stirling Ackroyd Fixflo tenant portal, whose pre-submit check returned
`net::ERR_FILE_NOT_FOUND` and hung. This path connects via `connect_over_cdp`,
injects the same `stealth.js` the in-cluster callers use, and submits first try.
The command owns only the *mechanics* (port-forward, stealth, lifecycle); the
agent supplies the Playwright script — judgment stays out of the CLI.
| Command | Tier | What it does |
|---|---|---|
| `browser run <script.js> [--url U] [--shared-context] [--keep-open] [--port N] [--timeout S]` | write | port-forward `svc/chrome-service:9222`, assert it's a real (non-headless) Chrome via `/json/version`, `connect_over_cdp`, `addInitScript(stealth.js)`, then run the script with `page`/`context`/`browser`/`log` in scope (top-level await ok; return a value to print it). Always tears the forward down. |
| `browser open <url> [--shared-context] [--timeout S]` | write | open `<url>` headful and print title + visible text + a screenshot path — a quick check. |
| `browser --help` | read | when-to-use signature + the error-code cheat-sheet (`ERR_FILE_NOT_FOUND` = automation-layer intercept, not egress; `ERR_CONNECTION_REFUSED`/`_TIMED_OUT`/`_NAME_NOT_RESOLVED` = real egress; one endpoint 500 while siblings 200 = bot rejection). |
Default context is a **fresh incognito** one (closed on exit) — safe for the
shared browser and concurrent callers (e.g. tripit's fare scrape); `--shared-context`
reuses the warmed persistent profile when a pre-logged-in session is needed.
`port-forward` tunnels API-server→pod, so it bypasses the `:9222` NetworkPolicy
that gates in-cluster callers — no namespace label needed. The node CDP client is
pinned to **`playwright-core@1.48.2`** to match the chrome-service image minor
(Chromium 130; protocol changes between minors) and is installed once, lazily,
into `~/.cache/homelab/browser-client/` (no per-user setup). Because the client
runs on the devvm, `setInputFiles` streams local files to the remote browser over
CDP — no `chmod`/staging-dir workaround. See `docs/architecture/chrome-service.md`
and `docs/adr/0013`.
## Build / install
Built from source to `/usr/local/bin/homelab` during devvm provisioning
(`scripts/workstation/setup-devvm.sh`, the `t3-dispatch` pattern); version is
stamped from `cli/VERSION` via ldflags. Manual build:
```
cd cli && go build -ldflags "-X main.version=$(cat VERSION)" -o /usr/local/bin/homelab .
go test ./...
```
## Legacy webhook use-cases (preserved)
This binary is also the in-cluster `infra-cli` image. Invocations starting with
`-use-case=<vpn|setup-openwrt-dns|add-email-alias|...>` fall through to the
original flag-based path unchanged, so the webhook handler is unaffected.
## Design
See `infra/docs/adr/0004``0013` for the architecture decisions.

1
cli/VERSION Normal file
View file

@ -0,0 +1 @@
v0.8.1

388
cli/browser.go Normal file
View file

@ -0,0 +1,388 @@
package main
import (
_ "embed"
"encoding/json"
"fmt"
"io"
"net"
"net/http"
"os"
"os/exec"
"os/signal"
"path/filepath"
"strconv"
"strings"
"sync"
"syscall"
"time"
)
// playwrightVersion pins the node CDP client to the chrome-service image minor
// (mcr.microsoft.com/playwright:v1.48.0-noble → Chromium 130). connect_over_cdp
// speaks the browser's CDP, so the client minor must track the server minor;
// see docs/architecture/chrome-service.md "Image pin".
const playwrightVersion = "1.48.2"
// defaultBrowserTimeout is how long (seconds) to wait for the port-forwarded CDP
// endpoint to become ready before giving up.
const defaultBrowserTimeout = 60
const (
chromeServiceNamespace = "chrome-service"
chromeServiceName = "chrome-service"
chromeServiceCDPPort = 9222
)
// stealthJS is vendored verbatim from stacks/chrome-service/files/stealth.js (the
// source of truth the in-cluster callers use). TestStealthJSEmbeddedMatchesCanonical
// guards against drift.
//
//go:embed browser_stealth.js
var stealthJS string
// runnerJS is the node wrapper that connects to the port-forwarded CDP endpoint,
// installs the stealth init script, and runs the user's Playwright script.
//
//go:embed browser_runner.js
var runnerJS string
// browserOpts is the parsed form of `homelab browser run|open` arguments.
type browserOpts struct {
mode string // "run" | "open"
script string // path to the user Playwright script (run mode)
url string // initial URL (run: optional; open: required positional)
sharedCtx bool // use the warmed persistent profile instead of a fresh context
keepOpen bool // leave the created context/pages open on exit
port int // explicit local port for the forward (0 = auto)
timeout int // CDP readiness timeout, seconds
help bool
}
// parseBrowserArgs parses the args after `browser run` / `browser open`.
func parseBrowserArgs(mode string, args []string) (browserOpts, error) {
o := browserOpts{mode: mode, timeout: defaultBrowserTimeout}
var positionals []string
atoi := func(s, flag string) (int, error) {
n, err := strconv.Atoi(s)
if err != nil {
return 0, fmt.Errorf("%s expects an integer, got %q", flag, s)
}
return n, nil
}
for i := 0; i < len(args); i++ {
a := args[i]
switch {
case a == "-h" || a == "--help":
o.help = true
case a == "--shared-context":
o.sharedCtx = true
case a == "--keep-open":
o.keepOpen = true
case a == "--url":
if i+1 < len(args) {
o.url = args[i+1]
i++
}
case strings.HasPrefix(a, "--url="):
o.url = strings.TrimPrefix(a, "--url=")
case a == "--port":
if i+1 < len(args) {
n, err := atoi(args[i+1], "--port")
if err != nil {
return o, err
}
o.port = n
i++
}
case strings.HasPrefix(a, "--port="):
n, err := atoi(strings.TrimPrefix(a, "--port="), "--port")
if err != nil {
return o, err
}
o.port = n
case a == "--timeout":
if i+1 < len(args) {
n, err := atoi(args[i+1], "--timeout")
if err != nil {
return o, err
}
o.timeout = n
i++
}
case strings.HasPrefix(a, "--timeout="):
n, err := atoi(strings.TrimPrefix(a, "--timeout="), "--timeout")
if err != nil {
return o, err
}
o.timeout = n
case strings.HasPrefix(a, "-"):
return o, fmt.Errorf("unknown flag %q (try: homelab browser --help)", a)
default:
positionals = append(positionals, a)
}
}
if o.help {
return o, nil
}
switch mode {
case "run":
if len(positionals) == 0 {
return o, fmt.Errorf("usage: homelab browser run <script.js> [--url URL] [--shared-context] [--keep-open] [--port N] [--timeout S]")
}
o.script = positionals[0]
case "open":
if len(positionals) == 0 {
return o, fmt.Errorf("usage: homelab browser open <url> [--shared-context] [--timeout S]")
}
o.url = positionals[0]
}
return o, nil
}
// cdpHealthy parses a CDP /json/version body and reports whether the endpoint is
// a real (non-headless) Chrome — the entire reason chrome-service exists.
func cdpHealthy(jsonBody []byte) (browser string, healthy bool, err error) {
var v struct {
Browser string `json:"Browser"`
UserAgent string `json:"User-Agent"`
}
if e := json.Unmarshal(jsonBody, &v); e != nil {
return "", false, fmt.Errorf("parse /json/version: %w", e)
}
if v.Browser == "" {
return "", false, fmt.Errorf("/json/version had no Browser field")
}
healthy = strings.HasPrefix(v.Browser, "Chrome/") &&
!strings.Contains(v.Browser, "Headless") &&
!strings.Contains(v.UserAgent, "Headless")
return v.Browser, healthy, nil
}
// buildPortForwardArgs is the kubectl invocation that exposes chrome-service's
// CDP locally. port-forward tunnels API-server→pod, so it bypasses the :9222
// NetworkPolicy that gates in-cluster callers.
func buildPortForwardArgs(localPort int) []string {
return []string{"-n", chromeServiceNamespace, "port-forward",
"svc/" + chromeServiceName, fmt.Sprintf("%d:%d", localPort, chromeServiceCDPPort)}
}
// browserClientPackageJSON is the auto-managed manifest for the pinned node CDP
// client kept under the user cache dir.
func browserClientPackageJSON() string {
return fmt.Sprintf(`{
"name": "homelab-browser-client",
"private": true,
"description": "Pinned CDP client for 'homelab browser' — auto-managed, do not edit.",
"dependencies": {
"playwright-core": "%s"
}
}
`, playwrightVersion)
}
// freePort asks the kernel for an unused ephemeral TCP port.
func freePort() (int, error) {
l, err := net.Listen("tcp", "127.0.0.1:0")
if err != nil {
return 0, err
}
defer l.Close()
return l.Addr().(*net.TCPAddr).Port, nil
}
// browserClientDir is where the pinned node client + managed runner files live.
func browserClientDir() (string, error) {
cache, err := os.UserCacheDir()
if err != nil || cache == "" {
home, herr := os.UserHomeDir()
if herr != nil {
return "", fmt.Errorf("locate cache dir: %v / %v", err, herr)
}
cache = filepath.Join(home, ".cache")
}
return filepath.Join(cache, "homelab", "browser-client"), nil
}
// installedPlaywrightVersion reads the version of the playwright-core already
// installed in dir, or "" if absent/unreadable.
func installedPlaywrightVersion(dir string) string {
b, err := os.ReadFile(filepath.Join(dir, "node_modules", "playwright-core", "package.json"))
if err != nil {
return ""
}
var v struct {
Version string `json:"version"`
}
if json.Unmarshal(b, &v) != nil {
return ""
}
return v.Version
}
// ensureBrowserClient writes the managed runner/stealth/package files into dir
// and lazily installs the pinned playwright-core (only when missing/mismatched),
// so no per-user setup is needed and the client tracks the binary version.
func ensureBrowserClient(dir string) error {
if err := os.MkdirAll(dir, 0o755); err != nil {
return err
}
files := map[string]string{
"package.json": browserClientPackageJSON(),
"browser_runner.js": runnerJS,
"stealth.js": stealthJS,
}
for name, content := range files {
if err := os.WriteFile(filepath.Join(dir, name), []byte(content), 0o644); err != nil {
return err
}
}
if installedPlaywrightVersion(dir) == playwrightVersion {
return nil
}
fmt.Fprintf(os.Stderr, "homelab browser: installing pinned playwright-core@%s (one-time, ~a few seconds)…\n", playwrightVersion)
cmd := exec.Command("npm", "install", "--no-audit", "--no-fund", "--silent")
cmd.Dir = dir
cmd.Stdout = os.Stderr
cmd.Stderr = os.Stderr
if err := cmd.Run(); err != nil {
return fmt.Errorf("npm install playwright-core@%s in %s: %w (is node/npm installed?)", playwrightVersion, dir, err)
}
if got := installedPlaywrightVersion(dir); got != playwrightVersion {
return fmt.Errorf("playwright-core install mismatch in %s: want %s, got %q", dir, playwrightVersion, got)
}
return nil
}
// waitForCDP polls the local CDP endpoint until it answers as a healthy
// (non-headless) Chrome, or the timeout elapses.
func waitForCDP(cdpURL string, timeout time.Duration) (string, error) {
deadline := time.Now().Add(timeout)
client := &http.Client{Timeout: 3 * time.Second}
var lastErr error
for time.Now().Before(deadline) {
resp, err := client.Get(cdpURL + "/json/version")
if err != nil {
lastErr = err
time.Sleep(300 * time.Millisecond)
continue
}
body, _ := io.ReadAll(resp.Body)
resp.Body.Close()
browser, healthy, herr := cdpHealthy(body)
if herr != nil {
lastErr = herr
time.Sleep(300 * time.Millisecond)
continue
}
if !healthy {
return browser, fmt.Errorf("CDP reports %q — expected a non-headless Chrome (wrong target?)", browser)
}
return browser, nil
}
if lastErr == nil {
lastErr = fmt.Errorf("timed out after %s", timeout)
}
return "", lastErr
}
// runBrowser is the orchestration: pick a port, ensure the pinned client, start
// (and ALWAYS tear down) a CDP port-forward, wait for readiness, then run node.
func runBrowser(o browserOpts) error {
port := o.port
if port == 0 {
p, err := freePort()
if err != nil {
return fmt.Errorf("pick local port: %w", err)
}
port = p
}
dir, err := browserClientDir()
if err != nil {
return err
}
if err := ensureBrowserClient(dir); err != nil {
return err
}
// Start the forward in its own process group so the whole tree dies on cleanup.
pf := exec.Command("kubectl", buildPortForwardArgs(port)...)
pf.SysProcAttr = &syscall.SysProcAttr{Setpgid: true}
var pfLog strings.Builder
pf.Stdout = &pfLog
pf.Stderr = &pfLog
if err := pf.Start(); err != nil {
return fmt.Errorf("start kubectl port-forward (kubeconfig set?): %w", err)
}
var once sync.Once
teardown := func() {
once.Do(func() {
if pf.Process != nil {
_ = syscall.Kill(-pf.Process.Pid, syscall.SIGKILL)
}
_ = pf.Wait()
})
}
defer teardown()
// Tear down on Ctrl-C / SIGTERM too, then exit non-zero.
sigCh := make(chan os.Signal, 1)
signal.Notify(sigCh, os.Interrupt, syscall.SIGTERM)
defer signal.Stop(sigCh)
go func() {
if _, ok := <-sigCh; ok {
teardown()
os.Exit(130)
}
}()
cdpURL := fmt.Sprintf("http://127.0.0.1:%d", port)
browser, err := waitForCDP(cdpURL, time.Duration(o.timeout)*time.Second)
if err != nil {
return fmt.Errorf("chrome-service CDP not ready on %s: %w\n--- port-forward log ---\n%s", cdpURL, err, pfLog.String())
}
fmt.Fprintf(os.Stderr, "homelab browser: connected to %s via %s\n", browser, cdpURL)
return runBrowserNode(dir, cdpURL, o)
}
// runBrowserNode invokes the managed node runner with inputs passed via env.
func runBrowserNode(dir, cdpURL string, o browserOpts) error {
env := append(os.Environ(),
"HOMELAB_CDP_URL="+cdpURL,
"HOMELAB_BROWSER_MODE="+o.mode,
"HOMELAB_STEALTH_PATH="+filepath.Join(dir, "stealth.js"),
"NODE_PATH="+filepath.Join(dir, "node_modules"),
)
if o.url != "" {
env = append(env, "HOMELAB_BROWSER_URL="+o.url)
}
if o.script != "" {
abs, err := filepath.Abs(o.script)
if err != nil {
return err
}
if _, err := os.Stat(abs); err != nil {
return fmt.Errorf("script %s: %w", o.script, err)
}
env = append(env, "HOMELAB_BROWSER_SCRIPT="+abs)
}
if o.sharedCtx {
env = append(env, "HOMELAB_BROWSER_SHARED=1")
}
if o.keepOpen {
env = append(env, "HOMELAB_BROWSER_KEEP_OPEN=1")
}
if o.mode == "open" {
shot := filepath.Join(os.TempDir(), fmt.Sprintf("homelab-browser-%d.png", os.Getpid()))
env = append(env, "HOMELAB_BROWSER_SCREENSHOT="+shot)
}
cmd := exec.Command("node", filepath.Join(dir, "browser_runner.js"))
cmd.Env = env
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
cmd.Stdin = os.Stdin
return cmd.Run()
}

106
cli/browser_runner.js Normal file
View file

@ -0,0 +1,106 @@
// homelab browser — node CDP runner (auto-managed; regenerated each run from the
// homelab binary — DO NOT EDIT here). Connects to the port-forwarded
// chrome-service CDP endpoint, installs the stealth init script, then runs the
// user's Playwright script (run mode) or opens a URL (open mode). All inputs
// arrive via HOMELAB_* env vars set by the Go CLI.
'use strict';
const fs = require('fs');
const { chromium } = require('playwright-core');
async function main() {
const cdpURL = process.env.HOMELAB_CDP_URL;
if (!cdpURL) throw new Error('HOMELAB_CDP_URL not set');
const mode = process.env.HOMELAB_BROWSER_MODE || 'run';
const stealthPath = process.env.HOMELAB_STEALTH_PATH || '';
const initURL = process.env.HOMELAB_BROWSER_URL || '';
const scriptPath = process.env.HOMELAB_BROWSER_SCRIPT || '';
const shared = process.env.HOMELAB_BROWSER_SHARED === '1';
const keepOpen = process.env.HOMELAB_BROWSER_KEEP_OPEN === '1';
const screenshotPath = process.env.HOMELAB_BROWSER_SCREENSHOT || '';
const browser = await chromium.connectOverCDP(cdpURL);
// Fresh isolated context by default (safe for the shared browser + concurrent
// callers); --shared-context reuses the warmed persistent profile.
let context;
let createdContext = false;
if (shared) {
const existing = browser.contexts();
if (existing.length) {
context = existing[0];
} else {
context = await browser.newContext();
createdContext = true;
}
} else {
context = await browser.newContext();
createdContext = true;
}
if (stealthPath) {
const stealth = fs.readFileSync(stealthPath, 'utf8');
if (stealth.trim()) await context.addInitScript(stealth);
}
const page = await context.newPage();
const log = (...a) => console.error('[browser]', ...a);
let exitCode = 0;
try {
if (initURL) {
await page.goto(initURL, { waitUntil: 'domcontentloaded' });
}
if (mode === 'open') {
console.log('url: ' + page.url());
console.log('title: ' + (await page.title()));
const text = (await page.evaluate(() => (document.body ? document.body.innerText : ''))).trim();
console.log('--- visible text (truncated to 4000 chars) ---');
console.log(text.slice(0, 4000));
if (screenshotPath) {
await page.screenshot({ path: screenshotPath, fullPage: true });
console.log('screenshot: ' + screenshotPath);
}
} else {
if (!scriptPath) throw new Error('run mode requires HOMELAB_BROWSER_SCRIPT');
const src = fs.readFileSync(scriptPath, 'utf8');
// Run the user's source with page/context/browser/log in lexical scope.
// AsyncFunction body permits top-level await.
const AsyncFunction = Object.getPrototypeOf(async () => {}).constructor;
const fn = new AsyncFunction('page', 'context', 'browser', 'log', src);
const result = await fn(page, context, browser, log);
if (result !== undefined) {
let out;
try {
out = typeof result === 'string' ? result : JSON.stringify(result, null, 2);
} catch (_) {
out = String(result);
}
console.log(out);
}
}
} catch (e) {
console.error('homelab browser: script error:', e && e.stack ? e.stack : e);
exitCode = 1;
} finally {
if (!keepOpen) {
try {
// Close only what we created; never tear down the shared persistent context.
if (createdContext) {
await context.close();
} else {
await page.close();
}
} catch (_) { /* ignore */ }
}
// Disconnect from the CDP endpoint; this does NOT kill the remote browser.
try {
await browser.close();
} catch (_) { /* ignore */ }
}
process.exit(exitCode);
}
main().catch((e) => {
console.error('homelab browser: fatal:', e && e.stack ? e.stack : e);
process.exit(1);
});

54
cli/browser_stealth.js Normal file
View file

@ -0,0 +1,54 @@
// Minimal stealth init script for Playwright-driven Chromium.
// Vendored from puppeteer-extra-plugin-stealth/evasions/* (MIT) — covers:
// webdriver, chrome.runtime, navigator.plugins, navigator.languages,
// Permissions.query, WebGL getParameter (vendor + renderer spoof).
// Run via context.add_init_script() so it executes before any page script.
(() => {
// navigator.webdriver — most common detection, removed entirely.
Object.defineProperty(Navigator.prototype, 'webdriver', { get: () => undefined });
// window.chrome.runtime — many sites check that real Chrome exposes this.
if (!window.chrome) window.chrome = {};
window.chrome.runtime = window.chrome.runtime || {};
// navigator.plugins — headless reports zero; spoof a plausible PDF viewer.
Object.defineProperty(navigator, 'plugins', {
get: () => [{ name: 'Chrome PDF Plugin' }, { name: 'Chrome PDF Viewer' }, { name: 'Native Client' }],
});
// navigator.languages — headless returns empty array.
Object.defineProperty(navigator, 'languages', { get: () => ['en-US', 'en'] });
// Permissions.query — headless returns 'denied' for notifications instead of 'default'.
const origQuery = window.navigator.permissions && window.navigator.permissions.query;
if (origQuery) {
window.navigator.permissions.query = (parameters) =>
parameters && parameters.name === 'notifications'
? Promise.resolve({ state: Notification.permission })
: origQuery(parameters);
}
// WebGL getParameter — spoof vendor + renderer strings to a real GPU.
const spoofGl = (proto) => {
if (!proto) return;
const orig = proto.getParameter;
proto.getParameter = function (parameter) {
if (parameter === 37445) return 'Intel Inc.'; // UNMASKED_VENDOR_WEBGL
if (parameter === 37446) return 'Intel Iris OpenGL Engine'; // UNMASKED_RENDERER_WEBGL
return orig.apply(this, arguments);
};
};
spoofGl(window.WebGLRenderingContext && window.WebGLRenderingContext.prototype);
spoofGl(window.WebGL2RenderingContext && window.WebGL2RenderingContext.prototype);
// disable-devtool.js (theajack/disable-devtool) auto-inits via a script
// tag with `disable-devtool-auto`. Its Performance detector trips under
// Playwright (CDP adds console.log latency vs console.table) and the
// redirect URL is hard-coded — for hmembeds that's google.com.
// Hide the auto-init marker so the library's IIFE exits early.
const origQS = Document.prototype.querySelector;
Document.prototype.querySelector = function (sel) {
if (typeof sel === 'string' && sel.indexOf('disable-devtool-auto') !== -1) return null;
return origQS.apply(this, arguments);
};
})();

117
cli/cmd_browser.go Normal file
View file

@ -0,0 +1,117 @@
package main
import "fmt"
// browser verbs drive the cluster's HEADFUL Chrome (ns chrome-service) over CDP
// from outside the cluster, for sites that detect/block headless automation.
// The headless @playwright/mcp browser can load such sites but their gated
// actions (submit/login) silently fail; this path submits first try. Mechanics
// only — the agent supplies the Playwright script. See docs/adr/0013.
func browserCommands() []Command {
return []Command{
{Path: []string{"browser"}, Tier: TierRead,
Summary: "headful cluster-Chrome automation for anti-bot sites (run `browser --help`)", Run: browserTopHelp},
{Path: []string{"browser", "run"}, Tier: TierWrite,
Summary: "run a Playwright script against headful cluster Chrome: browser run <script.js> [--url U] [--shared-context]", Run: browserRun},
{Path: []string{"browser", "open"}, Tier: TierWrite,
Summary: "open a URL in headful cluster Chrome; print title + text + screenshot: browser open <url>", Run: browserOpen},
}
}
func browserTopHelp([]string) error {
fmt.Print(browserHelp())
return nil
}
func browserRun(args []string) error {
o, err := parseBrowserArgs("run", args)
if err != nil {
return err
}
if o.help {
fmt.Print(browserHelp())
return nil
}
return runBrowser(o)
}
func browserOpen(args []string) error {
o, err := parseBrowserArgs("open", args)
if err != nil {
return err
}
if o.help {
fmt.Print(browserHelp())
return nil
}
return runBrowser(o)
}
// browserHelp carries the discoverability payload: WHEN to reach for this, and
// the diagnostic cheat-sheet that lets the agent self-correct instead of
// retrying a deterministic form blind (the failure mode that motivated this).
func browserHelp() string {
return `homelab browser drive the cluster's HEADFUL Chrome (anti-bot) over CDP
The shared chrome-service (ns chrome-service) runs a REAL, headed Chrome under
Xvfb. This connects to it via a port-forward + Playwright connect_over_cdp,
injects the same stealth.js the in-cluster callers use, and runs your script.
USAGE
homelab browser run <script.js> [--url URL] [--shared-context] [--keep-open] [--port N] [--timeout S]
homelab browser open <url> [--shared-context] [--timeout S]
WHEN TO USE THIS escalation only; DEFAULT to the headless/MCP browser
Default to the Playwright MCP / headless browser for ALL routine browsing and
automation it's interactive (snapshot per step), fast to start, isolated.
Reach for THIS command ONLY when headless is demonstrably blocked: a site
LOADS fine but a gated action FAILS or HANGS a submit/login/checkout spins
forever, or ONE request errors while its siblings 200. That is the signature
of headless / anti-bot detection (navigator.webdriver, UA "HeadlessChrome",
disable-devtool traps). It presents as a real Chrome and usually succeeds
first try but it's the shared cluster browser (slower startup, one batch
run, no per-step feedback), so it's the escalation path, never the default.
ERROR-CODE CHEAT-SHEET (diagnose BEFORE retrying)
ERR_FILE_NOT_FOUND (-6) request intercepted/resolved locally by the
automation layer NOT a network/egress problem.
(This is what silently broke the headless submit.)
ERR_CONNECTION_REFUSED / real egress failure (DNS/route/firewall). These also
ERR_TIMED_OUT / break the initial page load if the page loaded,
ERR_NAME_NOT_RESOLVED egress is fine and the cause is elsewhere.
one endpoint 500s while server-side bot rejection of the automation, not
its siblings 200 your payload.
HABITS
- Inspect the network panel BEFORE retrying a deterministic form; a blind
retry just repeats the same silent failure.
- Don't park a half-filled multi-step form across a user pause the session
can expire; re-run the whole flow from this command in one shot.
- Uploads stream over CDP via setInputFiles from THIS host no chmod/staging
of $HOME needed; just point setInputFiles at a local path.
CONTEXT
Default: a FRESH incognito context, closed on exit safe for the shared
browser and concurrent callers (e.g. tripit). Your script does its own login.
--shared-context: reuse the warmed PERSISTENT profile (cookies from a manual
noVNC login at chrome.viktorbarzin.me) when you need a pre-logged-in session.
SCRIPT CONTRACT (run mode)
Your file's body runs with page, context, browser and log() already in scope
(top-level await allowed). Return a value to print it. Example flow.js:
await page.goto('https://portal.example.com/login');
await page.fill('#user', 'me'); await page.fill('#pass', process.env.PW);
await page.click('button[type=submit]');
await page.waitForURL('**/dashboard');
return 'logged in: ' + page.url();
Run it: homelab browser run flow.js
NOTES
- The Playwright client is pinned to playwright-core@` + playwrightVersion + ` to match the
chrome-service image (Chrome 130); installed once into ~/.cache/homelab/.
- The port-forward is always torn down, on success and on error.
`
}

172
cli/cmd_browser_test.go Normal file
View file

@ -0,0 +1,172 @@
package main
import (
"os"
"reflect"
"strings"
"testing"
)
func TestParseBrowserArgsRun(t *testing.T) {
got, err := parseBrowserArgs("run", []string{
"flow.js", "--url", "https://example.com", "--shared-context",
"--port", "19999", "--timeout", "45", "--keep-open",
})
if err != nil {
t.Fatalf("parseBrowserArgs run: unexpected err: %v", err)
}
want := browserOpts{
mode: "run", script: "flow.js", url: "https://example.com",
sharedCtx: true, keepOpen: true, port: 19999, timeout: 45,
}
if !reflect.DeepEqual(got, want) {
t.Fatalf("parseBrowserArgs run =\n %+v\nwant\n %+v", got, want)
}
}
func TestParseBrowserArgsRunDefaults(t *testing.T) {
got, err := parseBrowserArgs("run", []string{"flow.js"})
if err != nil {
t.Fatalf("unexpected err: %v", err)
}
if got.script != "flow.js" || got.sharedCtx || got.keepOpen || got.port != 0 {
t.Fatalf("defaults wrong: %+v", got)
}
if got.timeout != defaultBrowserTimeout {
t.Fatalf("timeout default = %d, want %d", got.timeout, defaultBrowserTimeout)
}
}
func TestParseBrowserArgsRunRequiresScript(t *testing.T) {
if _, err := parseBrowserArgs("run", []string{"--url", "https://x"}); err == nil {
t.Fatalf("run without a script path should error")
}
}
func TestParseBrowserArgsOpenRequiresURL(t *testing.T) {
got, err := parseBrowserArgs("open", []string{"https://example.com"})
if err != nil {
t.Fatalf("unexpected err: %v", err)
}
if got.url != "https://example.com" || got.mode != "open" {
t.Fatalf("open parse wrong: %+v", got)
}
if _, err := parseBrowserArgs("open", []string{}); err == nil {
t.Fatalf("open without a URL should error")
}
}
func TestParseBrowserArgsHelp(t *testing.T) {
for _, a := range [][]string{{"--help"}, {"-h"}, {"flow.js", "--help"}} {
got, err := parseBrowserArgs("run", a)
if err != nil {
t.Fatalf("help parse %v: %v", a, err)
}
if !got.help {
t.Fatalf("args %v should set help", a)
}
}
}
func TestParseBrowserArgsEqualsForm(t *testing.T) {
got, err := parseBrowserArgs("run", []string{"flow.js", "--url=https://x", "--port=8123", "--timeout=10"})
if err != nil {
t.Fatalf("unexpected err: %v", err)
}
if got.url != "https://x" || got.port != 8123 || got.timeout != 10 {
t.Fatalf("--flag=value form not parsed: %+v", got)
}
}
func TestCDPHealthy(t *testing.T) {
real := []byte(`{"Browser":"Chrome/130.0.6723.31","User-Agent":"Mozilla/5.0 (X11; Linux x86_64) Chrome/130.0.0.0 Safari/537.36","webSocketDebuggerUrl":"ws://127.0.0.1/devtools/browser/x"}`)
browser, ok, err := cdpHealthy(real)
if err != nil || !ok {
t.Fatalf("real Chrome should be healthy: ok=%v err=%v", ok, err)
}
if !strings.HasPrefix(browser, "Chrome/") {
t.Fatalf("browser = %q, want Chrome/ prefix", browser)
}
headless := []byte(`{"Browser":"HeadlessChrome/130.0.6723.31","User-Agent":"Mozilla/5.0 HeadlessChrome/130.0.0.0"}`)
if _, ok, _ := cdpHealthy(headless); ok {
t.Fatalf("HeadlessChrome must be reported unhealthy (the whole point of chrome-service)")
}
if _, _, err := cdpHealthy([]byte("not json")); err == nil {
t.Fatalf("malformed /json/version body should error")
}
}
func TestBuildPortForwardArgs(t *testing.T) {
got := buildPortForwardArgs(18080)
want := []string{"-n", "chrome-service", "port-forward", "svc/chrome-service", "18080:9222"}
if !reflect.DeepEqual(got, want) {
t.Fatalf("buildPortForwardArgs =\n %v\nwant\n %v", got, want)
}
}
func TestBrowserClientPackageJSONPinsVersion(t *testing.T) {
pj := browserClientPackageJSON()
if !strings.Contains(pj, `"playwright-core": "`+playwrightVersion+`"`) {
t.Fatalf("package.json must pin playwright-core to %s; got:\n%s", playwrightVersion, pj)
}
}
func TestPlaywrightVersionPinnedToServerMinor(t *testing.T) {
// chrome-service runs mcr.microsoft.com/playwright:v1.48.0-noble; the CDP
// client minor MUST match (protocol changes between minors).
if !strings.HasPrefix(playwrightVersion, "1.48.") {
t.Fatalf("playwrightVersion = %q, must be 1.48.x to match the chrome-service image", playwrightVersion)
}
}
func TestBrowserHelpHasDiagnosticCheatSheet(t *testing.T) {
h := browserHelp()
for _, want := range []string{
"homelab browser run",
"ERR_FILE_NOT_FOUND",
"ERR_CONNECTION_REFUSED",
"network panel",
"headless",
"--shared-context",
} {
if !strings.Contains(h, want) {
t.Errorf("browser --help is missing %q (the discoverability/self-correction payload)", want)
}
}
}
func TestBrowserHelpIsTiered(t *testing.T) {
// --help must frame this as the ESCALATION path (default to headless first),
// matching ~/code/CLAUDE.md and chrome-service.md — non-conflicting agent
// instructions. Guard against a regression to "co-equal choice" wording.
h := browserHelp()
for _, want := range []string{"Default to the", "escalation"} {
if !strings.Contains(h, want) {
t.Errorf("browser --help must carry the tiered/default-headless framing; missing %q", want)
}
}
}
func TestStealthJSEmbeddedMatchesCanonical(t *testing.T) {
// The embedded copy must never drift from the source of truth that the
// in-cluster callers use, else the CLI's stealth and the cluster's diverge.
canonical, err := os.ReadFile("../stacks/chrome-service/files/stealth.js")
if err != nil {
t.Fatalf("read canonical stealth.js: %v", err)
}
if stealthJS != string(canonical) {
t.Fatalf("cli/browser_stealth.js has drifted from stacks/chrome-service/files/stealth.js — re-copy it")
}
}
func TestFreePortReturnsUsablePort(t *testing.T) {
p, err := freePort()
if err != nil {
t.Fatalf("freePort: %v", err)
}
if p <= 1024 || p > 65535 {
t.Fatalf("freePort returned %d, want an ephemeral port", p)
}
}

99
cli/cmd_ci.go Normal file
View file

@ -0,0 +1,99 @@
package main
import (
"fmt"
"os"
"strings"
"time"
)
func ciCommands() []Command {
return []Command{
{Path: []string{"ci", "status"}, Tier: TierRead,
Summary: "pipeline status for HEAD/a commit: ci status [commit]", Run: ciStatus},
{Path: []string{"ci", "watch"}, Tier: TierRead,
Summary: "poll the pipeline for HEAD (or a commit) to terminal; non-zero on failure", Run: ciWatch},
}
}
func short(s string) string {
if len(s) > 8 {
return s[:8]
}
return s
}
func firstLine(s string) string { return strings.SplitN(s, "\n", 2)[0] }
// currentHEAD returns the full HEAD sha of the cwd repo (empty if not a repo).
func currentHEAD() string {
cwd, _ := os.Getwd()
root, err := gitRepoRoot(cwd)
if err != nil {
return ""
}
sha, _ := gitOutput(root, "rev-parse", "HEAD")
return sha
}
func ciStatus(args []string) error {
commit, _ := firstPositional(args)
c, err := newWPClient()
if err != nil {
return err
}
id, err := c.repoID()
if err != nil {
return err
}
p, err := c.findPipeline(id, commit)
if err != nil {
return err
}
fmt.Printf("#%d %s event=%s %s %s\n", p.Number, p.Status, p.Event, short(p.Commit), firstLine(p.Message))
return nil
}
func ciWatch(args []string) error {
commit, _ := firstPositional(args)
if commit == "" {
commit = currentHEAD()
}
if commit == "" {
return fmt.Errorf("no commit given and not in a git repo")
}
c, err := newWPClient()
if err != nil {
return err
}
id, err := c.repoID()
if err != nil {
return err
}
timeout := 20 * time.Minute
deadline := time.Now().Add(timeout)
last := ""
for time.Now().Before(deadline) {
p, err := c.findPipeline(id, commit)
if err != nil {
if last != "waiting" {
fmt.Fprintf(os.Stderr, "homelab: waiting for pipeline (%s)...\n", short(commit))
last = "waiting"
}
} else {
if p.Status != last {
fmt.Fprintf(os.Stderr, "homelab: #%d %s\n", p.Number, p.Status)
last = p.Status
}
if isTerminalStatus(p.Status) {
fmt.Printf("#%d %s %s\n", p.Number, p.Status, short(commit))
if isFailureStatus(p.Status) {
return fmt.Errorf("pipeline #%d %s (woodpecker repo, see UI/DB for the failing step)", p.Number, p.Status)
}
return nil
}
}
time.Sleep(15 * time.Second)
}
return fmt.Errorf("timed out after %s waiting for CI on %s", timeout, short(commit))
}

56
cli/cmd_claim.go Normal file
View file

@ -0,0 +1,56 @@
package main
import (
"fmt"
"strings"
)
func claimCommands() []Command {
return []Command{
{Path: []string{"claim"}, Tier: TierWrite,
Summary: "claim a shared infra resource on the presence board",
Run: runClaim},
{Path: []string{"release"}, Tier: TierWrite,
Summary: "release a presence claim",
Run: runRelease},
}
}
// runClaim parses `<kind>:<name> --purpose "..."` in either order (the presence
// script takes the label first, so we can't rely on Go's flag package which
// stops at the first positional).
func runClaim(args []string) error {
var label, purpose string
for i := 0; i < len(args); i++ {
a := args[i]
switch {
case a == "--purpose" || a == "-purpose":
if i+1 < len(args) {
purpose = args[i+1]
i++
}
case strings.HasPrefix(a, "--purpose="):
purpose = strings.TrimPrefix(a, "--purpose=")
case !strings.HasPrefix(a, "-") && label == "":
label = a
}
}
if label == "" {
return fmt.Errorf(`usage: homelab claim <kind>:<name> --purpose "what + why"`)
}
return presenceClaim(label, purpose)
}
func runRelease(args []string) error {
var label string
for _, a := range args {
if !strings.HasPrefix(a, "-") {
label = a
break
}
}
if label == "" {
return fmt.Errorf("usage: homelab release <kind>:<name>")
}
return presenceRelease(label)
}

51
cli/cmd_deploy.go Normal file
View file

@ -0,0 +1,51 @@
package main
import (
"fmt"
"os"
"strings"
"time"
)
func deployCommands() []Command {
return []Command{
{Path: []string{"deploy", "wait"}, Tier: TierRead,
Summary: "wait for <ns>/<deploy> to roll out the current (or --sha) image: deploy wait <ns>/<deploy> [--sha SHA]", Run: deployWait},
}
}
// deployWait closes the "did the NEW code land" gap: rollout status alone returns
// success on the OLD ReplicaSet, so we first wait for the deployment image to
// reference the expected sha, THEN block on rollout status.
func deployWait(args []string) error {
target, _ := firstPositional(args)
if target == "" || !strings.Contains(target, "/") {
return fmt.Errorf("usage: homelab deploy wait <ns>/<deploy> [--sha SHA] [--timeout 10m]")
}
parts := strings.SplitN(target, "/", 2)
ns, deploy := parts[0], parts[1]
sha := flagValue(args, "--sha")
if sha == "" {
sha = short(currentHEAD())
}
deadline := time.Now().Add(10 * time.Minute)
if sha != "" {
fmt.Fprintf(os.Stderr, "homelab: waiting for %s/%s image to match %s...\n", ns, deploy, sha)
matched := false
for time.Now().Before(deadline) {
img, _ := kubectlCapture(ns, "get", "deploy", deploy, "-o", "jsonpath={.spec.template.spec.containers[*].image}")
if strings.Contains(img, sha) {
matched = true
break
}
time.Sleep(10 * time.Second)
}
if !matched {
return fmt.Errorf("timed out: %s/%s image never matched %q", ns, deploy, sha)
}
}
fmt.Fprintf(os.Stderr, "homelab: rollout status %s/%s...\n", ns, deploy)
return kubectlStream(ns, "rollout", "status", "deploy/"+deploy, "--timeout=180s")
}

172
cli/cmd_ha.go Normal file
View file

@ -0,0 +1,172 @@
package main
import (
"encoding/base64"
"fmt"
"os"
"path/filepath"
"strings"
)
// Home Assistant verbs cover the two things the `ha` MCP server can't: resolving
// the long-lived API token out of the cluster, and SSH to the HA host for
// host-level work (config files, docker, add-ons). Entity state/control stays
// with the MCP — see docs/adr/0012.
//
// The token lives in the dedicated k8s Secret openclaw/ha-tokens (one key per
// instance), split out of openclaw-secrets so non-admin operators (emo / "Home
// Server Admins") can read JUST the HA token, not the full skill_secrets blob.
// `ha token` resolves it on demand via the ambient kubeconfig, so it never
// depends on a pre-set env var (the gap that made agents re-derive the
// kubectl|base64|jq pipeline every session).
type haInstance struct {
name string // sofia | london
sshUser string // SSH login on the HA host
sshHost string // host reachable from the devvm (Sofia LAN)
secretKey string // key inside the openclaw/ha-tokens Secret holding this token
}
const (
haDefaultInstance = "sofia"
haSecretNamespace = "openclaw"
haSecretName = "ha-tokens" // dedicated, least-privilege; see stacks/openclaw/ha_tokens.tf
)
// haInstances maps instance name → connection/secret facts. sofia is the default
// because the devvm is on the Sofia LAN; london is documented but its host
// (192.168.8.x) is only reachable remotely, so `ha ssh --instance london`
// generally won't connect from here (token resolution still works).
var haInstances = map[string]haInstance{
"sofia": {name: "sofia", sshUser: "vbarzin", sshHost: "192.168.1.8", secretKey: "sofia"},
"london": {name: "london", sshUser: "hassio", sshHost: "192.168.8.103", secretKey: "london"},
}
func haCommands() []Command {
return []Command{
{Path: []string{"ha", "token"}, Tier: TierRead,
Summary: "reveal the HA long-lived API token from the cluster: ha token [--instance sofia|london]", Run: haToken},
{Path: []string{"ha", "ssh"}, Tier: TierWrite,
Summary: "run a command on the HA host over ssh: ha ssh [--instance sofia|london] [-i KEY] -- <cmd>", Run: haSSH},
}
}
// resolveHAInstance looks up an instance by name; "" yields the default (sofia).
func resolveHAInstance(name string) (haInstance, error) {
if name == "" {
name = haDefaultInstance
}
inst, ok := haInstances[name]
if !ok {
return haInstance{}, fmt.Errorf("unknown HA instance %q (want sofia or london)", name)
}
return inst, nil
}
// decodeSecretValue base64-decodes a k8s Secret `.data.<key>` value as returned
// by kubectl jsonpath (trailing whitespace tolerated).
func decodeSecretValue(b64 string) (string, error) {
raw, err := base64.StdEncoding.DecodeString(strings.TrimSpace(b64))
if err != nil {
return "", fmt.Errorf("base64-decode secret value: %w", err)
}
return string(raw), nil
}
func haToken(args []string) error {
name, _ := firstPositional(args) // accept `ha token sofia` as well as `--instance sofia`
for i := 0; i < len(args); i++ {
if args[i] == "--instance" && i+1 < len(args) {
name = args[i+1]
} else if strings.HasPrefix(args[i], "--instance=") {
name = strings.TrimPrefix(args[i], "--instance=")
}
}
inst, err := resolveHAInstance(name)
if err != nil {
return err
}
b64, err := kubectlCapture(haSecretNamespace, "get", "secret", haSecretName,
"-o", "jsonpath={.data."+inst.secretKey+"}")
if err != nil {
return fmt.Errorf("read secret %s/%s (kubeconfig set? RBAC?): %w", haSecretNamespace, haSecretName, err)
}
if b64 == "" {
return fmt.Errorf("secret %s/%s has no %q key", haSecretNamespace, haSecretName, inst.secretKey)
}
tok, err := decodeSecretValue(b64)
if err != nil {
return err
}
fmt.Println(tok)
return nil
}
// defaultHAKeyPath is the invoking user's ed25519 key, so the verb is per-user
// rather than tied to whoever first wrote the workflow.
func defaultHAKeyPath() string {
if home, err := os.UserHomeDir(); err == nil && home != "" {
return filepath.Join(home, ".ssh", "id_ed25519")
}
return filepath.Join("~", ".ssh", "id_ed25519")
}
// parseHASSH reads `[--instance X] [-i|--key PATH] [-- ] <cmd...>`. Tokens after
// `--` are taken verbatim; bare tokens before it are also the remote command.
func parseHASSH(args []string) (inst haInstance, keyPath string, remote []string, err error) {
name := haDefaultInstance
keyPath = defaultHAKeyPath()
for i := 0; i < len(args); i++ {
a := args[i]
switch {
case a == "--":
remote = append(remote, args[i+1:]...)
i = len(args)
case a == "--instance":
if i+1 < len(args) {
name = args[i+1]
i++
}
case strings.HasPrefix(a, "--instance="):
name = strings.TrimPrefix(a, "--instance=")
case a == "--key" || a == "-i":
if i+1 < len(args) {
keyPath = args[i+1]
i++
}
case strings.HasPrefix(a, "--key="):
keyPath = strings.TrimPrefix(a, "--key=")
default:
remote = append(remote, a)
}
}
inst, err = resolveHAInstance(name)
return inst, keyPath, remote, err
}
// buildHASSHArgs assembles deterministic, non-interactive ssh args: an explicit
// key, no user ssh config, and no known_hosts prompt/record — so it runs
// unattended in an agent session without hanging on a host-key prompt.
func buildHASSHArgs(inst haInstance, keyPath string, remote []string) []string {
args := []string{
"-F", "/dev/null",
"-o", "IdentityFile=" + keyPath,
"-o", "StrictHostKeyChecking=no",
"-o", "UserKnownHostsFile=/dev/null",
"-o", "ConnectTimeout=10",
"-o", "BatchMode=yes",
inst.sshUser + "@" + inst.sshHost,
}
return append(args, remote...)
}
func haSSH(args []string) error {
inst, keyPath, remote, err := parseHASSH(args)
if err != nil {
return err
}
if len(remote) == 0 {
return fmt.Errorf(`usage: homelab ha ssh [--instance sofia|london] [-i KEY] -- <command>`)
}
return runStreaming("ssh", buildHASSHArgs(inst, keyPath, remote)...)
}

92
cli/cmd_ha_test.go Normal file
View file

@ -0,0 +1,92 @@
package main
import (
"encoding/base64"
"reflect"
"strings"
"testing"
)
func TestResolveHAInstance(t *testing.T) {
// empty defaults to sofia (the devvm sits on the Sofia LAN)
if got, err := resolveHAInstance(""); err != nil || got.name != "sofia" {
t.Fatalf(`resolveHAInstance("") = %+v, %v; want sofia`, got, err)
}
if got, err := resolveHAInstance("sofia"); err != nil || got.secretKey != "sofia" {
t.Fatalf("sofia secretKey = %q, %v", got.secretKey, err)
}
if got, err := resolveHAInstance("london"); err != nil || got.secretKey != "london" || got.sshUser != "hassio" {
t.Fatalf("london = %+v, %v", got, err)
}
if _, err := resolveHAInstance("paris"); err == nil {
t.Fatalf("resolveHAInstance(paris) should error on unknown instance")
}
}
func TestDecodeSecretValue(t *testing.T) {
// k8s stores Secret values base64-encoded; `kubectl -o jsonpath={.data.<k>}`
// returns that base64, which decodeSecretValue turns back into the raw token.
enc := base64.StdEncoding.EncodeToString([]byte("tok-sofia"))
if got, err := decodeSecretValue(enc); err != nil || got != "tok-sofia" {
t.Fatalf("decodeSecretValue = %q, %v; want tok-sofia", got, err)
}
// trailing whitespace/newline from jsonpath output must be tolerated
if got, err := decodeSecretValue(enc + "\n"); err != nil || got != "tok-sofia" {
t.Fatalf("decodeSecretValue (trailing ws) = %q, %v; want tok-sofia", got, err)
}
if _, err := decodeSecretValue("not-base64!!"); err == nil {
t.Fatalf("decodeSecretValue should error on undecodable base64")
}
}
func TestBuildHASSHArgs(t *testing.T) {
inst, _ := resolveHAInstance("sofia")
got := buildHASSHArgs(inst, "/home/u/.ssh/id_ed25519", []string{"cat", "/config/configuration.yaml"})
want := []string{
"-F", "/dev/null",
"-o", "IdentityFile=/home/u/.ssh/id_ed25519",
"-o", "StrictHostKeyChecking=no",
"-o", "UserKnownHostsFile=/dev/null",
"-o", "ConnectTimeout=10",
"-o", "BatchMode=yes",
"vbarzin@192.168.1.8",
"cat", "/config/configuration.yaml",
}
if !reflect.DeepEqual(got, want) {
t.Fatalf("buildHASSHArgs =\n %v\nwant\n %v", got, want)
}
}
func TestParseHASSH(t *testing.T) {
// instance flag + everything after `--` is the verbatim remote command
inst, key, remote, err := parseHASSH([]string{"--instance", "sofia", "--", "docker", "ps", "-a"})
if err != nil {
t.Fatalf("parseHASSH err: %v", err)
}
if inst.name != "sofia" {
t.Errorf("instance = %q, want sofia", inst.name)
}
if !strings.HasSuffix(key, "/.ssh/id_ed25519") {
t.Errorf("default key = %q, want it to end in /.ssh/id_ed25519", key)
}
if !reflect.DeepEqual(remote, []string{"docker", "ps", "-a"}) {
t.Errorf("remote = %v, want [docker ps -a]", remote)
}
// bare args (no `--`) are also taken as the remote command; -i overrides the key
_, key2, remote2, err := parseHASSH([]string{"-i", "/tmp/k", "uptime"})
if err != nil {
t.Fatalf("parseHASSH err: %v", err)
}
if key2 != "/tmp/k" {
t.Errorf("key = %q, want /tmp/k", key2)
}
if !reflect.DeepEqual(remote2, []string{"uptime"}) {
t.Errorf("remote = %v, want [uptime]", remote2)
}
// unknown instance surfaces as an error
if _, _, _, err := parseHASSH([]string{"--instance", "paris", "--", "ls"}); err == nil {
t.Errorf("parseHASSH should error on unknown instance")
}
}

288
cli/cmd_k8s.go Normal file
View file

@ -0,0 +1,288 @@
package main
import (
"fmt"
"os"
"strings"
)
func k8sCommands() []Command {
return []Command{
{Path: []string{"k8s", "status"}, Tier: TierRead,
Summary: "pods (wide) + recent non-Normal events for a namespace (or -A)", Run: k8sStatus},
{Path: []string{"k8s", "get"}, Tier: TierRead,
Summary: "kubectl get in a namespace: k8s get <ns> <resource> [args]", Run: k8sGet},
{Path: []string{"k8s", "logs"}, Tier: TierRead,
Summary: "logs for <app> (deploy/<app>; --tail/-c/--previous/--since/-l)", Run: k8sLogs},
{Path: []string{"k8s", "describe"}, Tier: TierRead,
Summary: "describe <app>'s deployment (or an explicit resource)", Run: k8sDescribe},
{Path: []string{"k8s", "debug"}, Tier: TierRead,
Summary: "one-shot triage for <app>: pods+deploy+describe+logs+events", Run: k8sDebug},
{Path: []string{"k8s", "pf"}, Tier: TierRead,
Summary: "port-forward: k8s pf <app> <local:remote> [svc/pod target]", Run: k8sPortForward},
{Path: []string{"k8s", "db"}, Tier: TierWrite,
Summary: `query a dbaas DB: k8s db <app> [--mysql] [--db N] -- "<SQL>"`, Run: k8sDB},
{Path: []string{"k8s", "exec"}, Tier: TierWrite,
Summary: "exec in <app>'s pod: k8s exec <app> [--tty] -- <cmd>", Run: k8sExec},
{Path: []string{"k8s", "rm-pod"}, Tier: TierWrite,
Summary: "delete a stuck pod/job ONLY: k8s rm-pod <name> -n <ns> [--job] [--force]", Run: k8sRmPod},
{Path: []string{"k8s", "rollout-status"}, Tier: TierRead,
Summary: "rollout status of deploy/<app>", Run: k8sRolloutStatus},
{Path: []string{"k8s", "restart"}, Tier: TierWrite,
Summary: "rollout restart deploy/<app> then wait for status", Run: k8sRestart},
{Path: []string{"k8s", "probe"}, Tier: TierRead,
Summary: "in-cluster reachability: ephemeral curl pod to <app>.<ns>.svc", Run: k8sProbe},
}
}
func k8sStatus(args []string) error {
t := parseK8sTarget(args)
ns := t.namespace() // "" when no app/ns given → cluster-wide
get := []string{"get", "pods", "-o", "wide"}
ev := []string{"get", "events", "--field-selector", "type!=Normal", "--sort-by=.lastTimestamp"}
if ns == "" {
get = append(get, "-A")
ev = append(ev, "-A")
}
if err := kubectlStream(ns, get...); err != nil {
return err
}
fmt.Fprintln(os.Stderr, "\n--- recent events (type!=Normal) ---")
_ = kubectlStream(ns, ev...) // best-effort
return nil
}
func k8sGet(args []string) error {
t := parseK8sTarget(args)
if t.app == "" || len(t.rest) == 0 {
return fmt.Errorf("usage: homelab k8s get <ns> <resource> [args]")
}
return kubectlStream(t.app, append([]string{"get"}, t.rest...)...)
}
func k8sLogs(args []string) error {
t := parseK8sTarget(args)
if t.app == "" {
return fmt.Errorf("usage: homelab k8s logs <app> [--tail N] [-c ctr] [--previous] [--since 1h] [-l sel]")
}
a := []string{"logs"}
if t.selector != "" {
a = append(a, "-l", t.selector)
} else {
a = append(a, t.objectRef())
}
if t.container != "" {
a = append(a, "-c", t.container)
}
if !containsPrefix(t.rest, "--tail") {
a = append(a, "--tail=200")
}
a = append(a, t.rest...)
return kubectlStream(t.namespace(), a...)
}
func k8sDescribe(args []string) error {
t := parseK8sTarget(args)
if t.app == "" {
return fmt.Errorf("usage: homelab k8s describe <app> [resource]")
}
if len(t.rest) > 0 {
return kubectlStream(t.namespace(), append([]string{"describe"}, t.rest...)...)
}
return kubectlStream(t.namespace(), "describe", t.objectRef())
}
func k8sDebug(args []string) error {
t := parseK8sTarget(args)
if t.app == "" {
return fmt.Errorf("usage: homelab k8s debug <app>")
}
ns := t.namespace()
sec := func(title string) { fmt.Fprintf(os.Stderr, "\n=== %s ===\n", title) }
sec("pods")
_ = kubectlStream(ns, "get", "pods", "-o", "wide")
sec("workloads")
_ = kubectlStream(ns, "get", "deploy,sts,ds", "-o", "wide")
sec("describe "+t.objectRef())
_ = kubectlStream(ns, "describe", t.objectRef())
sec("recent logs (--tail=50)")
_ = kubectlStream(ns, "logs", t.objectRef(), "--tail=50")
sec("events (type!=Normal)")
_ = kubectlStream(ns, "get", "events", "--field-selector", "type!=Normal", "--sort-by=.lastTimestamp")
return nil
}
func k8sPortForward(args []string) error {
t := parseK8sTarget(args)
if t.app == "" || len(t.rest) == 0 {
return fmt.Errorf("usage: homelab k8s pf <app> <local:remote> [svc/pod target]")
}
ports := t.rest[0]
target := "svc/" + t.app
if len(t.rest) > 1 {
target = t.rest[1]
}
return kubectlStream(t.namespace(), "port-forward", target, ports)
}
func k8sDB(args []string) error {
var app, dbName, sql string
mysql := false
for i := 0; i < len(args); i++ {
a := args[i]
if a == "--" {
sql = strings.Join(args[i+1:], " ")
break
}
switch {
case a == "--mysql":
mysql = true
case a == "--db":
if i+1 < len(args) {
dbName = args[i+1]
i++
}
case strings.HasPrefix(a, "--db="):
dbName = strings.TrimPrefix(a, "--db=")
case !strings.HasPrefix(a, "-") && app == "":
app = a
}
}
if app == "" {
return fmt.Errorf(`usage: homelab k8s db <app> [--mysql] [--db NAME] -- "<SQL>"`)
}
p := planDBExec(app, dbName, sql, mysql)
pod := p.pod
if pod == "" && p.selector != "" {
resolved, err := kubectlCapture(p.ns, "get", "pod", "-l", p.selector, "-o", "jsonpath={.items[0].metadata.name}")
if err != nil || resolved == "" {
return fmt.Errorf("could not resolve db pod in %s (selector %q): %v", p.ns, p.selector, err)
}
pod = resolved
}
exec := []string{"exec"}
if sql == "" {
exec = append(exec, "-it") // interactive client when no SQL given
}
exec = append(exec, pod)
if p.container != "" {
exec = append(exec, "-c", p.container)
}
exec = append(exec, "--")
exec = append(exec, p.argv...)
return kubectlStream(p.ns, exec...)
}
func k8sExec(args []string) error {
t := parseK8sTarget(args)
if t.app == "" {
return fmt.Errorf("usage: homelab k8s exec <app> [--pod p] [-c ctr] [--tty] -- <cmd>")
}
if len(t.rest) == 0 {
return fmt.Errorf("provide a command after --, e.g. homelab k8s exec %s -- env", t.app)
}
a := []string{"exec"}
if t.tty {
a = append(a, "-it")
}
a = append(a, t.objectRef())
if t.container != "" {
a = append(a, "-c", t.container)
}
a = append(a, "--")
a = append(a, t.rest...)
return kubectlStream(t.namespace(), a...)
}
func k8sRmPod(args []string) error {
var pod, ns, grace string
force, job := false, false
for i := 0; i < len(args); i++ {
a := args[i]
switch {
case a == "-n" || a == "--namespace":
if i+1 < len(args) {
ns = args[i+1]
i++
}
case a == "--force":
force = true
case a == "--job":
job = true
case a == "--grace":
if i+1 < len(args) {
grace = args[i+1]
i++
}
case !strings.HasPrefix(a, "-") && pod == "":
pod = a
}
}
if pod == "" || ns == "" {
return fmt.Errorf("usage: homelab k8s rm-pod <name> -n <ns> [--job] [--force] [--grace N] (pods/jobs only)")
}
kind := "pod"
if job {
kind = "job"
}
a := []string{"delete", kind, pod}
if grace != "" {
a = append(a, "--grace-period="+grace)
}
if force {
a = append(a, "--force")
}
return kubectlStream(ns, a...)
}
func k8sRolloutStatus(args []string) error {
t := parseK8sTarget(args)
if t.app == "" {
return fmt.Errorf("usage: homelab k8s rollout-status <app>")
}
return kubectlStream(t.namespace(), "rollout", "status", "deploy/"+t.app)
}
func k8sRestart(args []string) error {
t := parseK8sTarget(args)
if t.app == "" {
return fmt.Errorf("usage: homelab k8s restart <app>")
}
ns := t.namespace()
if err := kubectlStream(ns, "rollout", "restart", "deploy/"+t.app); err != nil {
return err
}
return kubectlStream(ns, "rollout", "status", "deploy/"+t.app)
}
func k8sProbe(args []string) error {
t := parseK8sTarget(args)
if t.app == "" {
return fmt.Errorf("usage: homelab k8s probe <app> [path] [--port N]")
}
ns := t.namespace()
url := "http://" + t.app + "." + ns + ".svc.cluster.local"
if port := flagValue(args, "--port"); port != "" {
url += ":" + port
}
if len(t.rest) > 0 {
p := t.rest[0]
if !strings.HasPrefix(p, "/") {
p = "/" + p
}
url += p
}
return kubectlStream(ns, "run", "homelab-probe", "--rm", "-i", "--restart=Never",
"--image=curlimages/curl:latest", "--",
"curl", "-sS", "--max-time", "10", "-w", "\n[%{http_code}] %{time_total}s\n", url)
}
// containsPrefix reports whether any arg starts with prefix.
func containsPrefix(args []string, prefix string) bool {
for _, a := range args {
if strings.HasPrefix(a, prefix) {
return true
}
}
return false
}

302
cli/cmd_memory.go Normal file
View file

@ -0,0 +1,302 @@
package main
import (
"encoding/json"
"fmt"
"net/url"
"strings"
)
func memoryCommands() []Command {
return []Command{
{Path: []string{"memory", "recall"}, Tier: TierRead,
Summary: `semantic search of memory: memory recall "<context>" [--query …] [--category] [--sort] [--limit]`, Run: memoryRecall},
{Path: []string{"memory", "list"}, Tier: TierRead,
Summary: "list recent memories [--category C] [--tag T] [--limit N]", Run: memoryList},
{Path: []string{"memory", "categories"}, Tier: TierRead,
Summary: "list memory categories", Run: memorySimpleGet("/api/categories")},
{Path: []string{"memory", "tags"}, Tier: TierRead,
Summary: "list memory tags", Run: memorySimpleGet("/api/tags")},
{Path: []string{"memory", "stats"}, Tier: TierRead,
Summary: "memory store stats", Run: memorySimpleGet("/api/stats")},
{Path: []string{"memory", "secret"}, Tier: TierRead,
Summary: "reveal a sensitive memory's content: memory secret <id>", Run: memorySecret},
{Path: []string{"memory", "store"}, Tier: TierWrite,
Summary: `store a memory: memory store "<content>" [--category --tags --keywords --importance --sensitive]`, Run: memoryStore},
{Path: []string{"memory", "update"}, Tier: TierWrite,
Summary: "update a memory: memory update <id> [--content --tags --importance --keywords]", Run: memoryUpdate},
{Path: []string{"memory", "delete"}, Tier: TierWrite,
Summary: "delete a memory: memory delete <id>", Run: memoryDelete},
}
}
// printMemories renders a {memories:[…]} response as compact lines, or raw JSON.
func printMemories(raw []byte, jsonOut bool) error {
if jsonOut {
fmt.Println(string(raw))
return nil
}
var r struct {
Memories []struct {
ID int `json:"id"`
Content string `json:"content"`
Category string `json:"category"`
Tags string `json:"tags"`
Importance float64 `json:"importance"`
} `json:"memories"`
}
if err := json.Unmarshal(raw, &r); err != nil {
fmt.Println(string(raw))
return nil
}
if len(r.Memories) == 0 {
fmt.Println("(no memories)")
return nil
}
for _, m := range r.Memories {
c := strings.ReplaceAll(m.Content, "\n", " ")
if len(c) > 240 {
c = c[:240] + "…"
}
fmt.Printf("#%d [%s] (%.2f) %s\n", m.ID, m.Category, m.Importance, c)
if m.Tags != "" {
fmt.Printf(" tags: %s\n", m.Tags)
}
}
return nil
}
func memoryRecall(args []string) error {
req := memRecallReq{}
jsonOut := false
var pos []string
for i := 0; i < len(args); i++ {
a := args[i]
switch {
case a == "--query":
if i+1 < len(args) {
req.ExpandedQuery = args[i+1]
i++
}
case a == "--category":
if i+1 < len(args) {
req.Category = args[i+1]
i++
}
case a == "--sort":
if i+1 < len(args) {
req.SortBy = args[i+1]
i++
}
case a == "--limit":
if i+1 < len(args) {
fmt.Sscanf(args[i+1], "%d", &req.Limit)
i++
}
case a == "--json":
jsonOut = true
case !strings.HasPrefix(a, "-"):
pos = append(pos, a)
}
}
req.Context = strings.Join(pos, " ")
if req.Context == "" {
return fmt.Errorf(`usage: homelab memory recall "<context>" [--query …] [--category C] [--sort importance|relevance|recency] [--limit N]`)
}
c, err := newMemoryClient()
if err != nil {
return err
}
raw, err := c.do("POST", "/api/memories/recall", req)
if err != nil {
return err
}
return printMemories(raw, jsonOut)
}
func memoryList(args []string) error {
q := url.Values{}
jsonOut := false
for i := 0; i < len(args); i++ {
a := args[i]
switch {
case a == "--category":
if i+1 < len(args) {
q.Set("category", args[i+1])
i++
}
case a == "--tag":
if i+1 < len(args) {
q.Set("tag", args[i+1])
i++
}
case a == "--limit":
if i+1 < len(args) {
q.Set("limit", args[i+1])
i++
}
case a == "--json":
jsonOut = true
}
}
c, err := newMemoryClient()
if err != nil {
return err
}
path := "/api/memories"
if len(q) > 0 {
path += "?" + q.Encode()
}
raw, err := c.do("GET", path, nil)
if err != nil {
return err
}
return printMemories(raw, jsonOut)
}
func memorySimpleGet(path string) func([]string) error {
return func(args []string) error {
c, err := newMemoryClient()
if err != nil {
return err
}
raw, err := c.do("GET", path, nil)
if err != nil {
return err
}
fmt.Println(string(raw))
return nil
}
}
func memorySecret(args []string) error {
id, _ := firstPositional(args)
if id == "" {
return fmt.Errorf("usage: homelab memory secret <id>")
}
c, err := newMemoryClient()
if err != nil {
return err
}
raw, err := c.do("POST", "/api/memories/"+id+"/secret", nil)
if err != nil {
return err
}
fmt.Println(string(raw))
return nil
}
func memoryStore(args []string) error {
req := memStoreReq{Category: "facts", Importance: 0.5}
var pos []string
for i := 0; i < len(args); i++ {
a := args[i]
switch {
case a == "--category":
if i+1 < len(args) {
req.Category = args[i+1]
i++
}
case a == "--tags":
if i+1 < len(args) {
req.Tags = args[i+1]
i++
}
case a == "--keywords":
if i+1 < len(args) {
req.ExpandedKeywords = args[i+1]
i++
}
case a == "--importance":
if i+1 < len(args) {
fmt.Sscanf(args[i+1], "%f", &req.Importance)
i++
}
case a == "--sensitive":
req.ForceSensitive = true
case !strings.HasPrefix(a, "-"):
pos = append(pos, a)
}
}
req.Content = strings.Join(pos, " ")
if req.Content == "" {
return fmt.Errorf(`usage: homelab memory store "<content>" [--category C] [--tags ...] [--keywords ...] [--importance 0.5] [--sensitive]`)
}
c, err := newMemoryClient()
if err != nil {
return err
}
raw, err := c.do("POST", "/api/memories", req)
if err != nil {
return err
}
fmt.Println(string(raw))
return nil
}
func memoryUpdate(args []string) error {
var id string
req := memUpdateReq{}
for i := 0; i < len(args); i++ {
a := args[i]
switch {
case a == "--content":
if i+1 < len(args) {
v := args[i+1]
req.Content = &v
i++
}
case a == "--tags":
if i+1 < len(args) {
v := args[i+1]
req.Tags = &v
i++
}
case a == "--keywords":
if i+1 < len(args) {
v := args[i+1]
req.ExpandedKeywords = &v
i++
}
case a == "--importance":
if i+1 < len(args) {
var f float64
fmt.Sscanf(args[i+1], "%f", &f)
req.Importance = &f
i++
}
case !strings.HasPrefix(a, "-") && id == "":
id = a
}
}
if id == "" {
return fmt.Errorf("usage: homelab memory update <id> [--content ...] [--tags ...] [--importance N] [--keywords ...]")
}
c, err := newMemoryClient()
if err != nil {
return err
}
raw, err := c.do("PUT", "/api/memories/"+id, req)
if err != nil {
return err
}
fmt.Println(string(raw))
return nil
}
func memoryDelete(args []string) error {
id, _ := firstPositional(args)
if id == "" {
return fmt.Errorf("usage: homelab memory delete <id>")
}
c, err := newMemoryClient()
if err != nil {
return err
}
raw, err := c.do("DELETE", "/api/memories/"+id, nil)
if err != nil {
return err
}
fmt.Println(string(raw))
return nil
}

83
cli/cmd_net.go Normal file
View file

@ -0,0 +1,83 @@
package main
import (
"fmt"
"strings"
"time"
)
func netCommands() []Command {
return []Command{
{Path: []string{"net", "check"}, Tier: TierRead,
Summary: "reachability of <host>[/path]: external (public DNS→CF) vs internal (Traefik LB)", Run: netCheck},
{Path: []string{"dns", "lookup"}, Tier: TierRead,
Summary: "resolve <name> via Technitium (10.0.20.201) and public (1.1.1.1), diffed", Run: dnsLookup},
}
}
func fmtProbe(code int, d time.Duration, err error) string {
if err != nil {
return "ERR " + err.Error()
}
return fmt.Sprintf("HTTP %d %dms", code, d.Milliseconds())
}
func netCheck(args []string) error {
host, rest := firstPositional(args)
if host == "" {
return fmt.Errorf("usage: homelab net check <host> [path]")
}
path := "/"
if len(rest) > 0 && !strings.HasPrefix(rest[0], "-") {
path = rest[0]
if !strings.HasPrefix(path, "/") {
path = "/" + path
}
}
u := "https://" + host + path
fmt.Printf("%s\n", u)
// external leg: resolve via public DNS, dial the public IP (tests the real CF path)
pubOut, _ := dig(hostOnly(host), "1.1.1.1", "")
if pubIP := firstLine(pubOut); pubIP != "" {
c, d, e := probeURL(clientDialingIP(pubIP, 10*time.Second), u)
fmt.Printf(" external (public %-15s) %s\n", pubIP, fmtProbe(c, d, e))
} else {
fmt.Println(" external (public) no public A record")
}
// internal leg: dial the Traefik LB directly
c, d, e := probeURL(clientDialingIP(internalLBIP, 10*time.Second), u)
fmt.Printf(" internal (LB %-15s) %s\n", internalLBIP, fmtProbe(c, d, e))
return nil
}
func dnsLookup(args []string) error {
name, rest := firstPositional(args)
if name == "" {
return fmt.Errorf("usage: homelab dns lookup <name> [A|AAAA|TXT|MX|PTR]")
}
rr := ""
if len(rest) > 0 {
rr = rest[0]
}
tech, _ := dig(name, "10.0.20.201", rr)
pub, _ := dig(name, "1.1.1.1", rr)
fmt.Printf("technitium (10.0.20.201): %s\n", oneLineList(tech))
fmt.Printf("public (1.1.1.1) : %s\n", oneLineList(pub))
if strings.TrimSpace(tech) != strings.TrimSpace(pub) {
fmt.Println("⚠ mismatch — split-horizon (expected for internal-only apps) or a propagation gap")
}
return nil
}
func hostOnly(h string) string { // strip any path accidentally included
return strings.SplitN(h, "/", 2)[0]
}
func oneLineList(s string) string {
s = strings.TrimSpace(s)
if s == "" {
return "(none)"
}
return strings.ReplaceAll(s, "\n", ", ")
}

197
cli/cmd_obs.go Normal file
View file

@ -0,0 +1,197 @@
package main
import (
"encoding/json"
"fmt"
"net/url"
"sort"
"strconv"
"strings"
"time"
)
const (
promHost = "prometheus-query.viktorbarzin.lan"
lokiHost = "loki.viktorbarzin.lan"
)
func obsCommands() []Command {
return []Command{
{Path: []string{"metrics", "query"}, Tier: TierRead,
Summary: `Prometheus instant query: metrics query "<promql>" [--json]`, Run: metricsQuery},
{Path: []string{"metrics", "alerts"}, Tier: TierRead,
Summary: "list currently firing Prometheus alerts", Run: metricsAlerts},
{Path: []string{"logs", "query"}, Tier: TierRead,
Summary: `Loki query (last --since, default 1h): logs query "<logql>" [--since 1h] [--limit N] [--json]`, Run: logsQuery},
}
}
// queryArg joins non-flag args into the query (PromQL/LogQL should normally be
// passed as a single quoted argument; this also tolerates unquoted multi-token).
func queryArg(args []string, valueFlags map[string]bool) string {
var parts []string
for i := 0; i < len(args); i++ {
a := args[i]
if valueFlags[a] {
i++
continue
}
if strings.HasPrefix(a, "-") {
continue
}
parts = append(parts, a)
}
return strings.Join(parts, " ")
}
func labelStr(m map[string]string) string {
name := m["__name__"]
var kv []string
for k, v := range m {
if k != "__name__" {
kv = append(kv, k+"="+v)
}
}
sort.Strings(kv)
return name + "{" + strings.Join(kv, ",") + "}"
}
func metricsQuery(args []string) error {
q := queryArg(args, nil)
if q == "" {
return fmt.Errorf(`usage: homelab metrics query "<promql>" [--json]`)
}
v := url.Values{}
v.Set("query", q)
body, err := lbGetBody(promHost, "/api/v1/query", v)
if err != nil {
return err
}
if containsArg(args, "--json") {
fmt.Println(string(body))
return nil
}
var r struct {
Data struct {
Result []struct {
Metric map[string]string `json:"metric"`
Value []interface{} `json:"value"`
} `json:"result"`
} `json:"data"`
}
if err := json.Unmarshal(body, &r); err != nil {
fmt.Println(string(body))
return nil
}
if len(r.Data.Result) == 0 {
fmt.Println("(no series)")
return nil
}
for _, s := range r.Data.Result {
val := ""
if len(s.Value) == 2 {
val = fmt.Sprint(s.Value[1])
}
fmt.Printf("%-14s %s\n", val, labelStr(s.Metric))
}
return nil
}
func metricsAlerts(args []string) error {
// prometheus-query is a query-only frontend (no /api/v1/alerts); the firing
// set is exposed as the synthetic ALERTS series, queryable the normal way.
v := url.Values{}
v.Set("query", `ALERTS{alertstate="firing"}`)
body, err := lbGetBody(promHost, "/api/v1/query", v)
if err != nil {
return err
}
if containsArg(args, "--json") {
fmt.Println(string(body))
return nil
}
var r struct {
Data struct {
Result []struct {
Metric map[string]string `json:"metric"`
} `json:"result"`
} `json:"data"`
}
if err := json.Unmarshal(body, &r); err != nil {
fmt.Println(string(body))
return nil
}
if len(r.Data.Result) == 0 {
fmt.Println("(no firing alerts)")
return nil
}
for _, a := range r.Data.Result {
m := a.Metric
scope := ""
for _, k := range []string{"namespace", "deployment", "instance", "job", "node"} {
if v := m[k]; v != "" {
scope = k + "=" + v
break
}
}
fmt.Printf("%-9s %-34s %s\n", m["severity"], m["alertname"], scope)
}
return nil
}
func logsQuery(args []string) error {
q := queryArg(args, map[string]bool{"--since": true, "--limit": true})
if q == "" {
return fmt.Errorf(`usage: homelab logs query "<logql>" [--since 1h] [--limit N] [--json]`)
}
since := flagValue(args, "--since")
if since == "" {
since = "1h"
}
dur, err := time.ParseDuration(since)
if err != nil {
return fmt.Errorf("bad --since %q: %w", since, err)
}
limit := flagValue(args, "--limit")
if limit == "" {
limit = "100"
}
end := time.Now()
v := url.Values{}
v.Set("query", q)
v.Set("limit", limit)
v.Set("start", strconv.FormatInt(end.Add(-dur).UnixNano(), 10))
v.Set("end", strconv.FormatInt(end.UnixNano(), 10))
body, err := lbGetBody(lokiHost, "/loki/api/v1/query_range", v)
if err != nil {
return err
}
if containsArg(args, "--json") {
fmt.Println(string(body))
return nil
}
var r struct {
Data struct {
Result []struct {
Values [][]string `json:"values"`
} `json:"result"`
} `json:"data"`
}
if err := json.Unmarshal(body, &r); err != nil {
fmt.Println(string(body))
return nil
}
n := 0
for _, s := range r.Data.Result {
for _, val := range s.Values {
if len(val) == 2 {
fmt.Println(val[1])
n++
}
}
}
if n == 0 {
fmt.Println("(no log lines)")
}
return nil
}

122
cli/cmd_tf.go Normal file
View file

@ -0,0 +1,122 @@
package main
import (
"fmt"
"os"
"os/signal"
"path/filepath"
"strings"
"sync"
"syscall"
)
func tfCommands() []Command {
return []Command{
{Path: []string{"tf", "plan"}, Tier: TierRead,
Summary: "terragrunt plan a stack (via scripts/tg)", Run: tfPassthrough("plan")},
{Path: []string{"tf", "validate"}, Tier: TierRead,
Summary: "terragrunt validate a stack", Run: tfPassthrough("validate")},
{Path: []string{"tf", "fmt"}, Tier: TierRead,
Summary: "terraform fmt a stack's files", Run: tfFmt},
{Path: []string{"tf", "force-unlock"}, Tier: TierWrite,
Summary: "release a stuck terraform state lock (needs <stack> <lock-id>)", Run: tfForceUnlock},
{Path: []string{"tf", "apply"}, Tier: TierWrite,
Summary: "terragrunt apply a stack — presence-coupled, out-of-band", Run: tfApply},
}
}
// firstPositional returns the first non-flag arg and the remaining args with it removed.
func firstPositional(args []string) (string, []string) {
for i, a := range args {
if !strings.HasPrefix(a, "-") {
rest := append(append([]string{}, args[:i]...), args[i+1:]...)
return a, rest
}
}
return "", args
}
// resolveTfStack finds the infra root (from cwd) and the stack directory named
// by the first positional arg, returning the remaining args.
func resolveTfStack(args []string) (infraRoot, stackName, stackDir string, rest []string, err error) {
stackName, rest = firstPositional(args)
if stackName == "" {
err = fmt.Errorf("missing <stack> argument")
return
}
cwd, e := os.Getwd()
if e != nil {
err = e
return
}
infraRoot, err = findInfraRoot(cwd)
if err != nil {
return
}
stackDir, err = resolveStack(infraRoot, stackName)
return
}
func tgPath(infraRoot string) string { return filepath.Join(infraRoot, "scripts", "tg") }
// tfPassthrough runs `scripts/tg <verb> [extra]` in the stack directory.
func tfPassthrough(verb string) func([]string) error {
return func(args []string) error {
infraRoot, _, stackDir, rest, err := resolveTfStack(args)
if err != nil {
return err
}
return runStreamingIn(stackDir, tgPath(infraRoot), append([]string{verb}, rest...)...)
}
}
func tfFmt(args []string) error {
_, _, stackDir, _, err := resolveTfStack(args)
if err != nil {
return err
}
return runStreamingIn(stackDir, "terraform", "fmt", "-recursive", ".")
}
func tfForceUnlock(args []string) error {
infraRoot, _, stackDir, rest, err := resolveTfStack(args)
if err != nil {
return err
}
if len(rest) < 1 {
return fmt.Errorf("usage: homelab tf force-unlock <stack> <lock-id>")
}
return runStreamingIn(stackDir, tgPath(infraRoot), "force-unlock", "-force", rest[0])
}
// tfApply applies a stack out-of-band: claim the stack on the presence board,
// ALWAYS release on exit (normal, error, or signal — fixing the claim leak),
// and warn that CI applies canonically on push.
func tfApply(args []string) error {
infraRoot, stackName, stackDir, _, err := resolveTfStack(args)
if err != nil {
return err
}
label := "stack:" + stackName
fmt.Fprintf(os.Stderr,
"homelab: out-of-band apply of %q — CI applies canonically on push to master.\n", stackName)
if err := presenceClaim(label, "homelab tf apply "+stackName); err != nil {
return fmt.Errorf("presence claim failed (run `vault login -method=oidc`?): %w", err)
}
// Release exactly once, whether we exit normally, on error, or on signal —
// sync.Once makes the defer and the signal goroutine safe to both call it.
var once sync.Once
release := func() { once.Do(func() { _ = presenceRelease(label) }) }
defer release()
sig := make(chan os.Signal, 1)
signal.Notify(sig, os.Interrupt, syscall.SIGTERM)
go func() {
<-sig
release()
os.Exit(130)
}()
return runStreamingIn(stackDir, tgPath(infraRoot), "apply", "--non-interactive")
}

27
cli/cmd_tf_test.go Normal file
View file

@ -0,0 +1,27 @@
package main
import (
"reflect"
"testing"
)
func TestFirstPositional(t *testing.T) {
cases := []struct {
args []string
wantName string
wantRest []string
}{
{[]string{"vault"}, "vault", []string{}},
{[]string{"--json", "vault"}, "vault", []string{"--json"}},
{[]string{"vault", "abc-123"}, "vault", []string{"abc-123"}},
{[]string{"--foo", "monitoring", "extra"}, "monitoring", []string{"--foo", "extra"}},
{[]string{"--only-flags"}, "", []string{"--only-flags"}},
}
for _, c := range cases {
gotName, gotRest := firstPositional(c.args)
if gotName != c.wantName || !reflect.DeepEqual(gotRest, c.wantRest) {
t.Errorf("firstPositional(%v) = (%q, %v), want (%q, %v)",
c.args, gotName, gotRest, c.wantName, c.wantRest)
}
}
}

77
cli/cmd_usage.go Normal file
View file

@ -0,0 +1,77 @@
package main
import (
"encoding/json"
"fmt"
"net/url"
"sort"
"strconv"
)
func usageCommands() []Command {
return []Command{
{Path: []string{"usage", "top"}, Tier: TierRead,
Summary: "rank homelab verb usage across users (from Loki): usage top [--since 30d] [--user U] [--json]", Run: usageTop},
}
}
// usageQuery builds the LogQL metric query that counts invocations per verb.
func usageQuery(since, user string) string {
sel := `job="` + usageJob + `"`
if user != "" {
sel += `, user="` + user + `"`
}
return fmt.Sprintf(`sum by (verb) (count_over_time({%s}[%s]))`, sel, since)
}
func usageTop(args []string) error {
since := flagValue(args, "--since")
if since == "" {
since = "30d"
}
v := url.Values{}
v.Set("query", usageQuery(since, flagValue(args, "--user")))
body, err := lbGetBody(lokiHost, "/loki/api/v1/query", v)
if err != nil {
return err
}
if containsArg(args, "--json") {
fmt.Println(string(body))
return nil
}
var r struct {
Data struct {
Result []struct {
Metric map[string]string `json:"metric"`
Value []interface{} `json:"value"`
} `json:"result"`
} `json:"data"`
}
if err := json.Unmarshal(body, &r); err != nil {
fmt.Println(string(body))
return nil
}
type row struct {
verb string
n int
}
var rows []row
for _, s := range r.Data.Result {
n := 0
if len(s.Value) == 2 {
if f, e := strconv.ParseFloat(fmt.Sprint(s.Value[1]), 64); e == nil {
n = int(f)
}
}
rows = append(rows, row{s.Metric["verb"], n})
}
if len(rows) == 0 {
fmt.Println("(no usage recorded yet)")
return nil
}
sort.Slice(rows, func(i, j int) bool { return rows[i].n > rows[j].n })
for _, r := range rows {
fmt.Printf("%6d %s\n", r.n, r.verb)
}
return nil
}

663
cli/cmd_vault.go Normal file
View file

@ -0,0 +1,663 @@
package main
import (
"bufio"
"encoding/base64"
"encoding/json"
"fmt"
"os"
"os/exec"
"strings"
"syscall"
)
// vault verbs give each unix user no-HITL access to THEIR OWN Vaultwarden vault.
// Identity is the kernel UID; per-user creds live in that user's isolated Vault
// path (secret/workstation/claude-users/<user>) read via their scoped token, and
// decryption is done by the official `bw` CLI. See
// docs/superpowers/specs/2026-06-24-homelab-vault-design.md.
func vaultCommands() []Command {
return []Command{
{Path: []string{"vault", "setup"}, Tier: TierWrite,
Summary: "one-time: store your Vaultwarden master password + API key in your Vault path", Run: vaultSetup},
{Path: []string{"vault", "status"}, Tier: TierRead,
Summary: "show whether your vault is configured/reachable (no secrets)", Run: vaultStatus},
{Path: []string{"vault", "list"}, Tier: TierRead,
Summary: "list your item names: vault list [--search Q]", Run: vaultList},
{Path: []string{"vault", "get"}, Tier: TierRead,
Summary: "fetch one item: vault get <name> [--field password|username|uri|notes|totp] [--json]", Run: vaultGet},
{Path: []string{"vault", "search"}, Tier: TierRead,
Summary: "search your item names: vault search <query>", Run: vaultSearch},
{Path: []string{"vault", "code"}, Tier: TierRead,
Summary: "current TOTP code for an item: vault code <name>", Run: vaultCode},
{Path: []string{"vault", "lock"}, Tier: TierWrite,
Summary: "lock/log out the local bw session", Run: vaultLock},
{Path: []string{"vault"}, Tier: TierRead,
Summary: "Vaultwarden access for your own vault (run `homelab vault` for help)",
Run: func([]string) error { fmt.Print(vaultHelp()); return nil }},
}
}
// vaultHelp is shown for bare `homelab vault`.
func vaultHelp() string {
return `homelab vault read YOUR OWN Vaultwarden logins (no-HITL after one-time setup)
homelab vault setup one-time: store your master password + API key in your Vault path
homelab vault status configured / unlocked / reachable (no secrets)
homelab vault list [--search Q] list your item names (no secrets)
homelab vault get <name> [--field password|username|uri|notes|totp] [--json]
TTY clipboard (auto-clears); piped stdout
homelab vault code <name> current TOTP code
homelab vault lock lock / log out the local bw session
Creds live only in your own Vault path; the admin never sees them. Identity is
your unix UID. Security model: docs/superpowers/specs/2026-06-24-homelab-vault-design.md
(note: anything running as your user can decrypt your vault the accepted no-HITL trade).
`
}
const vwUserPathPrefix = "secret/workstation/claude-users/"
// vwCreds is one user's Vaultwarden auth material, read from their Vault path.
type vwCreds struct {
Email string
MasterPassword string
ClientID string
ClientSecret string
}
// cmdRunner shells out to an external command with an explicit environment and
// returns trimmed stdout. Secrets are passed via envv, NEVER argv. Tests inject
// a fake; realRunner is the production implementation.
type cmdRunner func(name string, argv, envv []string) (string, error)
func realRunner(name string, argv, envv []string) (string, error) {
cmd := exec.Command(name, argv...)
if envv != nil {
cmd.Env = envv
}
out, err := cmd.Output()
// Trim only the trailing newline the tool appends — NOT all whitespace, so a
// fetched secret with significant leading/trailing spaces is preserved.
return strings.TrimRight(string(out), "\r\n"), err
}
// realRunnerStdin runs a command feeding `stdin` to it, for secret values that
// must NOT appear in argv (visible via ps / /proc/<pid>/cmdline to same-UID
// processes). Used by setup to write the master password / client_secret.
func realRunnerStdin(name string, argv, envv []string, stdin string) (string, error) {
cmd := exec.Command(name, argv...)
if envv != nil {
cmd.Env = envv
}
cmd.Stdin = strings.NewReader(stdin)
out, err := cmd.Output()
return strings.TrimRight(string(out), "\r\n"), err
}
func vwCredsPath(user string) string { return vwUserPathPrefix + user }
func bwAppDataDir(uid string) string { return "/run/user/" + uid + "/homelab-bw" }
// readVaultField returns one field from a KV-v2 path, "" if absent/error.
func readVaultField(run cmdRunner, field, path string) string {
out, err := run("vault", []string{"kv", "get", "-field=" + field, path}, nil)
if err != nil {
return ""
}
return out
}
// loadCreds reads the four vaultwarden_* keys from the user's isolated path.
// A missing master password means the user hasn't onboarded.
func loadCreds(run cmdRunner, user string) (vwCreds, error) {
p := vwCredsPath(user)
c := vwCreds{
Email: readVaultField(run, "vaultwarden_email", p),
MasterPassword: readVaultField(run, "vaultwarden_master_password", p),
ClientID: readVaultField(run, "vaultwarden_client_id", p),
ClientSecret: readVaultField(run, "vaultwarden_client_secret", p),
}
if c.MasterPassword == "" {
return vwCreds{}, fmt.Errorf("vault not configured for this user — run `homelab vault setup`")
}
return c, nil
}
// vaultCurrentUser/vaultCurrentUID are seams for tests (avoid conflict with repo.go's currentUser func).
var vaultCurrentUser = func() string { return os.Getenv("USER") }
var vaultCurrentUID = func() string { return fmt.Sprintf("%d", os.Getuid()) }
// bwBaseEnv is the minimal non-secret environment bw/node need. We deliberately
// do NOT inherit the full parent env (keeps stray secrets out of the child).
func bwBaseEnv(appdata string) []string {
path := os.Getenv("PATH")
if path == "" {
path = "/usr/local/bin:/usr/bin:/bin"
}
return []string{
"PATH=" + path,
"HOME=" + os.Getenv("HOME"),
"BITWARDENCLI_APPDATA_DIR=" + appdata,
"BW_NOINTERACTION=true",
}
}
// bwSecretEnv adds the secret-bearing vars. session may be "" (pre-unlock).
func bwSecretEnv(appdata string, c vwCreds, session string) []string {
env := bwBaseEnv(appdata)
env = append(env,
"BW_CLIENTID="+c.ClientID,
"BW_CLIENTSECRET="+c.ClientSecret,
"BW_PASSWORD="+c.MasterPassword,
)
if session != "" {
env = append(env, "BW_SESSION="+session)
}
return env
}
func bwLoginArgs() []string { return []string{"login", "--apikey"} }
func bwUnlockArgs() []string { return []string{"unlock", "--passwordenv", "BW_PASSWORD", "--raw"} }
func bwGetArgs(field, name string) []string { return []string{"get", field, name} }
func bwStatusArgs() []string { return []string{"status"} }
// bwNeedsLogin parses `bw status` JSON and reports whether a `bw login` is
// required. Unparseable/empty output → true (safer to attempt login).
func bwNeedsLogin(statusJSON string) bool {
var s struct {
Status string `json:"status"`
}
if err := json.Unmarshal([]byte(statusJSON), &s); err != nil {
return true
}
return s.Status == "unauthenticated" || s.Status == ""
}
func bwListArgs(search string) []string {
a := []string{"list", "items"}
if search != "" {
a = append(a, "--search", search)
}
return a
}
// bwUnlock runs `bw unlock` and returns the raw session key.
func bwUnlock(run cmdRunner, env []string) (string, error) {
out, err := run("bw", bwUnlockArgs(), env)
if err != nil {
return "", fmt.Errorf("bw unlock failed (wrong master password? run `homelab vault setup`): %w", err)
}
return out, nil
}
// bwGet fetches one field of one item; session must be present in env.
func bwGet(run cmdRunner, env []string, field, name string) (string, error) {
return run("bw", bwGetArgs(field, name), env)
}
func returnMode(isTTY bool) string {
if isTTY {
return "clipboard"
}
return "stdout"
}
// stdoutIsTTY reports whether stdout is a character device (a terminal).
func stdoutIsTTY() bool {
fi, err := os.Stdout.Stat()
if err != nil {
return false
}
return fi.Mode()&os.ModeCharDevice != 0
}
// stderrIsTTY reports whether stderr is a terminal (the OSC52 escape is written
// to stderr, so the clipboard path is only viable when stderr is a terminal).
func stderrIsTTY() bool {
fi, err := os.Stderr.Stat()
if err != nil {
return false
}
return fi.Mode()&os.ModeCharDevice != 0
}
// osc52 returns the OSC 52 escape that makes the local terminal copy payload to
// the system clipboard (works over SSH; no X11). osc52clear copies empty.
func osc52(payload string) string {
return "\x1b]52;c;" + base64.StdEncoding.EncodeToString([]byte(payload)) + "\a"
}
func osc52clear() string { return "\x1b]52;c;\a" }
// terminalAllowed gates OSC 52: only terminals known to honor clipboard writes,
// else we'd dump the secret's base64 into scrollback on unsupported terminals.
func terminalAllowed(term, termProgram string) bool {
t := strings.ToLower(term)
p := strings.ToLower(termProgram)
for _, ok := range []string{"kitty", "alacritty", "foot", "wezterm", "ghostty", "tmux", "screen"} {
if strings.Contains(t, ok) || strings.Contains(p, ok) {
return true
}
}
// xterm proper supports it only when the program is a known-good emulator.
return false
}
// opRecord is one CLI operation. ItemName is accepted for the caller's
// convenience but is INTENTIONALLY never rendered into the log line — auditing
// which of your own logins you opened is itself sensitive, and per-item reads
// are invisible server-side anyway (spec §9a).
type opRecord struct {
User string
Verb string
PID int
PPID int
ParentComm string
ItemName string // never logged
}
func opLogLine(r opRecord) string {
return fmt.Sprintf("user=%s verb=%s pid=%d ppid=%d parent=%s",
r.User, r.Verb, r.PID, r.PPID, r.ParentComm)
}
// parentComm reads /proc/<ppid>/comm (best-effort; "" on failure).
func parentComm(ppid int) string {
b, err := os.ReadFile(fmt.Sprintf("/proc/%d/comm", ppid))
if err != nil {
return ""
}
return strings.TrimSpace(string(b))
}
// writeOpLog appends one privacy-aware line to the user's op-log (best-effort;
// never blocks or fails the command). Goes to syslog so it ships to Loki.
func writeOpLog(r opRecord) {
exec.Command("logger", "-t", "homelab-vault", opLogLine(r)).Run() // best-effort
}
func vaultLockPath(uid string) string { return "/run/user/" + uid + "/homelab-vault.lock" }
// hardenProcess disables core dumps so a bw/homelab crash can't spill the master
// password to a core file. Best-effort.
func hardenProcess() {
_ = syscall.Setrlimit(syscall.RLIMIT_CORE, &syscall.Rlimit{Cur: 0, Max: 0})
}
// withUserLock serializes bw mutations for this user (concurrent Claude sessions
// as the same user otherwise race bw's appdata). Returns an unlock func.
func withUserLock(uid string) (func(), error) {
f, err := os.OpenFile(vaultLockPath(uid), os.O_CREATE|os.O_RDWR, 0600)
if err != nil {
return nil, err
}
if err := syscall.Flock(int(f.Fd()), syscall.LOCK_EX); err != nil {
f.Close()
return nil, err
}
return func() { syscall.Flock(int(f.Fd()), syscall.LOCK_UN); f.Close() }, nil
}
// session is one usable bw context: the env (with BW_SESSION) ready for `bw get`.
type session struct {
env []string
}
// openSession resolves creds, ensures login, unlocks, and returns a ready env.
// Caller must hold the user lock. appdata is created on tmpfs (0700).
func openSession(run cmdRunner, user, uid string) (session, error) {
creds, err := loadCreds(run, user)
if err != nil {
return session{}, err
}
appdata := bwAppDataDir(uid)
if err := os.MkdirAll(appdata, 0700); err != nil {
return session{}, fmt.Errorf("create bw appdata %s: %w", appdata, err)
}
loginEnv := bwSecretEnv(appdata, creds, "")
// Ensure server is set and we're logged in (idempotent; ignore "already").
_, _ = run("bw", []string{"config", "server", "https://vaultwarden.viktorbarzin.me"}, loginEnv)
st, _ := run("bw", bwStatusArgs(), loginEnv)
if bwNeedsLogin(st) {
if _, err := run("bw", bwLoginArgs(), loginEnv); err != nil {
return session{}, fmt.Errorf("bw login --apikey failed (API key valid? run `homelab vault setup`): %w", err)
}
}
sess, err := bwUnlock(run, loginEnv)
if err != nil {
return session{}, err
}
return session{env: bwSecretEnv(appdata, creds, sess)}, nil
}
type getOpts struct {
name string
field string
json bool
}
var validGetFields = map[string]bool{"password": true, "username": true, "uri": true, "notes": true, "totp": true}
func parseGetArgs(args []string) (getOpts, error) {
o := getOpts{field: "password"}
for i := 0; i < len(args); i++ {
a := args[i]
switch {
case a == "--json":
o.json = true
case a == "--field" && i+1 < len(args):
o.field = args[i+1]
i++
case strings.HasPrefix(a, "--field="):
o.field = strings.TrimPrefix(a, "--field=")
case !strings.HasPrefix(a, "-") && o.name == "":
o.name = a
}
}
if o.name == "" {
return o, fmt.Errorf("usage: homelab vault get <name> [--field password|username|uri|notes|totp] [--json]")
}
if !validGetFields[o.field] {
return o, fmt.Errorf("invalid --field %q (want password|username|uri|notes|totp)", o.field)
}
return o, nil
}
// getValue opens a session and fetches one field. Pure of I/O side effects
// besides the runner, so it is unit-tested with a fake runner.
func getValue(run cmdRunner, user, uid string, o getOpts) (string, error) {
s, err := openSession(run, user, uid)
if err != nil {
return "", err
}
return bwGet(run, s.env, o.field, o.name)
}
// clipboardDecision picks how to return a secret value. "stdout" prints it (a
// pipe/agent — the intended machine path); "clipboard" copies via OSC52;
// "refuse" emits nothing sensitive (would otherwise risk dumping the secret's
// base64 into scrollback, or silently fail because the OSC52 escape goes to a
// non-terminal stderr).
func clipboardDecision(stdoutTTY, stderrTTY bool, term, termProgram string) string {
if !stdoutTTY {
return "stdout"
}
if terminalAllowed(term, termProgram) && stderrTTY {
return "clipboard"
}
return "refuse"
}
// jsonToStdoutOK reports whether `--json` may print the secret to stdout — only
// when stdout is NOT a terminal (i.e. piped to a machine consumer).
func jsonToStdoutOK(stdoutTTY bool) bool { return !stdoutTTY }
// emitSecret returns a value TTY-aware (see clipboardDecision). Never prints the
// secret to a terminal's stdout/scrollback.
func emitSecret(value string) {
switch clipboardDecision(stdoutIsTTY(), stderrIsTTY(), os.Getenv("TERM"), os.Getenv("TERM_PROGRAM")) {
case "stdout":
fmt.Println(value)
case "clipboard":
fmt.Fprint(os.Stderr, osc52(value))
fmt.Fprintln(os.Stderr, "copied to clipboard; clearing in 30s")
clearClipboardAfter(30)
default: // refuse
fmt.Fprintln(os.Stderr, "refusing to print secret: this terminal can't do OSC52 clipboard safely; pipe the command (e.g. | cat) or use a supported terminal")
}
}
// clearClipboardAfter spawns a detached background clear so the secret doesn't
// linger in the clipboard. Best-effort.
func clearClipboardAfter(seconds int) {
exec.Command("sh", "-c", fmt.Sprintf("sleep %d; printf '%s'", seconds, osc52clear())).Start()
}
// listNames extracts "name (id)" from `bw list items` JSON; never values.
func listNames(jsonOut string) []string {
var items []struct {
ID string `json:"id"`
Name string `json:"name"`
}
if err := json.Unmarshal([]byte(jsonOut), &items); err != nil {
return nil
}
out := make([]string, 0, len(items))
for _, it := range items {
out = append(out, fmt.Sprintf("%s (%s)", it.Name, it.ID))
}
return out
}
func runList(run cmdRunner, user, uid, search string) ([]string, error) {
s, err := openSession(run, user, uid)
if err != nil {
return nil, err
}
out, err := run("bw", bwListArgs(search), s.env)
if err != nil {
return nil, err
}
return listNames(out), nil
}
func vaultList(args []string) error {
hardenProcess()
search := ""
for i := 0; i < len(args); i++ {
if args[i] == "--search" && i+1 < len(args) {
search = args[i+1]
i++
} else if strings.HasPrefix(args[i], "--search=") {
search = strings.TrimPrefix(args[i], "--search=")
}
}
uid := vaultCurrentUID()
unlock, err := withUserLock(uid)
if err != nil {
return err
}
defer unlock()
names, err := runList(realRunner, vaultCurrentUser(), uid, search)
if err != nil {
return err
}
for _, n := range names {
fmt.Println(n)
}
return nil
}
func vaultSearch(args []string) error {
if len(args) == 0 {
return fmt.Errorf("usage: homelab vault search <query>")
}
return vaultList([]string{"--search", strings.Join(args, " ")})
}
func vaultCode(args []string) error {
hardenProcess()
if len(args) == 0 {
return fmt.Errorf("usage: homelab vault code <name>")
}
name := args[0]
uid := vaultCurrentUID()
unlock, err := withUserLock(uid)
if err != nil {
return err
}
defer unlock()
user := vaultCurrentUser()
val, err := getValue(realRunner, user, uid, getOpts{name: name, field: "totp"})
if err != nil {
return err
}
// TOTP is the most sensitive op: log AND emit an ntfy-bound marker (spec §9a-d).
writeOpLog(opRecord{User: user, Verb: "code", PID: os.Getpid(), PPID: os.Getppid(), ParentComm: parentComm(os.Getppid()), ItemName: name})
exec.Command("logger", "-t", "homelab-vault-totp", "user="+user+" totp-fetch parent="+parentComm(os.Getppid())).Run()
emitSecret(val)
return nil
}
// statusSummary reports config/reachability without revealing secrets.
func statusSummary(run cmdRunner, user, uid string) string {
if _, err := loadCreds(run, user); err != nil {
return "vault: not configured — run `homelab vault setup`"
}
s, err := openSession(run, user, uid)
if err != nil {
return "vault: configured, but unlock/login FAILED (creds stale? run `homelab vault setup`): " + err.Error()
}
if _, err := run("bw", []string{"sync"}, s.env); err != nil {
return "vault: configured + unlocked, but sync/reachability failed: " + err.Error()
}
return "vault: configured, unlocked, reachable ✓"
}
func vaultStatus(args []string) error {
hardenProcess()
uid := vaultCurrentUID()
unlock, err := withUserLock(uid)
if err != nil {
return err
}
defer unlock()
fmt.Println(statusSummary(realRunner, vaultCurrentUser(), uid))
return nil
}
func vaultLock(args []string) error {
uid := vaultCurrentUID()
unlock, err := withUserLock(uid) // logout mutates bw state — serialize with get/list
if err != nil {
return err
}
defer unlock()
appdata := bwAppDataDir(uid)
_, _ = realRunner("bw", []string{"lock"}, bwBaseEnv(appdata))
_, logoutErr := realRunner("bw", []string{"logout"}, bwBaseEnv(appdata))
if logoutErr == nil {
fmt.Println("locked")
}
return nil // lock/logout best-effort; never error the caller
}
// vaultPatchPublicArgs writes the non-secret identifiers via argv. Neither the
// email nor the API client_id is a usable credential on its own.
func vaultPatchPublicArgs(user, email, clientID string) []string {
return []string{"kv", "patch", vwCredsPath(user),
"vaultwarden_email=" + email,
"vaultwarden_client_id=" + clientID,
}
}
// vaultPatchSecretArgs writes ONE secret value via the `key=-` stdin form, so
// the value never appears in argv (ps / /proc/<pid>/cmdline). The value is fed
// on stdin by realRunnerStdin.
func vaultPatchSecretArgs(user, key string) []string {
return []string{"kv", "patch", vwCredsPath(user), key + "=-"}
}
// writeCreds stores all four fields in the user's Vault path. The two real
// secrets (master password, API client_secret) go via stdin — never argv.
func writeCreds(user string, c vwCreds) error {
if _, err := realRunner("vault", vaultPatchPublicArgs(user, c.Email, c.ClientID), nil); err != nil {
return err
}
if _, err := realRunnerStdin("vault", vaultPatchSecretArgs(user, "vaultwarden_master_password"), nil, c.MasterPassword); err != nil {
return err
}
if _, err := realRunnerStdin("vault", vaultPatchSecretArgs(user, "vaultwarden_client_secret"), nil, c.ClientSecret); err != nil {
return err
}
return nil
}
// promptNoEcho reads one line without terminal echo (for the master password).
func promptNoEcho(prompt string) (string, error) {
fmt.Fprint(os.Stderr, prompt)
exec.Command("stty", "-echo").Run()
defer func() { exec.Command("stty", "echo").Run(); fmt.Fprintln(os.Stderr) }()
r := bufio.NewReader(os.Stdin)
line, err := r.ReadString('\n')
// Trim only the line terminator — a master password / API secret may
// legitimately contain leading/trailing spaces.
return strings.TrimRight(line, "\r\n"), err
}
func promptLine(prompt string) (string, error) {
fmt.Fprint(os.Stderr, prompt)
line, err := bufio.NewReader(os.Stdin).ReadString('\n')
return strings.TrimSpace(line), err
}
func vaultSetup(args []string) error {
hardenProcess()
fmt.Fprintln(os.Stderr, "One-time setup. Stored ONLY in your own Vault path; the admin never sees it.")
fmt.Fprintln(os.Stderr, "Get your API key at https://vaultwarden.viktorbarzin.me → Settings → Security → Keys → View API key.")
email, err := promptLine("Vaultwarden email: ")
if err != nil {
return err
}
clientID, err := promptLine("API key client_id (user.xxxx): ")
if err != nil {
return err
}
clientSecret, err := promptNoEcho("API key client_secret: ")
if err != nil {
return err
}
master, err := promptNoEcho("Master password: ")
if err != nil {
return err
}
if master == "" || clientID == "" || clientSecret == "" {
return fmt.Errorf("all fields are required")
}
c := vwCreds{Email: email, MasterPassword: master, ClientID: clientID, ClientSecret: clientSecret}
if err := writeCreds(vaultCurrentUser(), c); err != nil {
return fmt.Errorf("writing creds to your Vault path failed (scoped token present?): %w", err)
}
fmt.Fprintln(os.Stderr, "Stored. Verifying unlock…")
uid := vaultCurrentUID()
unlock, err := withUserLock(uid)
if err != nil {
return err
}
defer unlock()
if _, err := openSession(realRunner, vaultCurrentUser(), uid); err != nil {
return fmt.Errorf("stored, but verification failed — double-check master password / API key: %w", err)
}
fmt.Fprintln(os.Stderr, "✓ Verified. Fetches are now AFK.")
return nil
}
func vaultGet(args []string) error {
hardenProcess()
o, err := parseGetArgs(args)
if err != nil {
return err
}
uid := vaultCurrentUID()
unlock, err := withUserLock(uid)
if err != nil {
return err
}
defer unlock()
user := vaultCurrentUser()
val, err := getValue(realRunner, user, uid, o)
if err != nil {
return err
}
writeOpLog(opRecord{User: user, Verb: "get", PID: os.Getpid(), PPID: os.Getppid(), ParentComm: parentComm(os.Getppid()), ItemName: o.name})
if o.json {
if !jsonToStdoutOK(stdoutIsTTY()) {
return fmt.Errorf("refusing to print a secret as JSON to a terminal; pipe it (e.g. | cat) or drop --json")
}
fmt.Printf("{%q:%q}\n", o.field, val)
return nil
}
emitSecret(val)
return nil
}

368
cli/cmd_vault_test.go Normal file
View file

@ -0,0 +1,368 @@
package main
import (
"encoding/base64"
"fmt"
"os"
"reflect"
"strings"
"testing"
)
func TestVaultCommandsRegistered(t *testing.T) {
want := map[string]Tier{
"vault setup": TierWrite,
"vault status": TierRead,
"vault list": TierRead,
"vault get": TierRead,
"vault search": TierRead,
"vault code": TierRead,
"vault lock": TierWrite,
}
got := map[string]Tier{}
for _, c := range vaultCommands() {
got[c.name()] = c.Tier
}
for name, tier := range want {
if got[name] != tier {
t.Errorf("command %q: tier=%q, want %q (registered=%v)", name, got[name], tier, got[name] != "")
}
}
}
func TestVaultGroupInRegistry(t *testing.T) {
if !isCommandGroup(buildRegistry(), "vault") {
t.Fatal("`vault` group not wired into buildRegistry()")
}
}
func TestVaultCredsPath(t *testing.T) {
if got := vwCredsPath("emo"); got != "secret/workstation/claude-users/emo" {
t.Fatalf("vwCredsPath = %q", got)
}
}
func TestBwAppDataDir(t *testing.T) {
if got := bwAppDataDir("1001"); got != "/run/user/1001/homelab-bw" {
t.Fatalf("bwAppDataDir = %q", got)
}
}
// fakeRunner records calls and returns canned stdout/err keyed by argv[0]+first arg.
type fakeRunner struct {
calls [][]string
out map[string]string // key: name+" "+strings.Join(argv," ") prefix-matched
err map[string]error
lastEnv []string
}
func (f *fakeRunner) run(name string, argv, envv []string) (string, error) {
f.calls = append(f.calls, append([]string{name}, argv...))
f.lastEnv = envv
key := name + " " + strings.Join(argv, " ")
for k, v := range f.out {
if strings.HasPrefix(key, k) {
return v, f.err[k]
}
}
return "", f.err[key]
}
func TestLoadCredsReadsFourFields(t *testing.T) {
f := &fakeRunner{out: map[string]string{
"vault kv get -field=vaultwarden_email secret/workstation/claude-users/emo": "emo@x.me",
"vault kv get -field=vaultwarden_master_password secret/workstation/claude-users/emo": "hunter2",
"vault kv get -field=vaultwarden_client_id secret/workstation/claude-users/emo": "user.abc",
"vault kv get -field=vaultwarden_client_secret secret/workstation/claude-users/emo": "sek",
}}
c, err := loadCreds(f.run, "emo")
if err != nil {
t.Fatalf("loadCreds: %v", err)
}
want := vwCreds{Email: "emo@x.me", MasterPassword: "hunter2", ClientID: "user.abc", ClientSecret: "sek"}
if !reflect.DeepEqual(c, want) {
t.Fatalf("loadCreds = %+v want %+v", c, want)
}
}
func TestLoadCredsUnconfigured(t *testing.T) {
f := &fakeRunner{out: map[string]string{}} // every field empty
if _, err := loadCreds(f.run, "emo"); err == nil || !strings.Contains(err.Error(), "not configured") {
t.Fatalf("want 'not configured' error, got %v", err)
}
}
func TestBwEnvCarriesSecretsNotArgv(t *testing.T) {
c := vwCreds{ClientID: "user.abc", ClientSecret: "sek", MasterPassword: "hunter2"}
env := bwSecretEnv("/run/user/1001/homelab-bw", c, "SESSIONKEY")
joined := strings.Join(env, "\n")
for _, want := range []string{
"BW_CLIENTID=user.abc", "BW_CLIENTSECRET=sek", "BW_PASSWORD=hunter2",
"BW_SESSION=SESSIONKEY", "BITWARDENCLI_APPDATA_DIR=/run/user/1001/homelab-bw",
} {
if !strings.Contains(joined, want) {
t.Errorf("bwSecretEnv missing %q", want)
}
}
if strings.Contains(joined, "PATH=") == false {
t.Error("bwSecretEnv must keep a PATH so node/bw resolve")
}
}
func TestBwGetArgsHasNoSessionInArgv(t *testing.T) {
argv := bwGetArgs("password", "github")
for _, a := range argv {
if strings.Contains(a, "SESSION") || a == "--session" {
t.Fatalf("session must travel via env, not argv: %v", argv)
}
}
if !reflect.DeepEqual(argv, []string{"get", "password", "github"}) {
t.Fatalf("bwGetArgs = %v", argv)
}
}
func TestBwListArgs(t *testing.T) {
if got := bwListArgs(""); !reflect.DeepEqual(got, []string{"list", "items"}) {
t.Fatalf("bwListArgs('') = %v", got)
}
if got := bwListArgs("git"); !reflect.DeepEqual(got, []string{"list", "items", "--search", "git"}) {
t.Fatalf("bwListArgs('git') = %v", got)
}
}
func TestBwUnlockReturnsSession(t *testing.T) {
f := &fakeRunner{out: map[string]string{"bw unlock": "THE-SESSION-KEY"}}
env := bwSecretEnv("/run/user/1001/homelab-bw", vwCreds{MasterPassword: "pw"}, "")
sess, err := bwUnlock(f.run, env)
if err != nil || sess != "THE-SESSION-KEY" {
t.Fatalf("bwUnlock = %q, %v", sess, err)
}
// argv must use --passwordenv + --raw, never the password literal
last := f.calls[len(f.calls)-1]
if strings.Join(last, " ") != "bw unlock --passwordenv BW_PASSWORD --raw" {
t.Fatalf("unlock argv = %v", last)
}
}
func TestReturnMode(t *testing.T) {
if returnMode(true) != "clipboard" || returnMode(false) != "stdout" {
t.Fatal("returnMode wrong")
}
}
func TestOSC52Encode(t *testing.T) {
got := osc52("secret")
want := "\x1b]52;c;" + base64.StdEncoding.EncodeToString([]byte("secret")) + "\a"
if got != want {
t.Fatalf("osc52 = %q want %q", got, want)
}
if osc52clear() != "\x1b]52;c;\a" {
t.Fatalf("osc52clear wrong: %q", osc52clear())
}
}
func TestTerminalAllowed(t *testing.T) {
allow := []struct{ term, prog string }{
{"xterm-kitty", ""}, {"alacritty", ""}, {"foot", ""}, {"tmux-256color", ""},
{"screen-256color", ""}, {"xterm-256color", "WezTerm"}, {"xterm-256color", "ghostty"},
}
for _, c := range allow {
if !terminalAllowed(c.term, c.prog) {
t.Errorf("terminalAllowed(%q,%q) = false, want true", c.term, c.prog)
}
}
deny := []struct{ term, prog string }{{"dumb", ""}, {"", ""}, {"vt100", ""}}
for _, c := range deny {
if terminalAllowed(c.term, c.prog) {
t.Errorf("terminalAllowed(%q,%q) = true, want false", c.term, c.prog)
}
}
}
func TestOpLogLineHasNoSecretOrItem(t *testing.T) {
line := opLogLine(opRecord{User: "emo", Verb: "get", PID: 10, PPID: 9, ParentComm: "claude", ItemName: "Chase Bank"})
for _, must := range []string{"user=emo", "verb=get", "ppid=9", "parent=claude"} {
if !strings.Contains(line, must) {
t.Errorf("op-log missing %q: %s", must, line)
}
}
for _, mustNot := range []string{"Chase", "password", "secret"} {
if strings.Contains(line, mustNot) {
t.Fatalf("op-log LEAKS %q (privacy violation): %s", mustNot, line)
}
}
}
func TestLockPath(t *testing.T) {
if got := vaultLockPath("1001"); got != "/run/user/1001/homelab-vault.lock" {
t.Fatalf("vaultLockPath = %q", got)
}
}
func TestParseGetArgs(t *testing.T) {
o, err := parseGetArgs([]string{"github", "--field", "username", "--json"})
if err != nil || o.name != "github" || o.field != "username" || !o.json {
t.Fatalf("parseGetArgs = %+v err=%v", o, err)
}
d, _ := parseGetArgs([]string{"github"})
if d.field != "password" || d.json {
t.Fatalf("defaults wrong: %+v", d)
}
if _, err := parseGetArgs([]string{}); err == nil {
t.Fatal("get with no name must error")
}
if _, err := parseGetArgs([]string{"x", "--field", "evil"}); err == nil {
t.Fatal("invalid --field must error")
}
}
func TestListNamesParsing(t *testing.T) {
// bw list items returns JSON; listNames extracts name + id only.
js := `[{"id":"1","name":"GitHub","login":{"username":"u"}},{"id":"2","name":"AWS"}]`
names := listNames(js)
if len(names) != 2 || names[0] != "GitHub (1)" || names[1] != "AWS (2)" {
t.Fatalf("listNames = %v", names)
}
}
func TestStatusSummaryUnconfigured(t *testing.T) {
f := &fakeRunner{out: map[string]string{}} // no creds
s := statusSummary(f.run, "emo", "1001")
if !strings.Contains(s, "not configured") {
t.Fatalf("status = %q", s)
}
}
func TestVaultPatchPublicArgs(t *testing.T) {
got := vaultPatchPublicArgs("emo", "e@x.me", "user.ci")
want := []string{"kv", "patch", "secret/workstation/claude-users/emo",
"vaultwarden_email=e@x.me", "vaultwarden_client_id=user.ci"}
if !reflect.DeepEqual(got, want) {
t.Fatalf("vaultPatchPublicArgs = %v", got)
}
for _, a := range got {
if strings.Contains(a, "master_password") || strings.Contains(a, "client_secret") {
t.Fatalf("secret key leaked into public argv: %v", got)
}
}
}
func TestVaultPatchSecretArgsNoValueInArgv(t *testing.T) {
for _, key := range []string{"vaultwarden_master_password", "vaultwarden_client_secret"} {
got := vaultPatchSecretArgs("emo", key)
want := []string{"kv", "patch", "secret/workstation/claude-users/emo", key + "=-"}
if !reflect.DeepEqual(got, want) {
t.Fatalf("vaultPatchSecretArgs(%q) = %v", key, got)
}
if got[len(got)-1] != key+"=-" {
t.Fatalf("secret value must be read from stdin (`%s=-`), got %v", key, got)
}
}
}
// TestNoSecretInArgvAcrossFlow is the load-bearing security test: across the
// whole get flow (vault reads, bw config/status/login/unlock/get) NO secret
// value may appear in any command's argv — secrets travel via env/stdin only.
func TestNoSecretInArgvAcrossFlow(t *testing.T) {
uid := fmt.Sprintf("%d", os.Getuid())
f := &fakeRunner{out: map[string]string{
"vault kv get -field=vaultwarden_master_password secret/workstation/claude-users/emo": "SUPERSECRETPW",
"vault kv get -field=vaultwarden_client_id secret/workstation/claude-users/emo": "user.x",
"vault kv get -field=vaultwarden_client_secret secret/workstation/claude-users/emo": "CLIENTSEKRET",
"bw status": `{"status":"locked"}`,
"bw unlock": "SESSIONXYZ",
"bw get password github": "p@ss",
}}
if _, err := getValue(f.run, "emo", uid, getOpts{name: "github", field: "password"}); err != nil {
t.Fatalf("getValue: %v", err)
}
for _, call := range f.calls {
for _, arg := range call {
for _, s := range []string{"SUPERSECRETPW", "CLIENTSEKRET", "SESSIONXYZ"} {
if strings.Contains(arg, s) {
t.Errorf("secret %q leaked into argv: %v", s, call)
}
}
}
}
if !strings.Contains(strings.Join(f.lastEnv, "\n"), "BW_SESSION=SESSIONXYZ") {
t.Error("expected BW_SESSION in the bw get env (test would be vacuous otherwise)")
}
}
func TestClipboardDecision(t *testing.T) {
cases := []struct {
stdoutTTY, stderrTTY bool
term, prog, want string
}{
{false, true, "xterm-kitty", "", "stdout"},
{true, true, "xterm-kitty", "", "clipboard"},
{true, true, "dumb", "", "refuse"},
{true, false, "xterm-kitty", "", "refuse"},
}
for _, c := range cases {
if got := clipboardDecision(c.stdoutTTY, c.stderrTTY, c.term, c.prog); got != c.want {
t.Errorf("clipboardDecision(%v,%v,%q) = %q, want %q", c.stdoutTTY, c.stderrTTY, c.term, got, c.want)
}
}
}
func TestJSONToStdoutOK(t *testing.T) {
if jsonToStdoutOK(true) {
t.Error("must refuse JSON secret on a terminal")
}
if !jsonToStdoutOK(false) {
t.Error("must allow JSON when piped")
}
}
func TestBwNeedsLogin(t *testing.T) {
if !bwNeedsLogin(`{"status":"unauthenticated"}`) {
t.Error("unauthenticated → needs login")
}
if bwNeedsLogin(`{"status":"locked"}`) {
t.Error("locked → no login (just unlock)")
}
if bwNeedsLogin(`{"status":"unlocked"}`) {
t.Error("unlocked → no login")
}
if !bwNeedsLogin(`not json`) {
t.Error("unparseable → attempt login")
}
}
func TestVaultHelpMentionsSecurity(t *testing.T) {
h := vaultHelp()
for _, want := range []string{"homelab vault get", "no-HITL", "your own", "setup"} {
if !strings.Contains(h, want) {
t.Errorf("vault help missing %q", want)
}
}
}
func TestVaultBareGroupRegistered(t *testing.T) {
for _, c := range vaultCommands() {
if len(c.Path) == 1 && c.Path[0] == "vault" {
return
}
}
t.Fatal("bare `vault` help command not registered")
}
// getValue is the testable core: given a runner + opts, returns the secret value.
func TestGetValueFlow(t *testing.T) {
f := &fakeRunner{out: map[string]string{
"vault kv get -field=vaultwarden_master_password secret/workstation/claude-users/emo": "pw",
"vault kv get -field=vaultwarden_client_id secret/workstation/claude-users/emo": "user.x",
"vault kv get -field=vaultwarden_client_secret secret/workstation/claude-users/emo": "cs",
"bw status": `{"status":"locked"}`,
"bw unlock": "SESS",
"bw get password github": "p@ss",
}}
// Use real UID so os.MkdirAll(/run/user/<uid>/homelab-bw) succeeds.
uid := fmt.Sprintf("%d", os.Getuid())
val, err := getValue(f.run, "emo", uid, getOpts{name: "github", field: "password"})
if err != nil || val != "p@ss" {
t.Fatalf("getValue = %q, %v", val, err)
}
}

212
cli/cmd_work.go Normal file
View file

@ -0,0 +1,212 @@
package main
import (
"fmt"
"os"
"path/filepath"
"strings"
)
func workCommands() []Command {
return []Command{
{Path: []string{"work", "start"}, Tier: TierWrite,
Summary: "create a worktree + branch for a task (enter it with EnterWorktree)", Run: workStart},
{Path: []string{"work", "land"}, Tier: TierWrite,
Summary: "merge master in, verify, push HEAD:master (run from the worktree)", Run: workLand},
{Path: []string{"work", "clean"}, Tier: TierWrite,
Summary: "remove a task's worktree + branch (run from the main checkout)", Run: workClean},
}
}
// flagValue extracts `--name value` or `--name=value` from args.
func flagValue(args []string, name string) string {
for i, a := range args {
if a == name && i+1 < len(args) {
return args[i+1]
}
if strings.HasPrefix(a, name+"=") {
return strings.TrimPrefix(a, name+"=")
}
}
return ""
}
func remotesOrEmpty(repoRoot string) []string {
r, _ := gitRemotes(repoRoot)
return r
}
// workStart creates .worktrees/<topic> on branch <user>/<topic> off <remote>/master.
func workStart(args []string) error {
topic, _ := firstPositional(args)
if topic == "" {
return fmt.Errorf("usage: homelab work start <topic>")
}
cwd, _ := os.Getwd()
repoRoot, err := gitRepoRoot(cwd)
if err != nil {
return fmt.Errorf("not in a git repository: %w", err)
}
remote := preferRemote(remotesOrEmpty(repoRoot))
if remote == "" {
return fmt.Errorf("no git remote configured in %s", repoRoot)
}
flags := cryptFlagsFor(repoRoot)
branch := currentUser() + "/" + topic
wtRel := filepath.Join(".worktrees", topic)
ensureWorktreesIgnored(repoRoot)
if err := gitStream(repoRoot, flags, "fetch", remote); err != nil {
return fmt.Errorf("fetch %s failed: %w", remote, err)
}
if err := gitStream(repoRoot, flags, "worktree", "add", wtRel, "-b", branch, remote+"/master"); err != nil {
return fmt.Errorf("worktree add failed: %w", err)
}
wtPath := filepath.Join(repoRoot, wtRel)
fmt.Printf("homelab: created worktree %s (branch %s off %s/master)\n", wtPath, branch, remote)
fmt.Printf("homelab: enter it with the native tool: EnterWorktree(path=%q)\n", wtPath)
return nil
}
// workLand integrates the current branch into master: fetch, merge master in,
// verify, push HEAD:master (retrying on non-fast-forward), with a feature-branch
// fallback when the direct push is rejected (e.g. branch protection).
func workLand(args []string) error {
verifyCmd := flagValue(args, "--verify-cmd")
cwd, _ := os.Getwd()
repoRoot, err := gitRepoRoot(cwd)
if err != nil {
return fmt.Errorf("not in a git repository: %w", err)
}
branch, err := gitOutput(repoRoot, "rev-parse", "--abbrev-ref", "HEAD")
if err != nil {
return err
}
if branch == "master" || branch == "main" {
return fmt.Errorf("refusing to land: already on %s", branch)
}
remote := preferRemote(remotesOrEmpty(repoRoot))
if remote == "" {
return fmt.Errorf("no git remote configured in %s", repoRoot)
}
flags := cryptFlagsFor(repoRoot)
if err := gitStream(repoRoot, flags, "fetch", remote); err != nil {
return fmt.Errorf("fetch failed: %w", err)
}
if err := gitStream(repoRoot, flags, "merge", "--no-edit", remote+"/master"); err != nil {
return fmt.Errorf("merging %s/master failed — resolve conflicts then re-run `homelab work land`: %w", remote, err)
}
if err := runVerify(repoRoot, verifyCmd, containsArg(args, "--no-verify")); err != nil {
return fmt.Errorf("not landing: %w", err)
}
if err := pushWithRetry(repoRoot, flags, remote, 3); err != nil {
return landFallback(repoRoot, flags, remote, branch, err)
}
fmt.Printf("homelab: landed %s -> %s/master.\n", branch, remote)
if containsArg(args, "--no-ci-watch") {
fmt.Println("homelab: --no-ci-watch set; not waiting for CI.")
return nil
}
landed, _ := gitOutput(repoRoot, "rev-parse", "HEAD")
fmt.Fprintln(os.Stderr, "homelab: watching CI for the landed commit...")
if err := ciWatch([]string{landed}); err != nil {
return fmt.Errorf("landed, but CI did not go green: %w", err)
}
return nil
}
// runVerify runs the explicit --verify-cmd, else auto-detects (go test). If
// neither is available it REFUSES (returns an error) unless allowSkip is set —
// landing to master unverified must be a deliberate choice (--no-verify).
func runVerify(repoRoot, verifyCmd string, allowSkip bool) error {
if verifyCmd != "" {
fmt.Fprintf(os.Stderr, "homelab: verify: %s\n", verifyCmd)
return runStreamingIn(repoRoot, "sh", "-c", verifyCmd)
}
if isFile(filepath.Join(repoRoot, "go.mod")) {
fmt.Fprintln(os.Stderr, "homelab: verify: go test ./...")
return runStreamingIn(repoRoot, "go", "test", "./...")
}
if allowSkip {
fmt.Fprintln(os.Stderr, "homelab: WARNING: --no-verify set — landing without verification")
return nil
}
return fmt.Errorf("no verification configured for this repo — pass --verify-cmd \"...\" or --no-verify to land without verifying")
}
// pushWithRetry pushes HEAD:master, recovering from non-fast-forward rejections
// by fetching + merging master and retrying.
func pushWithRetry(repoRoot string, flags []string, remote string, attempts int) error {
var lastErr error
for i := 0; i < attempts; i++ {
if err := gitStream(repoRoot, flags, "push", remote, "HEAD:master"); err == nil {
return nil
} else {
lastErr = err
}
if i < attempts-1 {
fmt.Fprintln(os.Stderr, "homelab: push rejected — fetching + merging master, then retrying")
if err := gitStream(repoRoot, flags, "fetch", remote); err != nil {
return err
}
if err := gitStream(repoRoot, flags, "merge", "--no-edit", remote+"/master"); err != nil {
return err
}
}
}
return fmt.Errorf("push to %s/master failed after %d attempts: %w", remote, attempts, lastErr)
}
// landFallback pushes the feature branch when the direct master push is rejected
// (e.g. branch protection), so the work isn't lost and a PR can be opened.
func landFallback(repoRoot string, flags []string, remote, branch string, pushErr error) error {
fmt.Fprintf(os.Stderr, "homelab: direct push to master failed (%v)\n", pushErr)
fmt.Fprintf(os.Stderr, "homelab: falling back to pushing the feature branch %q for a PR\n", branch)
if err := gitStream(repoRoot, flags, "push", "-u", remote, branch); err != nil {
return fmt.Errorf("fallback branch push also failed: %w", err)
}
fmt.Printf("homelab: pushed %s to %s. Open a PR to land it (branch protection blocked the direct push).\n", branch, remote)
return nil
}
// workClean removes a task's worktree and branch. Run from the main checkout.
func workClean(args []string) error {
topic, _ := firstPositional(args)
if topic == "" {
return fmt.Errorf("usage: homelab work clean <topic> (run from the main checkout)")
}
cwd, _ := os.Getwd()
repoRoot, err := gitRepoRoot(cwd)
if err != nil {
return fmt.Errorf("not in a git repository: %w", err)
}
flags := cryptFlagsFor(repoRoot)
wtRel := filepath.Join(".worktrees", topic)
branch := currentUser() + "/" + topic
if err := gitStream(repoRoot, flags, "worktree", "remove", wtRel); err != nil {
return fmt.Errorf("worktree remove failed (uncommitted changes? run from the main checkout, not the worktree): %w", err)
}
if err := gitStream(repoRoot, flags, "branch", "-d", branch); err != nil {
fmt.Fprintf(os.Stderr, "homelab: note: could not delete branch %s (unmerged — use `git branch -D` if intended): %v\n", branch, err)
}
fmt.Printf("homelab: removed worktree %s and branch %s\n", wtRel, branch)
return nil
}
// ensureWorktreesIgnored appends .worktrees/ to .gitignore if not already ignored.
func ensureWorktreesIgnored(repoRoot string) {
if _, err := gitOutput(repoRoot, "check-ignore", ".worktrees"); err == nil {
return
}
gi := filepath.Join(repoRoot, ".gitignore")
f, err := os.OpenFile(gi, os.O_APPEND|os.O_CREATE|os.O_WRONLY, 0o644)
if err != nil {
return
}
defer f.Close()
if _, err := f.WriteString("\n.worktrees/\n"); err == nil {
fmt.Fprintln(os.Stderr, "homelab: added .worktrees/ to .gitignore")
}
}

32
cli/cmd_work_test.go Normal file
View file

@ -0,0 +1,32 @@
package main
import "testing"
func TestRunVerifyRefusesWhenNothingToVerify(t *testing.T) {
dir := t.TempDir() // no go.mod, no verify cmd
if err := runVerify(dir, "", false); err == nil {
t.Fatal("runVerify must refuse (error) when nothing to verify and --no-verify absent")
}
if err := runVerify(dir, "", true); err != nil {
t.Fatalf("runVerify must skip when --no-verify set, got: %v", err)
}
}
func TestFlagValue(t *testing.T) {
cases := []struct {
args []string
name string
want string
}{
{[]string{"--verify-cmd", "go test ./..."}, "--verify-cmd", "go test ./..."},
{[]string{"--verify-cmd=make test"}, "--verify-cmd", "make test"},
{[]string{"topic", "--verify-cmd", "x"}, "--verify-cmd", "x"},
{[]string{"topic"}, "--verify-cmd", ""},
{[]string{"--verify-cmd"}, "--verify-cmd", ""}, // no value
}
for _, c := range cases {
if got := flagValue(c.args, c.name); got != c.want {
t.Errorf("flagValue(%v, %q) = %q, want %q", c.args, c.name, got, c.want)
}
}
}

104
cli/command.go Normal file
View file

@ -0,0 +1,104 @@
package main
import (
"encoding/json"
"fmt"
"sort"
"strings"
)
// Tier classifies whether a command observes (read) or mutates (write) state.
// v0.1 allows everything; the tier is recorded so a classifier hook can gate
// writes later without restructuring (see docs/adr/0005).
type Tier string
const (
TierRead Tier = "read"
TierWrite Tier = "write"
)
// Command is one homelab verb. Path is the token sequence that selects it,
// e.g. ["claim"] or ["tf", "plan"]. Run receives the args after the path.
type Command struct {
Path []string
Tier Tier
Summary string
Run func(args []string) error
}
// dispatch routes args to the command whose Path is the longest matching prefix
// of args, passing the remaining args to its Run.
func dispatch(reg []Command, args []string) error {
best := -1
bestLen := 0
for i, c := range reg {
if len(c.Path) > len(args) {
continue
}
match := true
for j, p := range c.Path {
if args[j] != p {
match = false
break
}
}
if match && len(c.Path) >= bestLen {
best = i
bestLen = len(c.Path)
}
}
if best < 0 {
return fmt.Errorf("unknown command: %q", strings.Join(args, " "))
}
matched := reg[best]
runErr := matched.Run(args[bestLen:])
emitUsage(matched.name(), runErr) // best-effort usage telemetry; never affects the command
return runErr
}
// name is the space-joined verb path, e.g. "tf plan".
func (c Command) name() string { return strings.Join(c.Path, " ") }
// sortedByName returns a copy of reg ordered by verb path for stable output.
func sortedByName(reg []Command) []Command {
out := make([]Command, len(reg))
copy(out, reg)
sort.Slice(out, func(i, j int) bool { return out[i].name() < out[j].name() })
return out
}
// manifestText renders one aligned line per command: "<path> <tier> <summary>".
// This is the cheap progressive-discovery entrypoint (see docs/adr/0004).
func manifestText(reg []Command) string {
cmds := sortedByName(reg)
width := 0
for _, c := range cmds {
if n := len(c.name()); n > width {
width = n
}
}
var b strings.Builder
for _, c := range cmds {
fmt.Fprintf(&b, "%-*s %-5s %s\n", width, c.name(), c.Tier, c.Summary)
}
return b.String()
}
// manifestJSON renders the registry as a JSON array of {command, tier, summary}
// so agents can parse the full surface in one call.
func manifestJSON(reg []Command) (string, error) {
type entry struct {
Command string `json:"command"`
Tier string `json:"tier"`
Summary string `json:"summary"`
}
entries := make([]entry, 0, len(reg))
for _, c := range sortedByName(reg) {
entries = append(entries, entry{Command: c.name(), Tier: string(c.Tier), Summary: c.Summary})
}
b, err := json.MarshalIndent(entries, "", " ")
if err != nil {
return "", err
}
return string(b), nil
}

73
cli/command_test.go Normal file
View file

@ -0,0 +1,73 @@
package main
import (
"encoding/json"
"reflect"
"strings"
"testing"
)
// Tracer bullet: the dispatcher must route `homelab <path...> <args...>` to the
// command whose Path is the longest matching prefix of the input tokens, and
// hand the command the remaining args.
func TestDispatchRoutesToLongestPrefixMatch(t *testing.T) {
var gotArgs []string
ran := ""
reg := []Command{
{Path: []string{"claim"}, Tier: TierWrite, Summary: "claim a resource",
Run: func(a []string) error { ran = "claim"; gotArgs = a; return nil }},
{Path: []string{"tf", "plan"}, Tier: TierRead, Summary: "plan a stack",
Run: func(a []string) error { ran = "tf plan"; gotArgs = a; return nil }},
}
if err := dispatch(reg, []string{"tf", "plan", "vault", "--json"}); err != nil {
t.Fatalf("dispatch returned error: %v", err)
}
if ran != "tf plan" {
t.Fatalf("routed to %q, want %q", ran, "tf plan")
}
if want := []string{"vault", "--json"}; !reflect.DeepEqual(gotArgs, want) {
t.Fatalf("command got args %v, want %v", gotArgs, want)
}
}
func TestDispatchUnknownCommandErrors(t *testing.T) {
reg := []Command{{Path: []string{"claim"}, Run: func(a []string) error { return nil }}}
if err := dispatch(reg, []string{"bogus"}); err == nil {
t.Fatal("expected error for unknown command, got nil")
}
}
// The manifest is the progressive-discovery entrypoint: one line per command
// showing the full verb path, its tier, and summary, sorted for stable output.
func TestManifestTextListsEveryCommandWithTier(t *testing.T) {
reg := []Command{
{Path: []string{"tf", "plan"}, Tier: TierRead, Summary: "plan a stack"},
{Path: []string{"claim"}, Tier: TierWrite, Summary: "claim a resource"},
}
out := manifestText(reg)
for _, want := range []string{"claim", "tf plan", "read", "write", "plan a stack", "claim a resource"} {
if !strings.Contains(out, want) {
t.Errorf("manifest text missing %q\n---\n%s", want, out)
}
}
// sorted: claim (c) must appear before tf plan (t)
if strings.Index(out, "claim") > strings.Index(out, "tf plan") {
t.Errorf("manifest not sorted by path:\n%s", out)
}
}
func TestManifestJSONIsParsableAndTagged(t *testing.T) {
reg := []Command{{Path: []string{"tf", "apply"}, Tier: TierWrite, Summary: "apply a stack"}}
out, err := manifestJSON(reg)
if err != nil {
t.Fatalf("manifestJSON error: %v", err)
}
var got []map[string]string
if err := json.Unmarshal([]byte(out), &got); err != nil {
t.Fatalf("manifest JSON not parsable: %v\n%s", err, out)
}
if len(got) != 1 || got[0]["command"] != "tf apply" || got[0]["tier"] != "write" {
t.Fatalf("unexpected manifest JSON: %v", got)
}
}

98
cli/homelab.go Normal file
View file

@ -0,0 +1,98 @@
package main
import (
"fmt"
"strings"
)
// version is stamped at build time via -ldflags "-X main.version=vX.Y.Z".
var version = "dev"
// buildRegistry returns every homelab verb. New verb-groups append here.
func buildRegistry() []Command {
var reg []Command
reg = append(reg, claimCommands()...)
reg = append(reg, tfCommands()...)
reg = append(reg, workCommands()...)
reg = append(reg, k8sCommands()...)
reg = append(reg, memoryCommands()...)
reg = append(reg, ciCommands()...)
reg = append(reg, deployCommands()...)
reg = append(reg, netCommands()...)
reg = append(reg, obsCommands()...)
reg = append(reg, usageCommands()...)
reg = append(reg, haCommands()...)
reg = append(reg, browserCommands()...)
reg = append(reg, vaultCommands()...)
return reg
}
// dispatchTop handles the homelab verb surface. handled=false means the args are
// not a homelab verb, so main() falls back to the legacy -use-case path.
func dispatchTop(args []string) (handled bool, err error) {
if len(args) == 0 {
fmt.Print(usage())
return true, nil
}
switch args[0] {
case "help", "-h", "--help":
fmt.Print(usage())
return true, nil
case "version", "--version":
fmt.Println("homelab " + version)
return true, nil
case "manifest":
reg := buildRegistry()
if containsArg(args[1:], "--json") {
out, err := manifestJSON(reg)
if err != nil {
return true, err
}
fmt.Println(out)
return true, nil
}
fmt.Print(manifestText(reg))
return true, nil
}
if strings.HasPrefix(args[0], "-") {
return false, nil
}
reg := buildRegistry()
if !isCommandGroup(reg, args[0]) {
return false, nil
}
return true, dispatch(reg, args)
}
func isCommandGroup(reg []Command, group string) bool {
for _, c := range reg {
if len(c.Path) > 0 && c.Path[0] == group {
return true
}
}
return false
}
func containsArg(args []string, want string) bool {
for _, a := range args {
if a == want {
return true
}
}
return false
}
func usage() string {
var b strings.Builder
fmt.Fprintf(&b, "homelab %s — unified homelab operations CLI\n\n", version)
b.WriteString("Usage:\n homelab <command> [args]\n\nCommands:\n")
for _, line := range strings.Split(strings.TrimRight(manifestText(buildRegistry()), "\n"), "\n") {
if line != "" {
b.WriteString(" " + line + "\n")
}
}
b.WriteString("\n manifest [--json] list all commands (machine-readable with --json)\n")
b.WriteString(" version print version\n")
b.WriteString("\nLegacy webhook use-cases remain available via -use-case=<name>.\n")
return b.String()
}

138
cli/k8s.go Normal file
View file

@ -0,0 +1,138 @@
package main
import (
"fmt"
"os/exec"
"strings"
)
// kubectl helpers use the ambient kubeconfig (no per-call auth flags).
func kubectlBase(ns string, args ...string) []string {
var full []string
if ns != "" {
full = append(full, "-n", ns)
}
return append(full, args...)
}
func kubectlStream(ns string, args ...string) error {
return runStreamingIn("", "kubectl", kubectlBase(ns, args...)...)
}
// kubectlCapture runs kubectl and returns trimmed stdout (for resolving pods).
func kubectlCapture(ns string, args ...string) (string, error) {
out, err := exec.Command("kubectl", kubectlBase(ns, args...)...).Output()
return strings.TrimSpace(string(out)), err
}
// k8sTarget is the parsed `<app>` + selectors shared by the k8s verbs.
type k8sTarget struct {
app string
ns string
pod string
container string
selector string
tty bool
rest []string // passthrough flags and, after `--`, the exec command
}
// parseK8sTarget reads `<app> [-n ns] [--pod p] [-c ctr] [-l sel] [flags] [-- cmd]`.
// The first bare token is the app; unknown flags pass through in rest.
func parseK8sTarget(args []string) k8sTarget {
t := k8sTarget{}
i := 0
take := func() string {
if i+1 < len(args) {
i++
return args[i]
}
return ""
}
for i = 0; i < len(args); i++ {
a := args[i]
switch {
case a == "--":
t.rest = append(t.rest, args[i+1:]...)
return t
case a == "-n" || a == "--namespace":
t.ns = take()
case strings.HasPrefix(a, "--namespace="):
t.ns = strings.TrimPrefix(a, "--namespace=")
case a == "--pod":
t.pod = take()
case strings.HasPrefix(a, "--pod="):
t.pod = strings.TrimPrefix(a, "--pod=")
case a == "-c" || a == "--container":
t.container = take()
case strings.HasPrefix(a, "--container="):
t.container = strings.TrimPrefix(a, "--container=")
case a == "-l" || a == "--selector":
t.selector = take()
case strings.HasPrefix(a, "--selector="):
t.selector = strings.TrimPrefix(a, "--selector=")
case a == "--tty" || a == "-it" || a == "-ti":
t.tty = true
case !strings.HasPrefix(a, "-") && t.app == "":
t.app = a
default:
t.rest = append(t.rest, a)
}
}
return t
}
// namespace defaults to the app name (most namespaces hold exactly one app).
func (t k8sTarget) namespace() string {
if t.ns != "" {
return t.ns
}
return t.app
}
// objectRef is the kubectl object for logs/exec: an explicit pod, else
// deploy/<app> (kubectl resolves a pod from the Deployment).
func (t k8sTarget) objectRef() string {
if t.pod != "" {
return "pod/" + t.pod
}
return "deploy/" + t.app
}
// --- database access (the dbaas exec pattern) ---
type dbPlan struct {
ns string
pod string // explicit pod (e.g. mysql-standalone-0)
selector string // resolve the pod by this label when pod == "" (CNPG primary)
container string // "" = default container
argv []string // command + args to run inside the pod
}
// planDBExec builds the in-pod command to run sql against app's database.
// PG (default): CNPG primary POD (resolved by label — pg-cluster-rw is a
// Service, not an exec target), psql -U postgres -d <db>.
// MySQL: mysql-standalone-0, password from env (never on the command line).
// dbName defaults to app. sql empty => interactive client.
func planDBExec(app, dbName, sql string, mysql bool) dbPlan {
if dbName == "" {
dbName = app
}
if mysql {
inner := fmt.Sprintf(`mysql -u root -p"$MYSQL_ROOT_PASSWORD" %s`, shellQuote(dbName))
if sql != "" {
inner += " -e " + shellQuote(sql)
}
return dbPlan{ns: "dbaas", pod: "mysql-standalone-0", argv: []string{"bash", "-c", inner}}
}
argv := []string{"psql", "-U", "postgres", "-d", dbName}
if sql != "" {
argv = append(argv, "-tAc", sql)
}
return dbPlan{ns: "dbaas", selector: "cnpg.io/instanceRole=primary", container: "postgres", argv: argv}
}
// shellQuote single-quotes s for safe embedding in a bash -c string.
func shellQuote(s string) string {
return "'" + strings.ReplaceAll(s, "'", `'\''`) + "'"
}

65
cli/k8s_test.go Normal file
View file

@ -0,0 +1,65 @@
package main
import (
"reflect"
"strings"
"testing"
)
func TestParseK8sTarget(t *testing.T) {
got := parseK8sTarget([]string{"tripit", "-n", "prod", "--pod", "x-123", "-c", "app", "-l", "k=v", "--tail=50", "--", "ls", "-la"})
want := k8sTarget{app: "tripit", ns: "prod", pod: "x-123", container: "app", selector: "k=v", rest: []string{"--tail=50", "ls", "-la"}}
if !reflect.DeepEqual(got, want) {
t.Fatalf("parseK8sTarget =\n %+v\nwant\n %+v", got, want)
}
}
func TestK8sTargetNamespaceDefaultsToApp(t *testing.T) {
if ns := parseK8sTarget([]string{"immich"}).namespace(); ns != "immich" {
t.Errorf("namespace() = %q, want immich", ns)
}
if ns := parseK8sTarget([]string{"immich", "-n", "dbaas"}).namespace(); ns != "dbaas" {
t.Errorf("namespace() = %q, want dbaas", ns)
}
}
func TestK8sTargetObjectRef(t *testing.T) {
if r := parseK8sTarget([]string{"tripit"}).objectRef(); r != "deploy/tripit" {
t.Errorf("objectRef() = %q, want deploy/tripit", r)
}
if r := parseK8sTarget([]string{"tripit", "--pod", "tripit-abc"}).objectRef(); r != "pod/tripit-abc" {
t.Errorf("objectRef() = %q, want pod/tripit-abc", r)
}
}
func TestPlanDBExecPostgresDefault(t *testing.T) {
p := planDBExec("fire-planner", "", "SELECT 1", false)
// pg-cluster-rw is a Service, so the PG plan resolves the primary POD by
// label rather than naming an (un-exec-able) Service.
if p.ns != "dbaas" || p.pod != "" || p.selector != "cnpg.io/instanceRole=primary" || p.container != "postgres" {
t.Fatalf("unexpected pg target: %+v", p)
}
// db name defaults to the app; SQL passed via -tAc
joined := strings.Join(p.argv, " ")
if !strings.Contains(joined, "-d fire-planner") || !strings.Contains(joined, "-tAc") {
t.Fatalf("pg argv missing db/sql: %v", p.argv)
}
}
func TestPlanDBExecMysqlEnvPassword(t *testing.T) {
p := planDBExec("wrongmove", "wrongmove", "SHOW TABLES", true)
if p.pod != "mysql-standalone-0" {
t.Fatalf("unexpected mysql pod: %+v", p)
}
inner := strings.Join(p.argv, " ")
// password must come from the env var, never inline
if !strings.Contains(inner, `-p"$MYSQL_ROOT_PASSWORD"`) {
t.Fatalf("mysql must use env password wrapper: %v", p.argv)
}
}
func TestShellQuoteEscapes(t *testing.T) {
if got := shellQuote("a'b"); got != `'a'\''b'` {
t.Fatalf("shellQuote = %q", got)
}
}

View file

@ -26,8 +26,16 @@ var (
)
func main() {
err := run()
if err != nil {
// homelab verb surface (work/tf/claim/...) is tried first; if the args are
// not a homelab verb, fall through to the legacy webhook -use-case path.
if handled, err := dispatchTop(os.Args[1:]); handled {
if err != nil {
fmt.Fprintln(os.Stderr, "homelab: "+err.Error())
os.Exit(1)
}
return
}
if err := run(); err != nil {
glog.Errorf("run failed: %s", err.Error())
os.Exit(255)
}

103
cli/memory.go Normal file
View file

@ -0,0 +1,103 @@
package main
import (
"bytes"
"encoding/json"
"fmt"
"io"
"net/http"
"os"
"strings"
"time"
)
// defaultMemoryURL is used when no env override is present (agents normally have
// CLAUDE_MEMORY_API_URL set by the memory hooks).
const defaultMemoryURL = "https://claude-memory.viktorbarzin.me"
type memoryClient struct {
base string
key string
http *http.Client
}
func firstEnv(keys ...string) string {
for _, k := range keys {
if v := os.Getenv(k); v != "" {
return v
}
}
return ""
}
func resolveMemoryBase() string {
if b := firstEnv("CLAUDE_MEMORY_API_URL", "MEMORY_API_URL"); b != "" {
return strings.TrimRight(b, "/")
}
return defaultMemoryURL
}
// newMemoryClient talks straight to the claude-memory HTTP API (the same backend
// the MCP wraps), so it works even when the MCP frontend is down.
func newMemoryClient() (*memoryClient, error) {
key := firstEnv("CLAUDE_MEMORY_API_KEY", "MEMORY_API_KEY")
if key == "" {
return nil, fmt.Errorf("no memory API key — set CLAUDE_MEMORY_API_KEY (or MEMORY_API_KEY)")
}
return &memoryClient{base: resolveMemoryBase(), key: key, http: &http.Client{Timeout: 30 * time.Second}}, nil
}
func (c *memoryClient) do(method, path string, body interface{}) ([]byte, error) {
var r io.Reader
if body != nil {
b, err := json.Marshal(body)
if err != nil {
return nil, err
}
r = bytes.NewReader(b)
}
req, err := http.NewRequest(method, c.base+path, r)
if err != nil {
return nil, err
}
req.Header.Set("Authorization", "Bearer "+c.key)
if body != nil {
req.Header.Set("Content-Type", "application/json")
}
resp, err := c.http.Do(req)
if err != nil {
return nil, err
}
defer resp.Body.Close()
out, _ := io.ReadAll(resp.Body)
if resp.StatusCode >= 300 {
return nil, fmt.Errorf("memory API %s %s -> %d: %s", method, path, resp.StatusCode, strings.TrimSpace(string(out)))
}
return out, nil
}
// Request bodies mirror src/claude_memory/api/models.py.
type memRecallReq struct {
Context string `json:"context"`
ExpandedQuery string `json:"expanded_query,omitempty"`
Category string `json:"category,omitempty"`
SortBy string `json:"sort_by,omitempty"`
Limit int `json:"limit,omitempty"`
}
type memStoreReq struct {
Content string `json:"content"`
Category string `json:"category,omitempty"`
Tags string `json:"tags,omitempty"`
ExpandedKeywords string `json:"expanded_keywords,omitempty"`
Importance float64 `json:"importance"`
ForceSensitive bool `json:"force_sensitive,omitempty"`
}
type memUpdateReq struct {
Content *string `json:"content,omitempty"`
Tags *string `json:"tags,omitempty"`
Importance *float64 `json:"importance,omitempty"`
ExpandedKeywords *string `json:"expanded_keywords,omitempty"`
}

51
cli/memory_test.go Normal file
View file

@ -0,0 +1,51 @@
package main
import (
"encoding/json"
"os"
"strings"
"testing"
)
func TestResolveMemoryBase(t *testing.T) {
old1, old2 := os.Getenv("CLAUDE_MEMORY_API_URL"), os.Getenv("MEMORY_API_URL")
defer func() { os.Setenv("CLAUDE_MEMORY_API_URL", old1); os.Setenv("MEMORY_API_URL", old2) }()
os.Unsetenv("CLAUDE_MEMORY_API_URL")
os.Unsetenv("MEMORY_API_URL")
if got := resolveMemoryBase(); got != defaultMemoryURL {
t.Errorf("resolveMemoryBase() = %q, want default %q", got, defaultMemoryURL)
}
os.Setenv("CLAUDE_MEMORY_API_URL", "https://m.example/") // trailing slash trimmed
if got := resolveMemoryBase(); got != "https://m.example" {
t.Errorf("resolveMemoryBase() = %q, want https://m.example", got)
}
}
func TestMemStoreReqAlwaysSendsImportance(t *testing.T) {
b, _ := json.Marshal(memStoreReq{Content: "x", Category: "facts", Importance: 0.5})
s := string(b)
if !strings.Contains(s, `"content":"x"`) || !strings.Contains(s, `"importance":0.5`) {
t.Fatalf("memStoreReq JSON missing fields: %s", s)
}
}
func TestMemUpdateReqOmitsUnsetFields(t *testing.T) {
tags := "a,b"
b, _ := json.Marshal(memUpdateReq{Tags: &tags})
s := string(b)
if strings.Contains(s, "content") || strings.Contains(s, "importance") {
t.Fatalf("unset update fields must be omitted: %s", s)
}
if !strings.Contains(s, `"tags":"a,b"`) {
t.Fatalf("set field missing: %s", s)
}
}
func TestMemRecallReqOmitsEmptyOptionals(t *testing.T) {
b, _ := json.Marshal(memRecallReq{Context: "hi"})
s := string(b)
if strings.Contains(s, "expanded_query") || strings.Contains(s, "category") || strings.Contains(s, "limit") {
t.Fatalf("empty optionals must be omitted: %s", s)
}
}

58
cli/presence.go Normal file
View file

@ -0,0 +1,58 @@
package main
import (
"fmt"
"os"
"path/filepath"
"strings"
)
// validPresenceKinds is the fixed label taxonomy accepted by the presence board.
var validPresenceKinds = []string{"node", "host", "stack", "service", "db", "pvc", "infra"}
// presenceScript locates the presence CLI — homelab WRAPS it, it does not
// reimplement it. Override with HOMELAB_PRESENCE; defaults to ~/code/scripts/presence.
func presenceScript() string {
if p := os.Getenv("HOMELAB_PRESENCE"); p != "" {
return p
}
home, err := os.UserHomeDir()
if err != nil {
return "presence"
}
return filepath.Join(home, "code", "scripts", "presence")
}
// validateLabel checks a presence label is <kind>:<name> with a known kind.
func validateLabel(label string) error {
parts := strings.SplitN(label, ":", 2)
if len(parts) != 2 || parts[0] == "" || parts[1] == "" {
return fmt.Errorf("label must be <kind>:<name> (e.g. stack:vault), got %q", label)
}
for _, k := range validPresenceKinds {
if parts[0] == k {
return nil
}
}
return fmt.Errorf("invalid label kind %q; valid kinds: %s", parts[0], strings.Join(validPresenceKinds, ", "))
}
// presenceClaim claims label on the board with a purpose note.
func presenceClaim(label, purpose string) error {
if err := validateLabel(label); err != nil {
return err
}
args := []string{"claim", label}
if purpose != "" {
args = append(args, "--purpose", purpose)
}
return runStreaming(presenceScript(), args...)
}
// presenceRelease releases a prior claim on label.
func presenceRelease(label string) error {
if err := validateLabel(label); err != nil {
return err
}
return runStreaming(presenceScript(), "release", label)
}

24
cli/presence_test.go Normal file
View file

@ -0,0 +1,24 @@
package main
import "testing"
func TestValidateLabelAcceptsTaxonomy(t *testing.T) {
good := []string{
"stack:vault", "service:health", "node:k8s-node1", "db:pg-cluster",
"infra:gpu-operator", "host:proxmox-1", "pvc:dbaas/data",
}
for _, l := range good {
if err := validateLabel(l); err != nil {
t.Errorf("validateLabel(%q) = %v, want nil", l, err)
}
}
}
func TestValidateLabelRejectsBadLabels(t *testing.T) {
bad := []string{"vault", "stack:", "bogus:x", ":x", "stack", ""}
for _, l := range bad {
if err := validateLabel(l); err == nil {
t.Errorf("validateLabel(%q) = nil, want error", l)
}
}
}

76
cli/probe.go Normal file
View file

@ -0,0 +1,76 @@
package main
import (
"context"
"crypto/tls"
"fmt"
"io"
"net"
"net/http"
"net/url"
"os/exec"
"strings"
"time"
)
// internalLBIP is the dedicated Traefik LB; every internal ingress routes through it.
const internalLBIP = "10.0.20.203"
// clientDialingIP returns an http.Client that dials ip for ANY host while keeping
// the URL host as SNI (so the cert matches) — the Go form of `curl --resolve
// host:443:ip`. TLS verification is skipped (these are reachability/observability
// probes, not security checks; internal .lan vhosts may serve a non-matching cert).
func clientDialingIP(ip string, timeout time.Duration) *http.Client {
d := &net.Dialer{Timeout: 8 * time.Second}
tr := &http.Transport{
DialContext: func(ctx context.Context, network, addr string) (net.Conn, error) {
if i := strings.LastIndex(addr, ":"); i >= 0 {
addr = ip + addr[i:]
}
return d.DialContext(ctx, network, addr)
},
TLSClientConfig: &tls.Config{InsecureSkipVerify: true},
}
return &http.Client{Timeout: timeout, Transport: tr}
}
// probeURL issues a GET and returns status code + elapsed time.
func probeURL(c *http.Client, rawurl string) (int, time.Duration, error) {
start := time.Now()
resp, err := c.Get(rawurl)
dur := time.Since(start)
if err != nil {
return 0, dur, err
}
resp.Body.Close()
return resp.StatusCode, dur, nil
}
// lbGetBody GETs https://<host><path>?<q> through the internal LB and returns the body.
func lbGetBody(host, path string, q url.Values) ([]byte, error) {
u := "https://" + host + path
if len(q) > 0 {
u += "?" + q.Encode()
}
resp, err := clientDialingIP(internalLBIP, 20*time.Second).Get(u)
if err != nil {
return nil, err
}
defer resp.Body.Close()
body, _ := io.ReadAll(resp.Body)
if resp.StatusCode >= 300 {
return nil, fmt.Errorf("%s -> %d: %s", path, resp.StatusCode, strings.TrimSpace(string(body)))
}
return body, nil
}
// dig runs `dig +short` against a resolver, optionally for a record type.
func dig(name, server, rrtype string) (string, error) {
args := []string{"+short", "+time=3", "+tries=1"}
if rrtype != "" {
args = append(args, rrtype)
}
args = append(args, name, "@"+server)
out, err := exec.Command("dig", args...).Output()
return strings.TrimSpace(string(out)), err
}

49
cli/probe_test.go Normal file
View file

@ -0,0 +1,49 @@
package main
import "testing"
func TestQueryArg(t *testing.T) {
if got := queryArg([]string{"up"}, nil); got != "up" {
t.Errorf(`queryArg(["up"]) = %q, want "up"`, got)
}
if got := queryArg([]string{"up", "--json"}, nil); got != "up" {
t.Errorf(`--json should be dropped, got %q`, got)
}
// single quoted PromQL arrives as one token
if got := queryArg([]string{"count by (node) (up)", "--json"}, nil); got != "count by (node) (up)" {
t.Errorf(`quoted query mangled: %q`, got)
}
// value-flags and their values are skipped, query survives
vf := map[string]bool{"--since": true, "--limit": true}
if got := queryArg([]string{`{app="x"}`, "--since", "1h", "--limit", "50"}, vf); got != `{app="x"}` {
t.Errorf(`value-flag skipping failed: %q`, got)
}
}
func TestLabelStr(t *testing.T) {
got := labelStr(map[string]string{"__name__": "up", "job": "x", "instance": "y"})
if got != "up{instance=y,job=x}" { // __name__ extracted, rest sorted
t.Errorf("labelStr = %q", got)
}
if got := labelStr(map[string]string{"alertname": "Foo"}); got != "{alertname=Foo}" {
t.Errorf("labelStr (no __name__) = %q", got)
}
}
func TestOneLineList(t *testing.T) {
if got := oneLineList(" "); got != "(none)" {
t.Errorf("empty = %q, want (none)", got)
}
if got := oneLineList("a\nb"); got != "a, b" {
t.Errorf("multi = %q, want 'a, b'", got)
}
}
func TestHostOnly(t *testing.T) {
if got := hostOnly("foo.me/path"); got != "foo.me" {
t.Errorf("hostOnly = %q", got)
}
if got := hostOnly("foo.me"); got != "foo.me" {
t.Errorf("hostOnly = %q", got)
}
}

101
cli/repo.go Normal file
View file

@ -0,0 +1,101 @@
package main
import (
"os"
"os/exec"
"os/user"
"path/filepath"
"strings"
)
// preferRemote picks the canonical remote: forgejo if present, else origin,
// else the first listed. (For infra, origin and forgejo both point at Forgejo.)
func preferRemote(remotes []string) string {
has := map[string]bool{}
for _, r := range remotes {
has[r] = true
}
switch {
case has["forgejo"]:
return "forgejo"
case has["origin"]:
return "origin"
case len(remotes) > 0:
return remotes[0]
default:
return ""
}
}
// hasGitCryptAttr reports whether .gitattributes content enables git-crypt.
func hasGitCryptAttr(gitattributes string) bool {
return strings.Contains(gitattributes, "filter=git-crypt")
}
// gitCryptFlags are the per-command flags that disable smudge/clean so git
// operations in a git-crypt repo don't try to decrypt (NEVER persisted to config).
func gitCryptFlags() []string {
return []string{
"-c", "filter.git-crypt.smudge=cat",
"-c", "filter.git-crypt.clean=cat",
"-c", "filter.git-crypt.required=false",
}
}
// gitOutput runs `git -C dir <args>` and returns trimmed stdout.
func gitOutput(dir string, args ...string) (string, error) {
cmd := exec.Command("git", append([]string{"-C", dir}, args...)...)
out, err := cmd.Output()
return strings.TrimSpace(string(out)), err
}
func gitRepoRoot(dir string) (string, error) {
return gitOutput(dir, "rev-parse", "--show-toplevel")
}
// gitRemotes lists configured remote names for the repo at dir.
func gitRemotes(dir string) ([]string, error) {
out, err := gitOutput(dir, "remote")
if err != nil {
return nil, err
}
if out == "" {
return nil, nil
}
return strings.Split(out, "\n"), nil
}
// isGitCryptRepo reports whether the repo at repoRoot uses git-crypt.
func isGitCryptRepo(repoRoot string) bool {
b, err := os.ReadFile(filepath.Join(repoRoot, ".gitattributes"))
if err != nil {
return false
}
return hasGitCryptAttr(string(b))
}
// cryptFlagsFor returns the git-crypt filter flags when repoRoot is encrypted,
// else nil. These are injected per-command and never persisted.
func cryptFlagsFor(repoRoot string) []string {
if isGitCryptRepo(repoRoot) {
return gitCryptFlags()
}
return nil
}
// gitStream runs `git [cryptFlags] -C repoRoot <args>` with live output.
func gitStream(repoRoot string, cryptFlags []string, args ...string) error {
full := append(append([]string{}, cryptFlags...), append([]string{"-C", repoRoot}, args...)...)
return runStreamingIn("", "git", full...)
}
// currentUser returns the OS username for branch naming (<user>/<topic>).
func currentUser() string {
if u := os.Getenv("USER"); u != "" {
return u
}
if u, err := user.Current(); err == nil && u.Username != "" {
return u.Username
}
return "user"
}

37
cli/repo_test.go Normal file
View file

@ -0,0 +1,37 @@
package main
import "testing"
func TestPreferRemote(t *testing.T) {
cases := []struct {
in []string
want string
}{
{[]string{"origin", "forgejo"}, "forgejo"},
{[]string{"forgejo"}, "forgejo"},
{[]string{"origin"}, "origin"},
{[]string{"upstream"}, "upstream"},
{nil, ""},
}
for _, c := range cases {
if got := preferRemote(c.in); got != c.want {
t.Errorf("preferRemote(%v) = %q, want %q", c.in, got, c.want)
}
}
}
func TestHasGitCryptAttr(t *testing.T) {
if !hasGitCryptAttr("*.tfvars filter=git-crypt diff=git-crypt") {
t.Error("expected git-crypt detected")
}
if hasGitCryptAttr("*.md text\n*.png binary") {
t.Error("expected no git-crypt")
}
}
func TestGitCryptFlagsShape(t *testing.T) {
f := gitCryptFlags()
if len(f) != 6 || f[0] != "-c" || f[1] != "filter.git-crypt.smudge=cat" {
t.Fatalf("unexpected git-crypt flags: %v", f)
}
}

23
cli/run.go Normal file
View file

@ -0,0 +1,23 @@
package main
import (
"os"
"os/exec"
)
// runStreaming executes name with args, wiring std streams to this process so
// the caller sees live output, and returns the command's error (non-nil on
// non-zero exit — preserved so homelab's own exit code reflects the child's).
func runStreaming(name string, args ...string) error {
return runStreamingIn("", name, args...)
}
// runStreamingIn is runStreaming with a working directory (empty = inherit).
func runStreamingIn(dir, name string, args ...string) error {
cmd := exec.Command(name, args...)
cmd.Dir = dir
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
cmd.Stdin = os.Stdin
return cmd.Run()
}

54
cli/stack.go Normal file
View file

@ -0,0 +1,54 @@
package main
import (
"fmt"
"os"
"path/filepath"
"sort"
"strings"
)
// findInfraRoot walks up from start to the infra repo root — the directory
// holding both terragrunt.hcl and a stacks/ directory.
func findInfraRoot(start string) (string, error) {
dir := start
for {
if isFile(filepath.Join(dir, "terragrunt.hcl")) && isDir(filepath.Join(dir, "stacks")) {
return dir, nil
}
parent := filepath.Dir(dir)
if parent == dir {
return "", fmt.Errorf("not inside an infra checkout (no terragrunt.hcl + stacks/ found above %s)", start)
}
dir = parent
}
}
// resolveStack maps a bare stack name to its directory under <infraRoot>/stacks.
func resolveStack(infraRoot, name string) (string, error) {
dir := filepath.Join(infraRoot, "stacks", name)
if isDir(dir) {
return dir, nil
}
avail := listStacks(infraRoot)
return "", fmt.Errorf("stack %q not found under stacks/; available: %s", name, strings.Join(avail, ", "))
}
// listStacks returns the sorted names of every directory under <infraRoot>/stacks.
func listStacks(infraRoot string) []string {
entries, err := os.ReadDir(filepath.Join(infraRoot, "stacks"))
if err != nil {
return nil
}
var out []string
for _, e := range entries {
if e.IsDir() {
out = append(out, e.Name())
}
}
sort.Strings(out)
return out
}
func isFile(p string) bool { fi, err := os.Stat(p); return err == nil && !fi.IsDir() }
func isDir(p string) bool { fi, err := os.Stat(p); return err == nil && fi.IsDir() }

52
cli/stack_test.go Normal file
View file

@ -0,0 +1,52 @@
package main
import (
"os"
"path/filepath"
"testing"
)
func newInfraTree(t *testing.T, stacks ...string) string {
t.Helper()
root := t.TempDir()
if err := os.WriteFile(filepath.Join(root, "terragrunt.hcl"), []byte("# root"), 0o644); err != nil {
t.Fatal(err)
}
for _, s := range stacks {
if err := os.MkdirAll(filepath.Join(root, "stacks", s), 0o755); err != nil {
t.Fatal(err)
}
}
return root
}
func TestFindInfraRootWalksUp(t *testing.T) {
root := newInfraTree(t, "vault")
got, err := findInfraRoot(filepath.Join(root, "stacks", "vault"))
if err != nil {
t.Fatalf("findInfraRoot error: %v", err)
}
if got != root {
t.Fatalf("findInfraRoot = %q, want %q", got, root)
}
}
func TestFindInfraRootErrorsOutsideInfra(t *testing.T) {
if _, err := findInfraRoot(t.TempDir()); err == nil {
t.Fatal("expected error outside an infra checkout")
}
}
func TestResolveStack(t *testing.T) {
root := newInfraTree(t, "vault", "monitoring")
dir, err := resolveStack(root, "vault")
if err != nil {
t.Fatalf("resolveStack error: %v", err)
}
if want := filepath.Join(root, "stacks", "vault"); dir != want {
t.Fatalf("resolveStack = %q, want %q", dir, want)
}
if _, err := resolveStack(root, "nonesuch"); err == nil {
t.Fatal("expected error for unknown stack")
}
}

62
cli/telemetry.go Normal file
View file

@ -0,0 +1,62 @@
package main
import (
"bytes"
"encoding/json"
"net/http"
"os"
"strconv"
"strings"
"time"
)
// usageJob is the Loki stream job label for homelab usage telemetry.
const usageJob = "homelab-usage"
// emitUsage best-effort records one verb invocation to Loki for cross-user
// usage analytics. Labels are low-cardinality (job/user/verb); the line carries
// only exit code + CLI version. NEVER args, paths, flags, or secrets. It must
// never affect the command: all errors are swallowed and a tight timeout bounds
// the cost. Opt out with HOMELAB_TELEMETRY=0.
func emitUsage(verb string, runErr error) {
switch os.Getenv("HOMELAB_TELEMETRY") {
case "0", "off", "false", "no":
return
}
if verb == "" || strings.HasPrefix(verb, "usage") {
return // don't self-record the analytics reader
}
exit := 0
if runErr != nil {
exit = 1
}
body, err := json.Marshal(lokiPush{Streams: []lokiStream{{
Stream: map[string]string{"job": usageJob, "user": currentUser(), "verb": verb},
Values: [][2]string{{
strconv.FormatInt(time.Now().UnixNano(), 10),
"exit=" + strconv.Itoa(exit) + " ver=" + version,
}},
}}})
if err != nil {
return
}
req, err := http.NewRequest("POST", "https://"+lokiHost+"/loki/api/v1/push", bytes.NewReader(body))
if err != nil {
return
}
req.Header.Set("Content-Type", "application/json")
resp, err := clientDialingIP(internalLBIP, 800*time.Millisecond).Do(req)
if err != nil {
return
}
resp.Body.Close()
}
type lokiPush struct {
Streams []lokiStream `json:"streams"`
}
type lokiStream struct {
Stream map[string]string `json:"stream"`
Values [][2]string `json:"values"`
}

View file

@ -103,6 +103,6 @@ func notifyForIPChange(oldIP, newIP net.IP) error {
if err != nil {
return errors.Wrapf(err, "Error reading response")
}
glog.Infof("Response:", string(responseBody))
glog.Infof("Response: %s", string(responseBody))
return nil
}

18
cli/usage_test.go Normal file
View file

@ -0,0 +1,18 @@
package main
import (
"strings"
"testing"
)
func TestUsageQuery(t *testing.T) {
got := usageQuery("30d", "")
want := `sum by (verb) (count_over_time({job="homelab-usage"}[30d]))`
if got != want {
t.Errorf("usageQuery(30d,\"\") = %q, want %q", got, want)
}
withUser := usageQuery("7d", "emo")
if !strings.Contains(withUser, `user="emo"`) || !strings.Contains(withUser, "[7d]") {
t.Errorf("usageQuery with user missing filter/range: %q", withUser)
}
}

191
cli/woodpecker.go Normal file
View file

@ -0,0 +1,191 @@
package main
import (
"context"
"encoding/json"
"fmt"
"io"
"net"
"net/http"
"os"
"os/exec"
"strings"
"time"
)
// Woodpecker is reached at ci.viktorbarzin.me but routed via the internal Traefik
// LB (mirrors the proven `curl --resolve ci.viktorbarzin.me:443:10.0.20.203`):
// we dial the LB IP while keeping SNI/Host = the hostname so the cert verifies.
const (
wpHost = "ci.viktorbarzin.me"
wpLBIP = "10.0.20.203"
)
type wpClient struct {
base string
token string
http *http.Client
}
// wpToken reads WOODPECKER_TOKEN, else the canonical Vault path.
func wpToken() string {
if t := firstEnv("WOODPECKER_TOKEN", "WP_TOKEN"); t != "" {
return t
}
out, err := exec.Command("vault", "kv", "get", "-field=woodpecker_api_token", "secret/ci/global").Output()
if err != nil {
return ""
}
return strings.TrimSpace(string(out))
}
func newWPClient() (*wpClient, error) {
tok := wpToken()
if tok == "" {
return nil, fmt.Errorf("no woodpecker token — set WOODPECKER_TOKEN or `vault login` (reads secret/ci/global)")
}
ip := firstEnv("HOMELAB_WP_IP")
if ip == "" {
ip = wpLBIP
}
dialer := &net.Dialer{Timeout: 8 * time.Second}
tr := &http.Transport{
DialContext: func(ctx context.Context, network, addr string) (net.Conn, error) {
if strings.HasPrefix(addr, wpHost+":") {
addr = ip + addr[strings.LastIndex(addr, ":"):]
}
return dialer.DialContext(ctx, network, addr)
},
}
return &wpClient{base: "https://" + wpHost, token: tok, http: &http.Client{Timeout: 20 * time.Second, Transport: tr}}, nil
}
// getJSON GETs path into v, retrying the transient empty/5xx responses the
// Woodpecker API intermittently returns under load.
func (c *wpClient) getJSON(path string, v interface{}) error {
var lastErr error
for attempt := 0; attempt < 5; attempt++ {
if attempt > 0 {
time.Sleep(2 * time.Second)
}
req, _ := http.NewRequest("GET", c.base+path, nil)
req.Header.Set("Authorization", "Bearer "+c.token)
resp, err := c.http.Do(req)
if err != nil {
lastErr = err
continue
}
body, _ := io.ReadAll(resp.Body)
resp.Body.Close()
if resp.StatusCode >= 500 || len(strings.TrimSpace(string(body))) == 0 {
lastErr = fmt.Errorf("woodpecker GET %s -> %d (empty/5xx, retrying)", path, resp.StatusCode)
continue
}
if resp.StatusCode >= 300 {
return fmt.Errorf("woodpecker GET %s -> %d: %s", path, resp.StatusCode, strings.TrimSpace(string(body)))
}
return json.Unmarshal(body, v)
}
return lastErr
}
type wpPipeline struct {
Number int `json:"number"`
Status string `json:"status"`
Event string `json:"event"`
Commit string `json:"commit"`
Message string `json:"message"`
}
func (c *wpClient) recentPipelines(repoID, n int) ([]wpPipeline, error) {
var ps []wpPipeline
err := c.getJSON(fmt.Sprintf("/api/repos/%d/pipelines?per_page=%d", repoID, n), &ps)
return ps, err
}
// findPipeline returns the pipeline for commit (prefix match), or the latest when
// commit is empty.
func (c *wpClient) findPipeline(repoID int, commit string) (wpPipeline, error) {
ps, err := c.recentPipelines(repoID, 25)
if err != nil {
return wpPipeline{}, err
}
if len(ps) == 0 {
return wpPipeline{}, fmt.Errorf("no pipelines for repo %d", repoID)
}
if commit == "" {
return ps[0], nil
}
for _, p := range ps {
if strings.HasPrefix(p.Commit, commit) {
return p, nil
}
}
return wpPipeline{}, fmt.Errorf("no pipeline for commit %s in the last %d", commit[:min(8, len(commit))], len(ps))
}
func (c *wpClient) repoID() (int, error) {
owner, repo, err := repoOwnerName()
if err != nil {
return 0, err
}
var r struct {
ID int `json:"id"`
}
if err := c.getJSON("/api/repos/lookup/"+owner+"/"+repo, &r); err != nil {
return 0, err
}
if r.ID == 0 {
return 0, fmt.Errorf("repo %s/%s not registered in woodpecker", owner, repo)
}
return r.ID, nil
}
// repoOwnerName derives <owner>/<repo> from the cwd git remote.
func repoOwnerName() (string, string, error) {
cwd, _ := os.Getwd()
root, err := gitRepoRoot(cwd)
if err != nil {
return "", "", fmt.Errorf("not in a git repository: %w", err)
}
remote := preferRemote(remotesOrEmpty(root))
url, err := gitOutput(root, "remote", "get-url", remote)
if err != nil {
return "", "", err
}
return parseOwnerRepo(url)
}
// parseOwnerRepo extracts owner/repo from an https or ssh git remote URL.
func parseOwnerRepo(url string) (string, string, error) {
u := strings.TrimSuffix(strings.TrimSpace(url), ".git")
u = strings.TrimSuffix(u, "/")
if i := strings.Index(u, "://"); i >= 0 {
u = u[i+3:]
}
u = strings.ReplaceAll(u, ":", "/") // git@host:owner/repo -> git@host/owner/repo
parts := strings.Split(u, "/")
if len(parts) < 2 || parts[len(parts)-1] == "" || parts[len(parts)-2] == "" {
return "", "", fmt.Errorf("cannot parse owner/repo from remote %q", url)
}
return parts[len(parts)-2], parts[len(parts)-1], nil
}
func isTerminalStatus(s string) bool {
switch s {
case "success", "failure", "error", "killed", "declined", "blocked":
return true
}
return false
}
func isFailureStatus(s string) bool {
return s == "failure" || s == "error" || s == "killed" || s == "declined"
}
func min(a, b int) int {
if a < b {
return a
}
return b
}

40
cli/woodpecker_test.go Normal file
View file

@ -0,0 +1,40 @@
package main
import "testing"
func TestParseOwnerRepo(t *testing.T) {
cases := []struct{ in, owner, repo string }{
{"https://forgejo.viktorbarzin.me/viktor/infra.git", "viktor", "infra"},
{"https://forgejo.viktorbarzin.me/viktor/infra", "viktor", "infra"},
{"git@github.com:ViktorBarzin/infra.git", "ViktorBarzin", "infra"},
{"https://github.com/ViktorBarzin/tripit/", "ViktorBarzin", "tripit"},
}
for _, c := range cases {
o, r, err := parseOwnerRepo(c.in)
if err != nil || o != c.owner || r != c.repo {
t.Errorf("parseOwnerRepo(%q) = (%q, %q, %v), want (%q, %q)", c.in, o, r, err, c.owner, c.repo)
}
}
if _, _, err := parseOwnerRepo("nonsense"); err == nil {
t.Error("expected error for unparseable remote")
}
}
func TestStatusClassification(t *testing.T) {
for _, s := range []string{"success", "failure", "error", "killed"} {
if !isTerminalStatus(s) {
t.Errorf("%q should be terminal", s)
}
}
for _, s := range []string{"running", "pending"} {
if isTerminalStatus(s) {
t.Errorf("%q should not be terminal", s)
}
}
if !isFailureStatus("failure") || !isFailureStatus("error") {
t.Error("failure/error should classify as failure")
}
if isFailureStatus("success") {
t.Error("success must not classify as failure")
}
}

View file

@ -0,0 +1,42 @@
---
status: accepted
---
# The Android testing environment is a privileged KVM emulator pod in-cluster
Viktor's apps are growing Android clients (first: tripit's Capacitor shell —
see tripit ADR-0013/0014), and agents need a native Android instance to test
changes against before shipping. All K8s nodes already run with CPU type
`host`, so `/dev/kvm` works inside the cluster.
Decision (2026-06-11): one shared **Android 16 (API 36) Google-emulator
instance** runs as a privileged pod in namespace `android-emulator`
(stack `stacks/android-emulator`), with `/dev/kvm` via hostPath, adb exposed
LAN-only on the shared MetalLB IP (10.0.20.200:5555), and a noVNC screen view
at android-emulator.viktorbarzin.lan. The SDK/system-image/AVD live on a PVC;
the image is a slim manually-built shell.
## Considered options
- **devvm-local docker emulator** — rejected as the durable home: shared
24GB workstation, ~13GB free disk, per-machine, not shared across agents.
- **Dedicated Proxmox VM** — rejected: burns scarce PVE host headroom 24/7
and adds a whole VM lifecycle for one emulator.
- **redroid (container-native Android)** — rejected: requires binder kernel
modules on every node (documented binderfs incompatibilities), max
Android 15; most invasive for the least version coverage.
- **budtmo/docker-android** — rejected: turnkey but capped at Android 14;
the native features driving the Android work (Live Updates, background
GPS) are Android 16 behaviors, matching the real target device.
- **/dev/kvm device plugin instead of privileged** — deferred: a new
cluster component to avoid one namespace-scoped exclude-list entry; the
exclude pattern (kured/woodpecker/frigate/changedetection) already exists.
## Consequences
- `android-emulator` joins the Kyverno `security_policy_exclude_namespaces`
list (privileged allowed; registry policy also bypassed in-namespace).
- adb is unauthenticated by design — the LB IP must remain LAN-only.
- Single shared instance: concurrent agent sessions share Android state;
long destructive work should presence-claim `service:android-emulator`.
- Rendering is swiftshader (CPU) — the contended T4 stays out of the path.

View file

@ -0,0 +1,24 @@
---
status: accepted
date: 2026-06-12
---
# All owned images build off-infra on GitHub Actions and live on ghcr.io
In-cluster Woodpecker buildkit builds repeatedly hurt the homelab: registry-push load OOMKilled Forgejo (2026-06-09), buildkit→Forgejo pushes ride a flaky hairpin, build IO lands on the shared sdc HDD, and the Forgejo registry PVC sat at its 50Gi ceiling with retention stuck in DRY_RUN. We decided every owned image is built by GitHub Actions and hosted on ghcr.io, extending the tripit pilot (2026-06-09) to the whole fleet: Forgejo stays the canonical git host, a one-way push-mirror feeds a GitHub mirror, and the mirror's workflow builds, pushes, then POSTs Woodpecker's API to deploy. The Forgejo container registry is decommissioned as a build target — one manual cleanup pass keeps a last-known-good tag per Service, after which nothing pushes to it.
## Considered options
- **GHA builds pushing back into the Forgejo registry** — keeps images home and the pull path unchanged, but keeps the exact failure mode that motivated the move (Forgejo OOM under blob-push load), keeps the PVC growth, and keeps the circular dependency where the images needed to repair the cluster live inside the cluster. Rejected.
- **Per-repo in-cluster fallback builds** (the old `build-fallback.yml` pattern) — rejected in favour of a clean cut: a GitHub outage pauses image builds (running workloads are unaffected), and existing fallback files are deleted. The hedge against ghcr's "currently free" private storage ever being enforced is the visibility split (public images are permanently free) plus re-creating fallbacks if that day comes.
- **Paid builders (Docker Build Cloud, Depot)** — solve a multi-arch/persistent-cache problem this fleet doesn't have (everything is linux/amd64). Rejected.
## Consequences
- DR improves: images survive homelab loss, so a dead cluster can pull everything it needs to come back — the same doctrine that keeps the monorepo on GitHub ("Forgejo dies with the cluster").
- Private ghcr pulls bypass the registry VM's pull-through cache (it can't authenticate), so cold-node pulls of private images depend on GitHub availability; public images cache normally.
- Visibility is decided per repo: public = generic tooling that passes a gitleaks/PII history scan; private = personal, financial, or legally-gray domains. A failed scan means the repo stays private — canonical history is never rewritten for publication. For interpreted languages repo visibility ≈ image visibility (the image ships the source).
- Only private-repo builds consume GitHub free-plan minutes (~12 builders, well under the 2,000/mo free tier; usage is reviewed after rollout wave 2 before considering Pro).
- Woodpecker becomes deploy-only; its agents never build. The Kyverno-synced `registry-credentials` stays (Forgejo git + frozen last-known-good images); a cluster-wide Kyverno-synced `ghcr-credentials` joins it.
- Builders with no live consumer (terminal-lobby, webhook-handler, hmrc-sync, trading-bot, travel-agent, trip-planner) are decommissioned rather than migrated; travel_blog is decommissioned outright (service + CI). Any revival adopts this ADR's pattern.
- Workflows build single-manifest images (`provenance: false`, linux/amd64 only) so registry retention never faces the orphaned-index-children failure class that broke Forgejo's cleanup.

View file

@ -0,0 +1,30 @@
# Keep Forgejo as the canonical forge; complete the one-way GitHub mirror instead of swapping to GitHub
Status: accepted (extends ADR-0002)
## Context
Repo trees kept diverging between the Forgejo **Canonical repo** (`viktor/<name>`) and its **GitHub mirror**. A 2026-06-15 audit found the cause: an *incomplete rollout* of the Forgejo→GitHub push-mirror, not anything inherent to Forgejo. 14 repos carry **both** remotes and are hand-pushed to each (`push_mirrors = 0` on Forgejo — e.g. `infra`, `finance`, `Website`), so a human forgets one side and the trees drift; the ADR-0002-onboarded repos have a working one-way mirror (`push_mirrors = 1` — e.g. `tripit`, `recruiter-responder`) and never diverge. `infra/CONTEXT.md` already says Forgejo is the only place commits land and the GitHub mirror must never be a second writable remote — practice had simply drifted from the documented model.
The trigger was a proposal to swap Forgejo out for GitHub entirely. The grilling reframed it: the pain (divergence) is a "two writable remotes" problem, and the stated preference is self-hosted-primary with the remote as backup.
## Decision
Do **not** swap to GitHub. Reaffirm and *complete* the model already in `CONTEXT.md`:
- Every first-party repo has exactly **one** push target — its **Canonical repo** on Forgejo. GitHub is a one-way push-mirror (off-site backup + the source GitHub Actions builds from). **No repo is ever dual-pushed.**
- A small, explicit set of **GitHub-first repos** are the exception (canonical lives on GitHub, outside the mirror policy): third-party clones/forks where GitHub is genuinely upstream (`jsoncrack.com`, `snmp_exporter`, `SparkyFitness`, `agent-rules-books`, `Plotting-Your-Dream-Book`) and the deliberately-public first-party `health`.
- `infra` is reconciled into the standard model: its GitHub-only `.github/workflows/build-*.yml` are brought onto Forgejo-canonical (inert on Forgejo, active on the mirror), then the mirror is enabled — ending the deliberate divergence while keeping Woodpecker on the Forgejo forge.
- Enforcement is **structural**: reconciled clones keep only the Forgejo remote, so there is no GitHub remote to habitually push to; the execution rule is "push to the canonical forge only, never the mirror."
## Considered options
- **Swap to GitHub (retire Forgejo).** Rejected: takes on a hard WAN dependency for *all* git ops — including `infra`, the repo you use to *recover* from outages — plus git-crypt secrets on GitHub as primary, a Woodpecker forge migration (WP authenticates against and watches Forgejo), and GitHub private-repo CI-minute/size limits. All to fix a problem that is actually an incomplete mirror, not Forgejo's existence. Contradicts the self-hosted-primary preference.
- **GitHub canonical, Forgejo demoted to a DR pull-mirror.** Rejected for the same WAN-dependency and forge-migration cost; unnecessary once the real cause is understood.
## Consequences
- Divergence becomes structurally impossible — one push target per repo.
- Forgejo stays load-bearing (canonical git + the Woodpecker forge), so every cost of the swap is avoided.
- The GitHub-limits worry is neutralized: private code lives on Forgejo (unlimited, self-hosted); GitHub holds mirrors for CI + backup. (GitHub Free has unlimited private repos anyway; the real limits are GHA minutes and ~1 GB repo size — `travel_blog` at 1.4 GB is why it never went to GHA.)
- One-time remediation is required and carries a data-loss footgun: the Forgejo→GitHub mirror **force-overwrites** GitHub, so for each currently-diverged repo, any GitHub-only commits must be merged into Forgejo **before** the mirror is enabled, or they are lost. Scope: the 14 dual-push repos + the `infra` reconciliation; all other repos are already single-remote and non-diverging.

View file

@ -0,0 +1,30 @@
# homelab: a unified infra-ops CLI grown in place from infra/cli
Agents re-derive the same operational command boilerplate every session — mining
51,116 bash commands across 2,225 past sessions showed dense, repeated patterns
(the infra inner-loop alone is ~29%). We are building `homelab`, one CLI encoding
the deterministic, repeated **actions** (not judgment) agents run — composable in
bash, JSON-capable, and discovered progressively via `homelab manifest`. It is
grown **in place** in `cli/` (the existing `infra-cli`), absorbing new verb-groups
alongside the preserved legacy webhook use-cases. Versioned with a `cli/VERSION`
file (the infra repo deploys continuously and does not cut semver tags).
## Considered options
- **Its own top-level repo** (the original plan) — rejected in favour of keeping
it where the Terraform/Terragrunt and `scripts/tg` it drives already live; the
Go source isn't git-crypt-encrypted and a provision-time build is unaffected by
GitOps continuous-deploy.
- **A fresh CLI ignoring infra-cli** — rejected: strands the VPN/DNS/email
webhook use-cases.
- **Raw kubectl/tg/ssh + skills + MCP only** — kept for everything outside the
recurring action surface (methodology skills; third-party/owned MCP such as
phpIPAM, which homelab does NOT duplicate).
## Consequences
- The binary is dual-purpose: the agent-facing `homelab` verb surface AND the
in-cluster `infra-cli` webhook image. `main()` front-dispatches homelab verbs
and falls through to the legacy `-use-case` path verbatim.
- Distribution: built from source to `/usr/local/bin/homelab` during devvm
provisioning (`t3-dispatch` precedent), refreshed by `t3-autoupdate`.

View file

@ -0,0 +1,23 @@
# homelab v0.1 scope: the infra inner-loop; everything allowed, tiers recorded
v0.1 ships only the highest-volume surface — the infra inner-loop: `work`
(worktree lifecycle), `tf` (terragrunt via `scripts/tg` + fmt/validate/
force-unlock), and `claim`/`release` (presence) — because it is ~29% of all mined
commands and where agents lose the most time and leak the most presence claims.
v0.1 enforces **no** homelab-level permission gating: everything is allowed,
relying on existing gates (harness permission mode, presence claims, plan
approval). But every verb records a `read|write` tier (visible in `manifest`), so
a PreToolUse classifier hook (auto-allow reads / prompt writes) can be added
later with zero restructuring.
## Considered options
- **Reads-first vertical slice** (top read verb per domain) — lower risk, broad
value, but defers the toil that motivated the project.
- **One domain deep (k8s)** — cleanest template, narrow day-one value.
We chose the highest-volume-but-write-heavy infra loop deliberately, accepting
the extra complexity (worktree lifecycle, git-crypt flag injection, presence
coupling, branch-protection PR fallback) for the biggest immediate toil
reduction. k8s/node/secret/net/ci verb-groups are deferred to later versions.

View file

@ -0,0 +1,29 @@
# homelab work/tf behaviour: native worktree entry, gated auto-land, presence-coupled apply
Four behaviours of the infra-loop verbs are surprising enough to record:
1. **`work` owns worktree create/land/clean, but session *entry* delegates to the
native harness worktree tool.** A CLI is a child process and cannot change the
agent's working directory; `EnterWorktree` can. So `homelab work start <topic>`
creates the worktree + branch off `<remote>/master` (git-crypt-aware) and
prints the path — the agent enters it with native `EnterWorktree({path})`.
2. **`work land` is auto-land, but gated on verification.** It merges master in →
runs verification → pushes `HEAD:master` (fetch+merge+retry on
non-fast-forward) → falls back to pushing the feature branch for a PR when the
direct push is rejected (branch protection). It **refuses to push when it
cannot verify** (no `--verify-cmd` and no auto-detected suite) unless
`--no-verify` is passed — added after an accidental smoke-test land pushed
unverified WIP to master (benign: the infra CI applied 0 stacks because the
diff was `cli/`-only, but an unverified land must be deliberate, not default).
3. **`tf apply` is first-class despite GitOps, and mandatorily presence-coupled.**
Local applies are out-of-band (CI applies canonically on push) but happen
constantly (~763× in the corpus). `tf apply <stack>` auto-claims `stack:<name>`,
delegates to `scripts/tg apply --non-interactive`, and **always releases on
exit** (normal, error, or signal via `sync.Once` + handler) — fixing the
documented ~200-claim leak — and prints an out-of-band reminder.
4. **Known v0.1 limitation:** `work land` does not yet block on CI to green; that
arrives with the ci/deploy watch verb-group. It prints a reminder to follow
the pipeline manually.

View file

@ -0,0 +1,30 @@
# homelab k8s verb-group: app→pod resolver, read/write split, config-mutation stays raw
v0.2 adds the Kubernetes verb-group — the biggest remaining surface by far
(mining the post-v0.1 corpus: 11,291 `kubectl` commands across 243 sessions, more
than every other domain combined).
It is built on an **app→namespace→pod resolver**: most namespaces hold exactly
one app, so `<app>` defaults to the namespace, and the target defaults to
`deploy/<app>` (kubectl resolves a pod from the Deployment). `-n`/`--pod`/`-c`/
`-l`/`--tty` override; multi-pod namespaces (`dbaas`, `monitoring`) need
specificity. The CLI uses the ambient kubeconfig — no per-call auth flags.
Verbs: read — `status`, `get`, `logs`, `describe`, `debug` (one-shot triage),
`pf`, `rollout-status`; write/operational — `db`, `exec`, `restart`, `rm-pod`.
## Decisions worth recording
- **Config-mutation verbs are deliberately NOT exposed** (`apply`/`edit`/`patch`/
`scale`/`create`). They stay raw `kubectl`, by design, per the repo's
Terraform-only policy — the corpus confirms they're low-frequency, and a
friendly verb would normalise a policy violation.
- **`rm-pod` is restricted to pods/jobs only** — deleting Deployments/STS/PVCs is
config mutation and forbidden; the verb cannot target them.
- **`db` encodes the dbaas exec pattern** (the single highest-value k8s
sub-pattern, ~886 dbaas ops): PG via `pg-cluster-rw -c postgres`,
`psql -U postgres -d <app>`; MySQL via `mysql-standalone-0` with a
`bash -c 'mysql -p"$MYSQL_ROOT_PASSWORD" …'` wrapper so the password comes from
the pod env and never appears on the command line.
- Read verbs were smoke-tested against the live cluster; write verbs are
unit-tested (resolver, db-plan, shell-quoting) but not fired at live state.

View file

@ -0,0 +1,30 @@
# homelab memory verb-group: direct HTTP client to claude-memory; MCP deprecation path
v0.3 adds the memory verb-group so agents can search and navigate memory from the
CLI. `claude-memory` is a FastAPI service (Postgres-backed, `Bearer`-auth,
ingress `auth = "none"` so programmatic clients work) — the **MCP is just one
frontend over it**. `homelab memory` is a thin HTTP client over the same API,
using the env the hooks already set (`CLAUDE_MEMORY_API_URL` +
`CLAUDE_MEMORY_API_KEY`; defaults to the ingress). Because it talks to the HTTP
API directly, it **works even when the MCP frontend is down** — the recurring
MCP-disconnect problem that motivated claude-memory HA (and that took the MCP
offline for the entire session this was built in).
Verbs: `recall` (server-side semantic ranking), `list`, `categories`, `tags`,
`stats`, `secret` (read); `store`, `update`, `delete` (write). Validated against
the live API including a store→recall→delete round-trip — full data-plane parity
with the MCP.
## Deprecation path (deliberate follow-up — NOT done in v0.3)
The MCP is more than tools: the **per-prompt auto-recall hook** and the
**auto-learn hook** run on every prompt for every agent. Deprecating it safely is
a separate, sequenced change:
1. Rewire the auto-recall hook to `homelab memory recall` and the auto-learn hook
to `homelab memory store`.
2. Update the CLAUDE.md memory policy to point at the CLI.
3. Uninstall the MCP.
Done CLI-first (verbs proven before touching the every-prompt path) so a
regression can't silently break auto-recall/auto-learn fleet-wide.

View file

@ -0,0 +1,29 @@
# homelab ci/deploy verbs: API-based watch, internal-LB dialer, work-land integration
v0.4 adds `ci`/`deploy` — the biggest *reasoning* sink in agent sessions (watching
a build/deploy to completion), proven during the session that built it (hours
spent hand-rolling Woodpecker API polling, DB-schema reverse-engineering, and
retrigger logic for a single CI incident).
## Decisions
- **API, not DB.** The verbs query the Woodpecker REST API (version-stable),
not its Postgres schema (which drifts across upgrades — column renames bit us
mid-incident). Reached via the internal Traefik LB by dialing `10.0.20.203`
while keeping SNI/Host = `ci.viktorbarzin.me` so the cert verifies (the Go
equivalent of the house `curl --resolve` pattern). Token from
`WOODPECKER_TOKEN` or Vault `secret/ci/global`; repo id resolved from the cwd
git remote via `/api/repos/lookup/<owner>/<repo>`.
- **Retries are mandatory.** The Woodpecker API intermittently returns empty/5xx
under load (it flapped through the whole build session); `getJSON` retries
empties with backoff so `ci watch` is reliable exactly when it's needed.
- **`work land` now waits for CI.** After pushing, `work land` calls `ci watch`
on the landed commit and fails if the pipeline does — closing the gap ADR-0005
deferred. `--no-ci-watch` opts out.
- **`deploy wait` encodes the "rollout status lies" rule:** it first waits for
the deployment image to reference the expected sha, *then* blocks on rollout
status (kubectl-based; reuses the k8s helpers).
- **`ci logs` deferred to v0.4.1.** Woodpecker's per-pipeline detail/log
endpoints were the least reliable this session (often empty); `status`/`watch`
rely on the list endpoint that works. A DB-backed `ci logs` is a possible
follow-up if the API path stays flaky.

View file

@ -0,0 +1,37 @@
# homelab net/dns/metrics/logs verbs: endpoint resolution as the unit of value
v0.5 adds `net`/`dns`/`metrics`/`logs`. These were chosen against an explicit
test the user posed mid-build: *does the verb save reasoning, or only typing?* A
wrapper over a command already known fluently (plain `ssh`, `vault kv get`) saves
keystrokes but not thought. These four save thought — the reasoning they encode
is **which endpoint, reached how, with what auth/URL shape** — re-derived every
time otherwise. (That same test deprioritized `node ssh` aliasing and `secret
get`, which are thin wrappers; see the session discussion.)
## Decisions
- **Internal ingresses, reached via the LB.** Everything routes through the
Traefik LB by dialing `10.0.20.203` with the URL host preserved as SNI — the
Go form of the house `curl --resolve host:443:10.0.20.203` pattern
(`probe.go: clientDialingIP`). Verified live before building: Prometheus
(`prometheus-query.viktorbarzin.lan`) and Loki (`loki.viktorbarzin.lan`) both
answer JSON over the LB with **no auth gate and no port-forward** — so these
stay clean HTTP clients, not kubectl wrappers.
- **`net check` is two-legged on purpose.** It resolves the host via public DNS
(→ Cloudflare) AND dials the internal LB, reporting both — because the useful
question is *where* a break is (CF edge vs the app vs the LB path), which a
single curl can't answer. The external leg forces public resolution (the devvm
resolver is split-horizon and would otherwise hit the LB for both).
- **`metrics alerts` uses the `ALERTS` series, not `/api/v1/alerts`.**
`prometheus-query.*` is a query-only frontend (404 on `/api/v1/alerts`), and
Alertmanager has no LB ingress (the alert-digest reads it in-cluster). Firing
alerts are exposed as the synthetic `ALERTS{alertstate="firing"}` time series,
queryable through the working endpoint — so no new dependency.
- **Deliberately NOT built:** in-cluster-only endpoints (Alertmanager v2,
raw `*.svc` services) that would force port-forward/`kubectl run`. The
reasoning-savings there don't beat the added moving parts; kept out of scope.
- **No `node`/`secret` group.** Same test: their high-volume parts are
command-wrappers (low savings); only compound node ops (serial console, VM
wait, fan-out) would qualify, and those are lower-frequency. Left unbuilt
unless a concrete pain surfaces — the high-value deterministic surface
(tf/work/ci/k8s/memory + these probes) is now covered.

View file

@ -0,0 +1,34 @@
# homelab usage telemetry: evidence-driven verb prioritization, privacy by construction
v0.6 adds `usage top` plus a fire-and-forget emit on every dispatched verb. It
exists to answer the question that drove the whole CLI — *which verbs are worth
adding next* — with data instead of one maintainer's habits (the earlier mining
covered a single user's ~51k commands, so the surface is shaped to that user).
## Decisions
- **Emit on dispatch, in `dispatch()`.** The longest-prefix match already knows
the verb path; after `Run` returns we emit `{verb, exit}`. Discovery verbs
don't go through `dispatch()` (`manifest`/`version`/`help` are handled in
`dispatchTop`), so they don't self-record; `usage *` is skipped explicitly so
the analytics reader doesn't pollute its own data.
- **Payload is deliberately minimal: verb path + exit code only.** Labels
`{job=homelab-usage, user, verb}` (all low-cardinality) + line `exit=N ver=X`.
**No args, paths, flags, hostnames, or secrets** ever leave the process — the
emit sees only the matched verb name, not the arguments. This is what makes
cross-user aggregation safe.
- **Shared Loki sink → cross-user analytics WITHOUT reading homes.** Each user's
CLI writes its own invocations (attributed to its OS user) to the shared Loki
push API via the Traefik LB (verified: HTTP 204, no auth). `usage top` reads
back with a LogQL metric query. This is the privacy-preserving resolution to
"what does everyone (e.g. another user) use" — it never touches anyone's
`~/.claude`, which the org per-user policy bars (see the per-user red-line in
managed-settings; reading another user's home is off-limits even for an owner
in-session — a fresh session under changed MDM policy is the only legitimate
path, and even then this telemetry is the better answer).
- **Best-effort, never affects the command.** All errors swallowed; an 800ms
client timeout bounds the cost; opt-out via `HOMELAB_TELEMETRY=0`. Telemetry
must never slow or break the tool it measures.
- **Loki, not a new datastore.** Zero new infra, and it dogfoods the v0.5 `logs`
path (same host, same LB dial). Presence MySQL was the alternative (queryable
SQL) but would add a write dependency and creds; Loki needs neither.

View file

@ -0,0 +1,54 @@
# homelab Home Assistant verbs: token resolution + host SSH, not entity control
v0.7 adds `ha token` and `ha ssh`. They were chosen by mining a heavy HA
operator's sessions: across ~1,900 shell commands the single most-repeated line
(420×) was a hand-rolled `kubectl … | base64 -d | python -c '…token'` pipeline,
and a bespoke `ssh -o StrictHostKeyChecking=no -o …` invocation was redefined as
a shell function ~30× — both re-derived from scratch every session. The existing
`home-assistant-sofia.py` already covers the *API*, but it goes unused from an
arbitrary cwd (it needs `HOME_ASSISTANT_SOFIA_TOKEN` set and is referenced by a
cwd-relative path), so agents bypassed it. A global verb on `$PATH` closes that
gap for every user in every directory.
## Decisions
- **Only the two gaps the `ha` MCP can't fill.** The `ha` MCP server already
does entity state and control (`get_state`, `call_service`, history, logs).
Per the CLI's founding rule — *MCP-encoded actions are out of scope* (ADR-0004)
— we do **not** reimplement `on`/`off`/`list`/`state`. We add only token
*resolution* and host *SSH*, neither of which an API-only MCP can provide. The
value is endpoint/secret/host resolution, exactly like `net`/`dns` (ADR-0010).
- **`ha token` resolves live from the cluster, not from an env var.** It reads
the dedicated k8s Secret `openclaw/ha-tokens` (one key per instance: `sofia` /
`london`) via the ambient kubeconfig. This is robust to env drift — the precise
failure that made agents re-derive the pipeline. Read-tier, prints the bare
token to stdout so it composes in `$(…)`, mirroring `memory secret`.
- **The token is split into its own least-privilege secret** (`stacks/openclaw/ha_tokens.tf`).
It was originally read from `openclaw-secrets``skill_secrets` (a JSON blob
also holding `slack_webhook` + `uptime_kuma_password`), which only cluster
admins can read — so the verb hung/failed for the non-admin operator it was
built for (emo = `emil.barzin@gmail.com`, group `Home Server Admins`, whose
OIDC identity is barred from secrets in `openclaw`). `ha-tokens` carries only
the HA tokens, with a Role+RoleBinding granting `get` on *just that secret* to
the `Home Server Admins` group (k8s RBAC can't scope to a JSON sub-key, hence
the separate object). openclaw's own deployment keeps reading `openclaw-secrets`
— this is purely additive.
- **`ha ssh` is deterministic and per-user.** Flags are fixed for unattended
use: `-F /dev/null` (ignore user ssh-config), `StrictHostKeyChecking=no` +
`UserKnownHostsFile=/dev/null` (no host-key prompt/record — agents have no
TTY), `BatchMode=yes` + `ConnectTimeout=10` (fail fast, never hang). The key
is the **invoking user's** `~/.ssh/id_ed25519`, so the verb isn't tied to
whoever first wrote the workflow; that user's key must be enrolled on the HA
host. Write-tier (runs an arbitrary remote command).
- **sofia is the default; london is structural.** The devvm sits on the Sofia
LAN, so `vbarzin@192.168.1.8` is reachable and is the default instance. london
(`hassio@192.168.8.103`) is in the instance map so `ha token --instance london`
works (a pure secret read), but `ha ssh --instance london` generally won't
connect from here — london is remote. We model it correctly rather than
pretend it's reachable.
- **Scope held at two verbs.** `ha api` (an authenticated curl passthrough for
the endpoints the MCP/script don't cover — `/api/template`, `/reload`,
`check_config`, `/error_log`) was deferred: once `ha token` exists, raw curl is
already unblocked, and a generic passthrough overlaps the MCP. Re-measure via
`usage top` (ADR-0011); add targeted sugar verbs only if those endpoints are
still hand-rolled often.

View file

@ -0,0 +1,75 @@
# homelab browser verbs: headful (anti-bot) web automation via cluster Chrome
v0.8 adds `browser run`, `browser open`, and `browser --help`. They package a
capability that already existed but was undiscoverable: driving the cluster's
**headful** Chrome (`chrome-service` — real Chrome under Xvfb, CDP on
`svc/chrome-service:9222`) from the devvm, for sites that detect and block
headless automation.
## Motivating incident (2026-06-22)
Logging a washing-machine repair on the Stirling Ackroyd **Fixflo** tenant
portal: the headless `@playwright/mcp` browser loaded the site and filled the
entire multi-step form, but the **final submit silently failed** — Fixflo's
pre-submit `POST /IssuePreCreationCheck` returned `net::ERR_FILE_NOT_FOUND`, the
spinner hung, no issue was created. Root cause = headless-Chrome detection. The
fix was to drive the headful `chrome-service` over `connect_over_cdp` — it
submitted first try (Fixflo ref IS22657587). That capability was documented
(`docs/architecture/chrome-service.md`) but **not packaged or discoverable**, so
it took ~40 min, three redundant full form re-runs, and a user hint. The agent
also misread `ERR_FILE_NOT_FOUND` as "network egress" and retried blind instead
of inspecting the network panel.
## Decisions
- **Mechanics in `homelab`, not a `~/.claude` skill.** A standalone skill was
rejected: the CLI is run every session (so the verb is *discoverable*), is
versioned, multi-user, and test-covered. A private, untested skill is none of
those. The command owns only the deterministic *mechanics* (port-forward,
stealth injection, lifecycle) — the agent supplies the Playwright script, so
*judgment* stays out of the CLI (the founding rule, ADR-0004/0005).
- **The failure was judgment, not setup friction**, so the CLI is paired with a
one-line pointer in always-in-context `~/code/CLAUDE.md` and a diagnostic
payload in `browser --help`: the *when-to-use* signature (a site loads but a
gated action fails/hangs, or one request 500s/aborts while siblings 200 →
suspect headless detection) and an error-code cheat-sheet (`ERR_FILE_NOT_FOUND`
= request resolved/intercepted by the automation layer, **not** egress;
egress failures are `ERR_CONNECTION_REFUSED`/`_TIMED_OUT`/`_NAME_NOT_RESOLVED`
and would break the page load too). A command the agent doesn't think to run is
useless; the cheat-sheet is the actual fix for the misdiagnosis.
- **Reach the pod via `kubectl port-forward`, then `connect_over_cdp` to
localhost.** port-forward tunnels API-server→pod, so it **bypasses the `:9222`
NetworkPolicy** that gates in-cluster callers — the devvm needs no namespace
label. Readiness is asserted against `/json/version`: the endpoint must report
a real `Chrome/…`, never `HeadlessChrome` (the whole point). The forward is
**always** torn down (process-group kill + signal handler), on success and on
error — an acceptance requirement.
- **Default to a fresh incognito context; `--shared-context` opts into the warmed
profile.** chrome-service is a single shared browser with a persistent profile.
A fresh, always-closed context is safe for concurrent callers (tripit's fare
scrape connects per-quote) and is what production already does. The warmed
persistent profile (cookies from a manual noVNC login) is opt-in for flows that
need a pre-logged-in session.
- **Pin the node CDP client to `playwright-core@1.48.2`** to match the
chrome-service image minor (`mcr.microsoft.com/playwright:v1.48.0-noble`,
Chromium 130). `connect_over_cdp` speaks the browser's CDP, and protocol
changes between Playwright minors — the devvm's ambient Python Playwright was
1.58, a 10-minor skew. The pin makes behaviour deterministic across the fleet
regardless of local drift. `playwright-core` (not `playwright`) because no
browser binary is needed — we connect to the remote one.
- **Self-provision the client lazily, no per-user setup.** The pinned client is
installed once into `~/.cache/homelab/browser-client/` (idempotent, version-
guarded) on first use, alongside the embedded runner + stealth files. node is
already fleet-wide; this avoids coupling the feature to a provisioner change
and keeps it self-contained and self-healing. The client runs on the devvm, so
`setInputFiles` streams local files to the remote browser over CDP — no
`chmod`/staging-dir workaround on the CDP path.
- **Vendor `stealth.js`, guard against drift.** The CLI embeds a byte-for-byte
copy of `stacks/chrome-service/files/stealth.js` (the source of truth the
in-cluster callers use) via `go:embed`; a unit test fails if the copy drifts.
`go:embed` can't reach outside the package dir, hence the vendored copy rather
than a path reference.
- **Scope held at two action verbs + help.** `run` (arbitrary script — the
workhorse) and `open` (navigate + title/text/screenshot — a quick check) cover
the surface. Both are write-tier; the bare `browser`/`--help` is read. Re-measure
via `usage top` (ADR-0011) before adding more.

View file

@ -0,0 +1,29 @@
---
status: accepted
date: 2026-06-24
---
# Service identity is namespace + label; east-west observability via Calico Goldmane; no service mesh
As the Service count grows we want an audit-grade record of which Service talks to which — the "service mesh evaluation" `docs/plans/2026-04-20-infra-audit-design.md` flagged as never done ("worth a design doc even if the answer is no, too much complexity for the gain"). We evaluated the full design space against two constraints: the trust model is a single-tenant cluster needing **attribution-grade** forensics (reconstruct events in a cluster we trust), not cryptographic non-repudiation against a hostile pod; and we are acutely **etcd-constrained** (we removed VPA/Goldilocks for exactly this, and carry open beads `code-oflt`/`code-at4f` on etcd starvation). Decision: **service identity = the workload's namespace** (primary; Goldmane stamps it natively and "one Service ≈ one namespace" holds for ~87 of our namespaces), refined by an explicit `service-identity` label only in the few genuinely multi-Service namespaces (`monitoring`, `kube-system`, `dbaas`). **East-west observability = Calico 3.30 Goldmane + Whisker** (already in our Calico v3.30.7, currently `enabled = false` in `stacks/calico/main.tf`), with Goldmane's emitter shipping flows to **Loki** for a durable trail. **Enforcement reuses the existing Wave 1 observe-then-enforce egress track**, now selecting on namespace/label and fed by Goldmane's allow/deny + policy-trace flows. We explicitly **reject** a service mesh, mTLS, SPIFFE/SPIRE, and dedicated per-Service ServiceAccounts for now.
## Considered options
- **Dedicated per-Service ServiceAccount as the identity primitive** — initially chosen, then reversed. 56% of pods (257/458) run as `default`, so it is a ~116-stack rollout; and Goldmane (the chosen flow source) carries pod/namespace/workload **labels but no ServiceAccount field**, so SAs would not even reach the audit trail without a custom pod→SA mapping. The cheaper, etcd-inert path (namespace is free; a handful of static labels) delivers the same attribution. Deferred until identity-aware NetworkPolicy needs a principal finer than namespace/label, or mTLS is adopted.
- **Service mesh (Istio / Linkerd / Cilium-mesh) + mTLS + SPIFFE/SPIRE** — the only thing that makes the trail cryptographically non-repudiable against a hostile pod. The trust model does not justify it, east-west stays single-tenant plaintext, and it is precisely the "too much complexity for the gain" the audit doc predicted. Rejected.
- **Microsoft Retina (CNI-agnostic eBPF)** — more capable (DNS, drops, Hubble UI) and GA, runs on Calico without a CNI change. But identity-rich mode writes **one `RetinaEndpoint` CRD per pod to etcd** (continuous, pod-proportional churn — the exact axis we guard), and it is metrics-first, not log-first (no per-flow Loki records without custom glue). Rejected for this use case; noted as the fallback if DNS/drop-level detail is ever needed.
- **Cilium Hubble** — reads Cilium's eBPF datapath maps; unusable on Calico without migrating the CNI. A CNI migration is not justified. Rejected.
- **Kiali** — builds its graph entirely from an Istio mesh's Prometheus telemetry; no mesh, no graph. Rejected.
- **Custom Grafana Alloy enrichment exporter over raw iptables-`LOG` flow lines** — Alloy has no IP→identity dictionary-lookup primitive (`loki.process` lacks a lookup stage; `k8sattributes` can't do per-line/dual-IP association), so this is a multi-day custom build that also has to beat pod-IP churn. Goldmane delivers identity-stamped flows natively and obviates it. Rejected.
- **Kyverno generate+mutate to provision/assign identity** — rejected on etcd grounds: background scans + PolicyReports + UpdateRequests are continuous writes, the VPA-class cost we shed. Identity stays static.
## Consequences
- **No etcd cost from the flow plane.** Goldmane streams flows from Felix (the existing `calico-node` DaemonSet) over gRPC into a ~60-minute in-memory ring buffer — nothing written to etcd or the K8s API. Steady-state cost is two Deployments (`goldmane`, `whisker`) + RAM/CPU on the goldmane pod.
- **The ring buffer is not a trail.** Durable, queryable history depends on the emitter→Loki path (reuse the 90-day security-stream retention); on a Goldmane restart the in-memory window is lost.
- **Goldmane is tech-preview** in OSS Calico 3.30 — the main risk. Enabling it is a reversible toggle in `stacks/calico/main.tf`, but the toggle interacts with the operator-managed Installation CR (only namespaces are TF-adopted today; verify how Goldmane/Whisker are enabled before applying).
- **Attribution is namespace-grained for free** across ~87 single-Service namespaces. Multi-Service namespaces (`monitoring`, `kube-system`, `dbaas`) need a `service-identity` label to disambiguate; most are platform/infra and already on the Wave 1 enforcement exclude list.
- **The trail is attribution-grade, not cryptographic.** It reliably reconstructs events in a trusted cluster but cannot prove identity against a pod that spoofs its source — an accepted limit of the trust model. This ADR does not change the east-west encryption posture (still plaintext, no mTLS).
- **Enforcement gains a better data source.** Goldmane's allow/deny + policy-trace flows build the Wave 1 empirical egress allowlist faster than the current iptables-`LOG`→journald→Loki path, and policies select on namespace/label with no SA dependency.
- **New ubiquitous language** recorded in `CONTEXT.md`: **Service identity** and **Goldmane / Whisker**.
- **Revisit triggers:** adopt dedicated per-Service SAs if identity-aware NetworkPolicy needs a principal finer than namespace/label, or if mTLS is ever required; reconsider Retina if DNS/drop-level flow detail becomes necessary.

View file

@ -40,10 +40,10 @@ graph TB
| Component | Version | Location | Purpose |
|-----------|---------|----------|---------|
| Authentik Server | 2026.2.2 | `stacks/authentik/` | Core IdP application servers (2 replicas) |
| Authentik Server | 2026.2.2 | `stacks/authentik/` | Core IdP application servers (3 replicas) |
| Authentik Worker | 2026.2.2 | `stacks/authentik/` | Background task processors (2 replicas) |
| PgBouncer | Latest | `stacks/authentik/` | PostgreSQL connection pooler (3 replicas) |
| Embedded Outpost | - | Built into Authentik | Forward auth endpoint for Traefik |
| Embedded Outpost | - | Standalone deployment, managed by Authentik | Forward auth endpoint for Traefik (2 replicas, PG-backed sessions) |
| Traefik ForwardAuth | - | `modules/kubernetes/ingress_factory/` | Middleware attached when `auth = "required"` or `"public"` |
| Vault OIDC Method | - | `stacks/vault/` | Human SSO authentication to Vault |
| Vault K8s Auth | - | `stacks/vault/` | Service account JWT authentication |
@ -64,15 +64,36 @@ Services pick an auth tier via the `auth` enum on the `ingress_factory` module (
When `auth = "required"`, an unauthenticated request flows:
1. Request hits Traefik ingress
2. ForwardAuth middleware calls Authentik embedded outpost
3. Authentik checks for valid session cookie
2. ForwardAuth middleware calls the `auth-proxy` nginx (basicAuth fallback when Authentik is down), which proxies to the Authentik embedded outpost over a keepalive connection pool
3. Authentik checks for valid session cookie (domain-level `authentik_proxy_*` cookie on `.viktorbarzin.me`, 4-week validity — one cookie covers all forward-auth apps)
4. If missing/invalid, redirects to Authentik login page (authentik.viktorbarzin.me)
5. User authenticates via social provider (Google/GitHub/Facebook)
5. User authenticates on a **single screen**: username + password together (the identification stage embeds the password stage), or a social provider button (Google/GitHub/Facebook), then MFA validation
6. Authentik creates session, sets cookie, redirects back to original URL
7. Subsequent requests include session cookie, pass auth check, reach backend
Authentik adds authentication headers (user, email, groups) to forwarded requests. These headers are stripped before reaching the backend to prevent confusion.
### First-time signin performance (2026-06-10)
Signin latency is dominated by screen count and round trips, not server time
(DB avg 1.6ms). Standing decisions:
- **Single-screen login**: the identification stage carries `password_stage`,
so username+password is one round trip. The separate password-stage binding
was removed from `default-authentication-flow` (required by authentik when
embedding). Pinned in TF: `authentik_stage_identification.default_identification`.
- **Implicit consent everywhere**: all OIDC providers are first-party, so none
use the explicit-consent flow (it re-prompted every 4 weeks per app).
- **Live tuning via `server.env`/`worker.env`** (the `authentik.*` Helm values
are inert due to `existingSecret`): 3 gunicorn workers, 30m flow-plan cache,
15m policy cache, 60s persistent DB connections.
- **Static assets cached immutable**: `/static` ingress carve-out adds
`Cache-Control: public, max-age=31536000, immutable` (assets are
version-fingerprinted; authentik itself sends no max-age).
- **Outpost**: 2 replicas, `log_level=info` (was 1 replica at `trace`).
- **auth-proxy nginx**: upstream `keepalive 32` + HTTP/1.1 — no per-request
TCP setup on the forward-auth subrequest path.
**Anti-exposure guard**: every `auth = "app"` or `auth = "none"` line MUST have a preceding `# auth = "<tier>": <reason>` comment documenting what gates the backend (for `"app"`) or why the endpoint is intentionally public (for `"none"`). The convention is enforced by `scripts/check-ingress-auth-comments.py`, which `scripts/tg` runs on every `plan/apply/destroy/refresh` and blocks the terragrunt invocation if violated. Stack-scoped — each stack documents itself.
### Social Login & Invitation Flow

View file

@ -4,7 +4,7 @@ This doc covers three independent automation paths:
1. **Service-level upgrades** — Container image bumps for OSS apps (DIUN → n8n → claude-agent → Terraform). Most of this doc.
2. **OS-level upgrades on K8s nodes**`unattended-upgrades` + `kured` with sentinel-gate + Prometheus halt-on-alert. See "K8s Node OS Upgrades" section and the runbook at `docs/runbooks/k8s-node-auto-upgrades.md`.
3. **K8s component version upgrades** (kubeadm/kubelet/kubectl) — weekly detection CronJob → chain of phase Jobs (preflight → master → worker × 4 → postflight). See "K8s Version Upgrades" section and the runbook at `docs/runbooks/k8s-version-upgrade.md`.
3. **K8s component version upgrades** (kubeadm/kubelet/kubectl) — daily detection CronJob → chain of phase Jobs (preflight → master → one worker Job per worker, enumerated live → postflight). See "K8s Version Upgrades" section and the runbook at `docs/runbooks/k8s-version-upgrade.md`.
## Overview
@ -252,7 +252,7 @@ kubeadm/kubelet/kubectl bumps (patch + minor) on all 5 K8s VMs.
### Architecture
```
k8s-version-check CronJob (Sun 12:00 UTC, k8s-upgrade ns)
k8s-version-check CronJob (23:00 UTC nightly, k8s-upgrade ns)
│ probe apt-cache madison kubeadm (master) → latest available patch
│ probe HEAD https://pkgs.k8s.io/.../v<NEXT_MINOR>/deb/Release → next minor?
│ push k8s_upgrade_available metric to Pushgateway
@ -262,20 +262,26 @@ envsubst on /template/job-template.yaml | kubectl apply -f -
│ spawns Job 0 = k8s-upgrade-preflight-<target_version>
Job 0 — preflight (pinned: k8s-node1)
Job 1 — master upgrade (pinned: k8s-node1) drains k8s-master
Job 2 — worker (pinned: k8s-node1) drains k8s-node4
Job 3 — worker (pinned: k8s-node1) drains k8s-node3
Job 4 — worker (pinned: k8s-node1) drains k8s-node2
Job 5 — worker (pinned: k8s-master) drains k8s-node1 ← control-plane toleration
Job 6 — postflight (no pinning)
Job 0 — preflight (pinned: first worker)
Job 1 — master upgrade (pinned: first worker) drains k8s-master
Job 2..N — worker (pinned: k8s-master) drains each worker still off-target
← control-plane toleration; one Job
per worker, enumerated live from
`kubectl get nodes` (covers node5/6
+ any future node automatically)
Job N+1 — postflight (no pinning)
```
Each Job runs `scripts/upgrade-step.sh`, which dispatches on `$PHASE` and ends
by spawning the next Job (`envsubst < /template/job-template.yaml | kubectl
apply -f -`). Job names are deterministic (`k8s-upgrade-<phase>-<target_version>[-<node>]`)
so `apply` reconciles to a single Job per run — re-running a failed Job
won't duplicate downstream Jobs.
so `apply` reconciles to a single Job per run — re-running won't duplicate
downstream Jobs. The detection CronJob and `spawn_next` additionally delete +
re-spawn a terminally-**Failed** Job of the same name (rather than skipping it
on existence), so a transient preflight gate self-heals on the next cycle
instead of wedging the pipeline until the dead Job's 7d TTL expires
(retry-on-failure, added 2026-06-17 after a spurious critical alert stalled
1.34.9 for 5 days).
### Self-preemption history (the reason for the Job-chain rewrite)
@ -304,11 +310,16 @@ each Job's pod and its drain target are always different nodes.
ConfigMap, and a `template` ConfigMap into each Job pod.
- **Per-node script**: `infra/scripts/update_k8s.sh`. Caller passes
`--role master|worker --release X.Y.Z`. Piped via SSH into each node by
upgrade-step.sh.
- **Three Upgrade Gates alerts**:
upgrade-step.sh. The master path runs `kubeadm upgrade apply` with
`--ignore-preflight-errors=CoreDNSMigration,CoreDNSUnsupportedPlugins
--skip-phases=addon/coredns` so kubeadm never touches CoreDNS (custom Corefile
+ separately-tracked image; CoreDNS is pinned off Keel via `keel.sh/policy=never`).
See the runbook's "CoreDNS is NOT upgraded by kubeadm here".
- **Four Upgrade Gates alerts**:
- `K8sVersionSkew` — kubelet/apiserver `gitVersion` count >1 for 30m. Catches a half-done rollout.
- `EtcdPreUpgradeSnapshotMissing``k8s_upgrade_in_flight==1 && k8s_upgrade_snapshot_taken==0` for 10m. Catches preflight failing silently.
- `K8sUpgradeStalled``k8s_upgrade_in_flight==1 && time()-k8s_upgrade_started_timestamp > 5400` for 5m. Catches a chain Job dying without spawning its successor.
- `K8sUpgradeChainJobFailed``(kube_job_status_failed{namespace="k8s-upgrade",job_name=~"k8s-upgrade-(preflight|master|worker|postflight)-.*",reason=~"BackoffLimitExceeded|DeadlineExceeded"} > 0) unless on() (k8s_upgrade_blocked == 1)` for 15m (warning). Catches a phase Job that terminally failed **before `in_flight` was set** (the preflight gates exit pre-metric) — invisible to the two `in_flight`-based alerts above; this was the blind spot behind the 5-day 1.34.9 preflight wedge. Reason-scoped so a retry-success doesn't false-positive (and so it doesn't needlessly block kured). The `unless k8s_upgrade_blocked == 1` clause (2026-06-21) excludes a deliberate compat-gate refusal (owned by `K8sUpgradeBlocked`) so a block doesn't double-fire as a wedge.
- **Pushgateway metrics**:
- `k8s_upgrade_in_flight` (set in preflight, cleared in postflight)
- `k8s_upgrade_snapshot_taken` (set after etcd snapshot Job completes with ≥1 KiB)
@ -334,7 +345,7 @@ The cluster has a single control plane (no HA). A failed `kubeadm upgrade apply`
- **Mandatory etcd snapshot before every run** (even patch). Recovery point if master breaks.
- **Halt-on-alert before every drain**. Reuses the same Prometheus ignore-list regex kured uses — any unrelated cluster-health alert blocks. Three gate alerts catch upgrade-specific half-states (version skew, missing snapshot, stalled chain).
- **Job pinning eliminates self-preemption**. Each Job's pod runs on a node that is NOT its drain target. k8s-node1 hosts every Job except the one that drains it (which runs on k8s-master with a control-plane toleration).
- **Job pinning eliminates self-preemption**. Each Job's pod runs on a node that is NOT its drain target: the master-drain Job runs on the first worker; every worker-drain Job runs on k8s-master (already upgraded, control-plane toleration). The worker set is enumerated live from `kubectl get nodes`, so new nodes are covered with no script change; SSH targets are node InternalIPs (no DNS dependency).
- **Sequential workers with 10-min inter-node soak**. Same risk-bounding as the 24h OS-reboot soak, but tightened because kubelet failures surface within minutes — not hours.
- **Master upgrade goes first, workers last**. If master breaks, the cluster is already degraded so further worker upgrades would just delay recovery. By upgrading master first, we either succeed (workers can roll afterward) or fail loud (operator triages before any worker is touched).
- **No auto-rollback**. kubeadm doesn't support clean downgrade; the snapshot + manual apt rollback in the runbook is the recovery path.

View file

@ -77,6 +77,8 @@ The **bypass list** (leg 2) is just `/srv/nfs/immich/` — too big for sda (1.5
- `Synology/Backup/Viki/nfs/` — immich only (post-2026-05-26)
- `Synology/Backup/Viki/nfs-ssd/`**immich-ML only (2026-06-01)**; ollama/llamacpp dropped (re-pullable models, live-only on the SSD)
**VM image backups (added 2026-06-09)**: the hand-managed Linux VMs (those NOT in Terraform — see `compute.md`) were historically **not imaged at all** — only their *contents* reached backup if they happened to host a PVC/NFS path. `vzdump-vms` now takes a daily live `vzdump --mode snapshot` of each configured VMID → `/mnt/backup/vzdump/` (Copy 2), carried offsite by the monthly offsite-sync full pass (Copy 3). **Currently enabled for VMID 102 (devvm)** — the shared workstation, whose per-user home dirs + local-only git repos are otherwise irreplaceable. Extend via `VZDUMP_VMIDS` in the unit. See "VM Image Backups (vzdump)" under How It Works.
## Architecture Diagram
### Data Routing — where each path goes (post-2026-05-26)
@ -208,13 +210,14 @@ graph LR
T0000["00:00 LVM thin snapshots<br/>(lvm-pvc-snapshot)<br/>sdc PVCs CoW"]
T0015["00:15 PostgreSQL per-DB dumps<br/>(CronJob)"]
T0045["00:45 MySQL per-DB dumps<br/>(CronJob)"]
T0100["01:00 vzdump-vms<br/>live image of hand-managed VMs<br/>(devvm) → sda /mnt/backup/vzdump/"]
T0200["02:00 nfs-mirror (daily)<br/>sdc /srv/nfs/* → sda /mnt/backup/<svc>/<br/>~10-20 min steady state"]
T0500["05:00 daily-backup<br/>mount LVM snapshots ro<br/>rsync PVC files → /mnt/backup/pvc-data/<br/>+ sqlite + pfsense + pve-config"]
T0600["06:00 offsite-sync-backup<br/>Step 1: sda → Synology /Viki/pve-backup/<br/>Step 2: sdc/immich + nfs-ssd → /Viki/nfs[-ssd]/"]
T1200["12:00 LVM thin snapshots (midday)<br/>second daily snapshot"]
end
T0000 --> T0015 --> T0045 --> T0200 --> T0500 --> T0600 --> T1200
T0000 --> T0015 --> T0045 --> T0100 --> T0200 --> T0500 --> T0600 --> T1200
INO -.->|change events feed Step 2| T0600
style Nightly fill:#ffe0b2
@ -322,6 +325,7 @@ graph LR
| NFS Change Tracker | Continuous (inotifywait) | PVE host: `nfs-change-tracker.service` | Logs changed NFS file paths to `/mnt/backup/.nfs-changes.log` |
| pfSense Backup | Daily 05:00 + daily-backup | PVE host: SSH + API | config.xml + full filesystem tar |
| Offsite Sync | Daily 06:00 (after daily-backup) | PVE host: `offsite-sync-backup` | Two-step: sda→pve-backup + NFS→nfs/nfs-ssd via inotify |
| VM Image Backup (vzdump) | Daily 01:00, keep 3 | PVE host: `vzdump-vms` | Live `vzdump` of hand-managed VMs (devvm) → `/mnt/backup/vzdump/` |
| PostgreSQL Backup (full) | Daily 00:00, 14d retention | CronJob in `dbaas` namespace | pg_dumpall for all databases |
| PostgreSQL Backup (per-db) | Daily 00:15, 14d retention | CronJob in `dbaas` namespace | pg_dump -Fc per database → `/backup/per-db/<db>/` |
| MySQL Backup (full) | Daily 00:30, 14d retention | CronJob in `dbaas` namespace | mysqldump --all-databases |
@ -352,6 +356,20 @@ Native LVM thin snapshots provide crash-consistent point-in-time recovery for 62
**Restore**: `lvm-pvc-snapshot restore <pvc-lv> <snapshot-lv>` — auto-discovers K8s workload, scales down, swaps LVs, scales back up. See `docs/runbooks/restore-lvm-snapshot.md`.
### VM Image Backups (vzdump)
The hand-managed Linux VMs are **intentionally not in Terraform** (telmate/bpg provider bugs — see `compute.md`) and were historically **not imaged at all**: nothing took a whole-disk backup of the VM itself. For most that is acceptable — k8s nodes are reprovisioned from cloud-init and their data lives in PVCs covered above. But **devvm** (the shared multi-user Claude Code workstation, VMID 102) holds irreplaceable state that lives nowhere else: per-user home dirs (`~/.claude`, `~/.t3`, shell history), manually-installed tooling, and **local-only git repos** — the monorepo root at `/home/wizard/code` has no git remote. A lost devvm disk = unrecoverable.
**Script**: `/usr/local/bin/vzdump-vms` on PVE host (source: `infra/scripts/vzdump-vms.sh`). Deploy: `scp infra/scripts/vzdump-vms.sh root@192.168.1.127:/usr/local/bin/vzdump-vms` + `scp infra/scripts/vzdump-vms.{service,timer} root@192.168.1.127:/etc/systemd/system/`, then `systemctl daemon-reload && systemctl enable --now vzdump-vms.timer`.
**Schedule**: Daily 01:00 via systemd timer — ahead of the other backup jobs so the fresh image is on sda before offsite-sync runs.
**Mode**: `vzdump --mode snapshot` — live, no downtime. devvm has the qemu guest agent enabled (`agent: 1`), so the snapshot is **filesystem-consistent** (fs-freeze) rather than merely crash-consistent. Runs `Nice=10` + `IOSchedulingClass=idle` + `--ionice 7` so it never starves etcd on the contended sdc IO domain.
**Scope**: VMIDs in `VZDUMP_VMIDS` (default `102` = devvm). Add VMIDs there to image other hand-managed VMs.
**Retention**: `KEEP=3` newest dumps per VMID on sda (`/mnt/backup/vzdump/`); each devvm image is ~35-50 GB zstd.
**Critical dependency**: `nfs-mirror` MUST keep `--exclude='/vzdump/'`. Its nightly `rsync -rlt --delete /srv/nfs/ → /mnt/backup/` treats any `/mnt/backup` dir with no `/srv/nfs` counterpart as an orphan and deletes it — this silently reaped the first two vzdump images at 02:00 on 2026-06-10 before the exclude was added (same reason `pvc-data`/`pfsense`/`pve-config`/`sqlite-backup` are excluded).
**Offsite**: deliberately **NOT** appended to the incremental offsite manifest — it never deletes, so daily multi-GB images would accumulate unbounded on Synology. Instead the **monthly offsite-sync full pass (days 1-7)** mirrors all of `/mnt/backup` (including `vzdump/`) to Synology with `--delete`, bounded to local retention. So Copy 2 (sda) refreshes **daily**; Copy 3 (Synology) refreshes **monthly**.
**Monitoring**: pushes `vzdump_last_run_timestamp` / `vzdump_last_status` / `vzdump_last_success_timestamp` to Pushgateway job `vzdump-backup`. Alerts `VzdumpBackupStale` (>~50h since last success), `VzdumpBackupNeverRun`, `VzdumpBackupFailing` (status≠0) are defined in `stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl` (the 3-2-1 group) — **effective on the next `monitoring` stack apply** (metrics already flow, so the alerts arm immediately once applied).
**Restore**: on the PVE host, `qmrestore /mnt/backup/vzdump/vzdump-qemu-<vmid>-<ts>.vma.zst <vmid>` — restore to a spare VMID first if the original still exists, then swap disks; or use the PVE UI (add `/mnt/backup` as a dir storage with content=backup → Restore).
### Layer 2: Weekly File-Level Backup (sda Backup Disk)
**Backup disk**: sda (1.1TB RAID1 SAS) → VG `backup` → LV `data` → ext4 → mounted at `/mnt/backup` on PVE host. Dedicated backup disk, independent of live storage.
@ -527,12 +545,16 @@ The btrfs cleaner thread reclaims async — `df` may lag the snapshot-delete by
| `/usr/local/bin/lvm-pvc-snapshot` | PVE host: LVM snapshot creation + restore |
| `/usr/local/bin/daily-backup` | PVE host: PVC file copy + auto SQLite backup + pfSense |
| `/usr/local/bin/offsite-sync-backup` | PVE host: two-step rsync to Synology (sda + NFS via inotify) |
| `/usr/local/bin/vzdump-vms` | PVE host: daily live `vzdump` image of hand-managed VMs (devvm) → `/mnt/backup/vzdump/` |
| `/mnt/backup/` | PVE host: sda mount point (1.1TB backup disk) |
| `/mnt/backup/vzdump/` | PVE host: vzdump VM images (keep 3 per VMID), mirrored offsite monthly |
| `/mnt/backup/.nfs-changes.log` | NFS change log from inotifywait, consumed by offsite-sync |
| `/etc/systemd/system/nfs-change-tracker.service` | inotifywait watcher for `/srv/nfs` + `/srv/nfs-ssd` |
| `/etc/systemd/system/lvm-pvc-snapshot.timer` | Daily 03:00 (LVM snapshots) |
| `/etc/systemd/system/daily-backup.timer` | Daily 05:00 (file backup) |
| `/etc/systemd/system/offsite-sync-backup.timer` | Daily 06:00 (offsite sync) |
| `/etc/systemd/system/vzdump-vms.timer` | Daily 01:00 (VM image backup) |
| `/etc/systemd/system/vzdump-vms.service` | oneshot: `vzdump-vms` (source `infra/scripts/vzdump-vms.{sh,service,timer}`) |
| `/usr/local/bin/nfs-mirror` | PVE host: daily 02:00 mirror of /srv/nfs/* → sda /mnt/backup/<svc>/ (Layer 3a) |
| `/etc/systemd/system/nfs-mirror.timer` | Daily 02:00 (NFS local mirror to sda) |
| `stacks/dbaas/` | Terraform: PostgreSQL/MySQL backup CronJobs |
@ -911,6 +933,9 @@ the 2026-04-22 backup_offsite_sync FAIL (node3 kubelet hiccup at
| Uptime Kuma | ✓ | ✓ | — | ✓ | proxmox-lvm |
| **Other apps not enumerated above** | ✓¹ | ✓¹ | varies | ✓ | proxmox-lvm / proxmox-lvm-encrypted |
| **Postiz** (bundled bitnami PG on local-path) | — | — | ✓ daily pg_dump → NFS | ✓ | local-path + NFS |
| **Hand-managed VMs (not in Terraform)** |
| devvm (workstation, VMID 102) | — | — | ✓ daily vzdump image | ✓ monthly | local-lvm (sdc) |
| Other hand-managed VMs (HA 103, registry 220, k8s nodes) | — | — | — gap² | — | local-lvm — see note² |
| **Media (NFS)** |
| Immich (~800GB) | — | — | — | ✓ | NFS |
| Audiobookshelf | — | — | — | ✓ | NFS |
@ -924,6 +949,8 @@ the 2026-04-22 backup_offsite_sync FAIL (node3 kubelet hiccup at
**Note**: All proxmox-lvm and proxmox-lvm-encrypted PVCs get LVM snapshots (except `dbaas` and `monitoring` namespaces, excluded for write-amplification reasons) + file-level backup. NFS-backed media syncs directly to Synology `nfs/` and `nfs-ssd/` via inotify change tracking.
² **Hand-managed VMs** — only **devvm (102)** is imaged today (`vzdump-vms`, `VZDUMP_VMIDS=102`). The k8s nodes are deliberately uncovered (reprovisioned from cloud-init; their data lives in the PVCs already backed up above). **home-assistant (103) and docker-registry (220) are a documented gap** — add their VMIDs to `VZDUMP_VMIDS` to image them (registry content is also re-pullable from upstreams; HA has its own add-on backups). pfSense (101) is covered separately by `daily-backup` (config.xml + weekly tar).
¹ **"Other apps not enumerated above"** — the table only enumerates services worth calling out. The default backup posture for any service using `proxmox-lvm` or `proxmox-lvm-encrypted` (outside `dbaas`/`monitoring`) is **automatic** Layer 1 (LVM thin snapshots, 7d retention) + Layer 2 (file backup, 4 weekly versions on sda) + Layer 3 (offsite to Synology). Auto-discovery is by LV name pattern (`vm-*-pvc-*`), so adding a new service to the cluster gets it covered without any explicit registration. Run `ssh root@192.168.1.127 lvs --noheadings -o lv_name pve | grep '^vm-.*-pvc-' | grep -v _snap_ | wc -l` to see the live count.
**Known gaps** — services with PVCs not on the proxmox-lvm path lose Layer 1+2:

View file

@ -10,9 +10,14 @@ serves two distinct populations:
`chromium.connect_over_cdp("http://chrome-service.chrome-service.svc:9222")`
to drive a real browser when upstream anti-bot trips a headless one
(`disable-devtool.js` redirect-to-google trap, `navigator.webdriver`
checks, console-clear timing tricks). The only currently-active
in-cluster caller is the `chrome-service-snapshot-harvester` CronJob;
the `stacks/f1-stream/files/backend/playback_verifier.py` +
checks, console-clear timing tricks). Currently-active in-cluster
callers: the `chrome-service-snapshot-harvester` CronJob, and
**tripit's `PlaywrightFareProvider`** (since 2026-06-11, tripit issue
#18 / ADR-0007) — the flight-fare scrape connects per quote, opens a
fresh incognito context, scrapes Google Flights, and closes the
context; rate-limited to one attempt per 30s with a 6h fare cache, so
browser load is negligible. The
`stacks/f1-stream/files/backend/playback_verifier.py` +
`chrome_browser.py` tree is a vestigial design — the deployed
f1-stream image (built from `github.com/ViktorBarzin/f1-stream`)
does not use this code path.
@ -107,17 +112,32 @@ External caller (dev box):
@playwright/mcp --isolated --storage-state ~/.cache/...storage-state.json
```
## Browser binary — real Google Chrome (for proprietary codecs)
The chrome-service container runs **real Google Chrome**, not the bundled
Chromium, via the infra-owned image `ghcr.io/viktorbarzin/chrome-service-browser`
(`files/chrome/Dockerfile` = `mcr.microsoft.com/playwright:v1.48.0-noble` +
`google-chrome-stable`, built by `.github/workflows/build-chrome-service-browser.yml`).
The launch resolves `CHROMIUM=/opt/google/chrome/chrome`.
**Why:** the Playwright-bundled Chromium has proprietary codecs **compiled out**,
so H.264/AAC video (Instagram Reels, X, most `.mp4`) fails in the noVNC view with
`MEDIA_ERR_SRC_NOT_SUPPORTED` (the bytes download `200 video/mp4` but there's no
decoder — NOT a GPU issue). Royalty-free codecs (VP9/VP8/AV1 → YouTube) always
worked. Swapping `libffmpeg.so` does NOT help (codecs are compiled out, not just
the lib stripped) and Chrome-for-Testing is also codec-less — only
`google-chrome-stable` carries them.
## Image pin
Both the server image (`mcr.microsoft.com/playwright:v1.48.0-noble` in
`stacks/chrome-service/main.tf`) and the Python client
(`playwright==1.48.0` in callers' `requirements.txt`) **must match
minor-versions**. Bump in lockstep — Playwright protocol changes between
minors and the client cannot connect to a mismatched server.
The harvester + snapshot-server sidecar use
`mcr.microsoft.com/playwright/python:v1.48.0-noble` — same playwright
minor, with Python-side bindings pre-installed.
The Playwright base + the Python client (`playwright==1.48.0` in callers'
`requirements.txt`) and the snapshot sidecars
(`mcr.microsoft.com/playwright/python:v1.48.0-noble`) historically had to match
minor-versions. The chrome-service browser is now real Google Chrome (a newer
milestone than the 1.48 Chromium), but the `connect_over_cdp` callers (tripit
fare scrape, `homelab browser`, snapshot-harvester) attach over raw CDP, which is
version-tolerant — verified working against this Chrome. If a future Chrome
milestone breaks a caller, pin Chrome in the Dockerfile or bump the clients.
## Storage
@ -162,7 +182,29 @@ minor, with Python-side bindings pre-installed.
`x11vnc` (connected to Xvfb on `localhost:6099`) bridged to
`websockify` on port 6080. Service `chrome` maps :80 → :6080 and is
exposed via `ingress_factory` at `chrome.viktorbarzin.me`,
Authentik-gated.
Authentik-gated. The bare host serves `vnc.html` (image symlinks
`index.html → vnc.html`); add `?autoconnect=true&resize=scale&path=websockify`
to skip the Connect button. The view is **black when no browser window is
open** (idle) — that is normal, not a failed connection. Chrome is launched
with `--window-size=1280,720 --window-position=0,0` to fill the Xvfb screen
(no window manager runs, so without it Chrome opens at its profile-persisted
size and the rest of the framebuffer shows as a black cut-off).
### noVNC fd-sweep gotcha (stuck "Connecting")
If the noVNC client hangs on **"Connecting" forever then times out**, the cause
is almost always x11vnc's fd-table sweep: containerd grants pods
`RLIMIT_NOFILE = 2^31`, and x11vnc `fcntl`-sweeps the **entire** fd table on
every client connection, so the RFB handshake never completes (websockify
accepts the WS and logs `connecting to: localhost:5900`, but x11vnc never sends
the `RFB 003.008` banner). Diagnose: `grep "open files" /proc/$(pgrep -n
x11vnc)/limits` (huge = bad) and time the handshake from a sibling container
(`python3 -c "import socket;s=socket.socket();s.connect(('127.0.0.1',5900));print(s.recv(12))"`
healthy <0.3s, broken hangs). **Fix: cap `ulimit -n 65536` before x11vnc starts**
— done both in `files/novnc/entrypoint.sh` (root) and via the container `command`
wrapper in `main.tf` (so it applies deterministically even though the image is
`:latest`/`IfNotPresent` and won't re-pull a rebuilt entrypoint). Same bug + fix
as the android-emulator stack.
- **snapshot-server sidecar** (`mcr.microsoft.com/playwright/python:v1.48.0-noble`)
serves `GET /api/snapshot` from `/profile/snapshots/storage-state.json`,
bearer-gated by `PW_TOKEN`. Service `chrome-snapshot` maps :8088 → :8088
@ -175,6 +217,45 @@ minor, with Python-side bindings pre-installed.
See `stacks/chrome-service/README.md` for the recipe (label namespace,
inject `CHROME_CDP_URL`, vendor `stealth.js`).
## Driving from OUTSIDE the cluster (`homelab browser`)
Agents on the devvm reach this browser through the **`homelab browser`** CLI
(`cli/`, ADR-0013) — the packaged, discoverable form of the ad-hoc
`connect_over_cdp` recipe. It is the **escalation path, not the default**:
agents default to the Playwright MCP / headless browser for all routine
automation, and reach for `homelab browser` ONLY when headless is blocked — a
site loads but a gated action (submit/login) silently fails or hangs, the
signature of headless / anti-bot detection. (Same tiered rule lives in
`~/code/CLAUDE.md` and `homelab browser --help`.)
```text
devvm: homelab browser run flow.js
│ kubectl port-forward svc/chrome-service :9222 (random local port)
http://127.0.0.1:<port> ──► chrome-service pod :9222 (CDP)
│ assert /json/version Browser is "Chrome/…", not "HeadlessChrome"
│ node + playwright-core@1.48.2 → connectOverCDP
│ context.addInitScript(stealth.js) ← same vendored file as in-cluster
│ run the user's Playwright script with page/context/browser in scope
└─ port-forward always torn down (success or error)
```
Key facts:
- **port-forward bypasses the `:9222` NetworkPolicy.** It tunnels
API-server→pod, so the devvm needs no `chrome-service.viktorbarzin.me/client`
label — unlike in-cluster callers.
- **Client pinned to the image minor.** The node client is
`playwright-core@1.48.2` (matches `v1.48.0-noble` / Chromium 130), installed
lazily into `~/.cache/homelab/browser-client/`. Bump it in lockstep when the
server image bumps (same rule as the in-cluster Python clients — see "Image
pin" above).
- **Default context is a fresh incognito one** (closed on exit), safe for the
shared browser; `--shared-context` reuses the warmed persistent profile.
- **`stealth.js` is vendored** into the CLI (`cli/browser_stealth.js`) as a
byte-identical copy of `files/stealth.js`, guarded by a drift test — so the
CLI's stealth never diverges from the in-cluster callers'.
## Limits + risks
- **Anti-bot vs stealth arms race** — when an upstream beats us (DRM

View file

@ -2,306 +2,378 @@
## Overview
The CI/CD pipeline uses a hybrid approach: GitHub Actions for building Docker images (providing free compute for public repos) and Woodpecker CI for deployments (leveraging cluster-internal access). Git pushes trigger GHA builds that produce Docker images with 8-character SHA tags, push to DockerHub, then POST to Woodpecker's API to trigger deployments that update Kubernetes workloads via `kubectl set image`.
**Doctrine (ADR-0002): all image builds and CI compute run OFF-infra.** Every
owned image is built, tested, and linted on **GitHub Actions** (free on public
repos; 2000 free min/mo on private) and pushed to **`ghcr.io/viktorbarzin/<name>`**.
Woodpecker is **deploy-only** — a GHA job POSTs its API with the freshly-built
image tag and Woodpecker runs `kubectl set image` from inside the cluster.
There are **no in-cluster image builds or CI test runs anywhere** — the
in-cluster Woodpecker buildkit and the fallback-build pattern were removed as a
clean cut (ADR-0002, 2026-06-13). The Forgejo container registry is **frozen
and emptied** — break-glass only.
This breaks the old circular dependency (images needed to repair the cluster
used to be built and stored *inside* it) and keeps build IO + registry pushes
off the homelab spindle.
## Architecture Diagram
```mermaid
graph LR
A[Git Push] --> B[GitHub Actions]
B --> C[Build Docker Image<br/>linux/amd64, 8-char SHA tag]
C --> D[Push to DockerHub]
D --> E[POST Woodpecker API]
E --> F[Woodpecker Pipeline]
F --> G[Vault K8s Auth<br/>SA JWT]
G --> H[kubectl set image]
H --> I[K8s Deployment]
I --> J[Pull from DockerHub<br/>or Pull-Through Cache]
A[git push Forgejo<br/>viktor/&lt;repo&gt; canonical] --> B[push-mirror sync_on_commit]
B --> C[GitHub mirror<br/>ViktorBarzin/&lt;repo&gt;]
C --> D[GitHub Actions<br/>.github/workflows/build.yml]
D --> E[lint / test]
E --> F[buildx linux/amd64<br/>provenance:false]
F --> G[push ghcr.io/viktorbarzin/&lt;name&gt;<br/>:sha8 + :latest]
G --> H[svu tag -> Forgejo canonical]
G --> I[POST Woodpecker deploy repo]
I --> J[.woodpecker/deploy.yml<br/>event: manual]
J --> K[kubectl set image<br/>in-cluster SA cluster-admin]
K --> L[K8s Deployment<br/>pulls from ghcr]
K[Pull-Through Cache<br/>10.0.20.10] -.-> J
L[forgejo.viktorbarzin.me<br/>Private Registry on Forgejo] -.-> J
style B fill:#2088ff
style F fill:#4c9e47
style K fill:#f39c12
style D fill:#2088ff
style J fill:#4c9e47
style G fill:#f39c12
```
## Components
| Component | Version | Location | Purpose |
|-----------|---------|----------|---------|
| GitHub Actions | Cloud | `.github/workflows/build-and-deploy.yml` | Build Docker images, push to DockerHub |
| Woodpecker CI | Self-hosted | `ci.viktorbarzin.me` | Deploy to Kubernetes cluster |
| DockerHub | Cloud | `viktorbarzin/*` | Public image registry |
| Private Registry | Forgejo Packages | `forgejo.viktorbarzin.me/viktor` | Private container images (PAT auth, retention CronJob) — migrated from registry.viktorbarzin.me 2026-05-07 |
| Pull-Through Cache | Custom | `10.0.20.10:5000` (docker.io)<br/>`10.0.20.10:5010` (ghcr.io) | LAN cache for remote registries |
| Kyverno | Cluster | `kyverno` namespace | Auto-sync registry credentials to all namespaces |
| Vault | Cluster | `vault.viktorbarzin.me` | K8s auth for Woodpecker pipelines |
| Component | Location | Purpose |
|-----------|----------|---------|
| GitHub Actions | `.github/workflows/build.yml` (per repo) | Build + lint + test + push image; trigger deploy; cut semver tag |
| ghcr.io | `ghcr.io/viktorbarzin/*` | Container registry for ALL owned images (public + private packages) |
| Woodpecker CI | `ci.viktorbarzin.me` | **Deploy-only**`kubectl set image` in-cluster; plus infra applies + maintenance crons |
| Forgejo | `forgejo.viktorbarzin.me/viktor/<repo>` | **Canonical** git source (push-mirrors to GitHub). Container registry **FROZEN** (break-glass only) |
| Pull-Through Cache | `10.0.20.10:5000/5010/5020/5030/5040` | LAN cache for upstream registries (DockerHub, ghcr, Quay, k8s.gcr, Kyverno) |
| Kyverno | `kyverno` namespace | Syncs `ghcr-credentials` (private-ghcr allowlist) + `registry-credentials` to namespaces |
| Vault | `vault.viktorbarzin.me` | K8s auth for Woodpecker deploy pipelines; CI tokens in `secret/ci/global` + `secret/viktor` |
## How It Works
### Build Flow (GitHub Actions)
### The fleet pattern (every owned app)
1. **Trigger**: Git push to main/master branch
2. **Build**: GHA builds Docker image for `linux/amd64` platform only
3. **Tag**: Image tagged with 8-character commit SHA (e.g., `viktorbarzin/app:a1b2c3d4`)
- `:latest` tags are **never used** to prevent stale pull-through cache issues
4. **Push**: Image pushed to DockerHub public registry
5. **Trigger Deploy**: POST request to Woodpecker API with repo ID and commit SHA
1. **Canonical source = Forgejo** `viktor/<repo>`. A **push-mirror**
(`sync_on_commit`) pushes every commit to the GitHub mirror
`ViktorBarzin/<repo>`. The `.github/workflows/build.yml` is committed on
Forgejo and mirrors over.
2. **GHA `build` job** (triggers `on: push: branches: [master]` ONLY — feature
branches mirror but build/deploy nothing, the safety valve):
- lint + test
- `svu` computes the next `vX.Y.Z` from conventional commits and pushes the
tag back to **canonical Forgejo** (GHA secret `FORGEJO_GIT_TOKEN` =
write:repository PAT); `VERSION` is baked into the image
- `docker buildx` `linux/amd64`, **`provenance: false`** (single-manifest —
avoids the orphaned-index-children failure class), push
`ghcr.io/viktorbarzin/<name>:<sha8>` + `:latest`
- `delete-package-versions` keeps the newest ~10 ghcr versions
3. **GHA `deploy` job** POSTs `ci.viktorbarzin.me/api/repos/<id>/pipelines`
(the Woodpecker registration for the **GitHub mirror**, github-forge; GHA
secret `WOODPECKER_TOKEN`) with `IMAGE_TAG` + `IMAGE_NAME`.
4. **`.woodpecker/deploy.yml`** (event: **manual** only, so the raw
Forgejo→GitHub mirror pushes don't fire a tag-less deploy) runs `kubectl set
image deployment/<app> <container>=<image>` in-cluster. The `woodpecker-agent`
SA is `cluster-admin`, so the `bitnami/kubectl` step needs no
kubeconfig/RBAC. The Deployment image is in `lifecycle.ignore_changes`
(`KEEL_IGNORE_IMAGE`) so the SHA tag sticks and `terragrunt apply` doesn't
fight it. CronJobs in owned apps track `:latest` + `imagePullPolicy: Always`
instead of a deploy step.
### Deploy Flow (Woodpecker CI)
**Keel stays enrolled** as a redundant net (finds the deployed SHA already
running → no-op).
1. **Receive Webhook**: Woodpecker API receives deployment trigger from GHA
2. **Authenticate**: Pipeline uses Kubernetes ServiceAccount JWT to authenticate with Vault via K8s auth
3. **Deploy**: `kubectl set image deployment/<name> <container>=viktorbarzin/<app>:<sha>`
4. **Notify**: Slack notification on success/failure
**Tooling**: `infra/scripts/offinfra-onboard` + `infra/scripts/offinfra-templates/`
scaffold a repo onto this pattern (mirror, workflow, Woodpecker deploy repo,
old-pipeline removal, default-branch flip). Mirror + workflow commits go via
the Forgejo API over the internal Traefik LB
(`curl --resolve forgejo.viktorbarzin.me:443:10.0.20.203`) since the devvm
can't reach Forgejo's public hairpin.
### Project Migration Status
### ghcr package visibility
**Migrated to GHA (8 projects)**:
- Website
- k8s-portal
- claude-memory-mcp
- apple-health-data
- audiblez-web
- plotting-book
- insta2spotify
- book-search (audiobook-search)
| Visibility | Packages | Pull mechanism |
|------------|----------|----------------|
| **Public** | beadboard, nextcloud-todos, claude-agent-service, claude-memory-mcp, kms-website, freedify, tuya_bridge, x402-gateway, chrome-service-novnc, android-emulator | Anonymous |
| **Private** | f1-stream, job-hunter, instagram-poster, payslip-ingest, wealthfolio-sync, fire-planner, recruiter-responder, tripit, infra-cli, infra-ci | `ghcr-credentials` dockerconfigjson |
**Woodpecker-native owned-app builds** (build + push to the Forgejo private
registry + `kubectl set image` rollout, all in one `.woodpecker.yml`; Keel
stays enrolled as a redundant net): `tuya_bridge`, `job-hunter`, `f1-stream`.
`f1-stream` was extracted from this monorepo to `viktor/f1-stream` on
2026-06-05 (Woodpecker repo id 166); the old github source is archived and its
GHA-era Woodpecker repo (id 10) is deactivated.
Private-image pulls use the `ghcr-credentials` dockerconfigjson, cloned by the
kyverno stack's `sync-ghcr-credentials` ClusterPolicy to an explicit
**ALLOWLIST** of private-ghcr namespaces only (NOT cluster-wide; source
`stacks/kyverno/modules/kyverno/ghcr-credentials.tf`). Cred = Vault
`secret/viktor/ghcr_pull_token` (a dedicated classic PAT scoped to
`read:packages`, UI-minted 2026-06-15 — no longer the admin `github_pat` alias.
GitHub has no token-mint API, so rotation is manual: re-mint the classic
`read:packages` PAT → `vault kv patch secret/viktor ghcr_pull_token=…`
targeted apply `module.kyverno.kubernetes_secret.ghcr_credentials` (reads Vault;
avoids the git-crypt `tls-secret-sync` landmine on a locked clone), which
Kyverno then re-syncs to the allowlisted namespaces).
**Woodpecker-only (infra + large apps)**:
- `travel_blog`: 5.7GB content directory exceeds GHA limits
- Infra pipelines: require cluster access (terragrunt apply, certbot, build-cli)
### Migrated apps (issues #13#27)
### Woodpecker Pipeline Files
f1-stream, job-hunter, tuya_bridge, beadboard, nextcloud-todos,
claude-agent-service, claude-memory-mcp, kms-website, Freedify,
instagram-poster, payslip-ingest, broker-sync (image name `wealthfolio-sync`),
fire-planner, recruiter-responder, x402-gateway — plus **tripit** (the original
pilot, 2026-06-09). Earlier public-repo apps already on GHA (Website,
k8s-portal, apple-health-data, audiblez-web, plotting-book, insta2spotify,
audiobook-search) now also land on ghcr.
Each project contains:
- `.woodpecker/deploy.yml`: kubectl set image + Slack notification
- `.woodpecker/build-fallback.yml`: Legacy full build pipeline (event: deployment, never auto-fires)
### Infra-owned images (issues #29 / #30)
### Woodpecker Repository IDs
Images owned by the infra repo build on GHA workflows **in the infra repo's own
`.github/workflows/`** (the github↔forgejo divergence was deliberately NOT
reconciled — the workflows were added to the GitHub lineage via PR):
Woodpecker API uses numeric IDs (not owner/name):
| Image | Workflow | Destination |
|-------|----------|-------------|
| chrome-service-novnc | `build-chrome-service-novnc.yml` | public `ghcr.io/viktorbarzin/chrome-service-novnc` |
| android-emulator | `build-android-emulator.yml` | public `ghcr.io/viktorbarzin/android-emulator` |
| infra CLI | `build-cli.yml` | DockerHub `viktorbarzin/infra` (kept) + `ghcr.io/viktorbarzin/infra-cli` |
| infra-ci | `build-infra-ci.yml` | private `ghcr.io/viktorbarzin/infra-ci` |
| Repo | ID |
|------|------|
| infra | 1 |
| Website | 2 |
| finance | 3 |
| health | 4 |
| travel_blog | 5 |
| webhook-handler | 6 |
| audiblez-web | 9 |
| plotting-book | 43 |
| claude-memory-mcp | 78 |
| infra-onboarding | 79 |
**`infra-ci`** is the image the `.woodpecker/default.yml` apply step and
`drift-detection.yml` run in (proven by pipelines 165/166). `chatterbox-tts` is
already built by tripit's GHA → ghcr.
### Image Registry Flow
The Woodpecker `build-ci-image.yml` and `build-cli.yml` pipelines were
**REMOVED**. Break-glass for infra-ci is now a manual
`.woodpecker/breakglass-infra-ci.yml` (ghcr pull-and-save to the registry VM).
1. **Containerd hosts.toml** redirects pulls from docker.io and ghcr.io to pull-through cache at `10.0.20.10`
2. **Pull-through cache** serves cached images from LAN, fetches from upstream on cache miss
3. **Kyverno ClusterPolicy** auto-syncs `registry-credentials` Secret to all namespaces for private registry access
4. **Private registry** has been Forgejo's built-in OCI registry at `forgejo.viktorbarzin.me/viktor/<image>` since 2026-05-07. Auth via PAT (Vault `secret/ci/global/forgejo_push_token` for push, `secret/viktor/forgejo_pull_token` for pull). The pre-migration `registry:2.8.3`-based private registry on `registry.viktorbarzin.me:5050` was the root cause of three orphan-index incidents in three weeks (2026-04-13, 2026-04-19, 2026-05-04 — see `docs/post-mortems/2026-04-19-registry-orphan-index.md` and the full migration writeup at `docs/plans/2026-05-07-forgejo-registry-consolidation-{design,plan}.md`). The five pull-through caches on `10.0.20.10` (ports 5000/5010/5020/5030/5040) stay in place for upstream registries.
5. **Integrity probe** (`registry-integrity-probe` CronJob in `monitoring` ns, every 15m) walks `/v2/_catalog` → tags → indexes → child manifests via HEAD and pushes `registry_manifest_integrity_failures` to Pushgateway; alerts `RegistryManifestIntegrityFailure` / `RegistryIntegrityProbeStale` / `RegistryCatalogInaccessible` page on broken state. Authoritative check (HTTP API, not filesystem).
### Forgejo container registry — FROZEN
### Infra Pipelines (Woodpecker-only)
Issue #32 wiped all `viktor/*` container packages (~19G reclaimed, `/data`
58%→20%). The registry is **break-glass-only** now; nothing pushes to it. The
`forgejo-cleanup` CronJob stays in `DRY_RUN` (nothing to clean). Pull-through
caches on the registry VM (`10.0.20.10`) are unchanged. See
`docs/runbooks/forgejo-registry-breakglass.md`.
### Image registry / pull path
1. **Containerd `hosts.toml`** redirects pulls from docker.io and ghcr.io to the
pull-through cache at `10.0.20.10` (5000 = docker.io, 5010 = ghcr.io).
2. **Pull-through cache** serves cached images from the LAN, fetches upstream on
a miss.
3. **Kyverno ClusterPolicies** sync `ghcr-credentials` (private-ghcr allowlist)
and `registry-credentials` to namespaces.
## Woodpecker — what it still runs
Woodpecker is **deploy + cluster-touching steps only**:
| Pipeline | File | Purpose |
|----------|------|---------|
| default | `.woodpecker/default.yml` | Terragrunt apply on push |
| renew-tls | `.woodpecker/renew-tls.yml` | Certbot renewal cron |
| build-cli | `.woodpecker/build-cli.yml` | Build and push to dual registries |
| build-ci-image | `.woodpecker/build-ci-image.yml` | Build `infra-ci` tooling image (triggered by `ci/Dockerfile` change or manual); post-push HEADs every blob via `verify-integrity` step to catch orphan-index pushes |
| k8s-portal | `.woodpecker/k8s-portal.yml` | Path-filtered build for k8s-portal subdirectory |
| registry-config-sync | `.woodpecker/registry-config-sync.yml` | SCP `modules/docker-registry/*` to `/opt/registry/` on `10.0.20.10` when any managed file changes; bounces containers + nginx per `docs/runbooks/registry-vm.md` |
| pve-nfs-exports-sync | `.woodpecker/pve-nfs-exports-sync.yml` | Sync `scripts/pve-nfs-exports``/etc/exports` on PVE host |
| postmortem-todos | `.woodpecker/postmortem-todos.yml` | Auto-resolve safe TODOs from new `docs/post-mortems/*.md` via headless Claude agent |
| drift-detection | `.woodpecker/drift-detection.yml` | Nightly Terraform drift detection |
| issue-automation | `.woodpecker/issue-automation.yml` | Triage + respond to `ViktorBarzin/infra` GitHub issues |
| per-app deploy | `.woodpecker/deploy.yml` (each repo) | `kubectl set image` + Slack notify (event: **manual**) |
| terragrunt apply | `.woodpecker/default.yml` | Changed-stacks apply on push to master (runs in `infra-ci`) |
| certbot | `.woodpecker/renew-tls.yml` | TLS renewal cron |
| drift-detection | `.woodpecker/drift-detection.yml` | Nightly Terraform drift (runs in `infra-ci`) |
| provision-user | `.woodpecker/provision-user.yml` | Add namespace-owner user from Vault spec |
| registry-config-sync | `.woodpecker/registry-config-sync.yml` | SCP `modules/docker-registry/*``10.0.20.10` on change |
| pve-nfs-exports-sync | `.woodpecker/pve-nfs-exports-sync.yml` | Sync `scripts/pve-nfs-exports``/etc/exports` on PVE |
| issue-automation | `.woodpecker/issue-automation.yml` | Triage + respond to `ViktorBarzin/infra` GitHub issues |
| postmortem-todos | `.woodpecker/postmortem-todos.yml` | Auto-resolve safe TODOs from new post-mortems |
| k8s-portal | `.woodpecker/k8s-portal.yml` | Path-filtered deploy for the portal |
| breakglass-infra-ci | `.woodpecker/breakglass-infra-ci.yml` | **Manual** ghcr pull-and-save of infra-ci to the registry VM |
**No build/test pipeline exists on any repo.** Do not (re)introduce one.
### Woodpecker API
Uses **numeric repo IDs** (`/api/repos/<id>/pipelines`), NOT owner/name paths
(those return HTML). The deploy registration for each app is the **GitHub
mirror** repo (registered github-forge). IDs are stable across renames and must
be looked up from the Woodpecker UI/DB.
### Woodpecker YAML gotchas
- Commands with `${VAR}:${VAR}` must be **quoted** — an unquoted `:` triggers
YAML map parsing when the vars are empty.
- Use `bitnami/kubectl:latest` (not pinned versions — entrypoint compatibility).
- Global secrets must include `manual` in their events list for API-triggered
pipelines.
### GitHub repo secrets
Per repo: `WOODPECKER_TOKEN` (POST the deploy pipeline), `FORGEJO_GIT_TOKEN`
(write:repository PAT for the `svu` tag push). ghcr push uses the workflow's
built-in `GITHUB_TOKEN` (`packages: write`).
## Infra repo CI topology
The infra repo runs on Woodpecker via **two** forge registrations: the Forgejo
forge (repo id 82, registered 2026-06-08) and the legacy GitHub forge (repo id
1). Pushes to **Forgejo** `master` fire `.woodpecker/default.yml`
(changed-stacks terragrunt apply, in `infra-ci`) plus the `notify-nonadmin-push`
Slack audit step. Operational facts (2026-06-10):
- **Webhook URL is the IN-CLUSTER service**:
`http://woodpecker-server.woodpecker.svc.cluster.local/api/hook?...` (PATCHed
via the Forgejo API). The Woodpecker default (`https://ci.viktorbarzin.me/...`)
resolves to the non-proxied public A record from pods → NAT hairpin →
intermittent `context deadline exceeded`, silently dropping push events. If
Woodpecker "repairs" the repo it rewrites the hook back to `ci.viktorbarzin.me`
— re-apply the in-cluster URL.
- **Repo-scoped secrets must exist on BOTH repos**: pipelines reference
repo-level secrets (`registry_ssh_key`, `pve_ssh_key`, `CLOUDFLARE_TOKEN`, …).
When registering a new forge repo for infra, clone the secret set too.
- **Empty commits defeat path filters**: a commit with no changed files makes
Woodpecker include ALL workflow files (path conditions can't exclude), so every
repo secret must resolve. Normal commits with real files only compile the
matching workflows.
The Forgejo trigger is not fully dependable — land infra changes by pushing
Forgejo master (as viktor), use `[ci skip]` for docs/no-op commits, and verify
deploys via `scripts/tg` + live cluster state rather than trusting the CI
checkmark. The two remotes have **diverged** (parallel histories under
different SHAs); expect github pushes to reject non-fast-forward and leave them
— never force-push.
## Configuration
### GitHub Actions
**File**: `.github/workflows/build-and-deploy.yml`
### GitHub Actions (per-app `.github/workflows/build.yml`)
```yaml
name: Build and Deploy
name: build
on:
push:
branches: [main, master]
branches: [master]
jobs:
build:
runs-on: ubuntu-latest
permissions:
contents: write # svu tag push
packages: write # ghcr push
steps:
- name: Build Docker image
run: docker build --platform linux/amd64 -t viktorbarzin/app:${SHORT_SHA} .
- name: Push to DockerHub
run: docker push viktorbarzin/app:${SHORT_SHA}
- name: Trigger Woodpecker Deploy
- uses: actions/checkout@v4
- name: lint + test
run: make lint test
- name: svu tag -> Forgejo
run: |
curl -X POST https://ci.viktorbarzin.me/api/repos/<REPO_ID>/pipelines \
-H "Authorization: Bearer ${{ secrets.WOODPECKER_TOKEN }}"
VERSION=$(svu next)
# ... push tag to canonical Forgejo with FORGEJO_GIT_TOKEN
- uses: docker/setup-buildx-action@v3
- uses: docker/build-push-action@v6
with:
platforms: linux/amd64
provenance: false
push: true
tags: |
ghcr.io/viktorbarzin/<name>:${{ github.sha }}
ghcr.io/viktorbarzin/<name>:latest
deploy:
needs: build
runs-on: ubuntu-latest
steps:
- name: Trigger Woodpecker deploy
run: |
curl -X POST https://ci.viktorbarzin.me/api/repos/<DEPLOY_REPO_ID>/pipelines \
-H "Authorization: Bearer ${{ secrets.WOODPECKER_TOKEN }}" \
-d '{"branch":"master","variables":{"IMAGE_TAG":"...","IMAGE_NAME":"..."}}'
```
**Required GitHub Secrets**:
- `DOCKERHUB_USERNAME`
- `DOCKERHUB_TOKEN`
- `WOODPECKER_TOKEN`
### Woodpecker Deploy Pipeline
**File**: `.woodpecker/deploy.yml`
### Woodpecker deploy pipeline (per-app `.woodpecker/deploy.yml`)
```yaml
when:
event: [deployment]
event: manual
steps:
deploy:
image: bitnami/kubectl:latest
image: bitnami/kubectl:latest # uses the in-cluster woodpecker-agent SA (cluster-admin)
commands:
- kubectl set image deployment/app app=viktorbarzin/app:${CI_COMMIT_SHA:0:8}
secrets: [k8s_token]
- "kubectl set image deployment/app app=${IMAGE_NAME}:${IMAGE_TAG} -n <ns>"
- "kubectl rollout status deployment/app -n <ns> --timeout=300s"
notify:
image: plugins/slack
settings:
webhook: ${SLACK_WEBHOOK}
when:
status: [success, failure]
```
**YAML Gotchas**:
- Commands with `${VAR}:${VAR}` syntax must be quoted to prevent YAML map parsing when vars are empty
- Use `bitnami/kubectl:latest` (not pinned versions)
- Global secrets must be manually added to `secrets:` list in pipeline
### CI/CD secrets sync
### Vault Configuration
**K8s Auth for Woodpecker**:
- Woodpecker pipelines authenticate using ServiceAccount JWT
- Vault K8s auth mount validates JWT and issues token
- Policies grant access to secrets and dynamic credentials
### CI/CD Secrets Sync
**CronJob**: Pushes `secret/ci/global` from Vault → Woodpecker API every 6 hours
- Keeps Woodpecker global secrets in sync with Vault
- Runs in `woodpecker` namespace
A CronJob in the `woodpecker` namespace pushes `secret/ci/global` from Vault →
the Woodpecker API every 6h, keeping global secrets in sync. Woodpecker deploy
pipelines authenticate to the cluster via the in-cluster `woodpecker-agent` SA
(cluster-admin); Vault K8s auth backs any secret reads.
## Decisions & Rationale
### Why GitHub Actions + Woodpecker?
### Why all builds off-infra (ADR-0002)?
**Alternatives considered**:
1. **Woodpecker-only**: Simple, but wastes cluster resources on builds
2. **GHA-only**: No cluster access, requires kubectl from outside (security risk)
3. **Hybrid (chosen)**: GHA for compute-heavy builds (free), Woodpecker for privileged deployments (secure cluster access)
- **Breaks the circular dependency** — the images needed to repair the cluster
no longer live inside it (they're on ghcr, an external registry).
- **Removes build IO + registry push load** from the contended homelab spindle.
- GHA is free on public repos and generous on private; buildx provenance:false
sidesteps the orphaned-index-children failure class that plagued the
in-cluster registry.
- **Clean cut** — no in-cluster fallback builds anywhere; one pattern,
fleet-wide.
**Benefits**:
- Free compute for builds on public repos
- Cluster access stays internal (Woodpecker has direct K8s access)
- Separation of concerns: build vs deploy
### Why ghcr (not push back to Forgejo)?
### Why 8-Character SHA Tags (Not :latest)?
Forgejo's container registry repeatedly orphaned OCI index children
(2026-04-13/19, 2026-05-04, 2026-06-10) and its retention is not container-aware.
ghcr is external (DR-safe), free for this scale, and has native multi-arch
handling. The Forgejo registry was frozen + emptied (issue #32).
- Pull-through cache serves stale `:latest` tags indefinitely
- SHA tags ensure every deployment pulls the correct image
- 8 characters provide sufficient collision resistance (16^8 = 4.3 billion combinations)
### Why Woodpecker stays for deploy?
### Why Numeric Repo IDs for Woodpecker API?
`kubectl set image` needs in-cluster privileged access; doing it from GHA would
mean exposing kube-apiserver or a long-lived kubeconfig. Woodpecker's
`woodpecker-agent` SA is already cluster-admin in-cluster — the deploy step
needs no credentials.
- Woodpecker API requires numeric IDs (not owner/name slugs)
- IDs are stable across repo renames
- Must be manually looked up from Woodpecker UI or database
### Why `event: manual` on deploy.yml?
### Why linux/amd64 Only?
The Forgejo→GitHub push-mirror sends raw, tag-less pushes to the GitHub mirror.
If `deploy.yml` fired on `push`, every mirror sync would trigger a deploy with no
image tag. `manual` means only the GHA `deploy` job's explicit API POST (with
`IMAGE_TAG`) deploys.
- Cluster runs on x86_64 nodes only
- ARM builds would waste time and storage
- Multi-arch images add complexity without benefit
### Why linux/amd64 only?
The cluster runs on x86_64 nodes only; ARM builds waste time and storage.
## Troubleshooting
### GHA Build Fails: "denied: requested access to the resource is denied"
### GHA build fails: ghcr push "denied"
**Cause**: DockerHub credentials expired or incorrect
The workflow `GITHUB_TOKEN` needs `packages: write` permission and the package
must allow the repo to push. Check the workflow `permissions:` block and the
package's "Manage Actions access" settings.
### Image pull fails: "ErrImagePull" / "ImagePullBackOff"
**Fix**:
```bash
# Regenerate DockerHub token
# Update GitHub repo secrets: DOCKERHUB_USERNAME, DOCKERHUB_TOKEN
# Public image — check the pull-through cache is up
curl http://10.0.20.10:5010/v2/_catalog
# Private image — verify the ghcr-credentials Secret exists in the namespace
kubectl get secret ghcr-credentials -n <namespace>
# It's Kyverno-synced to an allowlist; if missing, the namespace isn't on the
# allowlist in stacks/kyverno/modules/kyverno/ghcr-credentials.tf
```
### Woodpecker Deploy Fails: "Unauthorized"
If the cause is the internal-DNS hairpin (fresh pulls timing out on the public
Forgejo path), see the CoreDNS `viktorbarzin.me` carve-out in
`docs/architecture/networking.md` and `docs/runbooks/registry-vm.md`.
**Cause**: Vault K8s auth token expired or invalid
### Deploy didn't happen after a push
**Fix**:
```bash
# Restart Woodpecker pipeline (token auto-renewed)
# Check Vault K8s auth role exists: vault read auth/kubernetes/role/woodpecker-deployer
```
Confirm the push was to **master** (feature branches build/deploy nothing).
Check the GHA run completed the `deploy` job, then check Woodpecker received the
manual pipeline (`ci.viktorbarzin.me`, the GitHub-mirror deploy repo). Verify
live with `kubectl rollout status` — not the CI checkmark.
### Image Pull Fails: "ErrImagePull"
### Woodpecker deploy fails: "YAML: did not find expected key"
**Cause**: Pull-through cache or registry credentials issue
**Fix**:
```bash
# Check pull-through cache is running
curl http://10.0.20.10:5000/v2/_catalog
# Verify registry-credentials Secret exists in namespace
kubectl get secret registry-credentials -n <namespace>
# Manually sync credentials if missing
kubectl get secret registry-credentials -n default -o yaml | \
sed 's/namespace: default/namespace: <namespace>/' | kubectl apply -f -
```
### Woodpecker Pipeline: "YAML: did not find expected key"
**Cause**: Unquoted command with `${VAR}:${VAR}` syntax when VAR is empty
**Fix**: Quote the command:
```yaml
commands:
- "kubectl set image deployment/app app=viktorbarzin/app:${SHORT_SHA}"
```
### travel_blog Build Times Out on GHA
**Cause**: 5.7GB content directory exceeds GHA disk/time limits
**Fix**: Keep on Woodpecker (no migration). Build uses cluster storage and resources.
### CI/CD Secrets Out of Sync
**Cause**: CronJob failed to sync Vault → Woodpecker
**Fix**:
```bash
# Check CronJob status
kubectl get cronjob -n woodpecker
# Manually trigger sync
kubectl create job --from=cronjob/sync-secrets manual-sync -n woodpecker
```
Unquoted command with `${VAR}:${VAR}` syntax when a VAR is empty. Quote the
command (see the deploy.yml example above).
## Related
- [Databases Architecture](./databases.md) — Database credentials via Vault
- [Multi-Tenancy](./multi-tenancy.md) — Per-user Woodpecker access
- Runbook: `../runbooks/deploy-new-app.md` — How to set up CI/CD for a new app
- Runbook: `../runbooks/troubleshoot-image-pull.md` — Debug image pull issues
- Vault documentation: K8s auth configuration
- Woodpecker documentation: API reference
- ADR: `../adr/0002-all-image-builds-off-infra-gha-ghcr.md` — the decision
- [Databases Architecture](./databases.md) — database credentials via Vault
- [Multi-Tenancy](./multi-tenancy.md) — per-user Woodpecker access
- Runbook: `../runbooks/forgejo-registry-breakglass.md` — using the frozen registry
- Runbook: `../runbooks/registry-vm.md` — pull-through cache VM + image-pull debugging
- Onboarding tool: `../../scripts/offinfra-onboard` + `../../scripts/offinfra-templates/`

View file

@ -22,9 +22,11 @@ graph TB
NODE2["VM 202: k8s-node2<br/>8c / 32GB"]
NODE3["VM 203: k8s-node3<br/>8c / 32GB"]
NODE4["VM 204: k8s-node4<br/>8c / 32GB"]
NODE5["VM 205: k8s-node5<br/>8c / 32GB"]
NODE6["VM 206: k8s-node6<br/>8c / 32GB"]
end
subgraph K8s["Kubernetes Cluster v1.34.2"]
subgraph K8s["Kubernetes Cluster v1.34.8"]
direction TB
subgraph VPA["VPA (Goldilocks - Initial Mode)"]
@ -62,7 +64,7 @@ graph TB
| Model | Dell PowerEdge R730 |
| CPU | 1x Intel Xeon E5-2699 v4 (22 cores / 44 threads, CPU2 unpopulated) |
| Total Cores/Threads | 22 cores / 44 threads |
| RAM | 272GB DDR4-2400 ECC RDIMM physical (10 DIMMs: 8x32G Samsung + 2x8G Hynix). VMs use ~176GB total (k8s-node1 48GB + 4 K8s VMs x 32GB) |
| RAM | 272GB DDR4-2400 ECC RDIMM physical (10 DIMMs: 8x32G Samsung + 2x8G Hynix). K8s VMs use ~240GB total (k8s-node1 48GB + 6 K8s VMs x 32GB) |
| GPU | NVIDIA Tesla T4 (16GB GDDR6, PCIe 0000:06:00.0) |
| Storage | 1.1TB SSD + 931GB SSD + 10.7TB HDD |
| Hypervisor | Proxmox VE |
@ -76,8 +78,10 @@ graph TB
| k8s-node2 | 202 | 8 | 32GB | vmbr1:vlan20 | Worker | None |
| k8s-node3 | 203 | 8 | 32GB | vmbr1:vlan20 | Worker | None |
| k8s-node4 | 204 | 8 | 32GB | vmbr1:vlan20 | Worker | None |
| k8s-node5 | 205 | 8 | 32GB | vmbr1:vlan20 (10.0.20.105) | Worker (joined 2026-05-26) | None |
| k8s-node6 | 206 | 8 | 32GB | vmbr1:vlan20 (10.0.20.106) | Worker (joined 2026-05-26) | None |
**Total Cluster Resources**: 48 vCPUs, ~176GB RAM (k8s-node1 48GB + 4 nodes x 32GB)
**Total Cluster Resources**: 64 vCPUs, ~240GB RAM (k8s-node1 16c/48GB + master and 5 workers at 8c/32GB each)
> **All Linux VMs are hand-managed in Proxmox, NOT in Terraform**
> (decided 2026-05-26, commit 44c3770a). The telmate/proxmox v3.0.2
@ -97,7 +101,12 @@ graph TB
> PVE host (sources in `infra/scripts/`, install pattern per
> `architecture/backup-dr.md`). Timer fires `OnBootSec=5min` +
> `OnCalendar=hourly`, so any drift (config restore, manual `qm
> set`, fresh clone) self-heals within the hour. Current caps:
> set`, fresh clone) self-heals within the hour. The script compares
> *normalized option sets*, so an unchanged config is a true no-op —
> until 2026-06-11 a raw string compare (defeated by `qm config`'s
> canonical key order) re-issued `qm set` hourly against running VMs,
> live-rewriting QEMU throttle state via QMP (implicated in the devvm
> I/O stall; see `post-mortems/2026-06-11-devvm-qemu-io-stall.md`). Current caps:
> 102 devvm 60/60, 103 home-assistant 40/40, 200 k8s-master 100/60,
> 201 k8s-node1 150/120, 202 k8s-node2 150/120, 203 k8s-node3 150/120,
> 204 k8s-node4 150/120, 220 docker-registry 40/40.

View file

@ -258,19 +258,27 @@ The TP-Link AP (dumb AP on 192.168.1.x) does not support hairpin NAT. LAN client
Technitium's **Split Horizon AddressTranslation** app post-processes DNS responses for 192.168.1.0/24 clients, translating the public IP to the internal Traefik LB IP:
```
176.12.22.76 → 10.0.20.200
176.12.22.76 → 10.0.20.203
```
(Was `10.0.20.200` until Traefik's 2026-05-30 move to its dedicated `.203` LB IP.)
**DNS Rebinding Protection** has `viktorbarzin.me` in `privateDomains` to allow the translated private IP without being stripped as a rebinding attack.
### Scope
- **Affected**: Non-proxied domains (ha-sofia, immich, headscale, calibre, vaultwarden, etc.) for 192.168.1.x clients
- **Not affected**: Cloudflare-proxied domains (resolve to Cloudflare edge IPs, no translation needed)
- **Not affected**: 10.0.x.x and K8s clients (reach public IP via pfSense outbound NAT normally)
- **10.0.x.x clients (k8s nodes, devvm, other VMs)** — handled at the resolver since 2026-06-10: **pfSense Unbound carries a domain override forwarding the whole `viktorbarzin.me` zone to Technitium** (`10.0.20.201`). Technitium's split-horizon zone answers with the zone apex A record, which auto-tracks the live Traefik LB IP (`technitium-ingress-dns-sync` CNAMEs every ingress host hourly; `viktorbarzin-apex-probe` is the drift canary). Every client of pfSense Unbound — all VLANs, k8s nodes included — therefore gets internal answers with **zero per-host configuration** (no `/etc/hosts` pins, no resolved drop-ins; both earlier same-day approaches were removed, nodes are stock). Names not behind Traefik keep distinct records in the zone (e.g. `mail.viktorbarzin.me → 10.0.20.1`, verified working on :993/:25; since 2026-06-10 its :443 also works internally — pfSense carries an SNI-routed HAProxy frontend on 443 that sends hostname traffic to Traefik and bare-IP/no-SNI traffic to the webGUI, which moved to :8443; see `docs/runbooks/mailserver-pfsense-haproxy.md`). See `docs/runbooks/pfsense-unbound.md` for the override config + rollback, and `docs/post-mortems/2026-06-10-tuya-bridge-forgejo-pull-hairpin.md` for the incident that motivated this (kubelet forgejo pulls riding the broken hairpin; the containerd hosts.toml mirror cannot fix it — Traefik 404s bare-IP requests and the registry auth realm is an absolute public URL).
- **devvm**: also covered by a `~viktorbarzin.me → 10.0.20.201` resolved routing domain (predates the pfSense override, provisioned by `setup-devvm.sh`) — redundant-but-harmless belt-and-suspenders.
- **in-cluster PODS are ordinary internal clients too** (since 2026-06-10 evening): CoreDNS's dedicated `viktorbarzin.me:53` block (in `stacks/technitium`, TF-managed) forwards to the Technitium ClusterIP (`10.96.0.53`, same as the `.lan` block), so pods get the same split-horizon answers as everyone else. This works because on k8s 1.34 **pods CAN reach the ETP=Local Traefik LB IP** — kube-proxy short-circuits in-cluster traffic to LB IPs via the cluster path (verified from pods on three non-Traefik nodes; re-verify after major k8s upgrades — the canary is the uptime-kuma `[External]` fleet going red). forgejo stays pinned to Traefik's **ClusterIP** in the same block so CI pushes survive a Technitium outage. History: the block briefly forwarded to `8.8.8.8/1.1.1.1` (morning of 2026-06-10), which kept pods on public IPs and the broken TP-Link NAT loopback — 27 non-proxied `[External]` uptime-kuma monitors dark (beads code-yh33). Note: in-cluster `[External]` monitors now test DNS+Traefik+service via the internal path for ALL names, including Cloudflare-proxied ones — genuine edge-path fidelity is the job of a true external vantage (ha-london), not in-cluster probes.
- **Trade-off**: `viktorbarzin.me` resolution via pfSense now depends on in-cluster Technitium (3 replicas). During a full cluster outage the zone SERVFAILs LAN-wide — acceptable, the services behind it are down anyway; node bootstrap images pull via the IP-addressed `10.0.20.10` mirrors, so cold-start self-unwinds.
- **Residual nondeterminism**: nodes keep `94.140.14.14` as a secondary resolver (netplan/qm `--nameserver`). If systemd-resolved fails over to it during a pfSense DNS blip, `.me` answers are public again until it switches back — a rare, self-healing window, accepted.
Config is synced to all 3 Technitium instances by CronJob `technitium-split-horizon-sync` (every 6h).
**Superset rule for the internal `viktorbarzin.me` zone**: it is authoritative for every internal client (pods included since 2026-06-10), so it must carry every record type those clients consume — not just ingress A/CNAMEs. The `technitium-ingress-dns-sync` CronJob therefore also maintains the static **mail-auth records** (apex SPF + brevo-code TXT, MX → mail.viktorbarzin.me, `_dmarc`, `mail._domainkey` DKIM), mirrored from the public Cloudflare zone. Without them, rspamd on the mailserver saw `SPF=none` for inbound `@viktorbarzin.me` mail and quarantined it (broke the Brevo email-roundtrip probe, 2026-06-10). If these records change in Cloudflare, update the sync script too.
## NodeLocal DNSCache
A DaemonSet in `kube-system` (`node-local-dns`, image `registry.k8s.io/dns/k8s-dns-node-cache:1.23.1`) runs on every node including the control plane. Each pod uses `hostNetwork: true` + `NET_ADMIN` and installs iptables NOTRACK rules so it transparently serves DNS on both:
@ -456,13 +464,21 @@ The zone-sync CronJob (runs every 30min) pushes the following to the Prometheus
### Hairpin NAT Not Working (LAN → *.viktorbarzin.me Fails)
Since 2026-04-19 (Workstream D), pfSense Unbound answers LAN DNS queries
directly instead of forwarding to Technitium, so the Technitium Split Horizon
post-processing does NOT run for 192.168.1.x clients anymore. Non-proxied
services break hairpin on LAN clients again. Options:
**Since 2026-06-10 this is largely solved at the resolver**: pfSense Unbound
carries a domain override forwarding the entire `viktorbarzin.me` zone to
Technitium, so ANY client that queries pfSense (all VLANs + 192.168.1.x
clients pointed at `192.168.1.2`) gets the internal Traefik answer. If
hairpin still fails for a client, first check which resolver it actually
uses — clients on the TP-Link's own DHCP DNS (router/ISP) bypass pfSense
entirely. Options for those:
(Historical context: 2026-04-19 Workstream D made Unbound answer LAN
queries directly, which had removed the Technitium Split Horizon
post-processing from the LAN path until the 2026-06-10 domain override
restored internal answers at the zone level.)
1. **Switch service to proxied Cloudflare** (preferred) — set `dns_type = "proxied"` in the `ingress_factory` module call; DNS now resolves to Cloudflare edge, hairpin-independent.
2. **Add a local-data override on pfSense Unbound** — under `Services → DNS Resolver → Host Overrides`, set `<service>.viktorbarzin.me → 10.0.20.200` (Traefik LB IP). This is equivalent to what Split Horizon did, applied at the resolver.
2. **Add a local-data override on pfSense Unbound** — under `Services → DNS Resolver → Host Overrides`, set `<service>.viktorbarzin.me → 10.0.20.203` (Traefik LB IP). This is equivalent to what Split Horizon did, applied at the resolver.
3. **Revert to prior NAT rdr + Technitium Split Horizon** — documented in `docs/runbooks/pfsense-unbound.md` rollback section.
K8s-side Split Horizon is still configured and applies when `*.viktorbarzin.me` queries DO reach Technitium (e.g., from pods that query via CoreDNS → Technitium forwarding for `.viktorbarzin.me` via pfSense). Verify Technitium split-horizon app:
@ -470,7 +486,7 @@ K8s-side Split Horizon is still configured and applies when `*.viktorbarzin.me`
1. Verify Split Horizon app is installed on all instances
2. Check CronJob status: `kubectl get cronjob -n technitium technitium-split-horizon-sync`
3. Run the job manually: `kubectl create job --from=cronjob/technitium-split-horizon-sync test-sh -n technitium`
4. Test: `dig @10.0.20.201 immich.viktorbarzin.me` — should return 10.0.20.200 for 192.168.1.x source
4. Test: `dig @10.0.20.201 immich.viktorbarzin.me` — should return 10.0.20.203 for 192.168.1.x source
### Zone Not Replicating to Secondary/Tertiary

View file

@ -119,12 +119,18 @@ no `level` stream label.
cluster error/warn line counts (5-min window) → `sensor.cluster_log_errors_5m` /
`sensor.cluster_log_warnings_5m`, for a compact trend card on the Барзини status
view plus a Grafana-link button. Those sensors reach Loki via the Traefik LB IP
`10.0.20.203` + a `Host: loki.viktorbarzin.lan` header (`verify_ssl: false`)
because `loki.viktorbarzin.lan` has **no Technitium record yet** (the
`technitium-ingress-dns-sync` CronJob only creates `.me` CNAMEs + pins
`ingress.viktorbarzin.lan`). **Follow-up:** register `loki.viktorbarzin.lan` in
Technitium (or fix the `*.viktorbarzin.lan` wildcard) so both this sensor and the
Sofia-Pi promtail can resolve it by name instead of pinning the LB IP.
`10.0.20.203` + a `Host: loki.viktorbarzin.lan` header (`verify_ssl: false`).
**Update 2026-06-10:** `loki.viktorbarzin.lan` is now **registered in Technitium**
as a CNAME → `ingress.viktorbarzin.lan` (the anchor whose A record auto-tracks the
live Traefik LB IP), added via the Technitium API and AXFR-replicated to all 3
instances — so it resolves by name LAN-wide. The **PVE host** promtail (see
"External host: pve" below) uses the name directly, with **no `/etc/hosts` pin**.
This HA sensor and the rpi-sofia promtail still pin the LB IP in their own configs
and can drop to the name on next touch (`verify_ssl: false` / `insecure_skip_verify`
stays — the internal `.lan` cert isn't publicly trusted). Per-host `.lan` CNAMEs
are still added manually via the API; auto-managing them in
`technitium-ingress-dns-sync` (today `.me`-only + the `ingress.viktorbarzin.lan`
anchor) remains a follow-up.
### External host: rpi-sofia (Sofia Raspberry Pi)
@ -140,12 +146,29 @@ Query examples (Grafana → Loki): `{job="rpi-sofia-journal"}`, `{job="rpi-sofia
**Dashboard** — `dashboards/rpi-sofia.json` ("RPi Sofia", Hardware folder): status, undervoltage/throttle, SoC temp, load, memory, root-fs free + read-only, network.
**Alerts** (group `RPi Sofia` in `prometheus_chart_values.tpl`): `RpiSofiaDown` (`up==0`), `RpiSofiaFilesystemReadonly` (`node_filesystem_readonly{mountpoint="/"}==1` — the SD-failure signature), `RpiSofiaUndervoltage` (`rpi_under_voltage_occurred==1`), `RpiSofiaHighTemp`.
**Alerts** (group `RPi Sofia` in `prometheus_chart_values.tpl`): `RpiSofiaDown` (`up==0`), `RpiSofiaFilesystemReadonly` (`node_filesystem_readonly{mountpoint="/"}==1` — the SD-failure signature), `RpiSofiaUndervoltage` (`increase(rpi_under_voltage_occurred[1h])>0` — edge-triggered on the sticky bit; the live `rpi_under_voltage_now` bit is too transient to catch at 1-min sampling, so it fires on a *new* brown-out and auto-resolves ~1h later instead of latching until reboot), `RpiSofiaHighTemp`.
**Recovery** — a systemd hardware watchdog (`RuntimeWatchdogSec=14s`, bcm2835 max ~15s) auto-reboots the Pi on a hard hang instead of leaving it dead for hours.
> The cluster side (scrape job, alerts, Loki ingress, dashboard) is Terraform-managed in `stacks/monitoring/`. The **Pi-side** pieces (node_exporter, the textfile collector + timer, promtail, the watchdog config, and the `server=/viktorbarzin.lan/192.168.1.2` dnsmasq split-horizon forward needed to resolve the Loki ingress) are configured by hand on the Pi — it is not under Terraform — and are backed up off-box at `/home/wizard/rpi-sofia-backup/`. The real reliability fix (reflash/replace the SD card) needs on-site access.
### External host: pve (Proxmox hypervisor, 192.168.1.127)
`pve` is the Proxmox VE host — the hypervisor running **every** VM (pfSense, the 5 k8s nodes, the devvm, HA, Windows). It is not in the cluster. Since 2026-06-10 its **full systemd journal ships to cluster Loki**, closing a gap (the most critical host previously had no central logging) and giving the Wave-1 **S1** security rule its data source (`docs/architecture/security.md`).
**Why now:** emo's Claude agent was granted **root SSH** to the host (a dedicated shared-root key `emo-pve-agent@devvm`, fingerprint `SHA256:Wd+m0EABlm4RDDykDh85PIYSqe0Al8Hr9AZ+7Ksy4HQ`, reachable as `ssh pve` from the devvm) so he can manage the host (e.g. the R730 fan daemon) via his agent. To keep an audit trail, **snoopy** (enabled via `/etc/ld.so.preload``libsnoopy.so`; config `scripts/pve-snoopy.ini`) logs every `execve()` to journald under identifier `snoopy`, and promtail ships it to Loki.
**Logs** — `promtail` v3.5.1 (amd64) at `/usr/local/bin/promtail`, config `scripts/pve-promtail.yaml`, unit `scripts/pve-promtail.service`. Ships `/var/log/journal` to `https://loki.viktorbarzin.lan/loki/api/v1/push` (`insecure_skip_verify` — the internal `.lan` cert isn't publicly trusted; the name resolves via the Technitium CNAME above, no `/etc/hosts` pin). Relabels: `unit`, `level`, `identifier`; sshd lines (`identifier=~"sshd.*"`) are re-jobbed to `sshd-pve` so the S1 rule matches. Streams:
- `{job="pve-journal", host="pve"}` — full host journal (kernel, pvestatd, fan-control, NFS, etc.).
- `{job="pve-journal", identifier="snoopy"}`**command audit** (every execve: `uid login tty sid cwd cmdline`).
- `{job="sshd-pve"}` — sshd auth; an `Accepted publickey ... SHA256:<fp>` line ties a session to a key (e.g. emo's fp above). Feeds S1.
**Attribution caveat:** all SSH is shared-root, so snoopy `uid`/`login` are always `root`; attribute a command to a person by correlating its `sid`/timestamp with the matching `{job="sshd-pve"}` Accepted-publickey line (key fingerprint). emo's agent arrives SNAT'd as `192.168.1.2`, which is in the S1 allowlist, so legitimate access does not alert.
Query examples (Grafana → Loki): `{host="pve"}`, `{job="pve-journal", identifier="snoopy"}` (command audit), `{job="sshd-pve"} |= "Accepted publickey"`.
> Hand-managed (not Terraform), like the rpi-sofia and fan-control pieces: the promtail binary/config/unit and the snoopy enable (`/etc/ld.so.preload`) live on the host (Loki resolves via the Technitium CNAME — no `/etc/hosts` pin). Source-of-truth files: `scripts/pve-promtail.{yaml,service}` + `scripts/pve-snoopy.ini`; deploy steps are in the `pve-promtail.yaml` header.
### Dell R730 iDRAC: SNMP-primary + Redfish remnant (migrated 2026-06-05)
The R730 iDRAC (`192.168.1.4` / `idrac.viktorbarzin.lan`) is monitored by **two** Prometheus jobs, both relabeled to the `r730_idrac_*` prefix (which historically hid which source served what). Design/plan: `docs/plans/2026-06-05-idrac-snmp-migration-{design,plan}.md`.

View file

@ -541,11 +541,33 @@ Separate from the in-cluster namespace-owner model above, the **devvm** (`10.0.1
**RBAC tiers:** `admin` (Viktor — cluster-admin, unlocked tree, secrets) · `power-user` (cluster-wide read-only, NO Secrets, via a dedicated `oidc-power-user-readonly` ClusterRole) · `namespace-owner` (admin in own namespace only). Each session acts as the user's **own** OIDC identity (kubelogin), never the admin's.
**Config inheritance (live):** wizard authors the base (his chezmoi-versioned `~/.claude`). Two native layers carry it to every user — the enforced org `claudeMd` in `/etc/claude-code/managed-settings.json` (top precedence, all sessions) and per-user `~/.claude/{skills,rules,…}` **symlinks** to the base (seeded via `/etc/skel`; edits propagate live). Secrets stay per-user at mode 600, never symlinked.
**Config inheritance (live):** wizard authors the base (his chezmoi-versioned `~/.claude`). Two native layers carry it to every user — the enforced org `claudeMd` in `/etc/claude-code/managed-settings.json` (top precedence, all sessions) and per-user `~/.claude/{skills,rules,…}` **symlinks** to the base (seeded via `/etc/skel`; edits propagate live). Secrets stay per-user at mode 600, never symlinked. **The managed config self-deploys from the repo** (2026-06-10): the hourly reconcile's `sync_managed_config` installs `scripts/workstation/managed-settings.json` to `/etc/claude-code/` whenever the repo copy changes — so editing the claudeMd = edit + commit, no manual install — and `refresh_codex_mirror` regenerates each user's `~/.codex/AGENTS.md` (a static mirror of the claudeMd; only files carrying the mirror header are touched, user-customized ones are left alone). Repo-level guidance (`.claude/CLAUDE.md`, `AGENTS.md`, `CONTEXT.md` in the infra repo) reaches non-admins through their auto-freshened clones — commit + push and every user has it within the hour.
**Infra access:** non-admins get their own **writable, git-crypt-LOCKED** clone of the (public) infra repo at `~/code` — code/docs plaintext, secret files (`*.tfvars`, `secrets/**`) stay ciphertext. Changes are ungated (push ≠ apply); the real boundary is apply-time (`scripts/tg apply` needs an admin Vault token + cluster RBAC).
**Memory — homelab CLI hooks (rolled out 2026-06-21, deploy-fixed 2026-06-22):** the per-user `claude_memory` MCP was retired for the **homelab-memory hooks** — the reconcile's `install_memory` (re)installs four scripts into `~/.claude/hooks/` each run (`homelab-memory-recall.py` UserPromptSubmit recall, `auto-learn.py` Stop-hook extraction, `pre-compact-backup.sh`/`post-compact-recovery.sh`), wires them into `settings.json` if-absent + additive, and removes the old `claude_memory` MCP. **The provisioner binary itself now self-deploys from the repo** (step 0: `bash -n`-gated `install` + re-exec when `scripts/t3-provision-users.sh` differs from `/usr/local/bin/t3-provision-users`, guarded against re-exec loops / DRY_RUN mutation) — added after this very rollout sat committed-but-undeployed for a day (only the manual `setup-devvm.sh` had ever deployed the binary), so the hourly reconcile kept running the pre-memory version and emo/anca silently lost memory (recall + auto-learn never wired). A latent `set -e` abort in `install_memory` (a bare `[[ -d plugin-dir ]] && …` returning non-zero) was also fixed; it had killed the reconcile after the first user the first time it actually ran. The hooks need a `MEMORY_API_KEY` (or `CLAUDE_MEMORY_API_KEY`) in the user's `settings.json` env — the `homelab` CLI defaults the API URL, so **the key is the only hard requirement**; `install_memory` reuses an existing key and only WARNs if absent (it does NOT mint one — that's an admin Vault step, see Remaining). wizard + emo carry a key from their original MCP setup; **ancamilea is keyless → her memory no-ops until a key is minted.** (`auto-learn.py`'s passive store calls the API directly, so it additionally needs `*_API_URL` in env to avoid its local-SQLite fallback; recall + manual `homelab memory store` go through the URL-defaulting CLI and need only the key.)
**Status (2026-06-08):** built + verified on the live host — capacity (8 GiB swap), config inheritance, roster-driven provisioner, per-user locked clone, **per-user OIDC kubeconfig + the `oidc-power-user-readonly` ClusterRole + emo's `k8s_users` entry (applied + impersonation-verified), and the Authentik `T3 Users` edge gate (applied + verified)**. **Remaining (held / future):** the emo cutover to his own locked clone (Phase 5), the offboarding apply-side (Phase 7), per-user MCP/auth injection, and roster-reconciled `T3 Users` membership. See `../runbooks/offboard-user.md` for deprovisioning.
**Agent skills — vendored own-copies for an allowlist (2026-06-23):** beyond the config-inheritance base (above, which symlinks the admin's `~/.claude/skills` into every user), the reconcile's `install_skills` gives users on the `SKILL_USERS` allowlist (currently `emo`) their OWN copies of a curated skill set vendored in-repo at `scripts/workstation/claude-skills/` (16: the admin's 15 `mattpocock/skills` + `find-skills` from `vercel-labs/skills`). It copies each into `~/.agents/skills/<name>` (owned by the user, parent `~/.agents` chowned too — `install -d` leaves intermediates root-owned) and points `~/.claude/skills/<name>` at it with a **relative** symlink (`../../.agents/skills/<name>` — the layout `skills add -g` produces; Claude Code reads `~/.claude/skills/`). **Vendored, NOT `npx skills add`:** upstream drifted off this exact set (`diagnose``diagnosing-bugs`, `write-a-skill``writing-great-skills` renamed; `caveman` + `zoom-out` unpublished), so npx can't reproduce it — and a per-reconcile GitHub clone + unpinned-CLI dependency has no place in the hourly root job; refresh by re-snapshotting (`claude-skills/README.md`). **if-absent keys on the user's OWN copy** (a real dir under `~/.agents/skills`), so a steady-state reconcile is a no-op AND a stale or cross-user `~/.claude/skills` symlink is healed to the own copy — emo had `grill-me`/`file-issue` symlinked into the admin's home; `grill-me` is now emo's own (`file-issue` is outside the set, left as-is). A real dir squatting a name is never clobbered. Best-effort tail (`return 0`, like `install_memory`). Extend coverage = edit `SKILL_USERS`.
**Onboarding state self-heals (2026-06-15):** `~/.claude.json` is a single file that ALL of a user's concurrent `claude` processes (the ttyd terminal + their `t3-serve` instance + agent/SDK sessions) read-modify-write, so a stale writer periodically drops top-level keys — including `hasCompletedOnboarding` — which bounces the next *interactive* session back to the first-run "Choose the text style" wizard even though the user is fully logged in (credentials live in the SEPARATE `~/.claude/.credentials.json`, untouched by the race; first observed for emo 2026-06-15). The launcher (`skel/start-claude.sh`) now idempotently re-asserts `hasCompletedOnboarding` (+ `lastOnboardingVersion`) in `~/.claude.json` right before it runs `claude` — merge-only, never clobbers other keys, no-op if jq is missing or the file is empty/corrupt. And since the launcher is a per-user copy that `/etc/skel` only seeds at account creation, the reconcile's new `deploy_user_launcher` step re-copies `skel/start-claude.sh` into every non-admin home (copy-if-changed) so launcher edits now reach EXISTING users within the hour — `.tmux.conf` is deliberately NOT re-copied (terminal-lobby appends its own managed section to it).
**Claude Code runtime — native, per-user (2026-06-15):** `claude` is the **native** install (`~/.local/bin/claude``~/.local/share/claude/versions/<v>`, self-updating; `installMethod: native`) — NOT npm-global or npx. It is the runtime for both the ttyd launcher and each `t3-serve` instance. `setup-devvm.sh` installs node ONLY for the `t3` CLI (not claude); per-user native claude is provisioned by the reconcile's `install_user_claude_native` (covers terminal + t3, idempotent, skip-if-present) and self-bootstrapped by `start-claude.sh` on first launch — both via the official `https://claude.ai/install.sh`. The legacy machine-wide `npm install -g @anthropic-ai/claude-code` bootstrap and the launcher's `npx` fallback were removed; existing users had already auto-migrated to native, and the npm-global dir was empty. **PATH (`~/.local/bin`, where the native binary lives):** ensured three ways — `/etc/profile.d/10-local-bin.sh` for login shells (machine-wide, fresh-user-safe), `start-claude.sh` itself (the launcher runs in tmux's non-login env that skips the user's shell rc), and `t3-serve@.service` (`Environment=PATH=…:/home/%i/.local/bin`).
**Claude authentication — per-user, self-renewing, Vault-recoverable (2026-06-20):** every roster user logs in with their OWN Enterprise identity; shared `CLAUDE_CODE_OAUTH_TOKEN` injection was removed because environment auth outranks local login and collapses identity/audit/quota. Claude owns access-token refresh in `~/.claude/.credentials.json`. A system template timer (`claude-auth-sync@<user>.timer`, every 6h) renews a dedicated 32-day periodic Vault token, validates Claude with real non-persistent Haiku inference (`auth status` can lie during a 401), backs up only `claudeAiOauth` to `secret/workstation/claude-users/<os_user>`, and performs one atomic Vault restore/retry on failure while preserving `mcpOAuth`. Vault policy `workstation-claude-<os_user>` isolates every path; the roster generates policies for present and future users. A hard refresh-token revocation still requires the affected person to complete SSO—there is no supported noninteractive bypass. Loki alert `WorkstationClaudeAuthInvalid` surfaces exhausted recovery. Runbook: `../runbooks/claude-auth-renew-workstation.md`.
**Per-user browser MCP — playwright, reproducible from git (2026-06-16):** every user (incl. the admin) gets their OWN isolated `@playwright/mcp` server so their concurrent Claude sessions don't fight over tabs (`--isolated` → a fresh browser context per MCP connection), wired into Claude in **every directory** via a user-scope `~/.claude.json` entry (`playwright → http://localhost:<PLAYWRIGHT_PORT>/mcp`). Mechanism: **system-level template units** `playwright-mcp@<user>.service` + `playwright-snapshot-refresh@<user>.{service,timer}` (`User=%i`, sourced from `scripts/workstation/playwright/`, installed by `setup-devvm.sh` §9e — system manager, so NO systemd --user / linger). `roster_engine.py` allocates a sticky per-user `PLAYWRIGHT_PORT` (`PLAYWRIGHT_BASE_PORT=8931`); the reconcile's `install_playwright()` writes it, seeds the chrome-service snapshot token if-absent (staged from Vault `secret/chrome-service` to `/etc/t3-serve/chrome-service-token` by `setup-devvm.sh` §8c, since the hourly root reconcile has no Vault token), wires `~/.claude.json` by running `claude mcp add --scope user` AS the user (clobber-proof + if-absent, so it fixes existing/new/admin without rewriting a populated config), and `enable --now`s the instances (idempotent — never restarts a running server). The `@playwright/mcp` version is **pinned** in the unit (the `@latest`-silently-rolls-the-fleet footgun — see `T3_PIN`). Replaced the earlier hand-made `~/.config/systemd/user/playwright-*` units (one-time idle-gated migration; pre-migration emo/anca had servers running but never wired into their `.claude.json`). Cookie-warming pipeline + ops: `../runbooks/chrome-service-snapshot.md`.
**Infra access:** non-admins get their own **writable, git-crypt-LOCKED** clone of the (public) infra repo — code/docs plaintext, secret files (`*.tfvars`, `secrets/**`) stay ciphertext. Its location depends on the per-user `code_layout` in `roster.yaml`: `single` (default) puts the clone AT `~/code`; `workspace` makes `~/code` a plain directory of per-project clones — the infra clone at `~/code/infra` plus each roster `repos` entry cloned from Forgejo `viktor/<name>` **as the user** (their PAT authenticates, so private repos work; clone failures WARN and retry next hour). Flipping a user to `workspace` auto-migrates their existing `~/code` clone to `~/code/infra` (local branches/dirty state survive; running processes follow the moved inode). ancamilea = workspace + `tripit` since 2026-06-10. The provisioner clones infra anonymously from the public GitHub mirror; **contribute access is wired per-user on top** (see below). The apply boundary still holds (`scripts/tg apply` needs an admin Vault token + cluster RBAC), but **pushing `master` is NOT inert** — the Forgejo→Woodpecker webhook fires `.woodpecker/default.yml` (`event: push, branch: master`, `require_approval: forks` only), which terragrunt-applies changed stacks. `master` is **branch-protected on Forgejo** (force-push disabled for everyone — history is append-only; push + merge whitelists = `viktor` + explicitly granted users, deploy keys allowed). **Allow-then-audit (Viktor, 2026-06-10):** `ebarzin` (emo) is on the whitelist and pushes straight to `master` — no PR gate. The tracking burden moves to: (a) **commit messages that record what + why** (the agent instructions in AGENTS.md and the managed claudeMd require the body to paraphrase the user's request), (b) the **`notify-nonadmin-push` Slack audit step** in `.woodpecker/default.yml` — every master push by a non-admin author is posted to Slack (admin pushes are not), and (c) non-admins **never use `[ci skip]`** so every change fires the pipeline (and thus the audit feed). Users NOT on the whitelist fall back to `<user>/<topic>` branches + PRs. **Clones stay fresh automatically** (2026-06-10): the hourly `t3-provision-users` reconcile runs `refresh_user_clone` over every managed clone — the infra clone and any workspace repos (fetch all remotes + fast-forward `master`, ONLY when on master with a clean tree and an upstream — dirty trees and local commits are left alone with a WARN) — and also `wire_forgejo_remote`, which idempotently adds the documented `forgejo` remote + `forgejo/master` upstream to infra clones that predate that contract. `start-claude.sh` does the same freshen at session launch (10s fetch cap per repo so an offline remote never stalls the session; workspace layouts freshen each repo under `~/code`).
**Contribute access (per non-admin, manual — the anca/tripit PAT precedent):**
1. Add their Forgejo user as a **write** collaborator on `viktor/infra` (`PUT /api/v1/repos/viktor/infra/collaborators/<login>`).
2. Mint a PAT — the admin REST endpoint 404s here, use the in-pod CLI: `kubectl -n forgejo exec deploy/forgejo -- su -s /bin/sh git -c "forgejo admin user generate-access-token --username <login> --token-name devvm-infra-git --scopes 'write:repository'"`.
3. Install it in their `~/.git-credentials` (`https://<login>:<token>@forgejo.viktorbarzin.me`, mode 600) + `git config --global credential.helper store`, set `user.name`/`user.email`.
4. The reconcile wires the clone side automatically (`wire_forgejo_remote`): `forgejo` remote + `master` tracking `forgejo/master` on every non-admin infra clone (origin stays the anonymous GitHub mirror). No manual step since 2026-06-10.
5. (Optional — Viktor's call per user) Grant direct master push: add their login to the `master` branch-protection push + merge whitelists (`PATCH /api/v1/repos/viktor/infra/branch_protections/master`). Done for `ebarzin` 2026-06-10.
6. Verify: branch push succeeds; a `master` push succeeds for whitelisted users and is rejected with `Not allowed to push to protected branch` otherwise.
**Web-terminal session persistence (2026-06-10):** the tmux-based web terminal's named sessions (each running one Claude conversation) survive devvm reboots — `tmux-persist-save.timer` (5-min) snapshots every terminal user's sessions (name, cwd, conversation uuid from argv or the cwd-slug transcript dir) to `/var/lib/tmux-persist/<user>.tsv`, and `tmux-persist-restore.service` recreates missing sessions at boot with `claude --resume <uuid>` (per-session idempotent; also handles partial loss). The web terminal also exposes an **on-demand "Restore sessions" button** (terminal-lobby: `tmux-api` `POST /restore` → the validated root `tmux-restore-user` wrapper → `tmux-persist restore <user>`, a single-user mode of the same script): the boot-only restore service never fires when an **OOM kills a user's tmux server *without* a reboot** (the common case under multi-user memory pressure), so the button covers that gap. This is a **tmux/terminal-surface** feature, deliberately outside the t3 namespace: the t3 chat surface persists its own threads (`~/.t3` state, plus the daily `t3-backup-state` dump), and Claude conversations themselves were always durable (`~/.claude/projects/`) — what this adds is the volatile tmux wiring.
**Status (2026-06-20):** built + verified on the live host — capacity (8 GiB swap), config inheritance, roster-driven provisioner, per-user locked clone, per-user OIDC kubeconfig + the `oidc-power-user-readonly` ClusterRole + emo's `k8s_users` entry (applied + impersonation-verified), the Authentik `T3 Users` edge gate, **the emo Phase-5 cutover (own clone + launcher repoint + `code-shared` removal, completed 2026-06-10) and emo's contribute access (`ebarzin` write collaborator + PAT + protected `master`)**, **per-user `code_layout` with the ancamilea workspace cutover**, per-user playwright browser MCP, and per-user Claude OAuth renewal/Vault recovery. Per the live `/etc/skel` design, non-admin `~/.claude/{rules,skills}` symlinks into the admin base are **kept**. **Remaining (held / future):** the offboarding apply-side (Phase 7), per-user `ha`/`claude_memory`/beads credential injection, and roster-reconciled `T3 Users` membership. See `../runbooks/offboard-user.md` for deprovisioning.
## Related

View file

@ -4,7 +4,7 @@ Last updated: 2026-04-19 (WS E — Kea DHCP pushes dual DNS per subnet; Kea DDNS
## Overview
The homelab network is built on a dual-VLAN architecture with pfSense providing gateway services, Technitium for internal DNS, and Cloudflare for external DNS. Traefik serves as the Kubernetes ingress controller with a comprehensive middleware chain including CrowdSec bot protection, Authentik forward-auth, and rate limiting. All HTTP traffic flows through Cloudflared tunnels, avoiding the need for port forwarding or exposing public IPs.
The homelab network is built on a dual-VLAN architecture with pfSense providing gateway services, Technitium for internal DNS, and Cloudflare for external DNS. Traefik serves as the Kubernetes ingress controller with a middleware chain of anti-AI bot-blocking, Authentik forward-auth, rate limiting, and retry. CrowdSec IP-reputation enforcement is **out-of-band** (not a Traefik hop): banned IPs are dropped in-kernel via nftables on direct hosts and blocked at the Cloudflare edge on proxied hosts (see `docs/architecture/security.md`). All HTTP traffic flows through Cloudflared tunnels, avoiding the need for port forwarding or exposing public IPs.
## Architecture Diagram
@ -16,12 +16,14 @@ graph TB
Traefik[Traefik Ingress<br/>3 replicas + PDB]
subgraph "Middleware Chain"
CS[CrowdSec Bouncer<br/>fail-open]
AntiAI[Anti-AI bot-block<br/>fail-open]
Auth[Authentik Forward-Auth<br/>3 replicas + PDB]
RL[Rate Limiter<br/>429 response]
Retry[Retry<br/>2 attempts, 100ms]
end
CSdrop[CrowdSec drop<br/>nftables / CF edge<br/>out-of-band, pre-Traefik]
subgraph "Proxmox Host (eno1)"
vmbr0[vmbr0 Bridge<br/>192.168.1.127/24]
vmbr1[vmbr1 Internal<br/>VLAN-aware]
@ -53,8 +55,9 @@ graph TB
Internet -->|DNS query| CF
CF -->|CNAME to tunnel| CFD
CFD --> Traefik
Traefik --> CS
CS --> Auth
CSdrop -.->|banned IPs dropped before Traefik| Traefik
Traefik --> AntiAI
AntiAI --> Auth
Auth --> RL
RL --> Retry
Retry --> Service
@ -82,7 +85,7 @@ graph TB
| Cloudflare DNS | SaaS | External | ~50 public domains under viktorbarzin.me |
| Cloudflared | Container | K8s (3 replicas) | Tunnel ingress, replaces port forwarding |
| Traefik | Helm chart | K8s (3 replicas + PDB) | Ingress controller, HTTP/3 enabled |
| CrowdSec | Helm chart | K8s (LAPI: 3 replicas) | Bot protection, fail-open bouncer |
| CrowdSec | Helm chart | K8s (LAPI: 3 replicas) | IP reputation. Out-of-band enforcement: `cs-firewall-bouncer` DaemonSet (in-kernel nftables drop, direct hosts) + Cloudflare edge WAF rule (proxied hosts). Fail-open |
| Authentik | Helm chart | K8s (3 replicas + PDB) | SSO, forward-auth middleware |
| MetalLB | v0.15.3 Helm chart | K8s | LoadBalancer IPs (10.0.20.200-10.0.20.220), all services on 10.0.20.200 |
| Registry Cache | Container | 10.0.20.10 | Pull-through for docker.io:5000, ghcr.io:5010 |
@ -208,24 +211,31 @@ VMs tag traffic on vmbr1 to isolate workloads. pfSense bridges VLAN 20 to the up
### Ingress Flow
CrowdSec is **not** a step in this chain — banned IPs are dropped before the
request ever reaches Traefik (Cloudflare edge WAF rule on proxied hosts; host
nftables on direct hosts). The flow below is for a request that survives that
out-of-band gate.
```mermaid
sequenceDiagram
participant Client
participant Cloudflare
participant CFedge as Cloudflare (edge WAF: crowdsec_ban block)
participant Cloudflared
participant Traefik
participant CrowdSec
participant AntiAI
participant Authentik
participant RateLimit
participant Retry
participant Service
participant Pod
Client->>Cloudflare: HTTPS request to blog.viktorbarzin.me
Cloudflare->>Cloudflared: Forward via tunnel (QUIC)
Client->>CFedge: HTTPS request to blog.viktorbarzin.me
Note over CFedge: banned IP → blocked here (proxied hosts)
CFedge->>Cloudflared: Forward via tunnel (QUIC)
Cloudflared->>Traefik: HTTP to LoadBalancer IP
Traefik->>CrowdSec: Apply bouncer middleware
CrowdSec->>Authentik: If allowed, check auth (protected=true)
Note over Traefik: on direct hosts, banned IPs already dropped in-kernel (nftables forward hook)
Traefik->>AntiAI: anti-AI bot-block (fail-open)
AntiAI->>Authentik: If allowed, check auth (protected=true)
Authentik->>RateLimit: If authenticated, check rate limit
RateLimit->>Retry: If within limit, continue
Retry->>Service: Forward to Service
@ -234,24 +244,27 @@ sequenceDiagram
Service-->>Retry: Response
Retry-->>RateLimit: Response
RateLimit-->>Authentik: Response (strip auth headers)
Authentik-->>CrowdSec: Response
CrowdSec-->>Traefik: Response
Authentik-->>AntiAI: Response
AntiAI-->>Traefik: Response
Traefik-->>Cloudflared: Response
Cloudflared-->>Cloudflare: Response via tunnel
Cloudflare-->>Client: HTTPS response
Cloudflared-->>CFedge: Response via tunnel
CFedge-->>Client: HTTPS response
```
### Middleware Chain
Every ingress created by the `ingress_factory` module follows this chain:
CrowdSec IP-reputation enforcement is **not** in this chain — it is out-of-band
(host nftables on direct hosts; the Cloudflare edge WAF `crowdsec_ban` rule on
proxied hosts), so banned IPs never reach the chain and there is no per-request
CrowdSec hop. Every ingress created by the `ingress_factory` module follows this
Traefik chain:
1. **CrowdSec Bouncer**: Checks IP against threat database. **Fail-open** mode — if LAPI is unreachable, traffic passes through to prevent outages.
1. **Anti-AI bot-block** (`ai-bot-block` ForwardAuth, on by default via `ingress_factory`): blocks/tarpits known AI crawlers. **Fail-open** (currently a no-op `return 200` — poison-fountain scaled to 0; see `docs/architecture/security.md`).
2. **Authentik Forward-Auth** (if `protected = true`): SSO authentication via OIDC. Non-authenticated users are redirected to login. Auth headers are stripped before forwarding to backend.
3. **Rate Limiting**: Per-IP throttling. Returns **429 Too Many Requests** (not 503) when limit exceeded. Default limits are generous; services like Immich and Nextcloud have higher custom limits.
3. **Rate Limiting**: Per-IP throttling. Returns **429 Too Many Requests** (not 503) when limit exceeded. Default is `rate-limit` (average 10 req/s, burst 50). Services whose clients legitimately burst harder get a dedicated middleware via `skip_default_rate_limit = true` + `extra_middlewares`: Immich (`immich-rate-limit`, 1000/20000, photo uploads) and ActualBudget (`actualbudget-rate-limit`, 50/300 — the Actual web app boots with ~70 parallel asset/migration revalidations; the default burst 429'd the tail and stalled every page load).
4. **Retry**: 2 attempts with 100ms delay on transient failures (5xx errors, connection errors).
Additional middleware:
- **Anti-AI**: On by default via `ingress_factory`. Blocks common AI crawler user-agents.
- **HTTP/3 (QUIC)**: Enabled globally on Traefik.
### Entrypoint Transport Timeouts
@ -348,10 +361,10 @@ Containerd on all K8s nodes uses `hosts.toml` to redirect pulls to the local cac
| pfSense | `stacks/pfsense/` | VM + cloud-init config |
| Technitium | `stacks/technitium/` | Deployment, Service, PVC |
| Traefik | `stacks/platform/` (sub-module) | Helm release, IngressRoute CRDs |
| CrowdSec | `stacks/platform/` (sub-module) | Helm release, LAPI + bouncer |
| CrowdSec | `stacks/crowdsec/` (+ edge in `stacks/rybbit/`) | Helm release, LAPI + agent; `cs-firewall-bouncer` DaemonSet (nftables, direct hosts) + Cloudflare edge sync (proxied hosts) |
| Authentik | `stacks/authentik/` | Helm release, ingress, OIDC configs |
| MetalLB | `stacks/platform/` (sub-module) | Helm release, IPAddressPool |
| Cloudflared | `stacks/cloudflared/` | Deployment (3 replicas), tunnel config |
| Cloudflared | `stacks/cloudflared/` | Deployment (3 replicas), tunnel config; runs `--no-autoupdate` (in-place self-updates exited the pods and severed all tunnel WebSockets, 2026-06-09/10) |
| ingress_factory | `modules/ingress_factory/` | IngressRoute + middleware chain |
### Key Configuration Files
@ -436,13 +449,30 @@ Containerd on all K8s nodes uses `hosts.toml` to redirect pulls to the local cac
**Decision**: Technitium handles internal `.lan` domains with near-zero latency. Cloudflare handles public domains with global DNS. K8s nodes use Technitium as primary, which forwards non-.lan queries to Cloudflare.
### Why Fail-Open on CrowdSec Bouncer?
### Why CrowdSec Enforcement Is Out-of-Band (and Fails Open)
**Alternatives considered**:
1. **Fail-closed**: Maximum security, but LAPI downtime blocks all traffic.
2. **Redundant LAPI**: Already scaled to 3 replicas, but resource pressure can still cause outages.
CrowdSec used to enforce inline as a Traefik middleware (the
`crowdsec-bouncer-traefik-plugin`). On Traefik 3.7.5 the Yaegi plugin handler was
never invoked, so it enforced nothing; the plugin was removed and enforcement
moved off the request path entirely (full history in
`docs/architecture/security.md`). It now runs on two surfaces:
**Decision**: Availability > strict bot blocking. CrowdSec LAPI is scaled to 3 replicas for resilience, but during cluster-wide resource exhaustion (e.g., memory pressure), bouncer falls back to allowing traffic. This prevents a complete service outage due to a security add-on.
- **Direct hosts**`cs-firewall-bouncer` DaemonSet drops banned IPs in the host
nftables, in **both the `input` and `forward` hooks**. The `forward` hook is
the load-bearing one: with Traefik on a dedicated LB IP at
`externalTrafficPolicy=Local`, client packets are DNAT'd to the Traefik **pod**
and transit the node's `forward` chain (not `input`) — which is exactly why the
ingress must preserve the **real client IP** end-to-end (ETP=Local + PROXY-v2
for IPv6; see the Traefik LB IP and IPv6 ingress notes above). Without the real
client IP the firewall-bouncer (and the CF edge rule) would have nothing to
match on.
- **Proxied hosts** → a Cloudflare edge WAF rule (`ip.src in $crowdsec_ban`) fed
by the `crowdsec-cf-sync` CronJob.
Both **fail open**: if LAPI is unreachable, the firewall-bouncer simply stops
receiving new decisions (existing drops persist) and the CF sync skips a run —
neither ever blocks legitimate traffic. Availability > strict bot blocking, and
out-of-band enforcement adds **zero per-request latency** (no Traefik hop).
### Why HTTP/3 (QUIC)?
@ -473,9 +503,10 @@ Containerd on all K8s nodes uses `hosts.toml` to redirect pulls to the local cac
**Symptoms**: All ingress routes return 503, Traefik dashboard shows no backends available.
**Diagnosis**: Middleware chain is blocking traffic. Check:
1. Authentik status: `kubectl get pod -n authentik`
2. CrowdSec LAPI status: `kubectl get pod -n crowdsec`
**Diagnosis**: Middleware chain is blocking traffic. (CrowdSec is **not** in the
chain — a CrowdSec/LAPI outage cannot cause 503s; it only stops new bans.) Check:
1. Authentik status: `kubectl get pod -n authentik` (ForwardAuth fails closed if the auth server is unreachable)
2. `bot-block-proxy` status: `kubectl get pod -n traefik -l app=bot-block-proxy` (anti-AI ForwardAuth target — also fails closed if down)
3. Traefik logs: `kubectl logs -n kube-system deploy/traefik`
**Fix**: If Authentik is down and ingress uses forward-auth, pods won't pass health checks. Scale Authentik to 3 replicas or temporarily disable forward-auth middleware.
@ -515,11 +546,11 @@ Containerd on all K8s nodes uses `hosts.toml` to redirect pulls to the local cac
### Rate Limiter Blocks Legitimate Traffic
**Symptoms**: Users report 429 errors during normal usage (e.g., Immich uploads).
**Symptoms**: Users report 429 errors during normal usage (e.g., Immich uploads, ActualBudget's "Server returned an error while checking its status" boot screen).
**Diagnosis**: Check Traefik middleware config for the affected IngressRoute.
**Fix**: Increase rate limit in `ingress_factory` module. Default is 100 req/min per IP. Immich and Nextcloud use 500 req/min.
**Fix**: Give the service a dedicated higher-limit middleware (don't loosen the shared default): define `<service>-rate-limit` in `stacks/traefik/modules/traefik/middleware.tf`, then set `skip_default_rate_limit = true` + `extra_middlewares = ["traefik-<service>-rate-limit@kubernetescrd"]` on its `ingress_factory` call. Shared default is average 10 req/s / burst 50; Immich uses 1000/20000, ActualBudget 50/300.
### Large Downloads or Uploads Truncate / Fail Partway

View file

@ -2,40 +2,50 @@
## Overview
The homelab implements defense-in-depth security at the application layer (L7) using CrowdSec for threat intelligence and IP reputation, Kyverno for policy enforcement and resource governance, and a 3-layer anti-AI scraping defense (reduced from 5 in April 2026 after removing the rewrite-body plugin). All security components operate in graceful degradation mode (fail-open) to prevent cascading failures. Security policies are deployed in audit mode first, then selectively enforced after validation.
The homelab implements defense-in-depth security using CrowdSec for threat intelligence and IP reputation, Kyverno for policy enforcement and resource governance, and a 3-layer anti-AI scraping defense (reduced from 5 in April 2026 after removing the rewrite-body plugin). CrowdSec enforcement is **out-of-band** (not a per-request Traefik hop — see the CrowdSec section): banned IPs are dropped in-kernel via nftables on direct hosts, and blocked at the Cloudflare edge on proxied hosts, so enforcement adds **zero per-request latency**. All security components fail open (a CrowdSec outage stops new bans but never blocks legitimate traffic). Security policies are deployed in audit mode first, then selectively enforced after validation.
## Architecture Diagram
CrowdSec enforcement is out-of-band (NOT an inline Traefik middleware hop). The
Traefik request chain is anti-AI → Authentik ForwardAuth → rate-limit → retry;
CrowdSec drops banned IPs *before* (direct hosts) or *off* (proxied hosts) that
chain entirely.
```mermaid
graph LR
graph TB
Internet[Internet]
CF[Cloudflare WAF]
subgraph "Proxied hosts (orange-cloud)"
CFedge[Cloudflare edge<br/>WAF rule: ip.src in $crowdsec_ban → block]
end
subgraph "Direct hosts (grey-cloud / internal)"
NFT[Host nftables<br/>table crowdsec/crowdsec6<br/>drop in input + forward]
end
Tunnel[Cloudflared Tunnel]
CrowdSec[CrowdSec Bouncer<br/>Traefik Plugin]
AntiAI[Anti-AI Check<br/>poison-fountain]
ForwardAuth[Authentik ForwardAuth]
RateLimit[Rate Limit Middleware]
Retry[Retry Middleware<br/>2 attempts, 100ms]
Traefik[Traefik<br/>anti-AI → Authentik → rate-limit → retry]
Backend[Backend Service]
LAPI[CrowdSec LAPI<br/>3 replicas]
Agent[CrowdSec Agent]
Agent[CrowdSec Agent<br/>parses Traefik logs]
FWB[cs-firewall-bouncer<br/>DaemonSet, every node]
CFsync[crowdsec-cf-sync<br/>CronJob, every 2 min]
Internet -->|1| CF
CF -->|2| Tunnel
Tunnel -->|3| CrowdSec
CrowdSec -.->|Query| LAPI
Agent -.->|Report| LAPI
CrowdSec -->|4. Pass/Block| AntiAI
AntiAI -->|5. Human/Bot| ForwardAuth
ForwardAuth -->|6. Authenticated| RateLimit
RateLimit -->|7. Under Limit| Retry
Retry -->|8. Success/Retry| Backend
Internet -->|proxied| CFedge
Internet -->|direct| NFT
CFedge -->|allowed| Tunnel
Tunnel --> Traefik
NFT -->|allowed| Traefik
Traefik --> Backend
style CrowdSec fill:#f9f,stroke:#333
style AntiAI fill:#ff9,stroke:#333
style ForwardAuth fill:#9f9,stroke:#333
style RateLimit fill:#99f,stroke:#333
Agent -.->|report| LAPI
LAPI -.->|all decisions incl. CAPI| FWB
FWB -.->|program drop rules| NFT
LAPI -.->|ban/captcha decisions, CAPI excluded| CFsync
CFsync -.->|push IP list| CFedge
style CFedge fill:#f9f,stroke:#333
style NFT fill:#f9f,stroke:#333
```
## Components
@ -44,7 +54,8 @@ graph LR
|-----------|---------|----------|---------|
| CrowdSec LAPI | Pinned | `stacks/crowdsec/` | Local API, threat intelligence aggregation (3 replicas) |
| CrowdSec Agent | Pinned | `stacks/crowdsec/` | Log parser, scenario detection |
| CrowdSec Traefik Bouncer | Plugin | Traefik config | Plugin-based IP reputation check |
| cs-firewall-bouncer | v0.0.34 | `stacks/crowdsec/modules/crowdsec/firewall_bouncer.tf` | In-kernel nftables drop on every node (DIRECT hosts). Bouncer key `firewall` |
| crowdsec-cf-sync | — | `stacks/rybbit/crowdsec_edge.tf` | LAPI→Cloudflare-IP-List sync CronJob (PROXIED hosts). Bouncer key `kvsync` |
| Kyverno | Pinned chart | `stacks/kyverno/` | Policy engine for K8s admission control |
| poison-fountain | Latest | `stacks/poison-fountain/` | Anti-AI bot detection and tarpit service |
| cert-manager/certbot | - | `stacks/cert-manager/` | TLS certificate management |
@ -54,11 +65,15 @@ graph LR
### Request Security Layers
Every incoming request passes through 6 security layers:
CrowdSec IP-reputation enforcement happens **before** a request reaches the
Traefik chain (banned IPs are dropped in-kernel on direct hosts, or blocked at
the Cloudflare edge on proxied hosts — see CrowdSec Threat Intelligence below).
A request that survives that out-of-band gate then passes through the Traefik
middleware chain:
1. **Cloudflare WAF** - DDoS protection, bot detection, firewall rules (external)
2. **Cloudflared Tunnel** - Zero Trust tunnel, hides origin IP
3. **CrowdSec Bouncer** - IP reputation check against LAPI (fail-open on error)
1. **Cloudflare WAF / edge** - DDoS protection, bot detection, firewall rules incl. the CrowdSec `crowdsec_ban` block rule (proxied hosts only)
2. **Cloudflared Tunnel** - Zero Trust tunnel, hides origin IP (proxied hosts)
3. **CrowdSec out-of-band drop** - nftables on direct hosts; *not* a Traefik hop (zero per-request latency)
4. **Anti-AI Scraping** - 3-layer bot defense (optional per service, updated 2026-04-17)
5. **Authentik ForwardAuth** - Authentication check (if `protected = true`)
6. **Rate Limiting** - Per-source IP rate limits (returns 429 on breach)
@ -80,11 +95,71 @@ CrowdSec operates in a hub-and-agent model:
- Reports malicious IPs to LAPI
- Shares threat intel with CrowdSec community (anonymized)
**Traefik Bouncer Plugin**:
- Integrated as Traefik middleware
- Queries LAPI for IP reputation on each request
- **Fail-open mode**: If LAPI unreachable, allows traffic (graceful degradation)
- Blocks IPs on ban list, allows others
Enforcement is split across **two out-of-band surfaces**, neither of which adds
any per-request latency. (See "Why the Traefik bouncer plugin was removed" below
for the supersession history — there is no longer an inline Traefik bouncer.)
**Surface 1 — DIRECT (non-Cloudflare-proxied) hosts → in-kernel nftables drop**
(`cs-firewall-bouncer` DaemonSet, `stacks/crowdsec/modules/crowdsec/firewall_bouncer.tf`):
- Runs on **every node** (no nodeSelector). Programs the HOST nftables — `table ip
crowdsec` / `table ip6 crowdsec6` — with drop rules in **both the `input` AND
the `forward` hooks**. The `forward` hook is required because Traefik is a
LoadBalancer with `externalTrafficPolicy=Local`: client traffic is DNAT'd to the
Traefik **pod** and transits the node's `forward` hook (not `input`) with the
real client IP preserved. Chains use `policy accept` (only set members drop —
it can never blackhole normal traffic).
- Pulls **all** decisions from LAPI, **including the CAPI community blocklist
(~31k IPs)**. Packets from banned IPs are dropped **in-kernel before reaching
Traefik** → zero per-request hops, no Traefik involvement at all.
- **Packaging**: cs-firewall-bouncer publishes no container image, so the
**v0.0.34** static binary is fetched at runtime by an initContainer onto a
`debian:bookworm-slim` runtime container. Needs `hostNetwork` +
`NET_ADMIN`/`NET_RAW` to talk netlink directly. Registered bouncer key:
**`firewall`**.
- **Fail-open**: if LAPI is unreachable it just stops receiving new decisions
(existing drop rules persist); it never blocks legitimate traffic.
**Surface 2 — PROXIED (Cloudflare orange-cloud) hosts → Cloudflare edge block**
(`stacks/rybbit/crowdsec_edge.tf` + `lapi_kv_sync.py`):
- Proxied hosts terminate at the Cloudflare edge, so a host-level nftables drop
would never see them. Enforcement is instead a single Cloudflare Rules List
**`crowdsec_ban`** + a zone-scoped WAF custom rule `(ip.src in $crowdsec_ban)`
**block** action, which covers every proxied host in the zone.
- Fed by the **`crowdsec-cf-sync` CronJob** (namespace `rybbit`, every 2 min,
pure-stdlib Python in a ConfigMap). It pulls local **ban/captcha ip-scoped**
decisions and pushes them into the CF list, but **EXCLUDES the ~31k CAPI
community blocklist** — that set is far too large for a CF Rules List (the CF
account hard-limits to **one** list), and CAPI is already covered in-kernel on
direct hosts and by Cloudflare's own managed protections on proxied hosts.
Registered bouncer key: **`kvsync`**.
- **Block-only**: the single-list limit precludes a separate
captcha/managed-challenge list, so both ban and captcha decisions are enforced
as a plain block at the edge.
- **Auth carve-out:** the WAF rule excludes `authentik.viktorbarzin.me` +
`public-auth.viktorbarzin.me` (`… and not (http.host in {…})`). A CrowdSec hit
must never wall a user out of the login / WebAuthn flow they authenticate
through; auth keeps `traefik-rate-limit` for brute-force protection.
**Whitelist** (`stacks/crowdsec/whitelist.yaml`): a CrowdSec whitelist covers
RFC1918 + the tailnet + internal CIDRs (plus one specific external IP), so
internal users are never enforced. Internal access uses split-horizon DNS
straight to Traefik, and direct internal clients are RFC1918 — both whitelisted.
#### Why the Traefik bouncer plugin was removed
Enforcement used to run as an inline Traefik middleware — the
`crowdsec-bouncer-traefik-plugin` (Yaegi/Lua), which queried LAPI on every
request and could serve a Cloudflare Turnstile captcha for soft remediations.
On **Traefik 3.7.5 the Yaegi handler was never invoked**, so the bouncer was
registered but enforced **nothing** despite appearing healthy. Rather than chase
the Yaegi runtime, the whole plugin path was **removed** (2026-06): the plugin
static config + initContainer download, the `crowdsec` Middleware CRD, the
`captcha.html` template + its ConfigMap and volume mount, and the Cloudflare
Turnstile widget (`cloudflare_turnstile_widget.crowdsec_captcha`). It was
replaced by the two out-of-band surfaces above, which add zero per-request
latency and fail open. (The earlier `crowdsec-cf-sync` cursor-pagination /
IP-List-capacity issues are also moot now that CAPI is excluded from the edge
list and dropped in-kernel instead.)
**Metabase** (disabled by default):
- Dashboard for CrowdSec analytics
@ -189,7 +264,7 @@ Beads epic: `code-8ywc`. **Status: partially live as of 2026-05-18.**
| W1.2 Vault `x_forwarded_for_authorized_addrs = 10.10.0.0/16` | **LIVE** — applied via `tg apply -target=helm_release.vault` on 2026-05-18; all 3 vault pods restarted cleanly |
| W1.2 Vault audit log shipping to Loki | **LIVE**`audit-tail` sidecar in vault pods + Alloy DaemonSet ships to Loki with `container="audit-tail"`. Verified via `{namespace="vault",container="audit-tail"}` LogQL query. |
| W1.1 K8s API audit policy + shipping | **LIVE** — kube-apiserver audit policy was already configured (Metadata level, `/var/log/kubernetes/audit.log`, 7d retention). Alloy DaemonSet now tolerates control-plane taint, scrapes the audit log file, ships to Loki with `job=kubernetes-audit`. K2-K9 alert rules in Loki ruler. |
| W1.3 Source-IP anomaly rules (K9, V7, S1) | **LIVE** (K9, V7); **S1 PENDING** — fires once promtail/Alloy on PVE host ships sshd journal with `job=sshd-pve`. |
| W1.3 Source-IP anomaly rules (K9, V7, S1) | **LIVE** (K9, V7, S1). **S1 activated 2026-06-10** — promtail on the PVE host now ships the journal to Loki (`scripts/pve-promtail.yaml`); sshd auth lands as `job=sshd-pve` (the S1 data source). The same shipper carries snoopy `execve()` command audit as `{job="pve-journal", identifier="snoopy"}` (forensic, not alerting). Deployed because emo's agent was given root SSH to the host (shared key) — see `docs/architecture/monitoring.md` → "External host: pve". |
| W1.4 Kyverno security policies → Enforce | **LIVE** — 3 policies in Enforce mode with 35-namespace exclude list. |
| W1.5 Kyverno trusted-registries → Enforce | **LIVE** — explicit allowlist (15 registries + 6 DockerHub library bare names + 56 DockerHub user repos). Verified by admission dry-run: `evilcorp.example/malware:v1` BLOCKED, `alpine:3.20` and `docker.io/library/alpine:3.20` ALLOWED. |
| W1.6 Calico observe-phase (pilot: recruiter-responder) | **LIVE** (2026-05-19) — GlobalNetworkPolicy `wave1-egress-observe-recruiter-responder` with rules `[action:Log, action:Allow]`. FelixConfiguration.flowLogsFileEnabled approach abandoned (Calico Enterprise-only field, rejected by OSS v3.26). Log action emits iptables LOG with prefix `calico-packet: ` → kernel → journald → Alloy → Loki. Verified: `{job="node-journal"} \|~ "calico-packet"` returns real packet metadata (SRC/DST/PROTO). Expand to more namespaces by adding to `namespaceSelector`. |
@ -205,7 +280,7 @@ Response model: **(I) Slack-only, daily skim.** All security alerts land in a ne
|---|---|---|---|
| K8s API audit log | Custom audit policy on kube-apiserver: drop `get`/`list`/`watch` at `None` for most resources, log writes at `Metadata`, secret reads at `Metadata`, `exec`/`portforward` at `RequestResponse`, exclude kubelet+controller-manager noise. Codified in `stacks/infra` kubeadm config templating. | Alloy DaemonSet tails `/var/log/kubernetes/audit/*.log` | `job=kube-audit` |
| Vault audit log | `file` audit device on existing Vault PVC. Vault listener config sets `x_forwarded_for_authorized_addrs` trusting Traefik pod CIDR so `remote_addr` is the real client IP, not Traefik's. | Alloy tails audit log file | `job=vault-audit` |
| PVE sshd auth log | journald `_SYSTEMD_UNIT=ssh.service` | promtail systemd unit on Proxmox host (192.168.1.127) | `job=sshd-pve` |
| PVE sshd auth log | journald (`_SYSTEMD_UNIT=ssh.service`, `SYSLOG_IDENTIFIER=sshd-session`); promtail relabels `identifier=~"sshd.*"``job=sshd-pve` | promtail systemd unit on Proxmox host (192.168.1.127), `scripts/pve-promtail.yaml`**LIVE 2026-06-10** | `job=sshd-pve` |
| Calico flow log | `flowLogsFileEnabled: true` in Calico Felix config | Alloy (cluster-wide) | `job=calico-flow` (W1.6 only) |
#### Alert rules (16 total)
@ -255,6 +330,10 @@ Routed via **Loki ruler → Alertmanager → `#security` Slack receiver**. Same
**Policy: no public-IP access ever.** Vault, kube-apiserver, PVE sshd must transit a trusted LAN or Headscale. Anything else fires an alert.
**Documented exception — break-glass SSH (2026-06-11):** one deliberate carve-out. The Proxmox host's sshd listens on a WAN-exposed `:52222` (edge-router forward), **key-only**, trusting only a dedicated break-glass key (`Match LocalPort``authorized_keys.breakglass`), rate-limited (iptables hashlimit) + fail2ban. It is intentionally reachable from the public internet so it survives a cluster/tunnel outage with no dependency on the cluster — the one case the "must transit LAN/Headscale" rule cannot serve. Brute-force-proof (no password); the trade is Shodan-visibility. As-built: `docs/runbooks/breakglass-ssh.md`; rationale: `docs/plans/2026-06-11-breakglass-ssh-redesign-design.md`. (Replaced the 2026-05-30 port-knock variant, which was non-scannable but had a circular Vault dependency that caused a lockout.)
**Two privileged footholds for the warm break-glass UI (2026-06-12):** the in-cluster `claude-breakglass` service (`breakglass.viktorbarzin.me`, warm case = devvm wedged, cluster healthy) holds one ed25519 key (Vault `secret/claude-breakglass/ssh_key`) authorising: (1) a `breakglass` user on the **devvm** with NOPASSWD sudo (`from="10.0.20.0/24"` — the Calico-SNAT node subnet); (2) a **PVE** `authorized_keys` entry pinned to `command="/usr/local/bin/breakglass-pve",restrict,from="192.168.1.2"` (pfSense's inter-VLAN SNAT IP) that only runs the verbs `status|forensics|reset|stop|start|cycle` against VM 102. The key is reachable ONLY by the breakglass pod (own namespace, no Vault role, ESO-synced); the shared `claude-agent` pod's `terraform-state` Vault policy is explicitly DENIED `secret/claude-breakglass/*`. Reset is autonomous (the agent may fire it), forensics-first. Reachable via Authentik or the basic-auth fallback — LAN-routed, not WAN-exposed. Runbook: `docs/runbooks/breakglass-ui.md`; ADR: `claude-agent-service/docs/adr/0001-breakglass-security-architecture.md`.
#### Why no canary tokens
Original plan included canary tokens (fake K8s Secret, Vault KV path, PVE file, sinkhole hostname). Rejected because Viktor routinely greps `secret/viktor` (135 keys) and lists `kubectl get secret -A` — any read-trigger canary self-fires. Use-based canaries (zero-RBAC SA tokens with audit alerts on use) were also considered but rejected in favor of cleaner source-IP anomaly detection (K9, V7) on REAL tokens — same threat model, no fake-token operational burden.
@ -326,10 +405,12 @@ Beads: `code-8ywc` W1.6 + W1.7. **Status: planned.**
| Path | Purpose |
|------|---------|
| `stacks/crowdsec/` | CrowdSec LAPI, agent, bouncer config |
| `stacks/crowdsec/` | CrowdSec LAPI, agent config + `whitelist.yaml` |
| `stacks/crowdsec/modules/crowdsec/firewall_bouncer.tf` | cs-firewall-bouncer DaemonSet (in-kernel nftables drop, direct hosts) |
| `stacks/rybbit/crowdsec_edge.tf` + `lapi_kv_sync.py` | Cloudflare IP-List + WAF block rule + LAPI→CF sync CronJob (proxied hosts) |
| `stacks/kyverno/` | Kyverno deployment + policies |
| `stacks/poison-fountain/` | Anti-AI service + CronJob |
| `stacks/platform/modules/traefik/middleware.tf` | Security middleware definitions |
| `stacks/traefik/modules/traefik/middleware.tf` | Security middleware definitions (no longer includes a CrowdSec bouncer) |
| `stacks/platform/modules/ingress_factory/` | Per-service security toggles |
### Vault Paths
@ -439,7 +520,11 @@ spec:
**Fix**:
1. Check LAPI decisions: `kubectl exec -it crowdsec-lapi-0 -- cscli decisions list`
2. Remove ban: `kubectl exec -it crowdsec-lapi-0 -- cscli decisions delete --ip <IP>`
3. Whitelist if needed: Add to `stacks/crowdsec/whitelist.yaml`
— the in-kernel drop clears as soon as `cs-firewall-bouncer` reconciles (direct
hosts); for proxied hosts the `crowdsec-cf-sync` CronJob removes it from the
`crowdsec_ban` CF list within ~2 min.
3. Whitelist if needed: Add to `stacks/crowdsec/whitelist.yaml` (RFC1918 + tailnet
+ internal CIDRs are already whitelisted, so internal clients are never banned).
### Kyverno Policy Blocking Deployment

View file

@ -17,7 +17,7 @@ All services storing sensitive data were migrated to `proxmox-lvm-encrypted` on
- **HDD NFS**: `/srv/nfs` on ext4 LV `pve/nfs-data` (4TB) — bulk media and backup targets
- **SSD NFS**: `/srv/nfs-ssd` on ext4 LV `ssd/nfs-ssd-data` (100GB) — high-performance data (Immich ML)
Both `StorageClass: nfs-truenas` and `StorageClass: nfs-proxmox` point to the Proxmox host and are functionally identical. The `nfs-truenas` name is historical — it was retained because StorageClass names are immutable on bound PVs (48 PVs reference it) and renaming would force mass PV churn across the cluster.
`StorageClass: nfs-truenas` is the **only** NFS StorageClass and points to the Proxmox host. The name is historical — it was retained because StorageClass names are immutable on bound PVs (48 PVs reference it) and renaming would force mass PV churn across the cluster. (A short-lived parallel `nfs-proxmox` StorageClass was removed on 2026-04-25, commit 484b4c71, during the vault NFS-hostile migration.)
**Backup storage (sda)**: 1.1TB RAID1 SAS disk, VG `backup`, LV `data` (ext4), mounted at `/mnt/backup` on PVE host. Dedicated backup disk for weekly PVC file backups, auto SQLite backups, pfSense backups, and PVE config. NFS data syncs directly to Synology via inotify change tracking (not stored on sda). Independent of live storage (sdc).
@ -47,7 +47,7 @@ graph TB
end
subgraph K8s["Kubernetes Cluster"]
CSI_NFS["nfs-csi driver<br/>StorageClass: nfs-proxmox (+ legacy nfs-truenas)<br/>soft,timeo=30,retrans=3"]
CSI_NFS["nfs-csi driver<br/>StorageClass: nfs-truenas (historical name)<br/>soft,timeo=30,retrans=3"]
CSI_PVE["Proxmox CSI plugin<br/>StorageClass: proxmox-lvm<br/>StorageClass: proxmox-lvm-encrypted"]
NFS_PV["NFS PersistentVolumes<br/>RWX, ~100 volumes"]
@ -85,8 +85,7 @@ graph TB
| Proxmox NFS (HDD) | LV `pve/nfs-data`, 4TB ext4 | 192.168.1.127:/srv/nfs | Bulk NFS data for all services |
| Proxmox NFS (SSD) | LV `ssd/nfs-ssd-data`, 100GB ext4 | 192.168.1.127:/srv/nfs-ssd | High-performance data (Immich ML) |
| nfs-csi | Helm chart | Namespace: nfs-csi | NFS CSI driver |
| StorageClass `nfs-proxmox` | RWX, soft mount | Cluster-wide | NFS storage, points to Proxmox host |
| StorageClass `nfs-truenas` | RWX, soft mount | Cluster-wide | **Historical name** — functionally identical to `nfs-proxmox`, points to the Proxmox host. Kept because SC names are immutable on 48 bound PVs. |
| StorageClass `nfs-truenas` | RWX, soft mount | Cluster-wide | The only NFS StorageClass — **historical name**, points to the Proxmox host. Kept because SC names are immutable on 48 bound PVs. (Sibling `nfs-proxmox` SC removed 2026-04-25, commit 484b4c71.) |
| TF module `nfs_volume` | `modules/kubernetes/nfs_volume/` | Infra repo | Static NFS PV/PVC factory |
| ~~TrueNAS VM~~ | **DECOMMISSIONED 2026-04-13** | Was VM 9000 at 10.0.10.15 | Replaced by Proxmox NFS. VM still in stopped state pending deletion. |
| ~~democratic-csi-iscsi~~ | **REMOVED** | Was namespace: iscsi-csi | Replaced by Proxmox CSI (2026-04-02) |
@ -113,7 +112,7 @@ graph TB
**Note**: Some legacy PVs still reference `/mnt/main/<service>` paths. These work via compatibility symlinks/bind-mounts on the Proxmox host. New PVs should use `/srv/nfs/<service>` or `/srv/nfs-ssd/<service>`.
**CRITICAL**: Never use inline `nfs {}` blocks in pod specs — they default to `hard,timeo=600` which causes 10-minute hangs on network issues. Always use the `nfs-proxmox` StorageClass (or the legacy `nfs-truenas` for existing PVs) via PVCs.
**CRITICAL**: Never use inline `nfs {}` blocks in pod specs — they default to `hard,timeo=600` which causes 10-minute hangs on network issues. Always use the `nfs-truenas` StorageClass (historical name; it points at the Proxmox host) via PVCs.
### Block Storage Flow (Proxmox CSI) — NEW

Some files were not shown because too many files have changed in this diff Show more