infra

Author	SHA1	Message	Date
Viktor Barzin	e002fddede	WIP: goldmane-edge-aggregator deploy stack + vault role + ghcr allowlist (infra #58 ) NOT APPLIED. Staged for a fresh-session finish (see memory runbook). Contains: - stacks/goldmane-edge-aggregator/{main.tf,terragrunt.hcl}: namespace, TF-minted mTLS client cert from tigera-ca-private, goldmane_edges PG DB-init Job, db + slack ExternalSecrets, aggregate Deployment + digest CronJob. - stacks/vault/main.tf: pg-goldmane-edges static rotation role (Tier-0). - stacks/kyverno/.../ghcr-credentials.tf: ns added to the private-image allowlist. KNOWN BLOCKER: the stack uses the hashicorp/tls provider (cert minting) but the root terragrunt.hcl generate "k8s_providers" block doesn't declare it, and a second required_providers (the removed versions.tf) is illegal. FIX = add tls to that global block (mirrors proxmox/kubectl). Then apply order: db_init (creates goldmane_edges role) -> kyverno -> vault (Tier-0, plan-review) -> stack ExternalSecrets (targeted, first-apply) -> stack full -> verify mTLS to goldmane:7443. Vault KV secret/goldmane-edge-aggregator already created. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-24 13:01:37 +00:00
Viktor Barzin	1d0388da12	Merge remote-tracking branch 'origin/master' All checks were successful ci/woodpecker/push/default Pipeline was successful Details	2026-06-24 12:22:58 +00:00
Viktor Barzin	92361f36db	calico: enable Goldmane + Whisker (Calico 3.30 OSS flow observability) Turns on Calico 3.30's native east-west flow observability so we can see which Service talks to which (ADR-0014, issue #57). Enabled via the operator CRs directly (kubectl_manifest Goldmane + Whisker, name=default) rather than the Helm goldmane/whisker flags, because the goldmanes/whiskers CRDs already exist and this sidesteps the helm-upgrade CR-before-CRD ordering issue. Whisker notifications=Disabled so the UI doesn't call the external Tigera endpoint. Applied supervised: creating the Goldmane CR re-rendered calico-node with the FELIX_FLOWLOGSGOLDMANESERVER env (operator auto-wires Felix — no manual FelixConfiguration); calico-node rolled cleanly 7/7, tigerastatus healthy, goldmane is receiving flows from all nodes, Whisker UI serves. Durable Loki persistence is NOT included here: the Goldmane emitter is Calico Cloud/Enterprise-gated with no OSS knob to aim it at Loki (the CR can override only name+resources, not env), so a durable trail needs a small custom gRPC consumer of goldmane:7443 — tracked in issue #58. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-24 12:22:48 +00:00
Viktor Barzin	e711b2f971	feat(monitoring): homelab vault traceability alerts (TOTP-fetch + volume) Some checks failed ci/woodpecker/push/default Pipeline failed Details Build infra CLI / build (push) Has been cancelled Details Adds a Loki ruler group (lane=security -> #security) for the homelab vault op-log: VaultwardenTOTPFetched (every 2nd-factor fetch is visible) and VaultwardenFetchVolumeHigh (>100 fetches/10m backstop). The audit spine (Vault audit device, reads of secret/data/workstation/claude-users/*) is already captured. True CLI-bypass detection needs cross-stream correlation (follow-up).	2026-06-24 10:31:32 +00:00
Viktor Barzin	64104e56e9	feat(devvm): install Bitwarden CLI for homelab vault	2026-06-24 10:29:57 +00:00
Viktor Barzin	15643d1f44	feat(cli): bare `homelab vault` help command	2026-06-24 10:29:32 +00:00
Viktor Barzin	772aed5370	fix(cli): vault security review fixes C1 (critical): setup wrote the master password + API client_secret as `vault kv patch key=value` argv, leaking them via /proc/<pid>/cmdline to same-UID processes. Now written via stdin (key=- form); only email + client_id (non-credentials) remain in argv. I1: `get --json` refused on a TTY (was dumping the secret to scrollback). M1: vaultLock now holds the per-user flock (it mutates bw state). M4: bw login-detection parses status JSON instead of substring matching. M5: clipboard path refuses when stderr is not a TTY (was silently failing). M6: realRunner trims only trailing newline, preserving secret whitespace; secret prompts likewise. Adds security-property tests: no secret in argv across the get flow, clipboard decision matrix, --json TTY gate, bw status parsing.	2026-06-24 10:28:31 +00:00
Viktor Barzin	5a864cf19c	feat(cli): homelab vault setup onboarding (one-time, self-service)	2026-06-24 10:21:57 +00:00
Viktor Barzin	e20033855d	feat(cli): vault list/search/code/status/lock	2026-06-24 10:21:07 +00:00
Viktor Barzin	365340b37d	feat(cli): homelab vault get with TTY-aware return	2026-06-24 10:20:05 +00:00
Viktor Barzin	2dd12fc6be	feat(cli): vault session bootstrap with per-user flock + no-coredump	2026-06-24 10:18:36 +00:00
Viktor Barzin	5bae2a3907	feat(cli): privacy-aware vault op-log (process, never the secret)	2026-06-24 10:17:50 +00:00
Viktor Barzin	81122f8607	feat(cli): TTY-aware return + OSC52 clipboard with terminal gating	2026-06-24 10:17:13 +00:00
Viktor Barzin	06f4b87af1	feat(cli): vault bw engine env/arg builders + unlock	2026-06-24 10:16:19 +00:00
Viktor Barzin	cd44ca5921	feat(cli): vault creds loading from per-user Vault path	2026-06-24 10:15:32 +00:00
Viktor Barzin	6c53ee10b1	feat(cli): register homelab vault command group skeleton	2026-06-24 10:14:24 +00:00
Viktor Barzin	ae0d7984c4	docs: ADR-0014 + glossary — service identity (namespace+label) & Calico Goldmane observability All checks were successful ci/woodpecker/push/default Pipeline was successful Details Records the design reached in a /grill-with-docs session: how to track which Service talks to which as more Services are added, using k8s-native options. Decision: service identity = the workload's namespace (primary) plus a `service-identity` label only in the few multi-Service namespaces; east-west observability = Calico 3.30 Goldmane/Whisker (already in our Calico v3.30.7, currently disabled) emitting to Loki for a durable trail; enforcement reuses the existing Wave 1 egress track. Dedicated per-Service ServiceAccounts deferred and a service mesh / mTLS / SPIFFE rejected — the trust model needs attribution-grade forensics on a trusted, etcd-constrained cluster, not cryptographic non-repudiation. This is the service-mesh evaluation the 2026-04-20 infra audit flagged as missing; rejected alternatives (Retina, Hubble, Kiali, a custom Alloy enricher) are recorded with rationale. Adds glossary terms (Service identity, Goldmane / Whisker) to CONTEXT.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-24 10:00:36 +00:00
Viktor Barzin	0293b5c634	android-emulator: fix idle-sleeper dying with SIGPIPE before it could sleep All checks were successful ci/woodpecker/push/default Pipeline was successful Details Caught live-testing the previous commit: every sleeper run exited 141 (SIGPIPE) in ~1s with no output, never reaching the scale-down. Cause: `set -o pipefail` + `dumpsys power \| awk '...; exit'` — awk closes the pipe after the first match while `kubectl exec` is still streaming dumpsys, so the exec gets SIGPIPE, pipefail makes the pipeline 141, and set -e kills the script before any echo. (My earlier dry-run missed it because it didn't run under `set -euo pipefail`.) Fix: drop pipefail; capture each exec to a var (`\|\| true`) then parse with awk reading to END (no early `exit`), so nothing can SIGPIPE mid-stream and a failed/booting exec falls through to the fail-safe "do not sleep" branch. Also fetch the pod name via jsonpath instead of `-o name \| head -1` (no pipe to SIGPIPE, no `pod/` prefix to strip), and exec `adb` directly without the `sh -c` wrapper. Verified live: ran the corrected script as the gate ServiceAccount against the stuck emulator (idle ~120h) — it logged "idle >= 6h ... scaling to zero" and patched the deployment to replicas=0. The 6+ day pod is now asleep. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-24 08:57:36 +00:00
Viktor Barzin	839fdb33c2	android-emulator: sleep after 6h idle (activity-based), fix never-sleeping All checks were successful ci/woodpecker/push/default Pipeline was successful Details The emulator was meant to scale to zero when idle but had been up 6+ days straight despite ~5 days with no real use. Two bugs: 1. The idle check counted ESTABLISHED TCP connections to the adb/noVNC ports. A forgotten `adb connect` (no disconnect) holds that transport open forever, so every 15-min run saw "active" and reset the counter -- it never reached the sleep branch. (Right now: 4 such stale transports from pods on k8s-node3/node4.) 2. Even when it did reach the sleep branch, `kubectl scale --replicas=0` failed Forbidden -- the gate ServiceAccount can patch `deployments` but not `deployments/scale`. Switch the sleeper to measure actual use: time since last user activity (taps/keys/app-launches, incl. noVNC clicks) from `dumpsys power` vs guest uptime. No interaction for 6h -> sleep. This ignores idle/forgotten connections entirely. Scale down with a direct replicas patch on the named deployment (same path the wake gate scales up), so it needs only the existing `deployments` patch grant -- no `deployments/scale`. Now stateless (drops the idle-counter annotation; gate.py no longer sets it) and lighter on etcd. Fail-safe: any read error (e.g. mid-boot) does not sleep. Requested by Viktor: turn the dev-only emulator off when it hasn't been used for 6h. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-24 08:49:23 +00:00
Viktor Barzin	566447a698	k8s-upgrade: preflight kubeadm-plan gate must pass explicit target (minor-upgrade fix) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Last night's 1.34.9->1.35.6 run passed the ESO/kyverno compat gate (the migration worked!) but ABORTED at the kubeadm-plan-target gate: it ran `kubeadm upgrade plan` with NO version, so master's old 1.34.9 kubeadm auto-proposed only the current minor (Loki: "falling back to stable-1.34") and plan_target != 1.35.6 -> abort. That gate worked for patch upgrades but never for minors. Fix: pass the explicit `v$TARGET_VERSION` (verified on master: `kubeadm upgrade plan v1.35.6` emits "kubeadm upgrade apply v1.35.6"). Works for patches too. Applied live to the ConfigMap before tonight's run; deleted the failed preflight-1-35-6 job. Also: ESO 2.x took SSA ownership of .spec.refreshInterval, so terraform's apply of the k8s-upgrade-creds ExternalSecret hit a field-manager conflict. Added field_manager.force_conflicts=true (benign — interval is semantically identical). This pattern affects all 104 migrated ESs fleet-wide (follow-up). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-24 06:06:14 +00:00
Viktor Barzin	98d2b89614	calico: bump tigera-operator mem limit 256Mi -> 512Mi (OOM crashloop fix) All checks were successful ci/woodpecker/push/default Pipeline was successful Details The operator OOM-crashlooped on 2026-06-23: it idles at ~246Mi with a ~266Mi startup spike (re-listing resources to build informer caches), both at/over the 256Mi limit, so the first time the pod restarted it could never finish startup (exit 137 OOMKilled, leader-elect, OOM, repeat). A latent landmine — the limit was always too tight; it only bit once the pod restarted. Data plane was never affected (calico-node 7/7, tigerastatus green throughout). 512Mi gives headroom (now ~246Mi steady, verified stable 0 restarts). NOT caused by the ESO migration (which never touched calico); cluster churn was at most the trigger that exposed the tight limit. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-23 12:46:28 +00:00
Viktor Barzin	68c240b8de	Merge remote-tracking branch 'origin/master' Some checks failed ci/woodpecker/push/default Pipeline failed Details	2026-06-23 09:56:25 +00:00
Viktor Barzin	7d297dc6b1	eso: complete migration — chart 2.6.0, all CRs on v1, 1.35 gate cleared Phase 3 of the ESO 0.12->2.6 migration (the last k8s-1.35 compat-gate blocker). Climbed external-secrets 0.16.2 -> 0.17.0 -> ... -> 2.6.0 one minor at a time, each hop applied + verified (ES sync held at 109 Ready every hop; atomic=true rollback safety net). Crossed the 0.17 cutoff (v1beta1 serving removed) only after Phase 2 put all 104 ExternalSecrets + 2 ClusterSecretStores on external-secrets.io/v1. Result: compat-gate now returns "OK: cluster is safe to upgrade to 1.35.6" (EXIT 0) — the autonomous version-check chain will take k8s 1.34 -> 1.35 on its next nightly run. Also fixes the repo-wide stale-lock issue that broke CI pipeline 332: the terragrunt-generated providers.tf declares gavinbunney/kubectl + telmate/proxmox, but ~28-39 stacks' committed .terraform.lock.hcl predated that ("Inconsistent dependency lock file: no version selected"). Reconciled via `tg init -upgrade` and committed so `terragrunt apply`/CI work cleanly again. Docs: .claude/CLAUDE.md ESO line corrected (104 ESs, v1, chart 2.6.0); plan doc marked COMPLETE. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-23 09:55:51 +00:00
Viktor Barzin	ff4b01a674	state(external-secrets): update encrypted state	2026-06-23 09:53:36 +00:00
Viktor Barzin	e1a85dd727	state(external-secrets): update encrypted state	2026-06-23 09:52:30 +00:00
Viktor Barzin	af22416d6f	state(external-secrets): update encrypted state	2026-06-23 09:51:21 +00:00
Viktor Barzin	c75982f408	state(external-secrets): update encrypted state	2026-06-23 09:50:11 +00:00
Viktor Barzin	0407e3c578	state(external-secrets): update encrypted state	2026-06-23 09:48:33 +00:00
Viktor Barzin	dab8f9446f	state(external-secrets): update encrypted state	2026-06-23 09:47:24 +00:00
Viktor Barzin	e815bb0295	state(external-secrets): update encrypted state	2026-06-23 09:46:17 +00:00
Viktor Barzin	8412cd7d54	state(external-secrets): update encrypted state	2026-06-23 09:45:04 +00:00
Viktor Barzin	f2956e1e62	state(external-secrets): update encrypted state	2026-06-23 09:43:57 +00:00
Viktor Barzin	bf2f865eee	state(external-secrets): update encrypted state	2026-06-23 09:42:52 +00:00
Viktor Barzin	6f3cfb18c7	state(external-secrets): update encrypted state	2026-06-23 09:41:46 +00:00
Viktor Barzin	6e8e066215	state(external-secrets): update encrypted state	2026-06-23 09:40:14 +00:00
Viktor Barzin	de1fb04d9f	state(external-secrets): update encrypted state	2026-06-23 09:39:12 +00:00
Viktor Barzin	606cfdb544	state(external-secrets): update encrypted state	2026-06-23 09:38:12 +00:00
Viktor Barzin	72464e7880	state(external-secrets): update encrypted state	2026-06-23 09:37:11 +00:00
Viktor Barzin	e88ea50304	docs(multi-tenancy): document install_skills (vendored per-user agent skills) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Record the new reconcile step alongside install_memory/install_playwright: vendored own-copies of the 16-skill set for the SKILL_USERS allowlist (emo), why it's vendored not npx (upstream drift), and that if-absent keys on the user's own copy so it heals a stale/cross-user ~/.claude/skills symlink (emo's grill-me pointed into the admin's home). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-23 09:30:27 +00:00
Viktor Barzin	1c8dc6bd6c	t3-provision-users: install_skills heals stale symlinks + owns ~/.agents All checks were successful ci/woodpecker/push/default Pipeline was successful Details Follow-up to the vendored-skills change, from verifying the emo rollout: - The if-absent guard treated ANY pre-existing ~/.claude/skills/<name> entry as "installed", so a manual cross-user symlink emo already had (grill-me -> /home/wizard/.claude/skills/grill-me) was skipped — leaving the requested skill depending on the admin's home instead of emo's own copy. The guard now keys on the user's OWN copy (a real dir under ~/.agents/skills) and (re)points the ~/.claude/skills symlink at it, healing a stale/cross-user link while still never clobbering a real dir. - install -d left the intermediate ~/.agents owned by root; now owned by the user. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-23 09:27:31 +00:00
Viktor Barzin	987fdd16db	t3-provision-users: vendor agent skills + per-user install_skills (emo) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Make the admin's Claude Code agent skills available to the `emo` devvm user. Viktor asked to install Matt Pocock's skills for emo, starting with grill-me but covering the full set the admin already uses. The `npx skills` upstream has drifted off that set (diagnose -> diagnosing-bugs and write-a-skill -> writing-great-skills were renamed; caveman + zoom-out are no longer published), so reproducing it via npx is impossible and would also spray ~70 agent dirs into the user's home + add a GitHub-clone + unpinned-CLI dependency to the hourly root reconcile. Instead vendor a point-in-time snapshot of the 16 skills (scripts/workstation/claude-skills/) and copy them per-user, mirroring install_memory: install_skills() copies each skill into ~/.agents/skills/<name> (owned by the user) and symlinks ~/.claude/skills/<name> -> ../../.agents/skills/<name>. if-absent, additive, best-effort, scoped to the SKILL_USERS allowlist (emo). find-skills is from vercel-labs/skills (not Matt Pocock) but included since it is part of the admin's current set. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-23 09:23:37 +00:00
Viktor Barzin	59f2beda21	chrome-service: run real Google Chrome (H.264/AAC codecs) for the browser All checks were successful ci/woodpecker/push/default Pipeline was successful Details Point the chrome-service container at the new chrome-service-browser image and launch /opt/google/chrome/chrome instead of the bundled Chromium. Fixes MEDIA_ERR_SRC_NOT_SUPPORTED on H.264/AAC video (Instagram Reels etc.) in the noVNC view — bundled Chromium has those codecs compiled out; only real Chrome carries them. connect_over_cdp callers (tripit fare scrape, homelab browser, snapshot-harvester) attach over raw CDP (version-tolerant) — validated after rollout. Image is built off-infra on GHA (prior commit) → public ghcr. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 21:15:36 +00:00
Viktor Barzin	df1ec1879d	chrome-service: build a real-Chrome browser image (H.264/AAC codecs) Some checks failed ci/woodpecker/push/default Pipeline was successful Details Build chrome-service-browser / build (push) Has been cancelled Details Add an infra-owned image (Playwright base + google-chrome-stable) + its GHA build workflow. The bundled Chromium ships proprietary codecs compiled out, so H.264/AAC video (Instagram Reels, X, most .mp4) fails in the noVNC view with MEDIA_ERR_SRC_NOT_SUPPORTED; only real Google Chrome carries those codecs (libffmpeg swap + Chrome-for-Testing both ruled out). This commit only builds the image (→ ghcr.io/viktorbarzin/chrome-service-browser); a follow-up flips main.tf's launch to it once the image exists + is public. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 21:01:17 +00:00
Viktor Barzin	7061b1dfc6	state(external-secrets): update encrypted state	2026-06-22 20:55:27 +00:00
Viktor Barzin	e2f328ff4a	state(external-secrets): update encrypted state	2026-06-22 20:45:24 +00:00
Viktor Barzin	a735be9ba4	state(external-secrets): update encrypted state	2026-06-22 20:45:08 +00:00
Viktor Barzin	c670cb7118	eso: Phase 2 — migrate all 104 ExternalSecrets + 2 ClusterSecretStores to v1 Some checks failed ci/woodpecker/push/default Pipeline failed Details The API rewrite half of the ESO 0.12->2.6 migration (last k8s-1.35 compat-gate blocker). Done on chart 0.16.2, which serves BOTH external-secrets.io/v1beta1 and v1, so this is the safe window — MUST land before 0.17 removes v1beta1 (there is no conversion webhook). Pure apiVersion bump, schema is byte-identical: 106 occurrences (104 ExternalSecrets + 2 ClusterSecretStores vault-kv/vault-database) across 73 .tf files, v1beta1 -> v1, no other field changes. Validated live first on tandoor (single, non-coupled, synced ES): the kubernetes_manifest apiVersion bump forces a REPLACE; the target Secret is cascade-GC'd for ONE ~0.3s poll then ESO recreates it (identical value re-synced from Vault, new UID) and the ES returns SecretSynced=True on v1. Running pods keep their mounted copy through the sub-second blip. All 110 target Secrets were snapshotted to /tmp first as a backstop. CI applies the changed stacks serially (staged rollout); watching aggregate ES sync back to 108 synced (2 pre-existing dead: instagram-poster, payslip-ingest). Next: Phase 3 climb 0.16.2 -> 2.6.0. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 19:13:04 +00:00
Viktor Barzin	98cd535b97	authentik: lock chrome.viktorbarzin.me noVNC to Viktor only All checks were successful ci/woodpecker/push/default Pipeline was successful Details The chrome-service noVNC exposes Viktor's live logged-in browser sessions (Instagram etc. — he'll sign in there for homelab browser to reuse). It was auth="required" = any authenticated user, and "Home Server Admins" includes emo (emil.barzin@gmail.com), so the admin group is not a sufficient gate. Add a host-specific case to the domain-wide forward-auth restriction allowing only Viktor's accounts (vbarzin@gmail.com + akadmin break-glass); everyone else, incl. emo, is denied at the noVNC. emo's AGENT already can't reach the browser (read-only RBAC blocks port-forward); this closes the human noVNC path. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 18:09:27 +00:00
Viktor Barzin	a3cdc0d6d0	chrome-service: size headed Chrome window to fill Xvfb (noVNC cut-off) All checks were successful ci/woodpecker/push/default Pipeline was successful Details The noVNC view showed the browser in the top-left with the rest of the framebuffer black. Cause: Chrome launched with no --window-size, and there's no window manager, so it opened at its profile-persisted (smaller) size inside the 1280x720 Xvfb. Add --window-size=1280,720 --window-position=0,0 so the window fills the screen on every launch (fresh pods/profiles too). Live windows were already resized via CDP as a stopgap. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 18:00:20 +00:00
Viktor Barzin	c7ead032ec	chrome-service: fix noVNC stuck-"Connecting" (x11vnc fd-sweep under nofile=2^31) Some checks failed ci/woodpecker/push/default Pipeline was successful Details Build chrome-service-novnc / build (push) Has been cancelled Details The noVNC view hung on "Connecting" forever then timed out. Root cause: x11vnc sweeps the entire fd table (fcntl per fd) on every client connection, and containerd grants pods RLIMIT_NOFILE=2^31, so the RFB handshake never completes (websockify accepts the WS and dials localhost:5900, but x11vnc never sends its banner — verified: handshake timed out at 8s, x11vnc had burned 1h41m CPU spinning). Same bug + fix the android-emulator stack already carries. Cap nofile before x11vnc starts, in two places: - files/novnc/entrypoint.sh: `ulimit -n 65536` (root fix, makes the image correct) - main.tf novnc container: `command = ["bash","-c","ulimit -n 65536; exec /entrypoint.sh"]` so the cap applies deterministically on rollout even though the image is :latest/IfNotPresent (a rebuilt entrypoint isn't guaranteed to be re-pulled). Also documents the gotcha + diagnosis in docs/architecture/chrome-service.md and notes the black-when-idle behaviour + the autoconnect URL. (A live x11vnc relaunch with the cap already unblocked the running pod; this makes it survive restarts.) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 17:34:03 +00:00

1 2 3 4 5 ...

4557 commits