infra

Author	SHA1	Message	Date
Viktor Barzin	b858561bd0	Merge remote-tracking branch 'origin/master' Some checks failed ci/woodpecker/push/default Pipeline failed Details	2026-06-24 20:59:39 +00:00
Viktor Barzin	a7704f46a6	deploy goldmane-edge-aggregator: durable who-talks-to-whom edge trail (#58 , ADR-0014) Infra side of ADR-0014: an mTLS gRPC consumer of Calico Goldmane's Flows API that records the namespace-pair edge-set in CNPG and posts a daily new-edge digest to #security. Adds the goldmane-edge-aggregator stack, the pg-goldmane-edges Vault rotation role (Tier-0 vault state updated here), and the namespace in the ghcr-credentials allowlist. Cert: REUSES the operator-minted, Tigera-CA-signed whisker-backend client cert (Goldmane verifies only the CA chain, not identity) instead of minting from the Tigera CA private key. This avoids putting the CA key in TF state AND the hashicorp/tls provider, which is incompatible with this repo's global generate-providers/lockfile pattern (it broke every stack's lockfile). Verified live: aggregator streaming flows, 174 edges in Postgres across 50x54 namespaces, db+slack ExternalSecrets synced, digest dry-run formats correctly, private image pulls via the Kyverno-synced ghcr-credentials. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-24 20:59:39 +00:00
Viktor Barzin	aa510e3600	instagram-poster: force_conflicts on ESO manifests (fix apply) All checks were successful ci/woodpecker/push/default Pipeline was successful Details The ESO v1 migration (2026-06-22) made the external-secrets controller own .spec.refreshInterval via server-side apply, so terraform apply of the two ExternalSecret manifests fails with a field-manager conflict (Woodpecker #348), which blocked the replicas=0 scale-down from landing. Add force_conflicts=true to both, matching the grafana/woodpecker/traefik fix applied to other stacks the same day. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-24 20:49:53 +00:00
Viktor Barzin	53834deb24	instagram-poster: scale to 0 (unused, dead ExternalSecret) Some checks failed ci/woodpecker/push/default Pipeline failed Details Viktor confirmed the Instagram Graph poster isn't used. Its ExternalSecret has been dead on missing Vault keys (ig_graph_long_lived_token, ig_business_account_id), so the deployment sat at 0/1 firing DeploymentReplicasMismatch. Setting replicas=0 stops the alert and makes the scale-down durable (a bare kubectl scale reverts on the next stack apply). Re-set to 1 after minting a Meta long-lived token + populating the Vault keys. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-24 20:45:30 +00:00
Viktor Barzin	8dd9a3978d	Merge remote-tracking branch 'forgejo/master' into wizard/homelab-vault All checks were successful ci/woodpecker/push/default Pipeline was successful Details	2026-06-24 12:25:52 +00:00
Viktor Barzin	65b2df1222	fix(monitoring): force_conflicts on grafana_db_creds ExternalSecret The external-secrets controller owns .spec.refreshInterval via SSA, so a plain terraform apply of the monitoring stack conflicts. Latent until 2026-06-24 (the homelab-vault loki-rules change was the first monitoring apply in a while and surfaced it). force_conflicts lets TF win — same pattern as woodpecker/traefik/ k8s-version-upgrade stacks.	2026-06-24 12:25:36 +00:00
Viktor Barzin	1d0388da12	Merge remote-tracking branch 'origin/master' All checks were successful ci/woodpecker/push/default Pipeline was successful Details	2026-06-24 12:22:58 +00:00
Viktor Barzin	92361f36db	calico: enable Goldmane + Whisker (Calico 3.30 OSS flow observability) Turns on Calico 3.30's native east-west flow observability so we can see which Service talks to which (ADR-0014, issue #57). Enabled via the operator CRs directly (kubectl_manifest Goldmane + Whisker, name=default) rather than the Helm goldmane/whisker flags, because the goldmanes/whiskers CRDs already exist and this sidesteps the helm-upgrade CR-before-CRD ordering issue. Whisker notifications=Disabled so the UI doesn't call the external Tigera endpoint. Applied supervised: creating the Goldmane CR re-rendered calico-node with the FELIX_FLOWLOGSGOLDMANESERVER env (operator auto-wires Felix — no manual FelixConfiguration); calico-node rolled cleanly 7/7, tigerastatus healthy, goldmane is receiving flows from all nodes, Whisker UI serves. Durable Loki persistence is NOT included here: the Goldmane emitter is Calico Cloud/Enterprise-gated with no OSS knob to aim it at Loki (the CR can override only name+resources, not env), so a durable trail needs a small custom gRPC consumer of goldmane:7443 — tracked in issue #58. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-24 12:22:48 +00:00
Viktor Barzin	e711b2f971	feat(monitoring): homelab vault traceability alerts (TOTP-fetch + volume) Some checks failed ci/woodpecker/push/default Pipeline failed Details Build infra CLI / build (push) Has been cancelled Details Adds a Loki ruler group (lane=security -> #security) for the homelab vault op-log: VaultwardenTOTPFetched (every 2nd-factor fetch is visible) and VaultwardenFetchVolumeHigh (>100 fetches/10m backstop). The audit spine (Vault audit device, reads of secret/data/workstation/claude-users/*) is already captured. True CLI-bypass detection needs cross-stream correlation (follow-up).	2026-06-24 10:31:32 +00:00
Viktor Barzin	64104e56e9	feat(devvm): install Bitwarden CLI for homelab vault	2026-06-24 10:29:57 +00:00
Viktor Barzin	15643d1f44	feat(cli): bare `homelab vault` help command	2026-06-24 10:29:32 +00:00
Viktor Barzin	772aed5370	fix(cli): vault security review fixes C1 (critical): setup wrote the master password + API client_secret as `vault kv patch key=value` argv, leaking them via /proc/<pid>/cmdline to same-UID processes. Now written via stdin (key=- form); only email + client_id (non-credentials) remain in argv. I1: `get --json` refused on a TTY (was dumping the secret to scrollback). M1: vaultLock now holds the per-user flock (it mutates bw state). M4: bw login-detection parses status JSON instead of substring matching. M5: clipboard path refuses when stderr is not a TTY (was silently failing). M6: realRunner trims only trailing newline, preserving secret whitespace; secret prompts likewise. Adds security-property tests: no secret in argv across the get flow, clipboard decision matrix, --json TTY gate, bw status parsing.	2026-06-24 10:28:31 +00:00
Viktor Barzin	5a864cf19c	feat(cli): homelab vault setup onboarding (one-time, self-service)	2026-06-24 10:21:57 +00:00
Viktor Barzin	e20033855d	feat(cli): vault list/search/code/status/lock	2026-06-24 10:21:07 +00:00
Viktor Barzin	365340b37d	feat(cli): homelab vault get with TTY-aware return	2026-06-24 10:20:05 +00:00
Viktor Barzin	2dd12fc6be	feat(cli): vault session bootstrap with per-user flock + no-coredump	2026-06-24 10:18:36 +00:00
Viktor Barzin	5bae2a3907	feat(cli): privacy-aware vault op-log (process, never the secret)	2026-06-24 10:17:50 +00:00
Viktor Barzin	81122f8607	feat(cli): TTY-aware return + OSC52 clipboard with terminal gating	2026-06-24 10:17:13 +00:00
Viktor Barzin	06f4b87af1	feat(cli): vault bw engine env/arg builders + unlock	2026-06-24 10:16:19 +00:00
Viktor Barzin	cd44ca5921	feat(cli): vault creds loading from per-user Vault path	2026-06-24 10:15:32 +00:00
Viktor Barzin	6c53ee10b1	feat(cli): register homelab vault command group skeleton	2026-06-24 10:14:24 +00:00
Viktor Barzin	ae0d7984c4	docs: ADR-0014 + glossary — service identity (namespace+label) & Calico Goldmane observability All checks were successful ci/woodpecker/push/default Pipeline was successful Details Records the design reached in a /grill-with-docs session: how to track which Service talks to which as more Services are added, using k8s-native options. Decision: service identity = the workload's namespace (primary) plus a `service-identity` label only in the few multi-Service namespaces; east-west observability = Calico 3.30 Goldmane/Whisker (already in our Calico v3.30.7, currently disabled) emitting to Loki for a durable trail; enforcement reuses the existing Wave 1 egress track. Dedicated per-Service ServiceAccounts deferred and a service mesh / mTLS / SPIFFE rejected — the trust model needs attribution-grade forensics on a trusted, etcd-constrained cluster, not cryptographic non-repudiation. This is the service-mesh evaluation the 2026-04-20 infra audit flagged as missing; rejected alternatives (Retina, Hubble, Kiali, a custom Alloy enricher) are recorded with rationale. Adds glossary terms (Service identity, Goldmane / Whisker) to CONTEXT.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-24 10:00:36 +00:00
Viktor Barzin	0293b5c634	android-emulator: fix idle-sleeper dying with SIGPIPE before it could sleep All checks were successful ci/woodpecker/push/default Pipeline was successful Details Caught live-testing the previous commit: every sleeper run exited 141 (SIGPIPE) in ~1s with no output, never reaching the scale-down. Cause: `set -o pipefail` + `dumpsys power \| awk '...; exit'` — awk closes the pipe after the first match while `kubectl exec` is still streaming dumpsys, so the exec gets SIGPIPE, pipefail makes the pipeline 141, and set -e kills the script before any echo. (My earlier dry-run missed it because it didn't run under `set -euo pipefail`.) Fix: drop pipefail; capture each exec to a var (`\|\| true`) then parse with awk reading to END (no early `exit`), so nothing can SIGPIPE mid-stream and a failed/booting exec falls through to the fail-safe "do not sleep" branch. Also fetch the pod name via jsonpath instead of `-o name \| head -1` (no pipe to SIGPIPE, no `pod/` prefix to strip), and exec `adb` directly without the `sh -c` wrapper. Verified live: ran the corrected script as the gate ServiceAccount against the stuck emulator (idle ~120h) — it logged "idle >= 6h ... scaling to zero" and patched the deployment to replicas=0. The 6+ day pod is now asleep. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-24 08:57:36 +00:00
Viktor Barzin	839fdb33c2	android-emulator: sleep after 6h idle (activity-based), fix never-sleeping All checks were successful ci/woodpecker/push/default Pipeline was successful Details The emulator was meant to scale to zero when idle but had been up 6+ days straight despite ~5 days with no real use. Two bugs: 1. The idle check counted ESTABLISHED TCP connections to the adb/noVNC ports. A forgotten `adb connect` (no disconnect) holds that transport open forever, so every 15-min run saw "active" and reset the counter -- it never reached the sleep branch. (Right now: 4 such stale transports from pods on k8s-node3/node4.) 2. Even when it did reach the sleep branch, `kubectl scale --replicas=0` failed Forbidden -- the gate ServiceAccount can patch `deployments` but not `deployments/scale`. Switch the sleeper to measure actual use: time since last user activity (taps/keys/app-launches, incl. noVNC clicks) from `dumpsys power` vs guest uptime. No interaction for 6h -> sleep. This ignores idle/forgotten connections entirely. Scale down with a direct replicas patch on the named deployment (same path the wake gate scales up), so it needs only the existing `deployments` patch grant -- no `deployments/scale`. Now stateless (drops the idle-counter annotation; gate.py no longer sets it) and lighter on etcd. Fail-safe: any read error (e.g. mid-boot) does not sleep. Requested by Viktor: turn the dev-only emulator off when it hasn't been used for 6h. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-24 08:49:23 +00:00
Viktor Barzin	566447a698	k8s-upgrade: preflight kubeadm-plan gate must pass explicit target (minor-upgrade fix) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Last night's 1.34.9->1.35.6 run passed the ESO/kyverno compat gate (the migration worked!) but ABORTED at the kubeadm-plan-target gate: it ran `kubeadm upgrade plan` with NO version, so master's old 1.34.9 kubeadm auto-proposed only the current minor (Loki: "falling back to stable-1.34") and plan_target != 1.35.6 -> abort. That gate worked for patch upgrades but never for minors. Fix: pass the explicit `v$TARGET_VERSION` (verified on master: `kubeadm upgrade plan v1.35.6` emits "kubeadm upgrade apply v1.35.6"). Works for patches too. Applied live to the ConfigMap before tonight's run; deleted the failed preflight-1-35-6 job. Also: ESO 2.x took SSA ownership of .spec.refreshInterval, so terraform's apply of the k8s-upgrade-creds ExternalSecret hit a field-manager conflict. Added field_manager.force_conflicts=true (benign — interval is semantically identical). This pattern affects all 104 migrated ESs fleet-wide (follow-up). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-24 06:06:14 +00:00
Viktor Barzin	98d2b89614	calico: bump tigera-operator mem limit 256Mi -> 512Mi (OOM crashloop fix) All checks were successful ci/woodpecker/push/default Pipeline was successful Details The operator OOM-crashlooped on 2026-06-23: it idles at ~246Mi with a ~266Mi startup spike (re-listing resources to build informer caches), both at/over the 256Mi limit, so the first time the pod restarted it could never finish startup (exit 137 OOMKilled, leader-elect, OOM, repeat). A latent landmine — the limit was always too tight; it only bit once the pod restarted. Data plane was never affected (calico-node 7/7, tigerastatus green throughout). 512Mi gives headroom (now ~246Mi steady, verified stable 0 restarts). NOT caused by the ESO migration (which never touched calico); cluster churn was at most the trigger that exposed the tight limit. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-23 12:46:28 +00:00
Viktor Barzin	68c240b8de	Merge remote-tracking branch 'origin/master' Some checks failed ci/woodpecker/push/default Pipeline failed Details	2026-06-23 09:56:25 +00:00
Viktor Barzin	7d297dc6b1	eso: complete migration — chart 2.6.0, all CRs on v1, 1.35 gate cleared Phase 3 of the ESO 0.12->2.6 migration (the last k8s-1.35 compat-gate blocker). Climbed external-secrets 0.16.2 -> 0.17.0 -> ... -> 2.6.0 one minor at a time, each hop applied + verified (ES sync held at 109 Ready every hop; atomic=true rollback safety net). Crossed the 0.17 cutoff (v1beta1 serving removed) only after Phase 2 put all 104 ExternalSecrets + 2 ClusterSecretStores on external-secrets.io/v1. Result: compat-gate now returns "OK: cluster is safe to upgrade to 1.35.6" (EXIT 0) — the autonomous version-check chain will take k8s 1.34 -> 1.35 on its next nightly run. Also fixes the repo-wide stale-lock issue that broke CI pipeline 332: the terragrunt-generated providers.tf declares gavinbunney/kubectl + telmate/proxmox, but ~28-39 stacks' committed .terraform.lock.hcl predated that ("Inconsistent dependency lock file: no version selected"). Reconciled via `tg init -upgrade` and committed so `terragrunt apply`/CI work cleanly again. Docs: .claude/CLAUDE.md ESO line corrected (104 ESs, v1, chart 2.6.0); plan doc marked COMPLETE. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-23 09:55:51 +00:00
Viktor Barzin	ff4b01a674	state(external-secrets): update encrypted state	2026-06-23 09:53:36 +00:00
Viktor Barzin	e1a85dd727	state(external-secrets): update encrypted state	2026-06-23 09:52:30 +00:00
Viktor Barzin	af22416d6f	state(external-secrets): update encrypted state	2026-06-23 09:51:21 +00:00
Viktor Barzin	c75982f408	state(external-secrets): update encrypted state	2026-06-23 09:50:11 +00:00
Viktor Barzin	0407e3c578	state(external-secrets): update encrypted state	2026-06-23 09:48:33 +00:00
Viktor Barzin	dab8f9446f	state(external-secrets): update encrypted state	2026-06-23 09:47:24 +00:00
Viktor Barzin	e815bb0295	state(external-secrets): update encrypted state	2026-06-23 09:46:17 +00:00
Viktor Barzin	8412cd7d54	state(external-secrets): update encrypted state	2026-06-23 09:45:04 +00:00
Viktor Barzin	f2956e1e62	state(external-secrets): update encrypted state	2026-06-23 09:43:57 +00:00
Viktor Barzin	bf2f865eee	state(external-secrets): update encrypted state	2026-06-23 09:42:52 +00:00
Viktor Barzin	6f3cfb18c7	state(external-secrets): update encrypted state	2026-06-23 09:41:46 +00:00
Viktor Barzin	6e8e066215	state(external-secrets): update encrypted state	2026-06-23 09:40:14 +00:00
Viktor Barzin	de1fb04d9f	state(external-secrets): update encrypted state	2026-06-23 09:39:12 +00:00
Viktor Barzin	606cfdb544	state(external-secrets): update encrypted state	2026-06-23 09:38:12 +00:00
Viktor Barzin	72464e7880	state(external-secrets): update encrypted state	2026-06-23 09:37:11 +00:00
Viktor Barzin	e88ea50304	docs(multi-tenancy): document install_skills (vendored per-user agent skills) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Record the new reconcile step alongside install_memory/install_playwright: vendored own-copies of the 16-skill set for the SKILL_USERS allowlist (emo), why it's vendored not npx (upstream drift), and that if-absent keys on the user's own copy so it heals a stale/cross-user ~/.claude/skills symlink (emo's grill-me pointed into the admin's home). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-23 09:30:27 +00:00
Viktor Barzin	1c8dc6bd6c	t3-provision-users: install_skills heals stale symlinks + owns ~/.agents All checks were successful ci/woodpecker/push/default Pipeline was successful Details Follow-up to the vendored-skills change, from verifying the emo rollout: - The if-absent guard treated ANY pre-existing ~/.claude/skills/<name> entry as "installed", so a manual cross-user symlink emo already had (grill-me -> /home/wizard/.claude/skills/grill-me) was skipped — leaving the requested skill depending on the admin's home instead of emo's own copy. The guard now keys on the user's OWN copy (a real dir under ~/.agents/skills) and (re)points the ~/.claude/skills symlink at it, healing a stale/cross-user link while still never clobbering a real dir. - install -d left the intermediate ~/.agents owned by root; now owned by the user. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-23 09:27:31 +00:00
Viktor Barzin	987fdd16db	t3-provision-users: vendor agent skills + per-user install_skills (emo) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Make the admin's Claude Code agent skills available to the `emo` devvm user. Viktor asked to install Matt Pocock's skills for emo, starting with grill-me but covering the full set the admin already uses. The `npx skills` upstream has drifted off that set (diagnose -> diagnosing-bugs and write-a-skill -> writing-great-skills were renamed; caveman + zoom-out are no longer published), so reproducing it via npx is impossible and would also spray ~70 agent dirs into the user's home + add a GitHub-clone + unpinned-CLI dependency to the hourly root reconcile. Instead vendor a point-in-time snapshot of the 16 skills (scripts/workstation/claude-skills/) and copy them per-user, mirroring install_memory: install_skills() copies each skill into ~/.agents/skills/<name> (owned by the user) and symlinks ~/.claude/skills/<name> -> ../../.agents/skills/<name>. if-absent, additive, best-effort, scoped to the SKILL_USERS allowlist (emo). find-skills is from vercel-labs/skills (not Matt Pocock) but included since it is part of the admin's current set. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-23 09:23:37 +00:00
Viktor Barzin	59f2beda21	chrome-service: run real Google Chrome (H.264/AAC codecs) for the browser All checks were successful ci/woodpecker/push/default Pipeline was successful Details Point the chrome-service container at the new chrome-service-browser image and launch /opt/google/chrome/chrome instead of the bundled Chromium. Fixes MEDIA_ERR_SRC_NOT_SUPPORTED on H.264/AAC video (Instagram Reels etc.) in the noVNC view — bundled Chromium has those codecs compiled out; only real Chrome carries them. connect_over_cdp callers (tripit fare scrape, homelab browser, snapshot-harvester) attach over raw CDP (version-tolerant) — validated after rollout. Image is built off-infra on GHA (prior commit) → public ghcr. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 21:15:36 +00:00
Viktor Barzin	df1ec1879d	chrome-service: build a real-Chrome browser image (H.264/AAC codecs) Some checks failed ci/woodpecker/push/default Pipeline was successful Details Build chrome-service-browser / build (push) Has been cancelled Details Add an infra-owned image (Playwright base + google-chrome-stable) + its GHA build workflow. The bundled Chromium ships proprietary codecs compiled out, so H.264/AAC video (Instagram Reels, X, most .mp4) fails in the noVNC view with MEDIA_ERR_SRC_NOT_SUPPORTED; only real Google Chrome carries those codecs (libffmpeg swap + Chrome-for-Testing both ruled out). This commit only builds the image (→ ghcr.io/viktorbarzin/chrome-service-browser); a follow-up flips main.tf's launch to it once the image exists + is public. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 21:01:17 +00:00
Viktor Barzin	7061b1dfc6	state(external-secrets): update encrypted state	2026-06-22 20:55:27 +00:00
Viktor Barzin	e2f328ff4a	state(external-secrets): update encrypted state	2026-06-22 20:45:24 +00:00

1 2 3 4 5 ...

4562 commits