infra

Author	SHA1	Message	Date
Viktor Barzin	f92ab04dae	vault: grant emo read-only access to his own secret/emo emo (power-user tier) had no Vault policy granting his personal secret path, so `vault kv get secret/emo` failed. Viktor asked to give him that access. Adds a read-only `personal-emo` policy (read on secret/data/emo + metadata) and attaches it to emo's OIDC identity by adopting the entity/alias Vault auto-created on his first login. Scoped explicitly to emo; does not widen the power-user tier (which stays secret-less). Verified live: a personal-emo token reads secret/emo, is denied writes, and is denied other paths (secret/viktor -> 403). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-27 13:35:57 +00:00
Viktor Barzin	90f5425cdc	state(vault): update encrypted state	2026-06-27 13:33:34 +00:00
Emil Barzin	a7117e0bfe	immich(frame-emo): bump photo-frame Interval 30->45s All checks were successful ci/woodpecker/push/default Pipeline was successful Details Permissions-test change requested by Viktor: slow Emo's Sofia photo-frame slideshow from 30s to 45s per image. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-27 13:07:00 +00:00
Viktor Barzin	d50962b00e	immich: add Immich photo-frame for Emo's Portal (highlights-immich-emo) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Second ImmichFrame instance cloned from the London frame (frame.tf), scoped to Emo's Immich account (emil.barzin) with Sofia weather coords and last-2-years photos. Drives Emo's Meta Portal Mini in Sofia via the portal-immich-frame app. Dedicated API key minted on Emo's account and stored in Vault (secret/immich -> frame_api_key_emo). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-27 12:40:29 +00:00
Viktor Barzin	e8b72019b5	paperless-ngx: deploy Tika + Gotenberg for Office ingest + raise PVC ceiling to 80Gi All checks were successful ci/woodpecker/push/default Pipeline was successful Details Emo's import scope now includes his work-PC document set (C/Documents, Project Management, Service & MRO, etc. on the NAS), which is ~4.9k Office files (.doc/.docx/.xls/.xlsx/.ppt/.pptx) on top of Emo shared. Paperless can only archive/OCR/index those if it can convert them, so add the standard Apache Tika (text+metadata) + Gotenberg (-> PDF) sidecar deployments + their services in the paperless-ngx namespace and point PAPERLESS_TIKA_* at them. Pinned images (gotenberg 8.25, tika 3.3.1.0), single replica, no PVC. Total in-scope document set across all NAS locations is now ~13,700 PDF+Office files / ~13.7GB source (~30GB once OCR'd + archived), so raise the data PVC autoresize ceiling 30Gi -> 80Gi for comfortable headroom. The topolvm autoresizer grows on demand up to the ceiling. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-27 12:02:04 +00:00
Viktor Barzin	041aedc486	Merge remote-tracking branch 'origin/master' into wizard/paperless-emo All checks were successful ci/woodpecker/push/default Pipeline was successful Details	2026-06-27 08:17:28 +00:00
Viktor Barzin	7988a690ed	paperless-ngx: add Bulgarian OCR (bul+eng) + raise data PVC ceiling to 30Gi Preparing Paperless for Emo's document import from the NAS. His archive is Bulgarian (Cyrillic) + English, but OCR was English-only (tesseract had no 'bul' pack and PAPERLESS_OCR_LANGUAGE was unset/defaulted to eng), so scanned BG documents would OCR to garbage and be unsearchable. Add bul to the install list and set OCR_LANGUAGE=bul+eng. Also raise the data PVC autoresize ceiling from 5Gi to 30Gi: everything (originals + archive via PAPERLESS_MEDIA_ROOT=../data) lives on the single encrypted PVC, and the ~2.7GB in-scope import would blow past the 5Gi cap mid-ingest. The topolvm autoresizer grows the volume on demand up to the ceiling; 30Gi gives ample headroom. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-27 08:17:13 +00:00
Viktor Barzin	6415f77fed	Merge remote-tracking branch 'origin/master' into wizard/emo-vault-onboard Some checks failed Build infra CLI / build (push) Waiting to run Details ci/woodpecker/push/default Pipeline was canceled Details	2026-06-27 08:17:06 +00:00
Viktor Barzin	b371ae6eee	homelab vault: install bw system-wide + onboarding runbook Two remaining gaps to let non-admins (emo) use `homelab vault`: - setup-devvm.sh installed `@bitwarden/cli` only when `command -v bw` failed, which an admin's own ~/.local/bin/bw satisfied — so the system-wide copy was never installed and non-admins had no `bw` backend. Install to the npm /usr prefix and guard on the system path (/usr/bin/bw) instead. - Add docs/runbooks/homelab-vault-onboarding.md (per-user setup, the shared Organization/Collection flow for sharing passwords, admin deploy + verification, security model) and repoint the two code comments that cited a design-spec path which never existed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-27 08:16:52 +00:00
Viktor Barzin	51dc5d031c	homelab vault: make it work for non-admin workstation users `homelab vault` was effectively admin-only: two bugs blocked every non-admin (e.g. emo) from using it for their own Vaultwarden vault. 1. Token: the CLI relied purely on ambient `vault` auth (~/.vault-token / $VAULT_TOKEN), which only admins have. Non-admins carry a scoped token at ~/.config/claude-auth-sync/vault-token (policy workstation-claude-<user>). Add ensureVaultToken(): explicit env > ~/.vault-token > scoped fallback, wired into every vault verb. Admins are unaffected (their ambient token wins). 2. Write capability: `homelab vault setup` used plain `vault kv patch`, which needs the `patch` capability the scoped policy does not grant (only create/read/update) — so setup 403'd for non-admins. Switch to `kv patch -method=rw` (read-modify-write; same approach claude-auth-sync already uses), with `kv put` only when the path doesn't exist yet. Preserves co-located keys (claude_ai_oauth_json). Enables onboarding emo onto the per-user Vaultwarden access tool. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-27 08:15:42 +00:00
Viktor Barzin	82a7b2585b	chrome-service: reconcile state after pipeline #366 was killed mid-apply + document cancel-previous hazard All checks were successful ci/woodpecker/push/default Pipeline was successful Details Pipeline #366 (the SHA-pin apply, commit `7b4a8ba8`) was SIGKILLed mid-apply by Woodpecker cancel-previous when I pushed the next commit (#367, docs) while it was still running — the apply log ends at '[chrome-service] Starting apply...' with no 'Apply complete!', so the terraform state write did not finish. The live deployment is correct (image = the supervised SHA, verified, self-healing), but the stored state may be stale; this commit re-triggers a clean changed-stack apply to reconcile it (comment-only change → 0 resource changes, no rollout). Also adds a caution to the novnc image comment: after bumping the SHA, WAIT for the apply pipeline to finish before pushing again (memory id=1957). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-27 08:15:41 +00:00
Viktor Barzin	006f97ef58	docs: bless local terragrunt apply, but require committing every applied change All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor asked to change the infra apply guidance: instead of 'never apply locally, always rely on CI', the policy is now 'you MAY apply locally, but always commit the change to the infra repo'. - .claude/CLAUDE.md (Critical Rule: Terraform Only): new bullet making local apply explicit (scripts/tg apply / homelab tf apply) from the MAIN checkout (not a worktree — git-crypt'd tfvars read as ciphertext there), with a hard requirement that every applied change is committed + pushed to master the same session so the repo stays the source of truth and CI drift-detection doesn't revert it. Spells out the apply<->commit ordering both ways. - AGENTS.md (non-admin workstation land steps): step 5 now notes local apply as an option alongside CI auto-apply, with the same 'always committed, never applied uncommitted' rule. Note: the org-managed settings block also frames CI auto-apply but is not editable from a workstation clone. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-27 08:10:20 +00:00
Viktor Barzin	7b4a8ba867	chrome-service: pin noVNC image to the x11vnc-supervision build Some checks failed ci/woodpecker/push/default Pipeline was canceled Details Deploys the self-heal fix from the previous commit. Keel is off for this deployment (keel.sh/policy=never, because the browser container's playwright image is version-pinned to f1-stream) and the novnc image was :latest with imagePullPolicy=IfNotPresent, so a rebuilt :latest would NOT be re-pulled on a rollout — the supervised entrypoint would never reach the running pod. Pin novnc to :`19d0f0933a` (the build of the prior commit; ghcr digest sha256:5b783ac6, == :latest) so the stack apply rolls the sidecar onto the new image. Future novnc entrypoint changes deploy by bumping this digest after build-chrome-service-novnc.yml publishes a new SHA tag. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-27 08:04:55 +00:00
Viktor Barzin	19d0f0933a	chrome-service: supervise x11vnc in noVNC sidecar so the VNC view self-heals Some checks are pending Build chrome-service-novnc / build (push) Waiting to run Details ci/woodpecker/push/default Pipeline was successful Details The noVNC view at chrome.viktorbarzin.me went black: x11vnc (in the novnc sidecar) attaches to the browser container's Xvfb over localhost:6099, and when that container restarted (~8h ago, Chrome exited cleanly) x11vnc lost its X connection and exited. Because the entrypoint ran x11vnc as an unsupervised background child and then exec'd websockify as PID 1, the dead x11vnc was never relaunched — :5900 stayed dead (a defunct zombie), websockify kept returning 'Connection refused', and the view was black until a manual pod restart. Fix: the entrypoint now runs both x11vnc and websockify as supervised background children and exits non-zero via 'wait -n' if either dies, so the kubelet restarts the novnc container, which re-waits for Xvfb and relaunches x11vnc. The bridge now self-heals across browser-container restarts. Mirrors the android-emulator stack's supervision pattern. Architecture doc updated with the new failure mode, diagnosis, immediate-recovery, and SHA-pin deploy note. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-27 08:03:29 +00:00
Viktor Barzin	abb15cd49d	devvm: personalize emo's cluster-health skill for ha-sofia All checks were successful ci/woodpecker/push/default Pipeline was successful Details emo cares about ha-sofia + his Sofia smart-home devices (Tuya, the MPPT ATS, the Барзини → Статус dashboard), and only about the cluster when it's breaking those. Rewrite his vendored cluster-health into an ha-sofia-focused, read-only variant: - leads with ha-sofia's in-cluster dependency chain (tuya-bridge + the cloudflared/Traefik/DNS/TLS reachability path), all checkable read-only; - fixes the script path to emo's own clone (/home/emo/code) — he can't read wizard's tree — and runs it --no-fix (he's cluster read-only); - loads emo's own HA token (see below) so the ha-sofia checks (26-29, 45) actually run for him; documents the host-SSH/Vault checks that skip; - triages: cluster FAIL/WARN matters only if on his chain; everything else is a one-line "admin's area"; escalate via /file-issue since he can't fix. This snapshot copy is now an emo-specific variant, intentionally diverged from the canonical 47-check admin skill — README updated to say "do not re-sync from canonical". Token: a dedicated long-lived HA token (client_name emo-cluster-health) was minted on ha-sofia via the admin account and stored emo-readable at /home/emo/.config/cluster-health/haos_token (600). It carries admin HA scope (HA only mints tokens for the authenticating account); independently revocable. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-26 16:03:14 +00:00
Viktor Barzin	fc83595f5e	devvm: vendor cluster-health into per-user agent-skill snapshot All checks were successful ci/woodpecker/push/default Pipeline was successful Details Make cluster-health a user-global skill for emo (the lone entry in the provisioner's SKILL_USERS allowlist), so it's available from any directory — not only when working inside the infra clone where it already exists as a project skill (.claude/skills/cluster-health). install_skills() in t3-provision-users.sh copies the vendored snapshot into ~/.agents/skills/ and symlinks ~/.claude/skills/, so this is the durable, rebuild-surviving path. cluster-health is homelab-local (vendored from this repo's own .claude/skills/), unlike the other snapshot entries which mirror upstream mattpocock/skills + vercel-labs/skills; README documents its provenance and the explicit re-sync step so the vendored copy doesn't silently drift. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-26 15:20:19 +00:00
Viktor Barzin	fd33d1a447	monitoring: consolidate all Slack alerting to #alerts, abandon #security Some checks are pending ci/woodpecker/push/default Pipeline is running Details The dedicated #security Slack channel was unreachable: the shared incoming webhook (Vault secret/viktor -> alertmanager_slack_api_url) belongs to a Slack app that isn't a member of #security, so any channel override on it returns HTTP 404 channel_not_found. The goldmane-edges-digest was silently failing for that reason. Per request ("dump the security channel, post in an existing one"), route everything to #alerts instead: - alertmanager slack-security receiver -> #alerts (keeps its [SECURITY/<sev>] title styling so security-lane alerts still stand out in the shared channel) - goldmane-edges-digest CronJob SLACK_CHANNEL -> #alerts (comment only; value was already switched and applied last change) - AggregatorDown / DigestFailing alert summaries reworded to say #alerts - docs swept (security.md, monitoring.md, ADR-0014, goldmane runbook, .claude/CLAUDE.md, service-catalog, CONTEXT.md) to drop the "invite the app / flip back to #security" caveats and state the #security abandonment + #alerts consolidation as the current routing. Monitoring stack applied (alertmanager rolled, live config verified: slack-security channel is now #alerts). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-26 13:29:44 +00:00
Viktor Barzin	196d0db4bd	rbac/apiserver-oidc: back up the apiserver manifest OUTSIDE /etc/kubernetes/manifests All checks were successful ci/woodpecker/push/default Pipeline was successful Details The SSO restore script backed up the live manifest with `cp "$MANIFEST" "$MANIFEST.bak.$TS"` — i.e. INSIDE /etc/kubernetes/manifests/. The kubelet treats every file in that dir as a static pod, so the .bak became a SECOND kube-apiserver static pod. While both copies were identical it was harmless, but the instant `kubeadm upgrade` changed the real manifest's image to v1.35.6, the kubelet saw two same-named pods with different specs and flip-flopped (pod attempt count hit 13) — the new apiserver never stabilised, so kubeadm timed out on "static Pod hash did not change after 5m" and rolled back. THIS was the real cause of the 1.34->1.35 upgrade stalling for days (not etcd IO, which was a downstream symptom of the flip-flopping apiserver hammering etcd). Fix: write backups to a dedicated dir OUTSIDE the static-pod dir (/etc/kubernetes/apiserver-oidc-bak/) and read the rollback copy from there. The stray .bak that planted the landmine on 2026-06-18 was moved out manually 2026-06-26; this prevents the SSO script (and the upgrade chain's restore.sh, which is the same script) from ever re-creating it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-26 10:29:19 +00:00
Viktor Barzin	5d33327c30	postiz: repoint postgres-backup CronJob at CNPG (was failing on removed host) Some checks failed ci/woodpecker/push/default Pipeline failed Details The postiz-postgres-backup CronJob still dumped from the chart's bundled `postiz-postgresql` host with a hardcoded `postiz-password`. That bundled PostgreSQL was removed when postiz migrated to the shared CNPG cluster, so the host no longer resolves (NXDOMAIN) and every nightly run failed — firing BackupCronJobFailed, and leaving the postiz DB with no logical dump in the offsite pipeline. Connect via the app's own DATABASE_URL (from the postiz-secrets Secret, postgresql://postiz:…@pg-cluster-rw.dbaas.svc.cluster.local/postiz) instead of a hardcoded host/user/password, so the backup tracks the live DB and credentials. Verified with a one-off test job: psql + pg_dump 16.4 connect to CNPG 16.9 and produce a 180K custom-format dump. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-26 09:34:42 +00:00
Viktor Barzin	1bca799bb4	monitoring: give kube-state-metrics a 512Mi memory limit (Burstable) Some checks failed ci/woodpecker/push/default Pipeline failed Details kube-state-metrics had no explicit resources, so the monitoring-namespace LimitRange pinned it to requests=limits=256Mi (Guaranteed QoS). KSM idles around 45Mi but momentarily spikes past 256Mi during a full object relist (450+ pods, 150+ jobs, all secrets/endpoints) and gets OOMKilled. Each OOM blacks out the KSM-exported series that ~10 alert rules read, so they all fire false "<svc>Down" criticals at once and self-resolve when KSM recovers ~5 min later — exactly the alert storm seen at 2026-06-26 08:42 UTC. Set explicit Burstable resources: keep the request low (64Mi, just above idle) so we don't reserve memory we don't use, and raise only the limit to 512Mi to absorb the relist peak. No CPU limit, per the cluster-wide policy. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-26 09:06:31 +00:00
Viktor Barzin	d105713ae7	fix(workstation): claude-auth-sync must merge, not overwrite, the shared Vault path All checks were successful ci/woodpecker/push/default Pipeline was successful Details cas_backup did `vault kv put secret/workstation/claude-users/<user>`, a full KV-v2 replace that rewrote the document with only its 3 OAuth keys. Because `homelab vault setup` co-locates the user's vaultwarden_* credentials on that same path, every six-hourly sync silently deleted them — so `homelab vault` reported "not configured" within hours of each setup. (Reported as: homelab vault "keeps getting reset / logged out", set up 3 times.) Switch the backup to a merge: `kv patch -method=rw` (read+update, needs no `patch` capability) when the path exists, and `kv put` only to create it on the first backup. Add a regression test with a fake vault asserting a pre-existing sibling key survives a backup, and document the merge requirement in the renewal runbook. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-26 08:33:41 +00:00
Viktor Barzin	6f1951af93	fix(workstation): carry OS/sudo authz policy into managed-settings source + multi-tenancy doc All checks were successful ci/woodpecker/push/default Pipeline was successful Details ADR-0015's policy change was applied live to /etc/claude-code/managed-settings.json, but that file self-deploys from the repo source scripts/workstation/managed-settings.json via the hourly reconcile (sync_managed_config). Without updating the source the next reconcile would REVERT /etc to the old 'never read other homes' rule. This updates the source-of-truth claudeMd (now byte-identical to /etc) so the change is durable + canonical, and refresh_codex_mirror propagates it to every user's ~/.codex/AGENTS.md. Also notes the access-model change in the multi-tenancy architecture doc (pointer to ADR-0015). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-26 08:25:33 +00:00
Viktor Barzin	8121d8a4ac	docs(adr): add ADR-0015 (OS/sudo is the authorization boundary), supersede ADR-0011 privacy norm All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor (owner) wants agents to stop refusing file reads the OS already permits. wizard holds passwordless root ((ALL) NOPASSWD: ALL), so the managed-settings rule 'never read another user's ~/.claude' was stricter than the OS itself. The managed-settings policy (/etc/claude-code/managed-settings.json) was updated out-of-band to defer to OS/sudo authorization with no extra prompt; backup kept at .bak-2026-06-26. This ADR records the decision, its symmetry across sudo-holders, and the larger blast radius. ADR-0011's usage-telemetry design is unchanged; only the cross-user privacy norm it referenced is superseded. The original ask was to delete ADR-0011 — superseded instead to preserve the audit trail and the ADR-0012/0013 references. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-26 08:22:29 +00:00
Viktor Barzin	ebc8b6588f	ESO: add force_conflicts to all ExternalSecret manifests (fleet sweep) Some checks failed ci/woodpecker/push/default Pipeline failed Details The 2026-06-22 external-secrets v1 migration made the ESO controller the server-side-apply owner of .spec.refreshInterval on every ExternalSecret, so any stack defining one via kubernetes_manifest fails `terraform apply` with a field-manager conflict the next time it's applied (instagram-poster + grafana hit this on 2026-06-24; it was latent across the whole fleet). Add field_manager { force_conflicts = true } to all 101 remaining ExternalSecret manifests across 70 stacks, matching the fix already on grafana / woodpecker / traefik / k8s-version-upgrade / instagram-poster. TF and ESO set the same value, so it's stable (no perpetual drift). Defuses the landmine before each stack's next apply trips it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-25 21:28:11 +00:00
Viktor Barzin	6c5288998f	goldmane-trail: polish follow-ups #57/#59/#61/#62/#63 + digest→#alerts All checks were successful ci/woodpecker/push/default Pipeline was successful Details Completes the Goldmane who-talks-to-whom trail (ADR-0014), implemented by a subagent workflow (distinct stacks in parallel, docs last): - #57 Whisker gated ingress: ingress_factory (whisker.viktorbarzin.me, auth=required, Authentik-gated) + a NetworkPolicy allowing traefik->whisker:8081 (the operator's whisker NP default-denies ingress). calico stack. - #61 pipeline health: AggregatorDown + DigestFailing Prometheus alerts (prometheus_chart_values.tpl) + cluster-health check #48. - #59 service-identity labels on the multi-Service namespaces (monitoring's 5 TF-managed deployments + dbaas), with the KYVERNO_LIFECYCLE_V1 marker so they update in-place. - #62/#63 docs: docs/runbooks/goldmane-flow-trail.md (new), service-catalog, security.md + monitoring.md east-west sections, ADR-0014 as-built, CONTEXT.md. #62 = the SQL to derive the Wave-1 per-namespace egress allowlist from the edge table (feeds code-8ywc; enforce-flips out of scope). Also fixes the digest's Slack target: #security override 404s channel_not_found because the shared alertmanager_slack_api_url webhook's app isn't a member of #security (this likely also breaks alertmanager's slack-security receiver — flagged in the runbook). Routed to #alerts (the webhook's working channel) until the app is invited; verified a real digest run posts cleanly (360 edges). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-25 17:49:25 +00:00
Viktor Barzin	306cdd4cb3	state(dbaas): update encrypted state	2026-06-25 17:31:03 +00:00
Viktor Barzin	9c68d147e0	k8s-upgrade: reclaim+auto-prune kubeadm /etc/kubernetes/tmp leak; correct crash root cause to etcd IO (not OIDC) Some checks failed ci/woodpecker/push/postmortem-todos Pipeline was successful Details ci/woodpecker/push/default Pipeline failed Details Digging into "why did the apiserver crash" disproved the earlier OIDC explanation. An isolated v1.35.6 apiserver repro with authentik reachable initialises OIDC cleanly (oidc.go:313, no error) and runs fine — so the --authentication-config -> --oidc-* revert is NOT what crashed it. etcd's surviving crash-window log is the real cause: 1180 "apply request took too long" warnings in 16 min, individual applies up to 4.3s (healthy <100ms) right as kubeadm tried to bring up the new apiserver. That's etcd IO starvation on the shared sdc HDD (beads code-oflt). A big contributor + the reason master root fs sat at 73%: kubeadm dumps a full ~400MB etcd DB backup into /etc/kubernetes/tmp/kubeadm-backup-etcd-<ts>/ before every etcd upgrade and never cleans it up — 145 dirs / 28GB had accumulated, driving image-GC churn and extra write-IO onto etcd's spindle. Reclaimed live (73% -> 23%) and added a preflight prune (>3 days) so it can't re-accumulate. Also corrected the OIDC handling: the kubeadm-config drift is real but only breaks dashboard/kubectl SSO AFTER a successful upgrade (recoverable via the chain's restore.sh + the kubeadm-config reconciliation) — it does not crash the apiserver. So the preflight check is now an ALERT, not a block (was added on the wrong hypothesis). Post-mortem, runbook, and apiserver-oidc.tf header corrected. Per Viktor: reclaim the disk and automate so the manual cleanup never recurs; the durable IO fix remains code-oflt (etcd off the shared HDD). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-25 15:23:15 +00:00
Viktor Barzin	60a1cb9a25	k8s-upgrade: reconcile kubeadm-config OIDC drift that crash-looped the v1.35 apiserver upgrade All checks were successful ci/woodpecker/push/postmortem-todos Pipeline was successful Details ci/woodpecker/push/default Pipeline was successful Details Last night's autonomous 1.34->1.35 run reached the master control-plane phase for the first time (preflight passed, etcd snapshot taken, etcd upgraded), then the kube-apiserver upgrade to v1.35.6 crash-looped and kubeadm auto-rolled-back to 1.34.9. The cluster stayed healthy but the master was left cordoned and the chain wedged on in_flight. Root cause: kubeadm upgrade regenerates the apiserver static-pod manifest from the kubeadm-config ConfigMap. apiserver auth was switched on 2026-06-19 to a structured multi-issuer --authentication-config (kubectl + dashboard SSO), but kubeadm-config still carried the legacy single-issuer --oidc-* extraArgs, so the regenerated manifest reverted structured auth and the new apiserver crash-looped. Proven via `kubeadm upgrade diff`. The existing post-upgrade OIDC restore step never ran because the upgrade itself never succeeded. Fix: - rbac/apiserver-oidc.tf: the remote script now also reconciles kubeadm-config (kubeadm init phase upload-config: drop --oidc-*, add --authentication-config) so a future kubeadm upgrade regenerates a correct manifest. Delivered to the cluster via the apiserver-oidc-restore ConfigMap the chain re-runs (CI needs no ssh key); trigger deliberately not script-hashed since CI cannot ssh. - k8s-version-upgrade/upgrade-step.sh: new preflight gate runs `kubeadm upgrade diff` and BLOCKS+alerts (never drains the master) if --authentication-config would still be dropped. - Post-mortem + runbook updated. The live kubeadm-config was reconciled directly on the master and verified (`kubeadm upgrade diff` now shows only the control-plane image bump), so tonight's run can complete the 1.34->1.35 upgrade. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-25 14:16:04 +00:00
Viktor Barzin	c6bba1da6e	home-assistant skill: refresh ha-london map (HAOS 2026.5.2, Cowboy revived, Overview redesign) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor asked to redesign the ha-london dashboards and fix the broken integrations (the Cowboy one). The skill's ha-london knowledge map had drifted badly from reality, so this brings it current: it claimed HA 2025.9.1 on a docker-run container (it's HAOS 2026.5.2, managed); listed the now-dead jdejaegh Cowboy integration with sensor.bike_* entities (revived via elsbrock/cowboy-ha v1.2.0 -> entities are sensor.classic_performance_*); and didn't flag that met/metoffice/roomba/hildebrandglow are user-disabled (not broken) or that Tapo P100 is failing. Also documents the redesigned Overview (Home+More sections, Mushroom+mini-graph-card), the dashboard/view/card glossary, the parked-bike 'unknown battery' gotcha, and that london is API-only from the Sofia devvm (no SSH). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-24 22:03:15 +00:00
Viktor Barzin	b858561bd0	Merge remote-tracking branch 'origin/master' Some checks failed ci/woodpecker/push/default Pipeline failed Details	2026-06-24 20:59:39 +00:00
Viktor Barzin	a7704f46a6	deploy goldmane-edge-aggregator: durable who-talks-to-whom edge trail (#58 , ADR-0014) Infra side of ADR-0014: an mTLS gRPC consumer of Calico Goldmane's Flows API that records the namespace-pair edge-set in CNPG and posts a daily new-edge digest to #security. Adds the goldmane-edge-aggregator stack, the pg-goldmane-edges Vault rotation role (Tier-0 vault state updated here), and the namespace in the ghcr-credentials allowlist. Cert: REUSES the operator-minted, Tigera-CA-signed whisker-backend client cert (Goldmane verifies only the CA chain, not identity) instead of minting from the Tigera CA private key. This avoids putting the CA key in TF state AND the hashicorp/tls provider, which is incompatible with this repo's global generate-providers/lockfile pattern (it broke every stack's lockfile). Verified live: aggregator streaming flows, 174 edges in Postgres across 50x54 namespaces, db+slack ExternalSecrets synced, digest dry-run formats correctly, private image pulls via the Kyverno-synced ghcr-credentials. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-24 20:59:39 +00:00
Viktor Barzin	aa510e3600	instagram-poster: force_conflicts on ESO manifests (fix apply) All checks were successful ci/woodpecker/push/default Pipeline was successful Details The ESO v1 migration (2026-06-22) made the external-secrets controller own .spec.refreshInterval via server-side apply, so terraform apply of the two ExternalSecret manifests fails with a field-manager conflict (Woodpecker #348), which blocked the replicas=0 scale-down from landing. Add force_conflicts=true to both, matching the grafana/woodpecker/traefik fix applied to other stacks the same day. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-24 20:49:53 +00:00
Viktor Barzin	53834deb24	instagram-poster: scale to 0 (unused, dead ExternalSecret) Some checks failed ci/woodpecker/push/default Pipeline failed Details Viktor confirmed the Instagram Graph poster isn't used. Its ExternalSecret has been dead on missing Vault keys (ig_graph_long_lived_token, ig_business_account_id), so the deployment sat at 0/1 firing DeploymentReplicasMismatch. Setting replicas=0 stops the alert and makes the scale-down durable (a bare kubectl scale reverts on the next stack apply). Re-set to 1 after minting a Meta long-lived token + populating the Vault keys. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-24 20:45:30 +00:00
Viktor Barzin	8dd9a3978d	Merge remote-tracking branch 'forgejo/master' into wizard/homelab-vault All checks were successful ci/woodpecker/push/default Pipeline was successful Details	2026-06-24 12:25:52 +00:00
Viktor Barzin	65b2df1222	fix(monitoring): force_conflicts on grafana_db_creds ExternalSecret The external-secrets controller owns .spec.refreshInterval via SSA, so a plain terraform apply of the monitoring stack conflicts. Latent until 2026-06-24 (the homelab-vault loki-rules change was the first monitoring apply in a while and surfaced it). force_conflicts lets TF win — same pattern as woodpecker/traefik/ k8s-version-upgrade stacks.	2026-06-24 12:25:36 +00:00
Viktor Barzin	1d0388da12	Merge remote-tracking branch 'origin/master' All checks were successful ci/woodpecker/push/default Pipeline was successful Details	2026-06-24 12:22:58 +00:00
Viktor Barzin	92361f36db	calico: enable Goldmane + Whisker (Calico 3.30 OSS flow observability) Turns on Calico 3.30's native east-west flow observability so we can see which Service talks to which (ADR-0014, issue #57). Enabled via the operator CRs directly (kubectl_manifest Goldmane + Whisker, name=default) rather than the Helm goldmane/whisker flags, because the goldmanes/whiskers CRDs already exist and this sidesteps the helm-upgrade CR-before-CRD ordering issue. Whisker notifications=Disabled so the UI doesn't call the external Tigera endpoint. Applied supervised: creating the Goldmane CR re-rendered calico-node with the FELIX_FLOWLOGSGOLDMANESERVER env (operator auto-wires Felix — no manual FelixConfiguration); calico-node rolled cleanly 7/7, tigerastatus healthy, goldmane is receiving flows from all nodes, Whisker UI serves. Durable Loki persistence is NOT included here: the Goldmane emitter is Calico Cloud/Enterprise-gated with no OSS knob to aim it at Loki (the CR can override only name+resources, not env), so a durable trail needs a small custom gRPC consumer of goldmane:7443 — tracked in issue #58. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-24 12:22:48 +00:00
Viktor Barzin	e711b2f971	feat(monitoring): homelab vault traceability alerts (TOTP-fetch + volume) Some checks failed ci/woodpecker/push/default Pipeline failed Details Build infra CLI / build (push) Has been cancelled Details Adds a Loki ruler group (lane=security -> #security) for the homelab vault op-log: VaultwardenTOTPFetched (every 2nd-factor fetch is visible) and VaultwardenFetchVolumeHigh (>100 fetches/10m backstop). The audit spine (Vault audit device, reads of secret/data/workstation/claude-users/*) is already captured. True CLI-bypass detection needs cross-stream correlation (follow-up).	2026-06-24 10:31:32 +00:00
Viktor Barzin	64104e56e9	feat(devvm): install Bitwarden CLI for homelab vault	2026-06-24 10:29:57 +00:00
Viktor Barzin	15643d1f44	feat(cli): bare `homelab vault` help command	2026-06-24 10:29:32 +00:00
Viktor Barzin	772aed5370	fix(cli): vault security review fixes C1 (critical): setup wrote the master password + API client_secret as `vault kv patch key=value` argv, leaking them via /proc/<pid>/cmdline to same-UID processes. Now written via stdin (key=- form); only email + client_id (non-credentials) remain in argv. I1: `get --json` refused on a TTY (was dumping the secret to scrollback). M1: vaultLock now holds the per-user flock (it mutates bw state). M4: bw login-detection parses status JSON instead of substring matching. M5: clipboard path refuses when stderr is not a TTY (was silently failing). M6: realRunner trims only trailing newline, preserving secret whitespace; secret prompts likewise. Adds security-property tests: no secret in argv across the get flow, clipboard decision matrix, --json TTY gate, bw status parsing.	2026-06-24 10:28:31 +00:00
Viktor Barzin	5a864cf19c	feat(cli): homelab vault setup onboarding (one-time, self-service)	2026-06-24 10:21:57 +00:00
Viktor Barzin	e20033855d	feat(cli): vault list/search/code/status/lock	2026-06-24 10:21:07 +00:00
Viktor Barzin	365340b37d	feat(cli): homelab vault get with TTY-aware return	2026-06-24 10:20:05 +00:00
Viktor Barzin	2dd12fc6be	feat(cli): vault session bootstrap with per-user flock + no-coredump	2026-06-24 10:18:36 +00:00
Viktor Barzin	5bae2a3907	feat(cli): privacy-aware vault op-log (process, never the secret)	2026-06-24 10:17:50 +00:00
Viktor Barzin	81122f8607	feat(cli): TTY-aware return + OSC52 clipboard with terminal gating	2026-06-24 10:17:13 +00:00
Viktor Barzin	06f4b87af1	feat(cli): vault bw engine env/arg builders + unlock	2026-06-24 10:16:19 +00:00
Viktor Barzin	cd44ca5921	feat(cli): vault creds loading from per-user Vault path	2026-06-24 10:15:32 +00:00
Viktor Barzin	6c53ee10b1	feat(cli): register homelab vault command group skeleton	2026-06-24 10:14:24 +00:00

1 2 3 4 5 ...

4591 commits