Compare commits

...
Sign in to create a new pull request.

90 commits

Author SHA1 Message Date
Viktor Barzin
cf42042cba monitoring: re-trigger apply to persist state after CI cancel-race
All checks were successful
ci/woodpecker/push/default Pipeline was successful
No-op comment touch in loki.tf to force a clean `terragrunt apply monitoring`.
The pfSense egress-monitoring apply (commit 7fe2d978, CI pipeline #414) was
cancelled by a newer push and SIGKILLed mid-helm-upgrade: the live resources
applied (probes green, rules loaded) but the Terraform state write and the helm
release finalize were lost, leaving the prometheus release stuck in
pending-upgrade (manually unstuck). This commit re-applies the unchanged
monitoring stack so state matches live, with zero resource changes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 16:58:49 +00:00
Viktor Barzin
f92075b7c5 fire-planner: solve FIRE targets to age 100 (horizon 60→72)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor plans to live to 100, so the portfolio must last that long. The
fire-targets CronJob was solving a 60-year horizon (≈ to age 88); set it to 72
(retire ~age 28 → age 100). Raises every case's FIRE number modestly (more years
to fund). A one-off in-cluster job re-solves the existing rows at the new horizon.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 16:49:20 +00:00
Viktor Barzin
7fe2d9780e monitoring: add pfSense WAN/egress alerting + probes
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
On 2026-06-27 pfSense (Proxmox VMID 101) stopped passing internet egress for
~20 min while internal routing + Unbound stayed up; recovery needed a manual
reboot and NOTHING alerted — there was no egress probe and the cloudflared
replica metric stayed green. Add first-class egress monitoring so the next
occurrence pages in ~2 min instead of being noticed by a human.

- blackbox-exporter: new icmp_egress + dns_external probe modules (+ NET_RAW
  so ICMP can use raw sockets).
- Three in-cluster probe jobs exercising the pod->node->pfSense-NAT path that
  failed: wan-gateway-icmp (192.168.1.1), internet-egress-icmp (9.9.9.9 +
  1.1.1.1), internet-egress-dns (cloudflare.com via both resolvers).
- Prometheus alerts (group "Egress / pfSense"): WANGatewayUnreachable,
  InternetEgressDown (both providers dead), ExternalDNSResolutionDown,
  EgressOnlyDivergence (reuses the existing t3-probe legs — the incident's
  exact "external down while internal up" signature), PfSenseVMDown.
- Loki ruler: CloudflaredTunnelConnLoss — the canary that fired first; the
  cloudflared replica metric is blind to tunnel-connection loss. Threshold
  calibrated against live Loki (steady-state ~2/6h vs 37-85/5m in-incident).
- Alertmanager inhibit: WAN/egress-down suppresses the downstream egress
  symptom alerts so one root alert pages, not a storm.
- Runbook docs/runbooks/pfsense-egress.md + .claude/CLAUDE.md.

All metric names + the cloudflared threshold verified against live
Prometheus/Loki. Pure GitOps, no pfSense change. Firewall-side hardening
(dpinger retargeting, failover gateway, pfSense syslog -> Loki) is deferred
and documented in the runbook.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 16:46:30 +00:00
Viktor Barzin
279b88d2bc docs: add MetalLB L2Status-immutable PG-VIP-flap post-mortem (code-aoxk)
All checks were successful
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
Post-mortem for the 2026-04/05 SEV3 where a stuck MetalLB ServiceL2Status
CR (immutable status.node) flapped the PG load-balancer VIP and silently
broke Tier-1 Woodpecker terragrunt applies for ~5 days (the wrapper error
"Cannot read PG creds" masked the real cause for ~25 days). Written when
the incident closed (beads code-aoxk, 2026-05-26) but never committed;
landing it so the RCA + stuck-CR cleanup procedure live in the repo.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 16:25:10 +00:00
Viktor Barzin
6f042ee239 fix(fire-planner): grafana fire-planner-pg datasource survives pw rotation
Some checks failed
ci/woodpecker/push/default Pipeline failed
The fire-planner-pg Grafana datasource baked the rotating fire_planner DB
password into its provisioning ConfigMap at terraform plan-time, so on every
7-day static-role rotation the password went stale and ALL fire-planner-pg
dashboards (fire-planner, cost-of-living, and the new wealth FIRE Countdown)
silently failed with "password authentication failed for user fire_planner"
until the next stack apply.

Switch to the same live-env pattern wealth-pg / payslips-pg already use:
- new ExternalSecret grafana-fire-planner-pg-creds (monitoring ns, Reloader
  match) mirrors the rotating Vault static-creds/pg-fire-planner password
- datasource ConfigMap now references $__env{FIRE_PLANNER_PG_PASSWORD}
- Grafana mounts it via envFromSecrets; reloader (auto) restarts Grafana on
  rotation so the provisioned datasource never goes stale

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 16:14:42 +00:00
Viktor Barzin
35c0057d83 chrome-service: raise noVNC sidecar memory limit 96Mi->256Mi (fix OOMKill)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The noVNC sidecar (x11vnc + websockify) was OOMKilled (exit 137) repeatedly
whenever someone actively opened chrome.viktorbarzin.me — the view connected
then froze/hung. Idle usage is ~37Mi, but x11vnc + websockify
framebuffer/encode buffers spike past the 96Mi cap when streaming the
1280x720 screen to a client. Raised request 32Mi->64Mi, limit 96Mi->256Mi
(Burstable, aux tier). Already applied live via a transient kubectl patch
(Recreate rollout, verified 0 restarts since); this lands the durable state
so the next apply / daily drift-detection doesn't revert it to 96Mi.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 15:39:17 +00:00
Viktor Barzin
2e50c1235c chrome-service: grant emo shared browser access (noVNC + homelab browser CLI)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked to give emo access to the cluster's headed Chrome so he can fill
in forms and get past anti-bot / captcha pages. emo was deliberately locked
out of chrome-service (noVNC Authentik allowlist was Viktor-only + his
power-user RBAC has no pods/portforward). Viktor's explicit decision: SHARE
his existing browser rather than stand up an isolated per-user instance,
accepting that emo can therefore reach Viktor's warmed logged-in sessions
(CDP has no per-context auth, so the single shared persistent profile is
reachable by anyone who can drive the browser). emo's CLI use is hands-off
(his agent can run it unattended).

- authentik: add emo (emil.barzin / emil.barzin@gmail.com) to CHROME_ALLOWED
  so the admin-services-restriction policy admits him to chrome.viktorbarzin.me
  (noVNC). Reverses the prior Viktor-only lock; comment updated to record why.
- chrome-service/rbac.tf (new): emo-browser ServiceAccount + long-lived token
  (dashboard-sa.tf pattern), a chrome-service-portforward Role granting
  pods/portforward, and a cluster read-only binding (oidc-power-user-readonly)
  so the SA can resolve the Service and emo's normal read access doesn't regress.
- t3-provision-users.sh: install_browser_kubeconfig installs a dual-context
  kubeconfig for any user with a <user>-browser SA — SA token as the default
  context (non-interactive, works headless), personal OIDC retained as the
  oidc@homelab named context. emo's OIDC-only kubeconfig can't authenticate the
  headless agent session that homelab browser needs.
- docs/architecture/chrome-service.md: document the shared-browser multi-user
  access model, the session-exposure trade-off, and how to grant/revoke a user.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 15:20:07 +00:00
Viktor Barzin
50077b43d4 paperless-ngx: drop TASK_WORKERS 6->4 (6 OOMKilled the pod mid-import)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
6 OCR workers crept past the 8Gi per-container memory cap over ~6h and
OOMKilled paperless at 15:00 during the Emo bulk import. The import
auto-recovered (the consume dir lives on the PVC, so a restart re-scans
and reprocesses — nothing lost), but it left the queue inflated with
re-queued duplicates and spiked etcd on each restart.

The 8Gi cap is the shared edge-tier `tier-defaults` LimitRange, not worth
raising for one namespace. 4 workers fit with headroom (4 measured
~1.3Gi). Matches the value applied live via `kubectl set env` during
incident response; this removes the drift so the next apply keeps it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 15:06:46 +00:00
Viktor Barzin
8236ae309d postiz: reconcile HCL to live (adopt unmerged stack config), keep parked
All checks were successful
ci/woodpecker/push/default Pipeline was successful
postiz's live deployment (Helm + Temporal + Elasticsearch + Authentik
OIDC + static-DB password) came from the never-merged branch
`wizard/postiz-cnpg-oidc`, so master's HCL was stale and a `terragrunt
apply` would have DESTROYED the stack. This lands that postiz config to
master so HCL == state == live (CI green; destroy-landmine gone).

Kept PARKED (postiz + temporal replicas = 0): IG-via-postiz is Meta-
blocked (it hardcodes retired Instagram scopes → OAuth "Invalid Scopes"),
which is why it was parked; IG runs via the instagram-poster service. To
revive later: flip postiz `replicaCount` + temporal `replicas` back to 1
and re-check image pins.

Notes captured in this reconcile:
- ES image pinned to 7.17.28 (the branch's 7.17.24 was a DOWNGRADE vs the
  live data → ES refused to start "cannot downgrade node 7.17.28→7.17.24";
  caught + rolled back during this work).
- The 4 Authentik resources (app/provider/group/binding) were re-imported
  into state (adopted, not recreated — no duplicate AK objects); the
  obsolete `external_secret_jwt` ExternalSecret was removed (Retain → its
  synced secret was kept).
- Vault-side cleanup (removing the unused pg-postiz rotated role) is
  deliberately NOT included here — deferred, postiz uses a static
  secret/postiz database_url.

State was already reconciled by a local `scripts/tg apply`; this commit is
the HCL catch-up (CI re-apply is a no-op).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 12:54:59 +00:00
Viktor Barzin
250d0fc334 docs(authentik): document SFE forced-WebAuthn escape hatches (TOTP + social)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Old-browser users on the SFE who have a password but no MFA device hit the
default-authentication-flow's forced WebAuthn passkey enrolment, which the SFE
cannot render (the 'unsupported state: ak-stage-authenticator-webauthn' error).
emo (Google-only, iPadOS 15) hit this on the password path.

Document the two no-MFA-downgrade fixes: (1) social login, whose source flow
(default-source-authentication) has no MFA stage, so the SFE's social button
always completes; (2) enrolling TOTP, which the SFE can validate (unlike
WebAuthn) and which flips the MFA stage from force-enrol to validate. TOTP was
enrolled for emo and stored in his Vaultwarden authentik item; verified
end-to-end (a Bitwarden-generated code is accepted by authentik).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 12:24:40 +00:00
Viktor Barzin
e518ada3d4 authentik: repoint to overlay patch3 (all-iOS SFE + SFE social links) + docs
All checks were successful
ci/woodpecker/push/default Pipeline was successful
global.image -> 2026.2.4-patch3. Old iPad Chrome (and any iOS browser) now gets
the SFE too, and the SFE login shows social-login buttons (emo is Google-only with
no password, so the password form alone was a dead end). Docs: .claude/CLAUDE.md +
authentication.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 11:53:26 +00:00
Viktor Barzin
4fc09b7a61 Merge remote-tracking branch 'origin/master' into wizard/authentik-sfe-social
Some checks failed
Build Custom Authentik Image / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was canceled
2026-06-28 11:53:04 +00:00
Viktor Barzin
916516eeab authentik overlay patch3: SFE for ALL old iOS browsers + social-login links
Two follow-ups to patch2 (both in patch-compat-sfe.py, guarded):

1. compat_needs_sfe() now also serves the SFE to ANY iOS browser on iOS<=16.3,
   not just Safari. iOS Chrome/Firefox are WebKit skins (Apple mandate) reporting
   a non-Safari UA family, so the Safari-only check missed them and they still got
   the blank modern SPA. Added an os.family=="iOS" + version<=16.3 branch.

2. Inject static social-login <a> links (Continue with Google/GitHub/Facebook ->
   /source/oauth/login/<slug>/) into the SFE shell (flow-sfe.html). The SFE
   architecturally can't render Identification-stage sources (authentik docs), and
   emo's account (emil.barzin@gmail.com) is Google-only with NO password — so the
   SFE's username/password form was a dead end. The links are plain redirects that
   work on any browser. Slugs are static; re-verify on source changes.

Tag -> 2026.2.4-patch3; values repoint + docs land once GHA builds it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 11:53:03 +00:00
Viktor Barzin
08bdf32aa0 feat(fire-planner): FIRE Countdown dashboard section + monthly target solve
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
Add a "FIRE Countdown" section to the wealth Grafana dashboard plus a monthly
CronJob that computes the targets it reads.

Viktor wanted a £ countdown to retirement in today's money, per life-case
(Solo / Household / Family) and per country, with progress, a projected date,
runway, and his safety guardrails — so he can see how close he is to FIRE
(ideally lean) without ever coming back to work.

- wealth.json: new country / with_home / savings_per_year template vars + a
  per-Case row (target NW at the 99% GK bar, progress gauge, still-needed,
  projected FIRE date, runway) and safety-valve panels (re-entry trigger vs
  £1.0M, 2.5yr cash buffer, pension tranche @57, Anca-bridge note). Reads
  fire_planner.fire_target via the fire-planner-pg datasource (Mixed).
- fire-planner stack: fire-planner-fire-targets CronJob (monthly, 2nd 10:00
  UTC) runs `recompute-fire-targets --countries all`.

Targets come from the solver shipped in fire-planner edb4d11.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 11:52:17 +00:00
Viktor Barzin
6ba60cbb2d authentik: repoint to overlay patch2 (SFE for old Safari) + docs
All checks were successful
ci/woodpecker/push/default Pipeline was successful
global.image -> 2026.2.4-patch2 (adds the compat_needs_sfe SFE patch on top of the
SLOW-1a query patch). Old Safari/WebKit (<=16.3) now gets authentik's no-JS SFE
login instead of a blank page — fixes emo's iPadOS-15.8 iPad with no auth
downgrade. Docs: .claude/CLAUDE.md Authentik row + docs/architecture/authentication.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 11:39:29 +00:00
Viktor Barzin
5fb2004de5 Merge remote-tracking branch 'origin/master' into wizard/authentik-perf-fix
Some checks are pending
Build Custom Authentik Image / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
2026-06-28 11:38:07 +00:00
Viktor Barzin
f10bb71562 authentik overlay: serve the no-JS SFE login to old Safari (patch #2)
Old Safari/WebKit (<=16.3, e.g. iPadOS<=16.3) can't parse authentik's modern
ES2022 flow SPA and gets a COMPLETELY BLANK login — exactly what emo's iPadOS-15.8
iPad hit. authentik already ships a no-JS Simplified Flow Executor (SFE, ES5) and
serves it via compat_needs_sfe(), but only for IE/old-Edge/PKeyAuth. Extend that
to old Safari so those clients get the REAL authentik login (password + MFA +
reputation, identity preserved — NO auth downgrade, no new credential store).

Chosen over a Traefik basic-auth fallback after an adversarial review: that route
would put a single, spoofable-UA password in front of vbarzin->wizard (passwordless
root on the cluster-controlling devvm) — an MFA->single-factor path to cluster root.
SFE keeps full authentik auth and is generic for any old browser.

Shipped as patch #2 in the existing overlay image (patch-compat-sfe.py — guarded:
asserts the upstream anchor + ast-parses; verified against the live interface.py).
Tag -> 2026.2.4-patch2; the values repoint lands once GHA builds the image.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 11:38:05 +00:00
Viktor Barzin
ec681ba6e1 ci(infra): stop double-apply + stop counting PG lock-waits as failures
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The infra terragrunt-apply pipeline (.woodpecker/default.yml) was going
red ~20% of the time. Root causes (verified from the failure logs, not
guessed):

1. infra is registered in Woodpecker TWICE — canonical Forgejo (repo 82)
   AND legacy GitHub mirror (repo 1) — and BOTH run `default.yml` on every
   push. The two applies race each other for the per-stack PG state lock →
   "Error acquiring the state lock" failures + push-supersede "killed" runs.
2. The skip-not-fail lock guard only matched the Tier-0 Vault lock string
   ("is locked by"); the Tier-1 PG-backend lock ("Error acquiring the state
   lock") fell through and was counted as a hard FAILURE.
3. Transient provider-registry download timeouts (and Vault 5xx) failed the
   whole pipeline with no retry.

Fixes (all in default.yml):
- Forge guard: the push-apply runs ONLY on the canonical Forgejo forge; on
  the GitHub mirror it no-ops (exit 0). The mirror keeps running the crons
  (they live on repo 1), so we de-dup the apply without deactivating the
  registration. Fail-open on unknown forge.
- Lock-skip now matches BOTH tiers (Vault + PG) → lock-waits are SKIPPED.
- Bounded retry (3x) ONLY on transient signatures (provider download
  timeout, Vault 5xx). Config errors + helm atomic-timeouts fail fast.

Rejected (documented in docs/architecture/ci-cd.md): an off-infra GHA
validate gate (catches ~0 of the real, runtime/Vault-data/SSA/lock
failures; reproduced `terraform validate` passing the exact stacks that
fail at apply) and lock-reaping/force-unlock (PG advisory locks are
session-scoped + auto-release; force-unlock can't free them and would
corrupt a live concurrent apply).

Shell logic + the classification regexes were unit-tested locally against
the real decoded error strings (#359 PG lock, #353 provider timeout, #360
missing-arg, helm atomic timeout); `bash -n` clean; YAML parses.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 11:37:18 +00:00
Viktor Barzin
69e35efd95 Merge remote-tracking branch 'origin/master' into wizard/vault-kv
Some checks are pending
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
2026-06-28 11:09:38 +00:00
Viktor Barzin
e03e4719ad vault: distinguish Vaultwarden vs HashiCorp Vault, add vault kv
`homelab vault` only spoke to Vaultwarden (the password manager), but the
name reads as HashiCorp Vault (the infra secrets store — actually OpenBao
here). Make the two unmistakable and support both.

Distinction (no breakage — the existing Vaultwarden verbs are unchanged):
- bare `homelab vault` help now LEADS with the two-stores split;
- every verb summary is tagged `[vaultwarden]` or `[hashicorp-vault]`;
- HashiCorp Vault/OpenBao lives under a clearly-named `vault kv` group.

New `vault kv` (HashiCorp Vault / OpenBao, the secret/… KV store):
- `kv get <path> [--field K]` — read; --field → one value (TTY-aware
  clipboard/stdout), no field → full secret JSON (refuses a bare TTY).
- `kv list <path>` — list sub-paths (no values).
- `kv put <path> <key>` — write one key; value via stdin (piped or
  no-echo prompt, never argv); creates the path or merges (never
  clobbers siblings; uses kv patch -method=rw so no `patch` cap needed).

Critical: `kv` uses the caller's OWN Vault token (OIDC ~/.vault-token /
$VAULT_TOKEN), NOT the per-user scoped Vaultwarden token (bound only to
claude-users/<user>, which would 403 elsewhere) — handlers set VAULT_ADDR
but never inject the scoped token. Access is whatever the policy grants.

Logic in cmd_vault_kv.go (pure cores extractKVData/parseKVList/arg
builders/kvGet/List/Put; file header documents the credential split).
CLI v0.11.0. Tests: no value in put argv, create-then-merge, KV-v2
envelope strip, help names both systems. Verified e2e against live Vault
(read key-names-only + a scratch put/merge/cleanup).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 11:09:33 +00:00
Viktor Barzin
460f2ad42f state(vault): update encrypted state
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-28 11:07:22 +00:00
Viktor Barzin
87a450e9a3 vault: grant emo full read/write on his own secret/emo tree
Viktor asked that emo be able to edit his own secrets with full access.
emo's personal-emo policy was read-only (read on data, read/list on
metadata), so he could view but not change his personal secrets.

Widen it to the same self-service capability set every namespace-owner
already has over their own tree: create/read/update/delete/list on
secret/data/emo(+/*) and list/read/delete on secret/metadata/emo(+/*).
Scope is unchanged — still only emo's own secret/emo subtree, still a
named exception that does not widen the power-user tier in general.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 11:07:22 +00:00
Viktor Barzin
a1cf7ccaf6 authentik: repoint to the SLOW-1a overlay image + un-enroll Keel
All checks were successful
ci/woodpecker/push/default Pipeline was successful
GHA built ghcr.io/viktorbarzin/authentik-server:2026.2.4-patch1 (public, verified
anonymously pullable). Point global.image at it (repository + tag pinned
explicitly so neither helm's appVersion default nor Keel can downgrade it — the
2026-06-10 boot-storm class) and remove keel.sh/enrolled from the namespace so
Keel won't auto-bump the custom tag. authentik is now manual-upgrade: bump the
Dockerfile FROM + this tag together on each authentik version bump.

Net effect once rolled: the identification-stage query drops ~1.4s -> ~14ms, so
the cold login-flow first-load stops being slow. (Does NOT affect old-browser
clients — iPadOS<=15/Safari<=15.6 still can't run the SPA; that's unfixable
server-side.) Docs: .claude/CLAUDE.md Authentik row.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 10:46:21 +00:00
Viktor Barzin
7ec64ed5ff authentik: custom-image overlay to fix the 1.4s login-flow query (SLOW-1a)
Some checks are pending
Build Custom Authentik Image / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
The login flow's identification stage runs a bare select_subclasses() that
LEFT-JOINs every Source subtype table — ~1.4s server-side on every cold login
(verified live: 1527ms vs 14ms). Narrow it to only the subtypes that render a UI
login button (oauth/saml/plex/telegram/kerberos — not the sync-only ldap/scim),
via django-model-utils string accessors so no import is needed. Byte-identical
output, ~100x faster, robust to adding new login source types.

Shipped as a thin overlay over the official image (mirrors the diun/excalidraw
precedent): stacks/authentik/Dockerfile (FROM ghcr.io/goauthentik/server:2026.2.4
+ a guarded sed) built by .github/workflows/build-authentik.yml -> ghcr.io/
viktorbarzin/authentik-server:2026.2.4-patch1. The values repoint + Keel freeze
land in a follow-up commit once the image is built. Upstream bug still present in
main (no fix/PR) — drop this overlay once upstream narrows the query.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 10:42:58 +00:00
Viktor Barzin
12a45fa94e vault: bw sync on every read so reads show the latest values
Some checks are pending
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
`bw unlock` only decrypts the LOCAL cache, so a persisted (already
logged-in) session served stale data — a password changed in the web
vault wouldn't appear until the next fresh login. Add a best-effort
`bw sync` in openSession (the chokepoint every read shares: get, get
--all, list, code, status), so reads reflect current server-side values.

Best-effort by design: a transient sync failure warns on stderr and
falls back to the cached vault rather than failing the read (an AFK
agent shouldn't break on a network blip). status keeps its own explicit
sync so a reachability failure still surfaces in its report.

CLI v0.10.1. Tests assert the sync runs after unlock and before the read,
and that a read still succeeds when sync fails.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 10:19:54 +00:00
Viktor Barzin
3d948c7033 Merge remote-tracking branch 'origin/master' into wizard/upgrade-gate-held
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-28 10:09:42 +00:00
Viktor Barzin
2880fe1c29 docs: update k8s-version-upgrade runbook for actionable-vs-held gate
Reflect the classification change in the operational runbook: the gate's three
refusal classes (actionable/waiting/pinned), held wins on a mix, refusals now
Complete cleanly (no Failed Job), k8s_upgrade_held gauge + the deliberate
no-alert-for-held, the dropped K8sUpgradeChainJobFailed suppression clause, the
nightly report ⏸️ HELD outcome, and the detector's silent nightly re-evaluation.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 10:09:34 +00:00
Viktor Barzin
eebb6c8594 k8s-upgrade: classify compat-gate blocks as actionable vs held; quiet the held case
The nightly upgrade chain detected 1.36, the preflight compat-gate refused it,
and that produced a Failed preflight Job + a K8sUpgradeBlocked alert EVERY
night — even though the block is unactionable (no kyverno/ESO release supports
1.36 yet, and gpu-operator is pinned to its current version because bumping it
needs a newer NVIDIA driver image + Ubuntu/kernel we're not ready for). Viktor
asked to teach the checker to tell 'we can fix this' apart from 'nothing to do
but wait', and stop the nightly Failed-Job + alert noise for the latter.

compat-gate.py now classifies each blocker:
  - ACTIONABLE: a newer addon version in addon-compat.json supports the target
    -> exit 2, k8s_upgrade_blocked=1 -> K8sUpgradeBlocked alert (reasons in the
    nightly report).
  - WAITING-on-upstream: no released version supports the target yet -> held.
  - PINNED: addon marked pinned in the matrix (gpu-operator) -> held.
Held wins on a mix -> exit 4, k8s_upgrade_held=1 (NEW gauge), NO alert.

Tidy the block path (Viktor's scope choice): deliberate gate decisions now make
the preflight Job Complete cleanly (HALT_CHAIN stops chain progression without a
non-zero exit), so they no longer create Failed Jobs. Dropped the now-obsolete
'unless k8s_upgrade_blocked==1' suppression from K8sUpgradeChainJobFailed. Gauge
is pushed definitively once per run (no 1->0->1 flap that re-notifies). The
detector re-spawns a refused-but-Complete preflight nightly (silently) so a
standing hold still re-evaluates, and only announces genuine new/Failed spawns.

nightly-report gains a quiet '⏸️ HELD' headline with reasons grouped by class.
gpu-operator pinned in addon-compat.json (unpin = delete pinned + pin_reason).

Net effect on 1.36: HELD + quiet (waiting on kyverno/ESO, gpu-operator pinned;
Calico the lone actionable piece) — no nightly Failed Job, no alert, just the
morning report's HELD line. Design: docs/plans/2026-06-28-k8s-upgrade-gate-held-classification.md

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 10:08:20 +00:00
Viktor Barzin
ccee443790 vault: add get --all to browse every field of an item
Some checks are pending
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
`homelab vault get` could only fetch one of five allow-listed fields and
had no way to see what fields an item even has — in particular it could
not reach arbitrary user-defined custom fields. Add a `--all` flag that
dumps the whole item as a normalized JSON object
(`{name, username?, password?, uris?, totp?, notes?, fields?}`), so a
Claude session can discover and read every field, custom ones included,
in a single call.

Security model preserved:
- Like `get --json`, the dump is all secret values, so it refuses a bare
  TTY (pipe it, e.g. `| jq`); the machine/agent path is stdout.
- The TOTP *seed* is reduced to a presence flag (`"totp": true`) and
  never emitted — the seed is more powerful than a one-time code, so the
  only seed-derived path stays the specially-audited `vault code`. Tests
  assert the seed and password-history never appear in the dump.
- Op-log uses a distinct `get-all` verb (item name still never logged) so
  a bulk dump is distinguishable from a single-field read.

`normalizeItem` is a pure, unit-tested core; `getItem` is the
session+fetch seam. CLI bumped to v0.10.0. Docs: README changelog,
onboarding runbook, design spec §16.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 10:01:49 +00:00
Viktor Barzin
afcd463f39 k8s-upgrade: design doc for actionable-vs-held compat-gate classification
The nightly upgrade chain fails a preflight Job and raises K8sUpgradeBlocked
every night for the 1.36 target, even though the block is unactionable: no
kyverno/ESO release supports 1.36 yet and gpu-operator is deliberately pinned
(NVIDIA driver/Ubuntu coupling). Viktor asked to teach the checker to tell
'we can fix this' apart from 'nothing to do but wait', and stop the nightly
Failed-Job + alert noise for the latter.

This documents the design: classify each blocker as actionable / waiting-
upstream / pinned, keep the alert only for actionable, quiet the held case to
the nightly report, and make deliberate gate decisions Complete cleanly.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 10:01:36 +00:00
Viktor Barzin
b3c419e108 Merge remote-tracking branch 'origin/master' into wizard/authentik-perf-fix
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-28 09:55:25 +00:00
Viktor Barzin
9a1ab6247b cli: add homelab edges — who-talks-to-whom investigation helper (v0.9.0)
Some checks are pending
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
Makes the goldmane_edges east-west trail (ADR-0014) reachable during incident
investigations without remembering the DB/creds/SQL. New top-level verb:

  homelab edges --ns <ns>         edges touching <ns> (either direction)
  homelab edges --src/--dst <ns>  directional egress / ingress peers
  homelab edges --peers-of <ns>   distinct peer namespaces of <ns>
  homelab edges --new-since 24h   first seen since a duration or date (YYYY-MM-DD)
  homelab edges --denied          only action='deny' (blocked / lateral movement)
  homelab edges --json --limit N  machine-readable / row cap (default 200)

Filters render to a single read-only SELECT against the `edge` table, run via
the dbaas CNPG primary pod (same exec path as `k8s db`). Namespace values are
validated to the k8s name charset (injection guard) before they reach SQL.

TDD: edges_test.go covers flag parsing, query building (each filter, AND
combination, peers-of shape, JSON wrapper), the new-since duration/date parser,
and namespace-validation / injection rejection. Smoke-tested live: --peers-of,
--new-since 24h, --denied, and --json all return correct rows.

Docs: runbook query section now leads with the CLI; cli/README gains a v0.9
section. VERSION v0.8.2 -> v0.9.0.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 09:51:41 +00:00
Viktor Barzin
0fa5852ec6 homelab v0.8.2: fix memory recall truncating multibyte UTF-8 mid-character
Some checks are pending
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
emo's Claude Code sessions hit "UserPromptSubmit hook error" on almost every
prompt. Root cause: the homelab-memory-recall.py UserPromptSubmit hook runs
`homelab memory recall <prompt>` and strict-decodes its stdout. printMemories
truncated each memory's preview with a BYTE slice (c[:240]), which cuts through
the middle of a 2-byte Cyrillic character and emits invalid UTF-8 (a dangling
0xd0 lead byte). The hook's subprocess.run(text=True) then raised
UnicodeDecodeError — not caught by its `except (TimeoutExpired, OSError)` — so
the hook exited non-zero and Claude surfaced the error. It is Cyrillic-specific
(ASCII has no multibyte chars to split), so it bit emo (Bulgarian prompts) every
turn while English users almost never saw it.

Two-layer fix:
- cli: truncatePreview() now counts RUNES, not bytes, so the preview never
  splits a character. Regression test asserts valid UTF-8 on a long Cyrillic
  string. Fixes the root for every consumer of `memory recall` / `memory list`.
- hook: subprocess.run gains errors="replace" and the except is broadened to
  honor the script's own "best-effort, exit 0" contract — so a truncated or
  otherwise odd payload can never again surface as a hook error.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 09:40:51 +00:00
Viktor Barzin
a3eb309e26 calico: fix empty Whisker UI — allow whisker egress to the kube-dns ClusterIP
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Real root cause of the 2026-06-28 "Whisker UI empty" incident (the watchdog
added in 8d1d2fb9 was treating a symptom). The tigera operator's own `whisker`
NetworkPolicy is policyTypes:[Ingress,Egress]; its egress allows DNS only to the
kube-dns *pods* (podSelector k8s-app=kube-dns). But whisker-backend resolves
goldmane.calico-system.svc via the kube-dns *ClusterIP* (10.96.0.10), and Calico
drops UDP DNS to a ClusterIP under a podSelector-only egress rule.

Verified in an isolated repro: from the whisker pod's netns, ClusterIP DNS = 100%
timeout while direct kube-dns pod-IP DNS = OK; a pod with no egress policy
resolves fine; a test pod with the operator's podSelector-only egress rule
reproduces the failure, and adding an ipBlock(ClusterIP) egress rule flips it to
100% ok. whisker-backend resolves goldmane once in the brief startup window
before the policy programs, holds its long-lived gRPC stream, and only
re-resolves when that stream breaks (e.g. a node-reboot blip) — then the blocked
ClusterIP DNS wedges its Go resolver and the UI goes empty. The durable
aggregator (separate pod, unrestricted namespace) was never affected.

Fix: additive egress NetworkPolicy whisker-allow-dns-clusterip
(whisker -> 10.96.0.10/32 on 53 UDP+TCP); k8s egress policies are additive so
the operator NP is untouched. The whisker-watchdog CronJob is kept as a backstop
(repurposed comment). Applied + verified: ClusterIP DNS from the whisker netns
now 8/8 ok, whisker-backend 0 errors, flow API returns 828 flows / the namespace
list. Docs (runbook + CLAUDE.md) updated to the real root cause.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 09:32:28 +00:00
Viktor Barzin
385dfff0e7 authentik: fix episodic blank-screen + 30s-hang login (reliability R2)
The login screen would sometimes hang/blank for everyone for ~30s at a time.
Root-caused: the readiness probe (/-/health/ready/) queries the DB, and on a
transient PG/pgbouncer blip it 503s; with the chart-default ~30s tolerance all 3
goauthentik-server pods dropped out of the Service at once, so Traefik had no
healthy backend -> 502/503/504. Compounded by a silent drift: the repo set the
rollout strategy under `strategy:`, but the chart reads `deploymentStrategy:` —
so live ran the chart-default 25%/25% and dropped a pod out of rotation on every
roll. (Redis was removed upstream in authentik 2026.2, so sessions+cache are on
PostgreSQL and request-serving is coupled to PG — verified there is no
external-cache option to put back, so a SHORT transient is now survived but a
total CNPG outage still takes authentik down.)

Reliability package (R2, approved):
- readinessProbe.failureThreshold 3->8 (~80s) — absorbs a full CNPG failover
  reconnect without dropping the whole fleet from the Service.
- rename server+worker `strategy:` -> `deploymentStrategy:` (the real chart key)
  and set maxSurge:1/maxUnavailable:0 so a roll never dips below 3 ready.
- gunicorn AUTHENTIK_WEB__MAX_REQUESTS 1000->10000 / JITTER 50->1000 so the 9
  workers' recycles don't cluster on a DB blip.
- / and /static ingresses switch to the dedicated authentik-rate-limit (100/1000)
  from the previous commit (skip_default_rate_limit) — fixes the cold-load 429
  blank screen.

Liveness intentionally left DB-coupled-but-shallow (LiveView always returns 200,
so it can't kill a DB-blocked pod). CONN_MAX_AGE intentionally NOT set (pins the
pgbouncer pool, reverted 2026-06-10). Docs: .claude/CLAUDE.md + authentication.md
(also corrected a stale "60s persistent DB connections" note).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 09:17:05 +00:00
Viktor Barzin
b84b0021c2 authentik: dedicated rate-limit carve-out + per-router 5xx observability
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Unauthenticated users were getting a blank login screen (and the screen would
sometimes just hang). Root-caused via a read-only fan-out + adversarial verify:
the login SPA cold-loads ~70 flow-executor JS/CSS chunks from /static through
the SHARED 10/50 Traefik limiter, so a fresh/empty-cache load 429s the tail and
a failed ES-module import aborts SPA bootstrap -> permanent blank. authentik was
the only first-party SPA still on the default limiter (8 siblings already have a
carve-out). NAT-shared clients trip it especially easily (shared per-IP bucket).

- traefik: new `authentik-rate-limit` Middleware (average 100 / burst 1000,
  mirroring the existing health/tripit carve-outs). The authentik / and /static
  ingresses switch to it in the authentik-stack commit.
- monitoring: the `traefik` scrape job's drop-regex was a blanket
  `traefik_router_.*`, which also dropped `traefik_router_requests_total` — so
  per-router 4xx/5xx (incl. 429/503) was neither queryable nor alertable.
  Narrowed it to keep the counter while still dropping the high-cardinality
  `*_duration_seconds_bucket` histogram, and added `AuthentikRootRouter5xxHigh`
  for the episodic all-3-server-pods-NotReady 502/503/504 cascade.

Docs updated (networking.md rate-limit list, .claude/CLAUDE.md). GitOps CI applies.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 09:10:34 +00:00
Viktor Barzin
65a09dcbc4 docs(homelab-vault): rebuild snippet uses cli/VERSION, not git describe
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The onboarding runbook's "rebuild the binary" command stamped the version
from `git describe --tags --always`, but setup-devvm.sh stamps it from
`cli/VERSION`. The v0.8.1 tag is no longer reachable from master, so the
describe form silently produced a bare commit sha — diverging from what a
provisioner reconcile stamps. Match the canonical source.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 09:05:49 +00:00
Viktor Barzin
c53e7839e1 Merge remote-tracking branch 'origin/master' into wizard/vault-addr-default
Some checks failed
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was canceled
2026-06-28 09:04:43 +00:00
Viktor Barzin
0525f0b12d homelab vault: self-default VAULT_ADDR + prefer scoped token over ~/.vault-token
Setting up emo's Bitwarden access via `homelab vault`, his one-time
`homelab vault setup` failed with an opaque "exit status 2". Two latent
CLI bugs, both of which any non-admin AFK invocation can hit:

1. The CLI set VAULT_TOKEN but never VAULT_ADDR, relying on the ambient
   value. It IS in /etc/environment (login shells), but emo runs his
   agents from long-lived tmux / non-login shells that never sourced it,
   so every `vault` child hit the 127.0.0.1:8200 default -> connection
   refused. claude-auth-sync already self-defaults VAULT_ADDR; the CLI
   now does the same.

2. Token precedence was env > ~/.vault-token > scoped. A power-user who
   ran `vault login -method=oidc` carries a read-only ~/.vault-token
   (policy `default`, capability `deny` on their workstation path), which
   shadowed the purpose-built scoped token -> 403 permission denied on
   the user's OWN path. This tool only ever touches
   secret/workstation/claude-users/<user>, which the scoped token covers
   exactly, so precedence is now env > scoped > ~/.vault-token. Verified
   the scoped tokens for both emo and wizard hold create/read/update on
   their own paths, so admins are unaffected.

Also stop swallowing the shelled `vault`/`bw` stderr: errors now carry
the real message (connection refused / permission denied) instead of a
bare "exit status N" — without that, (1) and (2) were indistinguishable.

Verified end-to-end as emo (VAULT_ADDR unset + his read-only
~/.vault-token present): writeCreds now succeeds.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 09:04:28 +00:00
Viktor Barzin
8d1d2fb999 calico: add whisker-watchdog CronJob to self-heal a wedged whisker-backend
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Whisker showed an empty UI on 2026-06-28. Root cause: whisker-backend dials
goldmane:7443 over a long-lived gRPC stream; when that stream dropped during a
transient CNI/DNS blip (right after k8s-node5 finished its v1.35.6 upgrade, its
pod resolver briefly timed out on the kube-dns ClusterIP) the Go gRPC resolver
got WEDGED — spamming "failed to stream flows" / "code = Unavailable: dns ...
i/o timeout" forever, never reconnecting. The operator ships whisker-backend
with NO liveness probe, so nothing restarted it; the live UI stayed blank until
a manual `kubectl delete pod`. (The durable aggregator is a separate pod and
was unaffected — only Whisker's ~60-min live view went dark.)

Whisker is operator-managed (Whisker CR), so we can't inject a liveness probe.
Instead add a watchdog so this never needs a manual restart again:
- whisker-watchdog CronJob (every 10 min) + least-privilege SA/Role/RoleBinding
  (calico-system only: pods get/list/delete, pods/log get).
- It restarts the whisker pod only when whisker-backend logs >=10 goldmane-
  connection errors in 11m AND Goldmane is Ready (the Goldmane-Ready guard
  avoids restart-thrash during a real Goldmane outage).
- Self-tested: a manual run reports "whisker-backend healthy: 0 ... errors"
  and does not restart.

Docs: runbook gains a "Whisker UI empty" troubleshooting entry + a self-heal
note; the stale 2026-06-25 "digest never posted" known-state block is updated
to Resolved (digest posts to #alerts, lastSuccessfulTime current); CLAUDE.md
flow-trail bullet gains the whisker-wedge gotcha.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 08:59:31 +00:00
Viktor Barzin
c70810a51b workstation: per-user long-lived Claude token to end concurrent-refresh logout
All checks were successful
ci/woodpecker/push/default Pipeline was successful
A heavy user (emo) runs 8+ always-on `claude` agents + their t3-serve instance,
all sharing one ~/.claude/.credentials.json. When the shared access token expires
the processes refresh simultaneously; OAuth refresh-token rotation makes the
losing writer persist an EMPTY refresh token, logging the user out roughly every
access-token lifetime (~8h). Re-issuing the credential never sticks — the race
recurs (this is why emo's "standalone token" fix kept regressing).

Fix: an opt-in, per-user, non-rotating setup-token (sk-ant-oat01, ~1y, scope
user:inference) kept in the user's OWN Vault path (field `setup_token`).
claude-auth-sync materializes it to a user-owned
~/.config/claude-auth-sync/claude-oauth.env and, while it is present, SKIPS the
rotating-credential validate/backup/restore (so no false
WorkstationClaudeAuthInvalid). start-claude.sh and t3-serve@.service load it as
CLAUDE_CODE_OAUTH_TOKEN, so every session of that user uses the non-rotating
token and there is nothing to race on.

Fail-safe + opt-in: with no `setup_token` in Vault, every path is a no-op, so
users on the normal per-user Enterprise-SSO flow are unaffected. This is each
user's OWN identity, never the forbidden shared CLAUDE_CODE_OAUTH_TOKEN. Runbook
documents enable/disable/rotate.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 08:07:43 +00:00
Viktor Barzin
3cc8f9f661 paperless-ngx: keep mem limit at 8Gi (tier LimitRange caps containers)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The prior commit set the limit to 10Gi, but the shared tier-defaults
LimitRange caps per-container memory at 8Gi, so the rollout's new pod was
forbidden (FailedCreate) and paperless was briefly down. 8Gi is ample for
6 workers anyway (4 workers measured ~1.3Gi under full OCR load). Restored
service live via kubectl patch; this commit matches TF to the live 8Gi so
drift detection won't re-revert it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 19:37:59 +00:00
Viktor Barzin
21d20dccf8 paperless-ngx: bulk-import via PVC consume dir (restart-safe) + 6 workers
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
Emo's ~13.7k-document import was going through the API upload path, which
stages each file on the pod's EPHEMERAL scratch before queuing it. Any
paperless pod or redis restart therefore destroyed all in-flight work
(the "File not found" failures we hit) and required manual re-uploads.

Move bulk ingest to paperless's consume directory placed on the encrypted
PVC, with PAPERLESS_CONSUMER_POLLING so the whole folder is re-scanned
periodically (and on startup) with a file-stability check. Files now live
on durable storage and survive any restart — the folder is the queue and
self-heals, so we can copy everything in fast and let it process over
time with zero retry/integrity risk. RECURSIVE preserves the source tree
(avoids basename collisions); owner+tag come from a consumption workflow.

Bump TASK_WORKERS 4->6 to speed the OCR/convert-bound processing (node6
has the core headroom for one pod) and mem limit 8->10Gi for the extra
workers. Revert workers/mem/consume envs to defaults once the import ends.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 19:35:10 +00:00
Viktor Barzin
2cb37d51d4 paperless-ngx: scale Gotenberg x3 + Tika x2, 4 workers, skip-archive — speed the Emo import
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Bottleneck found: single Gotenberg 503s under concurrent workers (office docs
failing + slow). Cluster is otherwise idle (sdc 0.5% util, etcd ~1/min), so:
- Gotenberg 1->3 + Tika 1->2 (Service load-balances; fixes the 503s, parallel
  office conversion).
- paperless TASK_WORKERS 2->4, THREADS_PER_WORKER 2->1, mem limit 4->8Gi (avoid
  OOM with 4 concurrent OCR). Requests kept low to stay within tier-quota
  (requests.memory 3840/4096Mi).
- PAPERLESS_OCR_SKIP_ARCHIVE_FILE=with_text: skip redundant archive for born-
  digital/office docs (big IO saver for the work-doc set).
Guard + etcd watch stay in place; revert to defaults after the import.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 18:45:25 +00:00
Viktor Barzin
d6bd9486e3 Merge remote-tracking branch 'origin/master' into wizard/portal-onboarding-paths
Some checks failed
ci/woodpecker/push/default Pipeline was successful
Build k8s-portal / build (push) Has been cancelled
2026-06-27 16:34:44 +00:00
Viktor Barzin
fca948a23d k8s-portal: document all three cluster-access paths in onboarding
The Getting Started portal only walked through the heaviest path (local VPN + kubectl + Vault + sops install) and never mentioned the two zero-setup routes that users actually reach first. Restructure onboarding to lead with all three, recommendation first: (A) the t3 web terminal, which drops you into a ready shell with kubectl/Vault/repos preinstalled; (B) the k8s web dashboard, auto-authenticated per user; and (C) the existing own-machine setup. Flag the dashboard/terminal as the fallback when CLI OIDC login is unavailable, reframe the misleading home-page 'VPN required' banner (only path C needs it), add the access endpoints to the service catalog, and fix a stale Vaultwarden URL (was vault.viktorbarzin.me, which is actually HashiCorp Vault; Vaultwarden is vaultwarden.viktorbarzin.me).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 16:34:36 +00:00
Viktor Barzin
9599beadc9 paperless-ngx: 2 task workers + 2 threads/worker + 4Gi limit for the Emo bulk import
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
Emo's ~13.7k-doc import is OCR-bound on a single celery worker (~10s/doc =
multi-day). Bump PAPERLESS_TASK_WORKERS=2 + THREADS_PER_WORKER=2 for ~2x
throughput, and the memory limit 2Gi->4Gi to fit two concurrent OCR jobs.
Kept deliberately modest: archive writes hit the shared sdc HDD that etcd
also lives on (IO-storm risk, code-oflt) — watch etcd apply latency and
revert workers to 1 if it degrades. Revert to defaults once the import done.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 16:33:43 +00:00
d4f564e8d5 Merge pull request 'docs(ci-cd): plotting-book build→ghcr→deploy flow diagram' (#16) from wizard/plotting-doc into master
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-27 15:50:02 +00:00
Viktor Barzin
0097bddf9f docs(ci-cd): add plotting-book build→ghcr→deploy flow diagram
ASCII flow of the migrated plotting-book pipeline (GHA build in Anca's
repo → private ghcr.io/passionprojectsanca/book-plotter → Woodpecker
redeploy hook → in-cluster pull via ghcr-credentials), plus the Kyverno
admission / Keel backstop / Vault pull-cred supporting cast and the
serving path. Appended to the existing plotting-book section.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 15:49:58 +00:00
Viktor Barzin
bbc797b30e ci(woodpecker): stop applying/planning the Tier-0 vault stack in CI
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The nightly drift-detection cron and every vault-touching push apply have
been failing because CI runs terragrunt plan/apply on the Tier-0 `vault`
stack, which manages Vault's own transit mount + ACL policies. The CI
`ci` Vault role intentionally lacks those admin perms (sys/mounts,
sys/policies/acl), so the run always errors:
  - apply: 403 on vault_mount.transit + vault_policy.personal_emo, plus an
    Invalid for_each (local.k8s_users from secret/platform is deferred)
  - drift: terragrunt plan exits 1 → fails the whole nightly run

vault is Tier-0 = human-applied via OIDC. Skip it in both pipelines:
- default.yml: skip `vault` in the platform-apply loop (kept in
  PLATFORM_STACKS so the app-stack detector still excludes it)
- drift-detection.yml: skip `vault` in the per-stack plan loop
- ci-cd.md: document the exclusion on both pipeline rows

Found during a CI-health sweep (user reported many failures): GitHub
Actions all green; all Woodpecker repos green except this recurring
infra-repo failure, doubled by the legacy repo-1 + repo-82 dual
registration.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 15:48:41 +00:00
81c2b14e29 Merge pull request 'plotting-book: pull image from private ghcr instead of public DockerHub' (#15) from wizard/plotting-ghcr into master
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-27 15:32:35 +00:00
Viktor Barzin
c13a3f1694 plotting-book: pull image from private ghcr instead of public DockerHub
Anca's plotting-book app now builds its image in her own GitHub repo to
the private package ghcr.io/passionprojectsanca/book-plotter (off public
DockerHub viktorbarzin/book-plotter). Wire the cluster to pull it:

- stacks/plotting-book: point the deployment baseline image at the ghcr
  package and add imagePullSecrets {ghcr-credentials} so the pod can pull
  the private image (the live tag is still CI-owned via ignore_changes).
- stacks/kyverno: add the plotting-book namespace to the ghcr-credentials
  allowlist so the Kyverno generate policy clones the pull secret into it.
  Verified the shared ghcr_pull_token (Viktor, repo-admin on Anca's repo)
  can read the private package before wiring this.

Docs: correct ci-cd.md (it wrongly listed plotting-book as already on
ghcr — it was on DockerHub) and note the special arrangement; amend
ADR-0003 to record that this GitHub-first repo builds to its own org's
ghcr namespace.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 15:32:19 +00:00
Viktor Barzin
bf40409141 docs(security): note crowdsec-cf-sync rate-limit resilience
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Document the backoff_limit=0 + CF-429 soft-skip hardening alongside the
cf-sync architecture description, with the why (the backoff_limit=2
retry-storm that escalated Cloudflare's Lists-API throttle into a stuck
state). Follow-up to 5b49634f.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 15:27:44 +00:00
Viktor Barzin
5b49634fe0 rybbit/crowdsec-cf-sync: stop Cloudflare Lists-API retry-storm (429 self-DoS)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The edge-ban sync was failing every 2 min on Cloudflare HTTP 429
(rate-limited) and never recovering, leaving the crowdsec_ban list empty.

Root cause: backoff_limit=2 made k8s re-run a failing pod up to 3x within
seconds, so each */2 cycle fired a burst of POSTs into Cloudflare's
per-60s Lists-API write limit. That kept the throttle perpetually tripped
(it stopped clearing even after minutes of quiet) — a self-inflicted DoS.

Two changes make the sync gentle and self-healing:
- backoff_limit 2 -> 0: one attempt per */2 cycle (the schedule IS the
  retry cadence), no rapid-fire burst.
- lapi_kv_sync.py: treat a CF 429 as a soft-skip (exit 0, retry next
  cycle) like the existing LAPI fail-safe, instead of fail-loud + k8s
  retry. Any other CF error still fails loud.

Found during a cluster health check (AIOStreams CSI + pfSense SSH issues
handled separately).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 15:23:42 +00:00
Viktor Barzin
7c72368243 state(vault): update encrypted state
Some checks failed
ci/woodpecker/push/default Pipeline failed
2026-06-27 13:54:23 +00:00
Viktor Barzin
f92ab04dae vault: grant emo read-only access to his own secret/emo
emo (power-user tier) had no Vault policy granting his personal secret
path, so `vault kv get secret/emo` failed. Viktor asked to give him that
access. Adds a read-only `personal-emo` policy (read on secret/data/emo +
metadata) and attaches it to emo's OIDC identity by adopting the
entity/alias Vault auto-created on his first login. Scoped explicitly to
emo; does not widen the power-user tier (which stays secret-less).

Verified live: a personal-emo token reads secret/emo, is denied writes,
and is denied other paths (secret/viktor -> 403).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 13:35:57 +00:00
Viktor Barzin
90f5425cdc state(vault): update encrypted state 2026-06-27 13:33:34 +00:00
a7117e0bfe immich(frame-emo): bump photo-frame Interval 30->45s
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Permissions-test change requested by Viktor: slow Emo's Sofia photo-frame
slideshow from 30s to 45s per image.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 13:07:00 +00:00
Viktor Barzin
d50962b00e immich: add Immich photo-frame for Emo's Portal (highlights-immich-emo)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Second ImmichFrame instance cloned from the London frame (frame.tf), scoped to Emo's Immich account (emil.barzin) with Sofia weather coords and last-2-years photos. Drives Emo's Meta Portal Mini in Sofia via the portal-immich-frame app. Dedicated API key minted on Emo's account and stored in Vault (secret/immich -> frame_api_key_emo).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 12:40:29 +00:00
Viktor Barzin
e8b72019b5 paperless-ngx: deploy Tika + Gotenberg for Office ingest + raise PVC ceiling to 80Gi
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Emo's import scope now includes his work-PC document set (C/Documents,
Project Management, Service & MRO, etc. on the NAS), which is ~4.9k Office
files (.doc/.docx/.xls/.xlsx/.ppt/.pptx) on top of Emo shared. Paperless
can only archive/OCR/index those if it can convert them, so add the standard
Apache Tika (text+metadata) + Gotenberg (-> PDF) sidecar deployments + their
services in the paperless-ngx namespace and point PAPERLESS_TIKA_* at them.
Pinned images (gotenberg 8.25, tika 3.3.1.0), single replica, no PVC.

Total in-scope document set across all NAS locations is now ~13,700 PDF+Office
files / ~13.7GB source (~30GB once OCR'd + archived), so raise the data PVC
autoresize ceiling 30Gi -> 80Gi for comfortable headroom. The topolvm
autoresizer grows on demand up to the ceiling.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 12:02:04 +00:00
Viktor Barzin
041aedc486 Merge remote-tracking branch 'origin/master' into wizard/paperless-emo
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-27 08:17:28 +00:00
Viktor Barzin
7988a690ed paperless-ngx: add Bulgarian OCR (bul+eng) + raise data PVC ceiling to 30Gi
Preparing Paperless for Emo's document import from the NAS. His archive is
Bulgarian (Cyrillic) + English, but OCR was English-only (tesseract had no
'bul' pack and PAPERLESS_OCR_LANGUAGE was unset/defaulted to eng), so scanned
BG documents would OCR to garbage and be unsearchable. Add bul to the install
list and set OCR_LANGUAGE=bul+eng.

Also raise the data PVC autoresize ceiling from 5Gi to 30Gi: everything
(originals + archive via PAPERLESS_MEDIA_ROOT=../data) lives on the single
encrypted PVC, and the ~2.7GB in-scope import would blow past the 5Gi cap
mid-ingest. The topolvm autoresizer grows the volume on demand up to the
ceiling; 30Gi gives ample headroom.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 08:17:13 +00:00
Viktor Barzin
6415f77fed Merge remote-tracking branch 'origin/master' into wizard/emo-vault-onboard
Some checks failed
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was canceled
2026-06-27 08:17:06 +00:00
Viktor Barzin
b371ae6eee homelab vault: install bw system-wide + onboarding runbook
Two remaining gaps to let non-admins (emo) use `homelab vault`:

- setup-devvm.sh installed `@bitwarden/cli` only when `command -v bw`
  failed, which an admin's own ~/.local/bin/bw satisfied — so the
  system-wide copy was never installed and non-admins had no `bw`
  backend. Install to the npm /usr prefix and guard on the system path
  (/usr/bin/bw) instead.

- Add docs/runbooks/homelab-vault-onboarding.md (per-user setup, the
  shared Organization/Collection flow for sharing passwords, admin
  deploy + verification, security model) and repoint the two code
  comments that cited a design-spec path which never existed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 08:16:52 +00:00
Viktor Barzin
51dc5d031c homelab vault: make it work for non-admin workstation users
`homelab vault` was effectively admin-only: two bugs blocked every
non-admin (e.g. emo) from using it for their own Vaultwarden vault.

1. Token: the CLI relied purely on ambient `vault` auth (~/.vault-token
   / $VAULT_TOKEN), which only admins have. Non-admins carry a scoped
   token at ~/.config/claude-auth-sync/vault-token (policy
   workstation-claude-<user>). Add ensureVaultToken(): explicit env >
   ~/.vault-token > scoped fallback, wired into every vault verb. Admins
   are unaffected (their ambient token wins).

2. Write capability: `homelab vault setup` used plain `vault kv patch`,
   which needs the `patch` capability the scoped policy does not grant
   (only create/read/update) — so setup 403'd for non-admins. Switch to
   `kv patch -method=rw` (read-modify-write; same approach
   claude-auth-sync already uses), with `kv put` only when the path
   doesn't exist yet. Preserves co-located keys (claude_ai_oauth_json).

Enables onboarding emo onto the per-user Vaultwarden access tool.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 08:15:42 +00:00
Viktor Barzin
82a7b2585b chrome-service: reconcile state after pipeline #366 was killed mid-apply + document cancel-previous hazard
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Pipeline #366 (the SHA-pin apply, commit 7b4a8ba8) was SIGKILLed mid-apply by
Woodpecker cancel-previous when I pushed the next commit (#367, docs) while it
was still running — the apply log ends at '[chrome-service] Starting apply...'
with no 'Apply complete!', so the terraform state write did not finish. The live
deployment is correct (image = the supervised SHA, verified, self-healing), but
the stored state may be stale; this commit re-triggers a clean changed-stack
apply to reconcile it (comment-only change → 0 resource changes, no rollout).

Also adds a caution to the novnc image comment: after bumping the SHA, WAIT for
the apply pipeline to finish before pushing again (memory id=1957).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 08:15:41 +00:00
Viktor Barzin
006f97ef58 docs: bless local terragrunt apply, but require committing every applied change
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked to change the infra apply guidance: instead of 'never apply
locally, always rely on CI', the policy is now 'you MAY apply locally, but
always commit the change to the infra repo'.

- .claude/CLAUDE.md (Critical Rule: Terraform Only): new bullet making local
  apply explicit (scripts/tg apply / homelab tf apply) from the MAIN checkout
  (not a worktree — git-crypt'd tfvars read as ciphertext there), with a hard
  requirement that every applied change is committed + pushed to master the same
  session so the repo stays the source of truth and CI drift-detection doesn't
  revert it. Spells out the apply<->commit ordering both ways.
- AGENTS.md (non-admin workstation land steps): step 5 now notes local apply as
  an option alongside CI auto-apply, with the same 'always committed, never
  applied uncommitted' rule.

Note: the org-managed settings block also frames CI auto-apply but is not
editable from a workstation clone.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 08:10:20 +00:00
Viktor Barzin
7b4a8ba867 chrome-service: pin noVNC image to the x11vnc-supervision build
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
Deploys the self-heal fix from the previous commit. Keel is off for this
deployment (keel.sh/policy=never, because the browser container's playwright
image is version-pinned to f1-stream) and the novnc image was :latest with
imagePullPolicy=IfNotPresent, so a rebuilt :latest would NOT be re-pulled on a
rollout — the supervised entrypoint would never reach the running pod.

Pin novnc to :19d0f0933a (the build of the prior
commit; ghcr digest sha256:5b783ac6, == :latest) so the stack apply rolls the
sidecar onto the new image. Future novnc entrypoint changes deploy by bumping
this digest after build-chrome-service-novnc.yml publishes a new SHA tag.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 08:04:55 +00:00
Viktor Barzin
19d0f0933a chrome-service: supervise x11vnc in noVNC sidecar so the VNC view self-heals
Some checks failed
ci/woodpecker/push/default Pipeline was successful
Build chrome-service-novnc / build (push) Has been cancelled
The noVNC view at chrome.viktorbarzin.me went black: x11vnc (in the novnc
sidecar) attaches to the browser container's Xvfb over localhost:6099, and when
that container restarted (~8h ago, Chrome exited cleanly) x11vnc lost its X
connection and exited. Because the entrypoint ran x11vnc as an unsupervised
background child and then exec'd websockify as PID 1, the dead x11vnc was never
relaunched — :5900 stayed dead (a defunct zombie), websockify kept returning
'Connection refused', and the view was black until a manual pod restart.

Fix: the entrypoint now runs both x11vnc and websockify as supervised background
children and exits non-zero via 'wait -n' if either dies, so the kubelet restarts
the novnc container, which re-waits for Xvfb and relaunches x11vnc. The bridge
now self-heals across browser-container restarts. Mirrors the android-emulator
stack's supervision pattern. Architecture doc updated with the new failure mode,
diagnosis, immediate-recovery, and SHA-pin deploy note.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 08:03:29 +00:00
Viktor Barzin
abb15cd49d devvm: personalize emo's cluster-health skill for ha-sofia
All checks were successful
ci/woodpecker/push/default Pipeline was successful
emo cares about ha-sofia + his Sofia smart-home devices (Tuya, the MPPT
ATS, the Барзини → Статус dashboard), and only about the cluster when it's
breaking those. Rewrite his vendored cluster-health into an ha-sofia-focused,
read-only variant:
- leads with ha-sofia's in-cluster dependency chain (tuya-bridge + the
  cloudflared/Traefik/DNS/TLS reachability path), all checkable read-only;
- fixes the script path to emo's own clone (/home/emo/code) — he can't read
  wizard's tree — and runs it --no-fix (he's cluster read-only);
- loads emo's own HA token (see below) so the ha-sofia checks (26-29, 45)
  actually run for him; documents the host-SSH/Vault checks that skip;
- triages: cluster FAIL/WARN matters only if on his chain; everything else is
  a one-line "admin's area"; escalate via /file-issue since he can't fix.

This snapshot copy is now an emo-specific variant, intentionally diverged
from the canonical 47-check admin skill — README updated to say "do not
re-sync from canonical".

Token: a dedicated long-lived HA token (client_name emo-cluster-health) was
minted on ha-sofia via the admin account and stored emo-readable at
/home/emo/.config/cluster-health/haos_token (600). It carries admin HA scope
(HA only mints tokens for the authenticating account); independently revocable.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 16:03:14 +00:00
Viktor Barzin
fc83595f5e devvm: vendor cluster-health into per-user agent-skill snapshot
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Make cluster-health a user-global skill for emo (the lone entry in the
provisioner's SKILL_USERS allowlist), so it's available from any directory
— not only when working inside the infra clone where it already exists as a
project skill (.claude/skills/cluster-health). install_skills() in
t3-provision-users.sh copies the vendored snapshot into ~/.agents/skills/ and
symlinks ~/.claude/skills/, so this is the durable, rebuild-surviving path.

cluster-health is homelab-local (vendored from this repo's own
.claude/skills/), unlike the other snapshot entries which mirror upstream
mattpocock/skills + vercel-labs/skills; README documents its provenance and
the explicit re-sync step so the vendored copy doesn't silently drift.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 15:20:19 +00:00
Viktor Barzin
fd33d1a447 monitoring: consolidate all Slack alerting to #alerts, abandon #security
Some checks are pending
ci/woodpecker/push/default Pipeline is running
The dedicated #security Slack channel was unreachable: the shared incoming
webhook (Vault secret/viktor -> alertmanager_slack_api_url) belongs to a
Slack app that isn't a member of #security, so any channel override on it
returns HTTP 404 channel_not_found. The goldmane-edges-digest was silently
failing for that reason.

Per request ("dump the security channel, post in an existing one"), route
everything to #alerts instead:
- alertmanager slack-security receiver -> #alerts (keeps its [SECURITY/<sev>]
  title styling so security-lane alerts still stand out in the shared channel)
- goldmane-edges-digest CronJob SLACK_CHANNEL -> #alerts (comment only; value
  was already switched and applied last change)
- AggregatorDown / DigestFailing alert summaries reworded to say #alerts
- docs swept (security.md, monitoring.md, ADR-0014, goldmane runbook,
  .claude/CLAUDE.md, service-catalog, CONTEXT.md) to drop the
  "invite the app / flip back to #security" caveats and state the
  #security abandonment + #alerts consolidation as the current routing.

Monitoring stack applied (alertmanager rolled, live config verified:
slack-security channel is now #alerts).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 13:29:44 +00:00
Viktor Barzin
196d0db4bd rbac/apiserver-oidc: back up the apiserver manifest OUTSIDE /etc/kubernetes/manifests
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The SSO restore script backed up the live manifest with
`cp "$MANIFEST" "$MANIFEST.bak.$TS"` — i.e. INSIDE /etc/kubernetes/manifests/.
The kubelet treats every file in that dir as a static pod, so the .bak became a
SECOND kube-apiserver static pod. While both copies were identical it was
harmless, but the instant `kubeadm upgrade` changed the real manifest's image to
v1.35.6, the kubelet saw two same-named pods with different specs and flip-flopped
(pod attempt count hit 13) — the new apiserver never stabilised, so kubeadm timed
out on "static Pod hash did not change after 5m" and rolled back. THIS was the
real cause of the 1.34->1.35 upgrade stalling for days (not etcd IO, which was a
downstream symptom of the flip-flopping apiserver hammering etcd).

Fix: write backups to a dedicated dir OUTSIDE the static-pod dir
(/etc/kubernetes/apiserver-oidc-bak/) and read the rollback copy from there. The
stray .bak that planted the landmine on 2026-06-18 was moved out manually
2026-06-26; this prevents the SSO script (and the upgrade chain's restore.sh,
which is the same script) from ever re-creating it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 10:29:19 +00:00
Viktor Barzin
5d33327c30 postiz: repoint postgres-backup CronJob at CNPG (was failing on removed host)
Some checks failed
ci/woodpecker/push/default Pipeline failed
The postiz-postgres-backup CronJob still dumped from the chart's bundled
`postiz-postgresql` host with a hardcoded `postiz-password`. That bundled
PostgreSQL was removed when postiz migrated to the shared CNPG cluster, so
the host no longer resolves (NXDOMAIN) and every nightly run failed —
firing BackupCronJobFailed, and leaving the postiz DB with no logical dump
in the offsite pipeline.

Connect via the app's own DATABASE_URL (from the postiz-secrets Secret,
postgresql://postiz:…@pg-cluster-rw.dbaas.svc.cluster.local/postiz) instead
of a hardcoded host/user/password, so the backup tracks the live DB and
credentials. Verified with a one-off test job: psql + pg_dump 16.4 connect
to CNPG 16.9 and produce a 180K custom-format dump.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 09:34:42 +00:00
Viktor Barzin
1bca799bb4 monitoring: give kube-state-metrics a 512Mi memory limit (Burstable)
Some checks failed
ci/woodpecker/push/default Pipeline failed
kube-state-metrics had no explicit resources, so the monitoring-namespace
LimitRange pinned it to requests=limits=256Mi (Guaranteed QoS). KSM idles
around 45Mi but momentarily spikes past 256Mi during a full object relist
(450+ pods, 150+ jobs, all secrets/endpoints) and gets OOMKilled. Each OOM
blacks out the KSM-exported series that ~10 alert rules read, so they all
fire false "<svc>Down" criticals at once and self-resolve when KSM recovers
~5 min later — exactly the alert storm seen at 2026-06-26 08:42 UTC.

Set explicit Burstable resources: keep the request low (64Mi, just above
idle) so we don't reserve memory we don't use, and raise only the limit to
512Mi to absorb the relist peak. No CPU limit, per the cluster-wide policy.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 09:06:31 +00:00
Viktor Barzin
d105713ae7 fix(workstation): claude-auth-sync must merge, not overwrite, the shared Vault path
All checks were successful
ci/woodpecker/push/default Pipeline was successful
cas_backup did `vault kv put secret/workstation/claude-users/<user>`, a full
KV-v2 replace that rewrote the document with only its 3 OAuth keys. Because
`homelab vault setup` co-locates the user's vaultwarden_* credentials on that
same path, every six-hourly sync silently deleted them — so `homelab vault`
reported "not configured" within hours of each setup. (Reported as: homelab
vault "keeps getting reset / logged out", set up 3 times.)

Switch the backup to a merge: `kv patch -method=rw` (read+update, needs no
`patch` capability) when the path exists, and `kv put` only to create it on the
first backup. Add a regression test with a fake vault asserting a pre-existing
sibling key survives a backup, and document the merge requirement in the
renewal runbook.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 08:33:41 +00:00
Viktor Barzin
6f1951af93 fix(workstation): carry OS/sudo authz policy into managed-settings source + multi-tenancy doc
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ADR-0015's policy change was applied live to /etc/claude-code/managed-settings.json, but that file self-deploys from the repo source scripts/workstation/managed-settings.json via the hourly reconcile (sync_managed_config). Without updating the source the next reconcile would REVERT /etc to the old 'never read other homes' rule. This updates the source-of-truth claudeMd (now byte-identical to /etc) so the change is durable + canonical, and refresh_codex_mirror propagates it to every user's ~/.codex/AGENTS.md. Also notes the access-model change in the multi-tenancy architecture doc (pointer to ADR-0015).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 08:25:33 +00:00
Viktor Barzin
8121d8a4ac docs(adr): add ADR-0015 (OS/sudo is the authorization boundary), supersede ADR-0011 privacy norm
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor (owner) wants agents to stop refusing file reads the OS already permits. wizard holds passwordless root ((ALL) NOPASSWD: ALL), so the managed-settings rule 'never read another user's ~/.claude' was stricter than the OS itself. The managed-settings policy (/etc/claude-code/managed-settings.json) was updated out-of-band to defer to OS/sudo authorization with no extra prompt; backup kept at .bak-2026-06-26. This ADR records the decision, its symmetry across sudo-holders, and the larger blast radius.

ADR-0011's usage-telemetry design is unchanged; only the cross-user privacy norm it referenced is superseded. The original ask was to delete ADR-0011 — superseded instead to preserve the audit trail and the ADR-0012/0013 references.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 08:22:29 +00:00
Viktor Barzin
ebc8b6588f ESO: add force_conflicts to all ExternalSecret manifests (fleet sweep)
Some checks failed
ci/woodpecker/push/default Pipeline failed
The 2026-06-22 external-secrets v1 migration made the ESO controller the
server-side-apply owner of .spec.refreshInterval on every ExternalSecret, so any
stack defining one via kubernetes_manifest fails `terraform apply` with a
field-manager conflict the next time it's applied (instagram-poster + grafana hit
this on 2026-06-24; it was latent across the whole fleet). Add
field_manager { force_conflicts = true } to all 101 remaining ExternalSecret
manifests across 70 stacks, matching the fix already on grafana / woodpecker /
traefik / k8s-version-upgrade / instagram-poster. TF and ESO set the same value,
so it's stable (no perpetual drift). Defuses the landmine before each stack's
next apply trips it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-25 21:28:11 +00:00
Viktor Barzin
6c5288998f goldmane-trail: polish follow-ups #57/#59/#61/#62/#63 + digest→#alerts
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Completes the Goldmane who-talks-to-whom trail (ADR-0014), implemented by a
subagent workflow (distinct stacks in parallel, docs last):

- #57 Whisker gated ingress: ingress_factory (whisker.viktorbarzin.me,
  auth=required, Authentik-gated) + a NetworkPolicy allowing traefik->whisker:8081
  (the operator's whisker NP default-denies ingress). calico stack.
- #61 pipeline health: AggregatorDown + DigestFailing Prometheus alerts
  (prometheus_chart_values.tpl) + cluster-health check #48.
- #59 service-identity labels on the multi-Service namespaces (monitoring's 5
  TF-managed deployments + dbaas), with the KYVERNO_LIFECYCLE_V1 marker so they
  update in-place.
- #62/#63 docs: docs/runbooks/goldmane-flow-trail.md (new), service-catalog,
  security.md + monitoring.md east-west sections, ADR-0014 as-built, CONTEXT.md.
  #62 = the SQL to derive the Wave-1 per-namespace egress allowlist from the
  edge table (feeds code-8ywc; enforce-flips out of scope).

Also fixes the digest's Slack target: #security override 404s channel_not_found
because the shared alertmanager_slack_api_url webhook's app isn't a member of
#security (this likely also breaks alertmanager's slack-security receiver — flagged
in the runbook). Routed to #alerts (the webhook's working channel) until the app
is invited; verified a real digest run posts cleanly (360 edges).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-25 17:49:25 +00:00
Viktor Barzin
306cdd4cb3 state(dbaas): update encrypted state 2026-06-25 17:31:03 +00:00
Viktor Barzin
9c68d147e0 k8s-upgrade: reclaim+auto-prune kubeadm /etc/kubernetes/tmp leak; correct crash root cause to etcd IO (not OIDC)
Some checks failed
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/default Pipeline failed
Digging into "why did the apiserver crash" disproved the earlier OIDC
explanation. An isolated v1.35.6 apiserver repro with authentik reachable
initialises OIDC cleanly (oidc.go:313, no error) and runs fine — so the
--authentication-config -> --oidc-* revert is NOT what crashed it. etcd's
surviving crash-window log is the real cause: 1180 "apply request took too long"
warnings in 16 min, individual applies up to 4.3s (healthy <100ms) right as
kubeadm tried to bring up the new apiserver. That's etcd IO starvation on the
shared sdc HDD (beads code-oflt).

A big contributor + the reason master root fs sat at 73%: kubeadm dumps a full
~400MB etcd DB backup into /etc/kubernetes/tmp/kubeadm-backup-etcd-<ts>/ before
every etcd upgrade and never cleans it up — 145 dirs / 28GB had accumulated,
driving image-GC churn and extra write-IO onto etcd's spindle. Reclaimed live
(73% -> 23%) and added a preflight prune (>3 days) so it can't re-accumulate.

Also corrected the OIDC handling: the kubeadm-config drift is real but only
breaks dashboard/kubectl SSO AFTER a successful upgrade (recoverable via the
chain's restore.sh + the kubeadm-config reconciliation) — it does not crash the
apiserver. So the preflight check is now an ALERT, not a block (was added on the
wrong hypothesis). Post-mortem, runbook, and apiserver-oidc.tf header corrected.

Per Viktor: reclaim the disk and automate so the manual cleanup never recurs;
the durable IO fix remains code-oflt (etcd off the shared HDD).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-25 15:23:15 +00:00
Viktor Barzin
60a1cb9a25 k8s-upgrade: reconcile kubeadm-config OIDC drift that crash-looped the v1.35 apiserver upgrade
All checks were successful
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
Last night's autonomous 1.34->1.35 run reached the master control-plane phase
for the first time (preflight passed, etcd snapshot taken, etcd upgraded), then
the kube-apiserver upgrade to v1.35.6 crash-looped and kubeadm auto-rolled-back
to 1.34.9. The cluster stayed healthy but the master was left cordoned and the
chain wedged on in_flight.

Root cause: kubeadm upgrade regenerates the apiserver static-pod manifest from
the kubeadm-config ConfigMap. apiserver auth was switched on 2026-06-19 to a
structured multi-issuer --authentication-config (kubectl + dashboard SSO), but
kubeadm-config still carried the legacy single-issuer --oidc-* extraArgs, so the
regenerated manifest reverted structured auth and the new apiserver crash-looped.
Proven via `kubeadm upgrade diff`. The existing post-upgrade OIDC restore step
never ran because the upgrade itself never succeeded.

Fix:
- rbac/apiserver-oidc.tf: the remote script now also reconciles kubeadm-config
  (kubeadm init phase upload-config: drop --oidc-*, add --authentication-config)
  so a future kubeadm upgrade regenerates a correct manifest. Delivered to the
  cluster via the apiserver-oidc-restore ConfigMap the chain re-runs (CI needs no
  ssh key); trigger deliberately not script-hashed since CI cannot ssh.
- k8s-version-upgrade/upgrade-step.sh: new preflight gate runs `kubeadm upgrade
  diff` and BLOCKS+alerts (never drains the master) if --authentication-config
  would still be dropped.
- Post-mortem + runbook updated.

The live kubeadm-config was reconciled directly on the master and verified
(`kubeadm upgrade diff` now shows only the control-plane image bump), so tonight's
run can complete the 1.34->1.35 upgrade.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-25 14:16:04 +00:00
Viktor Barzin
c6bba1da6e home-assistant skill: refresh ha-london map (HAOS 2026.5.2, Cowboy revived, Overview redesign)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked to redesign the ha-london dashboards and fix the broken integrations (the Cowboy one). The skill's ha-london knowledge map had drifted badly from reality, so this brings it current: it claimed HA 2025.9.1 on a docker-run container (it's HAOS 2026.5.2, managed); listed the now-dead jdejaegh Cowboy integration with sensor.bike_* entities (revived via elsbrock/cowboy-ha v1.2.0 -> entities are sensor.classic_performance_*); and didn't flag that met/metoffice/roomba/hildebrandglow are user-disabled (not broken) or that Tapo P100 is failing. Also documents the redesigned Overview (Home+More sections, Mushroom+mini-graph-card), the dashboard/view/card glossary, the parked-bike 'unknown battery' gotcha, and that london is API-only from the Sofia devvm (no SSH).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 22:03:15 +00:00
Viktor Barzin
b858561bd0 Merge remote-tracking branch 'origin/master'
Some checks failed
ci/woodpecker/push/default Pipeline failed
2026-06-24 20:59:39 +00:00
Viktor Barzin
a7704f46a6 deploy goldmane-edge-aggregator: durable who-talks-to-whom edge trail (#58, ADR-0014)
Infra side of ADR-0014: an mTLS gRPC consumer of Calico Goldmane's Flows API
that records the namespace-pair edge-set in CNPG and posts a daily new-edge
digest to #security. Adds the goldmane-edge-aggregator stack, the
pg-goldmane-edges Vault rotation role (Tier-0 vault state updated here), and the
namespace in the ghcr-credentials allowlist.

Cert: REUSES the operator-minted, Tigera-CA-signed whisker-backend client cert
(Goldmane verifies only the CA chain, not identity) instead of minting from the
Tigera CA private key. This avoids putting the CA key in TF state AND the
hashicorp/tls provider, which is incompatible with this repo's global
generate-providers/lockfile pattern (it broke every stack's lockfile).

Verified live: aggregator streaming flows, 174 edges in Postgres across 50x54
namespaces, db+slack ExternalSecrets synced, digest dry-run formats correctly,
private image pulls via the Kyverno-synced ghcr-credentials.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 20:59:39 +00:00
Viktor Barzin
aa510e3600 instagram-poster: force_conflicts on ESO manifests (fix apply)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The ESO v1 migration (2026-06-22) made the external-secrets controller own
.spec.refreshInterval via server-side apply, so terraform apply of the two
ExternalSecret manifests fails with a field-manager conflict (Woodpecker #348),
which blocked the replicas=0 scale-down from landing. Add force_conflicts=true
to both, matching the grafana/woodpecker/traefik fix applied to other stacks
the same day.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 20:49:53 +00:00
Viktor Barzin
53834deb24 instagram-poster: scale to 0 (unused, dead ExternalSecret)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Viktor confirmed the Instagram Graph poster isn't used. Its ExternalSecret
has been dead on missing Vault keys (ig_graph_long_lived_token,
ig_business_account_id), so the deployment sat at 0/1 firing
DeploymentReplicasMismatch. Setting replicas=0 stops the alert and makes the
scale-down durable (a bare kubectl scale reverts on the next stack apply).
Re-set to 1 after minting a Meta long-lived token + populating the Vault keys.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 20:45:30 +00:00
Viktor Barzin
8dd9a3978d Merge remote-tracking branch 'forgejo/master' into wizard/homelab-vault
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-24 12:25:52 +00:00
Viktor Barzin
65b2df1222 fix(monitoring): force_conflicts on grafana_db_creds ExternalSecret
The external-secrets controller owns .spec.refreshInterval via SSA, so a plain
terraform apply of the monitoring stack conflicts. Latent until 2026-06-24 (the
homelab-vault loki-rules change was the first monitoring apply in a while and
surfaced it). force_conflicts lets TF win — same pattern as woodpecker/traefik/
k8s-version-upgrade stacks.
2026-06-24 12:25:36 +00:00
163 changed files with 12834 additions and 4484 deletions

View file

@ -16,6 +16,7 @@
**ALL infrastructure changes MUST go through Terraform/Terragrunt.** Never use `kubectl apply/edit/patch/set`, `helm install/upgrade`, or any manual cluster mutation as the final state.
- **No exceptions for "quick fixes"** — even one-line changes must be in `.tf` files and applied via `scripts/tg apply`
- **Apply locally OR let CI do it — but ALWAYS commit.** You don't have to wait for CI: with apply access you MAY run the apply yourself (`scripts/tg apply <stack>` / `homelab tf apply <stack>`), but **from the main checkout, never a worktree** (git-crypt'd `*.tfvars` come through as ciphertext under the worktree filter-bypass, so a worktree apply reads garbage). **Every applied change MUST be committed and pushed to `master` the same session** — the repo is the source of truth, so applied-but-uncommitted HCL is drift that the next CI apply / daily drift-detection will try to revert. Order either way: apply locally then commit + push (CI's changed-stack apply then no-ops), or commit + push and let CI apply. Never apply an uncommitted edit; never leave a committed change unapplied.
- **kubectl is for read-only operations and temporary debugging only** (get, describe, logs, exec, port-forward)
- **If a resource isn't in Terraform yet**, evaluate whether it can be added before making manual changes. If manual change is unavoidable (e.g., emergency), document it immediately and create the Terraform resource in the same session
- **kubectl scale/patch during migrations is acceptable** as a transient step, but the final state must be in Terraform and applied via `scripts/tg apply`
@ -203,7 +204,7 @@ the workflow's built-in `GITHUB_TOKEN` (`packages: write`).
- **PDBs**: minAvailable=2 on Traefik and Authentik.
- **Fallback proxies**: basicAuth when Authentik is down, fail-open when poison-fountain is down.
- **CrowdSec enforcement is out-of-band** (no Traefik plugin/middleware — the dead Yaegi `crowdsec-bouncer-traefik-plugin` was removed on Traefik 3.7.5): banned IPs are dropped **in-kernel via nftables** by the `cs-firewall-bouncer` DaemonSet on **direct** hosts (drops in BOTH the `input` and `forward` hooks — Traefik is ETP=Local so client traffic is DNAT'd to the pod via `forward`; pulls ALL decisions incl. the ~31k CAPI blocklist), and **blocked at the Cloudflare edge** for **proxied** hosts (one `crowdsec_ban` Rules List + a zone WAF block rule, fed by the `crowdsec-cf-sync` CronJob in `rybbit` ns every 2 min — excludes CAPI). Zero per-request latency; **fails open** (LAPI down → no new bans, existing drops persist, legit traffic never blocked). Whitelist covers RFC1918 + tailnet + internal CIDRs. Full as-built: `docs/architecture/security.md`.
- **Rate limiting**: Return 429 (not 503). Per-service tuning via dedicated middleware + `skip_default_rate_limit` (default 10/s burst 50): Immich 1000/20000, ActualBudget 50/300 (app boot = ~70 parallel revalidations).
- **Rate limiting**: Return 429 (not 503). Per-service tuning via dedicated middleware + `skip_default_rate_limit` (default 10/s burst 50): Immich 1000/20000, ActualBudget 50/300 (app boot = ~70 parallel revalidations), authentik 100/1000 on `/`+`/static` (login SPA cold-loads ~70 flow chunks from `/static`; default burst 429'd them → blank login screen).
- **Retry middleware**: 2 attempts, 100ms — in default ingress chain.
- **Entrypoint transport timeouts** (`websecure` `respondingTimeouts`): `writeTimeout=0` (unlimited download duration), `readTimeout=3600s` (uploads ≤1h), `idleTimeout=600s`. These are **HARD total-duration caps**, not nginx-style per-read idle timeouts — a finite `writeTimeout` truncates *any* large download at that wall-clock mark (a prior `writeTimeout=60s` silently cut Immich videos at 60s). **Do NOT re-tighten `writeTimeout`**; keep `readTimeout` finite (slow-loris backstop) but ≥ longest expected upload. Full rationale: `docs/architecture/networking.md` → "Entrypoint Transport Timeouts".
- **HTTP/3 (QUIC)**: Enabled on Traefik. Works for **direct (non-proxied) apps** via the dedicated LB IP below (ETP=Local). Proxied apps get QUIC at the Cloudflare edge.
@ -218,7 +219,7 @@ the workflow's built-in `GITHUB_TOKEN` (`packages: write`).
| Immich | ML on SSD (CUDA), disable ModSecurity (breaks streaming), frequent upgrades. **`immich-machine-learning` MUST run with `MACHINE_LEARNING_MODEL_TTL > 0`** (set to `600` in `stacks/immich/main.tf`, env on the `immich-machine-learning` deployment). At `0`, no model ever unloads and onnxruntime's CUDA arena (OCR's dynamic input shapes inflate it to ~10 GB) is held forever on the **time-sliced T4 it shares with llama-swap/frigate/immich-server** — which has no VRAM isolation, so immich-ml starved llama-swap (qwen3-8b) and silently broke recruiter-responder triage for ~5 h on 2026-06-02 (post-mortem `docs/post-mortems/2026-06-02-immich-ml-ttl-gpu-oom-recruiter.md`). TTL>0 lets idle models (OCR, face — AND CLIP) free VRAM. The TTL is a single GLOBAL knob (no per-model pin), so CLIP would also unload after 600s idle; the `clip-keepalive` CronJob (`*/5 * * * *`, same stack) pings the CLIP textual encoder so smart-search stays warm without pinning the ad-hoc models. **Smart search has a SECOND warmth layer in Postgres** (don't conflate it with the ML model): the ~665MB vchord `clip_index` must stay resident in PG `shared_buffers`, else an ANN probe that lands on an evicted list pays a ~1.8s cold storage read vs ~4ms warm. The `postStart` hook prewarms it ONCE at pod start and `pg_prewarm.autoprewarm` only re-warms at *startup*, so the index decays out of cache over days under job buffer-pressure (observed ~33% resident after 9d uptime → slow context search, easily misattributed to the ML model). The `clip-index-prewarm` CronJob (`*/5`, same stack) re-runs `pg_prewarm('clip_index')` to pin it hot; `immich-search-probe` (`*/5`) measures live latency + residency → Pushgateway gauges (`immich_smart_search_db_seconds`, `immich_clip_index_cached_pct`) → alerts `ImmichSmartSearchSlow`/`ImmichClipIndexColdCache`/`ImmichSearchProbeStale` + cluster-health check #46 (`check_immich_search`). immich PG role is a superuser so the CronJobs can run `pg_prewarm`/`pg_buffercache`. **Video transcoding is GPU-accelerated**: `immich-server` is pinned to GPU node1 (nodeSelector `nvidia.com/gpu.present` + NoSchedule toleration + `gpu-workload` priority) with a time-sliced `nvidia.com/gpu=1` slice — the stock immich-server image's ffmpeg already ships h264/hevc_nvenc + NVDEC. Activated via `ffmpeg.accel=nvenc` + `accelDecode=true` in the **DB** system-config (`system_metadata` table, key `system-config`, JSONB — NOT Terraform; app config is DB-managed here like oauth/smtp). Direct DB edits need a pod **recreate** to reload (config is cached at boot; only API-driven changes broadcast a reload). **Streaming bitrate is capped** to keep 4K playback smooth on the contended HDD and over remote uplinks: `ffmpeg.maxBitrate=20000k` + `preset=medium` + `transcode=bitrate` (set 2026-06-01 — was uncapped `maxBitrate=0` + `ultrafast` + `targetResolution=original`, which produced 77264 Mbps 4K transcodes that stuttered for every client, local and remote, since even a single stream needs ~1013.5 MB/s off the shared `sdc` spindle). 4K resolution is preserved (`targetResolution=original`); originals are NEVER modified — only the `encoded-video/` streaming copy. To re-apply transcode settings to EXISTING videos (config changes only affect new/missing ones): delete the offenders' `asset_file` rows `WHERE type='encoded_video'` (derived/regenerable — never touches originals) then run videoConversion `force=false` (admin Jobs API → "Missing"); it regenerates them to the deterministic `<assetId>.mp4` path at concurrency 1 (gentle on sdc). See `docs/runbooks/immich-transcode-bitrate.md`. If Immich is ever reinstalled fresh (not restored), re-set these keys (accel, accelDecode, **maxBitrate=20000k, preset=medium, transcode=bitrate**). Thumbnails/previews live on SSD NFS (sdb) — do NOT move to block storage (HDD sdc = slower + the contended IO domain). **Background-job concurrency is capped to protect sdc** (DB-managed system-config, `system_metadata` key `system-config`, JSONB `job.*.concurrency`; re-set on fresh install): `thumbnailGeneration=2`, `metadataExtraction=2`, `library=2` — these jobs read ORIGINALS off the HDD library. Left uncapped (were 8/4/4) a library-wide job (e.g. Duplicate Detection on 2026-06-01) fans the ML/thumbnail backfill out into a read storm that saturates sdc and starves etcd → apiserver down. `sidecar`/`smartSearch`/`faceDetection` stay at Immich defaults (small `.xmp` / SSD previews). Apply via Job Settings UI or the `system-config` API; **direct DB edits need an `immich-server` pod recreate to reload** (config cached at boot). See `docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md`. |
| CrowdSec | Pin version, disable Metabase when not needed (CPU hog), LAPI scaled to 3, **DB on PostgreSQL** (migrated from MySQL), flush config: max_items=10000/max_age=7d/agents_autodelete=30d, DECISION_DURATION=168h in blocklist CronJob. **Enforcement is out-of-band, NOT a Traefik plugin** (the Yaegi `crowdsec-bouncer-traefik-plugin` was dead on Traefik 3.7.5 and removed): `cs-firewall-bouncer` DaemonSet drops in-kernel via nftables on direct hosts (bouncer key `firewall`, v0.0.34 binary fetched at runtime, hostNetwork+NET_ADMIN, `stacks/crowdsec/modules/crowdsec/firewall_bouncer.tf`); `crowdsec-cf-sync` CronJob blocks at the CF edge for proxied hosts (bouncer key `kvsync`, `stacks/rybbit/crowdsec_edge.tf`). Both fail open. See `docs/architecture/security.md` |
| Frigate | GPU stall detection in liveness probe (inference speed check), high CPU |
| Authentik | 3 server replicas + 2-replica embedded outpost (PG-backed sessions), PgBouncer in front of PostgreSQL, strip auth headers before forwarding. **`authentik.*` Helm values are INERT** (existingSecret skips chart env rendering) — tune via `server.env`/`worker.env` in `modules/authentik/values.yaml`. Single-screen login (password embedded in identification stage); all first-party OIDC apps use implicit consent (2026-06-10). `/static` ingress carve-out serves assets with immutable Cache-Control. |
| Authentik | 3 server replicas + 2-replica embedded outpost (PG-backed sessions), PgBouncer in front of PostgreSQL, strip auth headers before forwarding. **`authentik.*` Helm values are INERT** (existingSecret skips chart env rendering) — tune via `server.env`/`worker.env` in `modules/authentik/values.yaml`. Single-screen login (password embedded in identification stage); all first-party OIDC apps use implicit consent (2026-06-10). `/static` ingress carve-out serves assets with immutable Cache-Control; `/`+`/static` use a dedicated `authentik-rate-limit` (100/1000) so the cold-load chunk burst isn't 429'd into a blank screen. **Reliability (2026-06-28): the chart key is `deploymentStrategy`, NOT `strategy`** — the old `strategy:` key was inert, so live ran the chart default 25%/25% and dropped a server pod out of rotation on every roll; now `maxSurge:1/maxUnavailable:0`. Readiness `failureThreshold:8` (~80s, was 30s): the DB-coupled `/-/health/ready/` returns 503 on a PG/pgbouncer blip, and with too-tight tolerance all 3 server pods left the Service at once → Traefik 502/504 (the episodic blank-screen + 30s-hang). gunicorn `max_requests=10000`/jitter=1000 decorrelates worker recycles from DB blips. Redis is GONE since 2026.2 (sessions+cache+channels on PostgreSQL, no external-cache option) — a short PG transient is now survived, but a TOTAL CNPG outage still takes authentik down. **Custom overlay image (2026-06-28):** server+worker run `ghcr.io/viktorbarzin/authentik-server:2026.2.4-patch3` (built by `.github/workflows/build-authentik.yml` from `stacks/authentik/Dockerfile` + `patch-compat-sfe.py`) with TWO guarded patches: **#1 SLOW-1a** — narrows the identification-stage `select_subclasses()` query (~1.4s→~14ms; bare upstream call LEFT-JOINs every source subtype); **#2 old-browser blank login** — `patch-compat-sfe.py` (a) extends `compat_needs_sfe()` to serve authentik's built-in no-JS **SFE** login to old Safari/WebKit AND **any iOS browser** (Chrome/CriOS, Firefox/FxiOS — all share the system WebKit) on iOS≤16.3, and (b) **injects static social-login `<a>` links into the SFE shell** (`flow-sfe.html`) since the SFE can't render Identification-stage sources — required for password-less accounts (e.g. emo = Google-only). The modern flow SPA is ES2022 (needs Safari 16.4+) and renders BLANK on older WebKit; every iOS browser shares that WebKit, so it's not browser-choice (emo's iPadOS-15.8 iPad hit this). SFE = the *real* authentik login (password + MFA + reputation, no auth downgrade) — chosen over a Traefik basic-auth fallback which would have put a spoofable-UA single password in front of `vbarzin→wizard` passwordless-root. Social link = plain redirect to `/source/oauth/login/<slug>/` (works on any browser); slugs (google/github/facebook) are static — re-verify on source changes. **Keel un-enrolled** for the ns → image pinned in `global.image` (repo+tag), **upgraded manually**: bump the Dockerfile `FROM` + the values tag (+ re-verify both patches) together, GHA rebuilds, then apply. |
| Kyverno | failurePolicy=Ignore to prevent blocking cluster, pin chart version |
| MySQL Standalone | Raw `kubernetes_stateful_set_v1` pinned to `mysql:8.4.8` exactly (migrated from InnoDB Cluster 2026-04-16; **pinned to 8.4.8 on 2026-05-18** after Keel-driven `mysql:8.4` → 8.4.9 bump stalled the DD upgrade and required a full PVC-wipe + dump-restore — see `docs/runbooks/restore-mysql.md` and beads code-eme8/code-k40p). `skip-log-bin`, `innodb_flush_log_at_trx_commit=2`, `innodb_doublewrite=ON`. ConfigMap `mysql-standalone-cnf`. PVC `data-mysql-standalone-0` (5Gi initial → 30Gi via autoresizer, `proxmox-lvm-encrypted`). Service `mysql.dbaas` unchanged. Anti-affinity excludes k8s-node1. Bitnami charts deprecated (Broadcom Aug 2025) — use official images. |
| phpIPAM | IPAM — no active scanning. `pfsense-import` CronJob (hourly) pulls Kea leases + ARP via SSH. `dns-sync` CronJob (15min) bidirectional sync with Technitium. Kea DDNS on pfSense handles all 3 subnets. API app `claude` (ssl_token). |
@ -231,9 +232,10 @@ the workflow's built-in `GITHUB_TOKEN` (`packages: write`).
- Alertmanager is now scraped (`extraScrapeConfigs` job `alertmanager`) → `alertmanager_notifications_total`/`_alerts`/`_notifications_failed_total` available; it had no `prometheus.io/scrape` annotation so notification volume was previously unmeasurable.
- Every new service gets Prometheus scrape config + Uptime Kuma monitor. External monitors auto-created for Cloudflare-proxied services by `external-monitor-sync` CronJob (10min, uptime-kuma ns). Mechanism: `ingress_factory` auto-adds `uptime.viktorbarzin.me/external-monitor=true` whenever `dns_type != "none"` (see `modules/kubernetes/ingress_factory/main.tf`) — no manual action needed on new services. The `cloudflare_proxied_names` list in `config.tfvars` is a legacy fallback for the 17 hostnames not yet migrated to `ingress_factory` `dns_type`; don't check that list when debugging "is this monitored?" questions.
- **External monitoring**: `[External] <service>` monitors in Uptime Kuma test full external path (DNS → Cloudflare → Tunnel → Traefik). Divergence metric `external_internal_divergence_count` → alert `ExternalAccessDivergence` (15min). Config: `stacks/uptime-kuma/`, targets from `cloudflare_proxied_names` in `config.tfvars` (17 remaining centrally-managed hostnames; most DNS records now auto-created by `ingress_factory` `dns_type` param).
- Key alerts: OOMKill, pod replica mismatch, 4xx/5xx error rates, UPS battery, CPU temp, SSD writes, NFS responsiveness, ClusterMemoryRequestsHigh (>85%), ContainerNearOOM (>85% limit), PodUnschedulable, ExternalAccessDivergence, ImmichSmartSearchSlow (context-search latency / clip_index cache eviction).
- Key alerts: OOMKill, pod replica mismatch, 4xx/5xx error rates, UPS battery, CPU temp, SSD writes, NFS responsiveness, ClusterMemoryRequestsHigh (>85%), ContainerNearOOM (>85% limit), PodUnschedulable, ExternalAccessDivergence, ImmichSmartSearchSlow (context-search latency / clip_index cache eviction), AuthentikRootRouter5xxHigh (all-3-server-pods-NotReady cascade → 502/503/504 on the authentik `/` router). **The Traefik scrape keeps `traefik_router_requests_total`** (per-router `code` label) — the drop-regex in the `traefik` scrape job drops only the high-cardinality `*_duration_seconds_bucket` histogram, NOT the request counter, so per-router 429/5xx is queryable + alertable.
- **E2E email monitoring**: CronJob `email-roundtrip-monitor` (every 20 min) sends test email via Brevo HTTP API to `smoke-test@viktorbarzin.me` (catch-all → `spam@`), verifies IMAP delivery, deletes test email, pushes metrics to Pushgateway + Uptime Kuma. Alerts: `EmailRoundtripFailing` (60m), `EmailRoundtripStale` (60m), `EmailRoundtripNeverRun` (60m). Outbound relay: Brevo EU (`smtp-relay.brevo.com:587`, 300/day free — migrated from Mailgun). Inbound external traffic enters via pfSense HAProxy on `10.0.20.1:{25,465,587,993}`, which forwards to k8s `mailserver-proxy` NodePort (30125-30128) with `send-proxy-v2`. Mailserver pod runs alt PROXY-speaking listeners (2525/4465/5587/10993) alongside stock PROXY-free ones (25/465/587/993) for intra-cluster clients. Real client IPs recovered from PROXY v2 header despite kube-proxy SNAT (replaces pre-2026-04-19 MetalLB `10.0.20.202` ETP:Local scheme; see bd code-yiu + `docs/runbooks/mailserver-pfsense-haproxy.md`). Vault: `brevo_api_key` in `secret/viktor` (probe + relay).
- **Authentik walling-off guard**: `blackbox-exporter` (monitoring ns, `stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf`) probes each must-stay-public `auth = "none"` carve-out URL with `no_follow_redirects` and FAILS (`fail_if_header_matches` on `Location`) iff it 302s to Authentik. Catches a carve-out regressing (TF revert / deploy / `ingress_factory` `auth` default flipping back to `"required"`). Scrape job `blackbox-authentik-walloff` (1m) → alert `AuthentikWallingOffPublicPath` (`probe_failed_due_to_regex == 1`, for 10m, `lane=security``#security` Slack). **To guard a new carve-out: add one line to `local.authentik_walloff_targets`** (a `service → URL` map; `valid_status_codes` includes 301/302 so legit redirects/404s stay green — only the Authentik `Location` fails the probe). `curl -sI '<url>'` must NOT show a Location to `authentik.viktorbarzin.me` before adding.
- **Authentik walling-off guard**: `blackbox-exporter` (monitoring ns, `stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf`) probes each must-stay-public `auth = "none"` carve-out URL with `no_follow_redirects` and FAILS (`fail_if_header_matches` on `Location`) iff it 302s to Authentik. Catches a carve-out regressing (TF revert / deploy / `ingress_factory` `auth` default flipping back to `"required"`). Scrape job `blackbox-authentik-walloff` (1m) → alert `AuthentikWallingOffPublicPath` (`probe_failed_due_to_regex == 1`, for 10m, `lane=security` → posts to `#alerts` via the `slack-security` receiver, which keeps its `[SECURITY]` styling; the dedicated `#security` channel was abandoned 2026-06-25 — the shared webhook's app isn't a member of it). **To guard a new carve-out: add one line to `local.authentik_walloff_targets`** (a `service → URL` map; `valid_status_codes` includes 301/302 so legit redirects/404s stay green — only the Authentik `Location` fails the probe). `curl -sI '<url>'` must NOT show a Location to `authentik.viktorbarzin.me` before adding.
- **pfSense egress / WAN monitoring** (added 2026-06-28 after the 2026-06-27 egress-only incident — pfSense VMID 101 stopped passing internet egress for ~20 min while internal routing + Unbound stayed up, and NOTHING alerted: no egress probe existed and the cloudflared replica metric stayed green): `blackbox-exporter` gained `icmp_egress` + `dns_external` modules (+ `NET_RAW` on the pod) in `authentik_walloff_probe.tf`. Three in-cluster probe jobs (`wan-gateway-icmp` → 192.168.1.1, `internet-egress-icmp` → 9.9.9.9/1.1.1.1, `internet-egress-dns` → cloudflare.com via both) traverse the pod→node→pfSense-NAT path that fails. Alerts (group `Egress / pfSense` in `alerting_rules.yml`): `WANGatewayUnreachable`, `InternetEgressDown` (`max()==0` = both providers dead, not a single-provider blip), `ExternalDNSResolutionDown`, `EgressOnlyDivergence` (t3-probe `cloudflare` leg down WHILE `internal` leg up — the incident signature, reuses the existing t3-probe), `PfSenseVMDown` (`pve_up{id="qemu/101"}==0` while host up — does NOT catch a guest-internal reboot, `pve_up` tracks the qemu process). Plus Loki ruler `CloudflaredTunnelConnLoss` (>20 edge-conn failures/5m; calibrated live: steady-state ~2/6h vs 37-85/5m in-incident; the cloudflared replica metric is blind to tunnel-connection loss). `WANGatewayUnreachable`/`InternetEgressDown` **inhibit** the downstream egress symptoms (ExternalDNSResolutionDown/EgressOnlyDivergence/CloudflaredTunnelConnLoss/Email*/ExternalAccessDivergence). Runbook: `docs/runbooks/pfsense-egress.md`. **Deferred (needs a live-pfSense change, not in this monitoring-only change):** point dpinger's monitor at the local gateway + widen thresholds, disable `gw_down_kill_states`, add a failover gateway group + auto-recovery watchdog, and ship pfSense system/gateway/routing syslog to Loki (today only filterlog → CrowdSec; those logs are NOT centrally queryable — id #6717). No Uptime-Kuma egress monitor was added (the `external-monitor-sync` is purpose-built for `*.viktorbarzin.me` Cloudflare-path discovery; the blackbox probes cover egress directly).
## Security Posture (Wave 1 — locked 2026-05-18)
@ -241,9 +243,10 @@ Plan in `docs/architecture/security.md` + response playbook in `docs/runbooks/se
- **Identity allowlist for security rules**: ONLY `me@viktorbarzin.me`. NOT `viktor@viktorbarzin.me`, NOT `emo@viktorbarzin.me` (those don't exist). emo's identity scheme is unknown — ask before assuming.
- **Source-IP allowlist (K2, K9, V7, S1)**: `10.0.20.0/22`, `192.168.1.0/24` (Proxmox + Sofia LAN), K8s pod CIDR, K8s service CIDR, Headscale tailnet. **Policy: no public-IP access** — Vault, kube-apiserver, PVE sshd must transit LAN or Headscale. **One documented exception (2026-06-11): break-glass SSH** — PVE sshd on a WAN-exposed `:52222`, key-only, dedicated break-glass key only (`Match LocalPort`), rate-limited + fail2ban; intentionally cluster-independent so it survives an outage. As-built `docs/runbooks/breakglass-ssh.md`. (Replaced the 2026-05-30 port-knock design — circular Vault dep caused a lockout.)
- **Response model**: (I) Slack-only daily skim. All security alerts via Loki ruler → Alertmanager → `#security` Slack receiver. Single channel with severity labels inside (critical/warning/info). No paging.
- **Response model**: (I) Slack-only daily skim. All security alerts via Loki ruler → Alertmanager → the `slack-security` receiver, which posts to `#alerts` (it keeps its `[SECURITY/<sev>]` title styling so security-lane alerts stand out). Severity labels carried in the alert (critical/warning/info). No paging. The dedicated `#security` channel was abandoned 2026-06-25 — the shared `alertmanager_slack_api_url` webhook's Slack app isn't a member of it (a `#security` override 404s), so everything consolidated to `#alerts`.
- **Kyverno policies (wave 1)**: `deny-privileged-containers`, `deny-host-namespaces`, `restrict-sys-admin`, `require-trusted-registries` flip Audit→Enforce with the 31-namespace exclude list (memory id=1970). `failurePolicy: Ignore` preserved. Cosign `verify-images` deferred.
- **NetworkPolicy default-deny egress (wave 1)**: observe-then-enforce (γ approach) — Calico flow logs cluster-wide + GlobalNetworkPolicy log-only on tier 3+4, build empirical allowlist after 1 week, phased per-namespace enforce starting `recruiter-responder`. Tier 0/1/2 deferred.
- **NetworkPolicy default-deny egress (wave 1)**: observe-then-enforce (γ approach) — Calico flow logs cluster-wide + GlobalNetworkPolicy log-only on tier 3+4, build empirical allowlist after 1 week, phased per-namespace enforce starting `recruiter-responder`. Tier 0/1/2 deferred. **The internal (ns-to-ns) half of each allowlist now derives faster from the east-west flow trail** (below): `SELECT DISTINCT dst_ns FROM edge WHERE src_ns='<ns>' AND action='allow'`. External egress is NOT in that table (empty-ns flows dropped) — those still come from the Calico flow-log W1.6 snapshot. Enforce-flips remain out of scope of the trail (observe-and-derive only; beads `code-8ywc`).
- **East-west flow trail (who-talks-to-whom, ADR-0014)**: Calico **Goldmane** (`goldmane.calico-system:7443`, gRPC/mTLS, ~60-min in-memory ring buffer — no etcd writes) + **Whisker** live UI (`whisker.viktorbarzin.me`, Authentik-gated) → **`goldmane-edge-aggregator`** streams Goldmane's `Flows.Stream` over mTLS and upserts the namespace-pair **edge set** (`edge(src_ns,dst_ns,action,first_seen,last_seen,flow_count)`, self-edges + public-internet flows dropped) into **CNPG DB `goldmane_edges`** → daily **`goldmane-edges-digest`** CronJob posts first-seen edges to `#alerts` (consolidated to `#alerts`; the `#security` channel was abandoned 2026-06-25 — the shared webhook's Slack app isn't a member of it, so a `#security` override 404s; see runbook). **CERT-REUSE GOTCHA**: the aggregator's mTLS client cert reuses the operator's Tigera-CA-signed `whisker-backend-key-pair` Secret (Goldmane verifies CA-chain only) — **re-apply `stacks/goldmane-edge-aggregator` if the operator rotates it** (symptom: no `last_seen` updates, `AggregatorDown`). Service identity = namespace, + `service-identity` label only in `monitoring`/`kube-system`/`dbaas`. Health: `AggregatorDown` + `DigestFailing` alerts + cluster-health #48. **WHISKER-WEDGE GOTCHA** (2026-06-28): the operator's `whisker` NetworkPolicy allows DNS egress only to kube-dns *pods*, but whisker-backend resolves goldmane via the kube-dns *ClusterIP* — Calico drops UDP DNS to a ClusterIP under a podSelector-only egress rule, so when whisker-backend's gRPC stream breaks and it re-resolves, it wedges and the UI goes **empty** (the aggregator, a separate pod, is unaffected). FIX = additive egress NP `whisker-allow-dns-clusterip` (`stacks/calico`, allows whisker→10.96.0.10/32:53); the `whisker-watchdog` CronJob is a backstop. Manual heal `kubectl -n calico-system delete pod -l k8s-app=whisker`. Runbook: `docs/runbooks/goldmane-flow-trail.md`. (Goldmane is OSS tech-preview — reversible operator-CR toggle in `stacks/calico/main.tf`.)
- **What's NOT in scope**: canary tokens (rejected — self-trigger risk with Viktor's normal `vault kv list secret/viktor` and `kubectl get secret -A` workflows), Falco/Tetragon (too noisy for Slack-only daily check), Cloudflare/GitHub audit polling (deferred to wave 2).
## Storage & Backup Architecture

View file

@ -13,6 +13,8 @@
| authentik | Identity provider (SSO) | authentik |
| cloudflared | Cloudflare tunnel | cloudflared |
| authelia | Auth middleware (may be merged into ebooks or removed) | platform |
| goldmane | Calico 3.30 OSS flow aggregator (`goldmane.calico-system.svc:7443`, gRPC/mTLS). Stamps identity (ns/pod/workload/labels + allow-deny) on every flow from Felix into a ~60-min in-memory ring buffer — no etcd/API writes. East-west "who-talks-to-whom" source (ADR-0014). Enabled via operator CR (`kubectl_manifest.goldmane`). | calico |
| whisker | Calico 3.30 OSS live flow-observability UI (`whisker.calico-system.svc:8081`) at `whisker.viktorbarzin.me` (Authentik-gated, `auth=required` — no own login; additive NP ORs Traefik past the operator default-deny). ~60-min live view of Goldmane flows, NOT history. Enabled via operator CR (`kubectl_manifest.whisker`). | calico |
| monitoring | Prometheus/Grafana/Loki stack | monitoring |
## Storage & Security (Tier: cluster)
@ -37,6 +39,7 @@
## Active Use
| Service | Description | Stack |
|---------|-------------|-------|
| goldmane-edge-aggregator | Durable who-talks-to-whom audit trail (ADR-0014 / #58). Go service: `aggregate` Deployment streams Goldmane's gRPC `Flows.Stream` (mTLS) and upserts the low-cardinality namespace-pair edge set (`edge(src_ns,dst_ns,action,first_seen,last_seen,flow_count)`) into CNPG DB `goldmane_edges`; `goldmane-edges-digest` CronJob posts first-seen edges daily to `#alerts` (the `#security` channel was abandoned 2026-06-25 — shared webhook's app isn't a member of it). mTLS client cert REUSES the operator's `whisker-backend-key-pair` (re-apply if rotated). Tier-4-aux. Image `ghcr.io/viktorbarzin/goldmane-edge-aggregator` (private). Runbook: [goldmane-flow-trail.md](../../docs/runbooks/goldmane-flow-trail.md). | goldmane-edge-aggregator |
| mailserver | Email (docker-mailserver) | mailserver |
| shadowsocks | Proxy | shadowsocks |
| webhook_handler | Webhook processing | webhook_handler |
@ -161,3 +164,4 @@ procedures) are documented in `infra/docs/runbooks/`:
| pfSense + Unbound DNS | [pfsense-unbound.md](../../docs/runbooks/pfsense-unbound.md) |
| Mailserver PROXY-protocol / HAProxy | [mailserver-pfsense-haproxy.md](../../docs/runbooks/mailserver-pfsense-haproxy.md) |
| Technitium apply flow | [technitium-apply.md](../../docs/runbooks/technitium-apply.md) |
| Goldmane flow trail (east-west who-talks-to-whom) | [goldmane-flow-trail.md](../../docs/runbooks/goldmane-flow-trail.md) |

View file

@ -11,8 +11,8 @@ description: |
There are TWO Home Assistant deployments: ha-london (default) and ha-sofia.
Always use Home Assistant for smart home control.
author: Claude Code
version: 2.0.0
date: 2026-02-07
version: 2.1.0
date: 2026-06-24
---
# Home Assistant Control
@ -395,14 +395,27 @@ Advanced SSH, File Editor, Studio Code Server, InfluxDB, Mosquitto, Node-RED, Fr
## ha-london Knowledge Map
### Overview
- **HA Version**: 2025.9.1 (Docker container on Raspberry Pi)
- **HA Version**: 2026.5.2 on **Home Assistant OS** (HAOS — managed appliance, NOT a `docker run` container). Latest is 2026.6.4 (update available, deliberately not applied).
- **Location**: London, UK
- **Platform**: Raspberry Pi 4, HA OS (not Docker standalone)
- **SSH**: `ssh hassio@192.168.8.103` (requires `sudo` for file access)
- **Config path**: `/config/` (requires `sudo` for file access)
- **Platform**: Raspberry Pi 4, HA OS
- **Access from the Sofia devvm**: london is **remote**`homelab ha ssh --instance london` generally WON'T connect (ADR-0012). Drive it via the API: `homelab ha token --instance london` + `https://ha-london.viktorbarzin.me/api/...`, and the WebSocket API `wss://ha-london.viktorbarzin.me/api/websocket` for dashboards / config-entries / HACS installs.
- **SSH (only from the London LAN)**: `ssh hassio@192.168.8.103` (requires `sudo` for file access)
- **Config path**: `/config/`
- **3 tracked people**: Viktor Barzin, Anca Milea, Gheorghe Milea
- **Zone**: London (home)
### Dashboards (redesigned 2026-06-24)
**Glossary** (HA terms — keep distinct):
- **Dashboard** = a sidebar entry (Overview, Air Quality, Map). Sidebar *order* is a per-USER frontend preference, not in any dashboard config.
- **View** = a tab inside a dashboard. View order is global (stored in the dashboard config).
- **Card** = a widget inside a view.
- **Overview** (`lovelace`, the default): responsive **sections** views, styled with Mushroom + mini-graph-card.
- **Home** tab: *Who's home* · *Comfort & Air* (CO₂/temp/humidity/PM2.5/VOC chips + CO₂ and temp/humidity trend graphs + link to Air Quality) · *Cowboy* (battery/range/last-ride) · *Energy* (5 Kasa plugs + power trend) · *Quick actions* (Netflix/Stremio/Night).
- **More** tab: *Network* (GL-MT6000 router) · *System* (HA version/update, last backup, RPi power) · *Phones*.
- **Air Quality** (`air-quality`): deep-dive (views: Home, Detailed). (`detialed``detailed` path typo fixed 2026-06-24.)
- Built via the WS `lovelace/config/save` API (london is remote — no SSH path).
### Key Systems
#### 1. Smart Plugs (TP-Link Kasa) — Energy Monitoring
@ -424,10 +437,15 @@ Named plugs with power/energy tracking:
- PM1.0/2.5/4.0/10 particulate sensors
- VOC, NOx, ammonia, CO, ethanol, hydrogen, methane, NO2 gas sensors
#### 3. Cowboy E-Bike
- `sensor.bike_state_of_charge`: Battery %
- `sensor.bike_total_distance`: Total km
- `sensor.bike_total_co2_saved`: CO2 saved (grams)
#### 3. Cowboy E-Bike (`elsbrock/cowboy-ha`)
Bike named **"Classic Performance"** → entities are `sensor.classic_performance_*` (26 total). The old `sensor.bike_*` names are GONE (they were the dead `jdejaegh` integration).
- `sensor.classic_performance_remaining_battery`: Battery % (was `sensor.bike_state_of_charge`)
- `sensor.classic_performance_remaining_range`: Range km
- `sensor.classic_performance_mileage`: Total km (was `sensor.bike_total_distance`)
- `sensor.classic_performance_saved_co2`: Lifetime CO2 saved (was `sensor.bike_total_co2_saved`)
- Plus `_distance_today`, `_last_trip_*`, `_battery_health`, `device_tracker.classic_performance`, etc.
- **GOTCHA**: live battery/range/mileage read `unknown` while the bike is parked/asleep — Cowboy only reports live SoC when awake (ridden/charging); trip-history + `distance_today` stay live regardless.
- Auth: account **email+password** (no AWS Cognito — that was the dead `jdejaegh`/`cowboybike` lineage). Setup via UI config flow / REST `config_entries/flow`. Creds in Vaultwarden item **"cowboy bike"** (`homelab vault get "cowboy bike"`).
#### 4. Uptime Monitoring (UptimeRobot)
- `sensor.blog`: blog uptime
@ -446,12 +464,17 @@ Named plugs with power/energy tracking:
- Scripts: `script.start_netflix`, `script.start_stremio`
- Scene: `scene.night` (turns off Livia + Michelle plugs)
### Custom Components
- **cowboy**: Cowboy e-bike integration (HACS)
- **hildebrandglow_dcc**: UK smart meter DCC energy data (HACS)
### Custom Components (HACS integrations)
- **cowboy** (`elsbrock/cowboy-ha` v1.2.0): Cowboy e-bike — revived 2026-06-24. The old `jdejaegh/home-assistant-cowboy` repo is **dead (404)**; don't chase it.
- **hildebrandglow_dcc**: UK smart meter DCC energy — **DISABLED by user** (config entry `disabled_by: user`), not broken.
### HACS frontend cards (plugins)
- **Mushroom** (`piitaya/lovelace-mushroom`), **mini-graph-card** (`kalkih/mini-graph-card`), **plotly-graph-card** (`dbuezas/lovelace-plotly-graph-card`) — used by the redesigned Overview. Install over WS `hacs/repository/download`; resources auto-register in storage mode.
### Integrations
ESPHome, TP-Link Kasa, Tapo, UptimeRobot, Cowboy, Hildebrand Glow DCC, Oral-B BLE, Ookla Speedtest, HACS, OpenRouter (multiple free LLMs), Piper (local TTS), Whisper (local STT), Android TV/ADB
ESPHome, TP-Link Kasa, Tapo, UptimeRobot, **Cowboy** (elsbrock), Oral-B BLE, Ookla Speedtest (exposes only an `update` entity, no live speed sensors), HACS, OpenRouter (free LLMs), Piper (TTS), Whisper (STT), Android TV/ADB.
- **Disabled by user (NOT broken)**: `met` + `metoffice` (weather — so `weather.*` entities are ABSENT), `roomba` (Rumi vacuum), `hildebrandglow_dcc` (energy).
- **Failing**: `tplink` **Tapo P100** projector plug — `setup_retry`, 403 KLAP handshake from 192.168.8.108 (plug off / firmware). Left as-is.
### AI / Voice Assistants
- 5 free LLM conversation agents: Google Gemma 3 27B, Meta Llama 3.2 3B, Mistral Devstral 2, OpenAI GPT-OSS-20B, Z.AI GLM 4.5 Air
@ -466,15 +489,8 @@ ESPHome, TP-Link Kasa, Tapo, UptimeRobot, Cowboy, Hildebrand Glow DCC, Oral-B BL
- Anca arrival/departure notifications
- Night scene: turns off Livia + Michelle
### Docker Setup
```bash
docker run -d --name homeassistant --privileged \
-e TZ=Europe/London \
-v /home/pi/docker/homeAssistant:/config \
-v /run/dbus:/run/dbus:ro \
--network=host --restart=unless-stopped \
homeassistant/home-assistant:2025.9
```
### Platform (HAOS — ignore any legacy `docker run` snippet)
ha-london runs **Home Assistant OS** (managed appliance), NOT a hand-run Docker container. There is no `docker run homeassistant/home-assistant` to manage. Install HACS components over the WebSocket API (`hacs/repository/download` with the repo's HACS id), then restart via `POST /api/services/homeassistant/restart` — a HAOS restart drops automations for ~12 min and resets `sensor.uptime` (use that as the "back up" marker).
### SSH Access
```bash

39
.github/workflows/build-authentik.yml vendored Normal file
View file

@ -0,0 +1,39 @@
name: Build Custom Authentik Image
# ADR-0002: infra-owned image built off-infra on GHA → ghcr.
# Thin SLOW-1a overlay over the official authentik server (narrows the login
# identification stage's select_subclasses() to the login-capable source subtypes;
# see stacks/authentik/Dockerfile). Rebuild only when the Dockerfile changes — on
# every authentik bump, edit the FROM tag + the patchN suffix here + the image tag
# in modules/authentik/values.yaml together.
on:
push:
branches: [master]
paths:
- 'stacks/authentik/Dockerfile'
workflow_dispatch: {}
permissions:
contents: read
packages: write
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: docker/setup-buildx-action@v3
- uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- uses: docker/build-push-action@v6
with:
context: stacks/authentik
platforms: linux/amd64
provenance: false
push: true
tags: |
ghcr.io/viktorbarzin/authentik-server:2026.2.4-patch3
ghcr.io/viktorbarzin/authentik-server:latest

View file

@ -65,6 +65,21 @@ steps:
# don't need explicit token propagation.
VAULT_ADDR: http://vault-active.vault.svc.cluster.local:8200
commands:
# ── Forge guard: apply ONLY on the canonical Forgejo forge ──
# infra is registered in Woodpecker on BOTH the Forgejo canonical repo and
# the legacy GitHub mirror, and BOTH fire this push pipeline. Without this
# guard both run `terragrunt apply` on every push and race each other for
# the per-stack PG state lock — the dominant cause of the "Error acquiring
# the state lock" failures + push-supersede "killed" runs. The GitHub-mirror
# registration keeps running the CRONS (drift-detection, renew-tls, …) — only
# its duplicate push-apply no-ops here. Fail-open: an unknown forge (neither
# env var set) still applies, preserving prior behaviour.
- |
if echo "${CI_REPO_URL:-}${CI_FORGE_URL:-}" | grep -qi 'github\.com'; then
echo "[forge-guard] GitHub-mirror push — apply runs only on the Forgejo canonical repo (avoids double-apply + state-lock races). Skipping."
exit 0
fi
# ── Skip CI commits ──
- |
if echo "$CI_COMMIT_MESSAGE" | grep -q '\[CI SKIP\]\|\[ci skip\]'; then
@ -213,23 +228,40 @@ steps:
if [ -s .platform_apply ]; then
echo "=== Applying platform stacks (serial, locked) ==="
while read -r stack; do
# Tier-0 `vault` is human-applied via OIDC; the CI `ci` Vault role
# lacks Vault-admin perms (sys/mounts + sys/policies/acl), so a CI
# apply always 403s and fails the pipeline. Kept in PLATFORM_STACKS
# (so the app-stack detector still excludes it) but skipped here.
# (2026-06-27 — see docs/architecture/ci-cd.md)
if [ "$stack" = "vault" ]; then echo "[vault] SKIPPED (Tier-0, human-applied via OIDC)"; continue; fi
echo "[$stack] Starting apply..."
ATTEMPT=0
while :; do
ATTEMPT=$((ATTEMPT + 1))
set +e
OUTPUT=$(cd "stacks/$stack" && ../../scripts/tg apply --non-interactive 2>&1)
EXIT=$?
set -e
if [ $EXIT -ne 0 ]; then
if echo "$OUTPUT" | grep -q "is locked by"; then
echo "[$stack] SKIPPED (locked by another session)"
else
echo "$OUTPUT" | tail -50
echo "[$stack] FAILED (exit $EXIT)"
FAILED_PLATFORM_STACKS="$FAILED_PLATFORM_STACKS $stack"
if [ $EXIT -eq 0 ]; then
echo "$OUTPUT" | tail -3; echo "[$stack] OK"; break
fi
else
echo "$OUTPUT" | tail -3
echo "[$stack] OK"
# Lock contention → SKIP, not fail. Match BOTH the Tier-0 Vault lock
# ("is locked by", from scripts/tg) AND the Tier-1 PG-backend lock
# ("Error acquiring the state lock" / "already locked"). The PG case
# was previously counted as a failure — the #1 source of false reds.
if echo "$OUTPUT" | grep -qE 'is locked by|Error acquiring the state lock|already locked'; then
echo "[$stack] SKIPPED (locked by another session/run)"; break
fi
# Transient: provider-registry download timeout / Vault 5xx → bounded
# retry. Deliberately NOT helm atomic-timeouts or config errors
# (missing arg, invalid index) — those must fail fast, retry can't fix
# them and can worsen a stuck helm release.
if [ $ATTEMPT -lt 3 ] && echo "$OUTPUT" | grep -qE 'Failed to install provider|Client\.Timeout exceeded while awaiting headers|error reading from Vault.*Code: 5[0-9][0-9]'; then
echo "[$stack] transient error (attempt $ATTEMPT/3) — retrying in 15s..."; sleep 15; continue
fi
echo "$OUTPUT" | tail -50; echo "[$stack] FAILED (exit $EXIT)"
FAILED_PLATFORM_STACKS="$FAILED_PLATFORM_STACKS $stack"; break
done
done < .platform_apply
fi
# Deferred until after app stacks so both lists get a chance to run.
@ -242,22 +274,27 @@ steps:
echo "=== Applying app stacks (serial, locked) ==="
while read -r stack; do
echo "[$stack] Starting apply..."
ATTEMPT=0
while :; do
ATTEMPT=$((ATTEMPT + 1))
set +e
OUTPUT=$(cd "stacks/$stack" && ../../scripts/tg apply --non-interactive 2>&1)
EXIT=$?
set -e
if [ $EXIT -ne 0 ]; then
if echo "$OUTPUT" | grep -q "is locked by"; then
echo "[$stack] SKIPPED (locked by another session)"
else
echo "$OUTPUT" | tail -50
echo "[$stack] FAILED (exit $EXIT)"
FAILED_APP_STACKS="$FAILED_APP_STACKS $stack"
if [ $EXIT -eq 0 ]; then
echo "$OUTPUT" | tail -3; echo "[$stack] OK"; break
fi
else
echo "$OUTPUT" | tail -3
echo "[$stack] OK"
# Lock contention → SKIP, not fail (Tier-0 Vault + Tier-1 PG; see platform loop).
if echo "$OUTPUT" | grep -qE 'is locked by|Error acquiring the state lock|already locked'; then
echo "[$stack] SKIPPED (locked by another session/run)"; break
fi
# Transient provider-download / Vault 5xx → bounded retry (see platform loop).
if [ $ATTEMPT -lt 3 ] && echo "$OUTPUT" | grep -qE 'Failed to install provider|Client\.Timeout exceeded while awaiting headers|error reading from Vault.*Code: 5[0-9][0-9]'; then
echo "[$stack] transient error (attempt $ATTEMPT/3) — retrying in 15s..."; sleep 15; continue
fi
echo "$OUTPUT" | tail -50; echo "[$stack] FAILED (exit $EXIT)"
FAILED_APP_STACKS="$FAILED_APP_STACKS $stack"; break
done
done < .app_apply
fi
# Fail the step loudly so the pipeline `default` workflow state

View file

@ -85,6 +85,13 @@ steps:
stack=$(basename "$stack_dir")
[ -f "$stack_dir/terragrunt.hcl" ] || continue
# Tier-0 `vault` is human-applied via OIDC; the CI `ci` Vault role lacks
# Vault-admin perms (sys/mounts + sys/policies/acl), so `terragrunt plan`
# on it ERRORs (detailed-exitcode 1) and fails the whole nightly drift
# run. Skip it — drift on Tier-0 vault is caught at human apply time.
# (2026-06-27)
[ "$stack" = "vault" ] && continue
echo -n "[$stack] planning... "
OUTPUT=$(cd "$stack_dir" && terragrunt plan -detailed-exitcode -input=false 2>&1)
EXIT=$?

View file

@ -273,8 +273,11 @@ To land a finished change from such a clone:
Slack audit feed; a no-op CI apply on a docs-only commit is harmless.
4. Leave the clone on clean `master` so auto-refresh keeps working.
5. Tell the user in plain language what happened. Stack changes are
auto-applied by CI — verify the live result with the user's read-only
kubectl before saying "it's live".
auto-applied by CI on push — or, with apply access, applied locally yourself
(`scripts/tg apply`, from the main checkout, not a worktree); either path is
fine, but the change must always be committed here, never applied
uncommitted. Verify the live result with the user's read-only kubectl before
saying "it's live".
If a push to `master` is rejected by branch protection (user not on the
whitelist — e.g. new users before Viktor grants it), fall back to a

View file

@ -125,7 +125,7 @@ How a **Service** is named in flow/audit data — its **namespace** is the prima
_Avoid_: equating "service identity" with a workload's **ServiceAccount** (that's the deferred enforcement principal, not the attribution key) or with cryptographic/SPIFFE identity; "Service" here is the domain **Service**, not the K8s `Service` object.
**Goldmane / Whisker**:
Calico 3.30's OSS flow-observability pair — **Goldmane** aggregates identity-stamped flows (namespace/pod/workload/labels + allow-deny + policy trace) streamed from Felix over gRPC into an in-memory ~60-min ring buffer (no etcd/API writes); **Whisker** is its live web UI. The east-west "who-talks-to-whom" data plane, succeeding raw iptables-`LOG`→journald lines (which carry no identity). Durable history requires emitting Goldmane flows to **Loki**; the in-memory buffer alone is not an audit trail.
Calico 3.30's OSS flow-observability pair — **Goldmane** aggregates identity-stamped flows (namespace/pod/workload/labels + allow-deny + policy trace) streamed from Felix over gRPC into an in-memory ~60-min ring buffer (no etcd/API writes); **Whisker** is its live web UI. The east-west "who-talks-to-whom" data plane, succeeding raw iptables-`LOG`→journald lines (which carry no identity). The in-memory buffer alone is not an audit trail — durable history is the **`goldmane-edge-aggregator`** (the implemented trail; ADR-0014 originally framed this as a Loki emitter), which streams Goldmane's gRPC `Flows.Stream` over mTLS and upserts the namespace-pair **edge set** into CNPG DB `goldmane_edges` + a daily `#alerts` digest (the `#security` channel was abandoned 2026-06-25). As-built: `docs/runbooks/goldmane-flow-trail.md`.
_Avoid_: assuming Goldmane persists (it's a ring buffer — lost on restart); expecting a ServiceAccount field in its schema (it carries labels, not SA); confusing it with Cilium **Hubble** (needs the Cilium datapath, unusable on Calico) or **Kiali** (needs an Istio mesh).
### Storage

View file

@ -202,6 +202,69 @@ runs on the devvm, `setInputFiles` streams local files to the remote browser ove
CDP — no `chmod`/staging-dir workaround. See `docs/architecture/chrome-service.md`
and `docs/adr/0013`.
### v0.9 verbs — edges (east-west "who-talks-to-whom" trail)
Read-only investigation helper over the `goldmane_edges` CNPG trail (ADR-0014):
filters render to a single safe `SELECT` (namespace values validated to the k8s
name charset) run via the dbaas primary pod — the same exec path as `k8s db`.
| Command | Tier | What it does |
| --- | --- | --- |
| `edges --ns <ns>` | read | edges touching `<ns>` (either direction) |
| `edges --src <ns>` / `--dst <ns>` | read | directional: `<ns>`'s egress / ingress peers |
| `edges --peers-of <ns>` | read | distinct peer namespaces of `<ns>` (both directions) |
| `edges --new-since <24h\|7d\|YYYY-MM-DD>` | read | edges first seen since a duration or date |
| `edges --denied` | read | only `action='deny'` edges (blocked / lateral-movement) |
| `edges --json` / `--limit N` | read | JSON array output / row cap (default 200) |
### v0.10 — `vault get --all` (browse every field)
`vault get <name> --all` returns the **whole item** as a normalized JSON object,
so an agent can discover and read fields the single-field `--field` allowlist
can't reach — notably arbitrary **custom fields**.
| Command | Tier | What it does |
| --- | --- | --- |
| `vault get <name> --all` | read | all fields as JSON: `{name, username?, password?, uris?, totp?, notes?, fields?}` |
Shape notes: present standard fields only (empty ones omitted); `fields` is a
custom `name→value` map (duplicate names → last-wins; `linked` fields skipped).
The TOTP **seed is never emitted**`totp` is a presence flag (`true`), so the
only seed-derived path stays the specially-audited `vault code`. Like
`get --json`, the dump is all secret values, so it **refuses a terminal** — pipe
it (`homelab vault get <name> --all | jq`).
### v0.10.1 — reads `bw sync` first (always fresh)
Every vault read (`get`, `get --all`, `list`, `code`, `status`) now runs `bw
sync` when opening its session, so it reflects the latest server-side values.
`bw unlock` only decrypts the *local* cache, so without this a persisted
(already-logged-in) session served stale data — a password changed in the web
vault wouldn't show up until the next login. The sync is **best-effort**: a
transient failure warns on stderr and falls back to the cached vault rather than
failing the read.
### v0.11 — `vault kv` (HashiCorp Vault / OpenBao infra secrets)
`homelab vault` now fronts **two unrelated stores**, made explicit in the bare
`homelab vault` help and via `[vaultwarden]` / `[hashicorp-vault]` summary tags:
- **Vaultwarden** — your personal password manager (`vault get/list/code/…`, unchanged).
- **HashiCorp Vault / OpenBao** — homelab infra secrets, the `secret/…` KV store, under `vault kv`.
| Command | Tier | What it does |
| --- | --- | --- |
| `vault kv get <path> [--field K]` | read | read a secret: `--field K` → one value (TTY-aware clipboard/stdout); no field → all fields as JSON (refuses a bare TTY) |
| `vault kv list <path>` | read | list sub-paths under `<path>` (no values) |
| `vault kv put <path> <key>` | write | write one key; **value via stdin** (piped or no-echo prompt, never argv); creates the path or **merges** (never clobbers siblings) |
**Different credentials:** the Vaultwarden verbs use the per-user *scoped* token
(bound to `claude-users/<user>`); `vault kv` uses your **own** Vault token
(`vault login -method=oidc``~/.vault-token`, or `$VAULT_TOKEN`) — the kv
handlers set `VAULT_ADDR` but never inject the scoped token (which would 403 off
its own path). Access is whatever your policy grants. Writes are merge-only;
`put` (replace) / `delete` are out of scope — use the raw `vault` CLI.
## Build / install
Built from source to `/usr/local/bin/homelab` during devvm provisioning

View file

@ -1 +1 @@
v0.8.1
v0.11.0

69
cli/cmd_edges.go Normal file
View file

@ -0,0 +1,69 @@
package main
import "fmt"
func edgesCommands() []Command {
return []Command{
{Path: []string{"edges"}, Tier: TierRead,
Summary: "who-talks-to-whom trail: edges [--ns|--src|--dst|--peers-of N] [--new-since 24h] [--denied] [--json] [--limit N]",
Run: edgesRun},
}
}
// edgesRun renders the filter flags to SQL and runs it read-only against the
// goldmane_edges CNPG DB via the dbaas primary pod (same exec path as `k8s db`).
func edgesRun(args []string) error {
for _, a := range args {
if a == "-h" || a == "--help" {
fmt.Print(edgesUsage())
return nil
}
}
o, err := parseEdgesArgs(args)
if err != nil {
return fmt.Errorf("%w\n\n%s", err, edgesUsage())
}
sql, err := buildEdgesQuery(o)
if err != nil {
return err
}
// pg-cluster-rw is a Service (not exec-able); resolve the primary POD.
pod, err := kubectlCapture("dbaas", "get", "pod", "-l", "cnpg.io/instanceRole=primary",
"-o", "jsonpath={.items[0].metadata.name}")
if err != nil || pod == "" {
return fmt.Errorf("could not resolve CNPG primary pod in dbaas: %v", err)
}
exec := []string{"exec", pod, "-c", "postgres", "--", "psql", "-U", "postgres", "-d", "goldmane_edges"}
if o.asJSON {
exec = append(exec, "-tAc", sql) // raw tuple → the JSON array
} else {
exec = append(exec, "-P", "pager=off", "-c", sql) // aligned table for humans
}
return kubectlStream("dbaas", exec...)
}
func edgesUsage() string {
return `homelab edges query the who-talks-to-whom trail (goldmane_edges, ADR-0014)
Usage: homelab edges [filters]
Filters (AND-combined; namespace values are validated to the k8s name charset):
--ns NAME edges touching NAME (either direction)
--src NAME edges where source namespace = NAME
--dst NAME edges where destination namespace = NAME
--peers-of NAME distinct peer namespaces of NAME (both directions)
--new-since SPEC first seen since SPEC: a duration (24h, 7d, 30m, 90s) or a date (YYYY-MM-DD)
--denied only denied (action='deny') edges blocked / lateral-movement attempts
--json output a JSON array (for agents/pipelines)
--limit N cap rows (default 200)
Examples:
homelab edges --ns immich # everything immich talks to / is talked to by
homelab edges --peers-of authentik # authentik's peer namespaces
homelab edges --src recruiter-responder # that namespace's egress peers
homelab edges --new-since 24h # edges first seen in the last day
homelab edges --denied --json # blocked flows, machine-readable
Read-only SELECT against CNPG DB goldmane_edges via the dbaas primary pod.
`
}

View file

@ -54,10 +54,7 @@ func printMemories(raw []byte, jsonOut bool) error {
return nil
}
for _, m := range r.Memories {
c := strings.ReplaceAll(m.Content, "\n", " ")
if len(c) > 240 {
c = c[:240] + "…"
}
c := truncatePreview(strings.ReplaceAll(m.Content, "\n", " "), 240)
fmt.Printf("#%d [%s] (%.2f) %s\n", m.ID, m.Category, m.Importance, c)
if m.Tags != "" {
fmt.Printf(" tags: %s\n", m.Tags)
@ -66,6 +63,21 @@ func printMemories(raw []byte, jsonOut bool) error {
return nil
}
// truncatePreview shortens s to at most maxRunes RUNES, appending "…" when it
// trims. Counting runes (not bytes) is load-bearing: a byte slice like s[:240]
// can cut through the middle of a multibyte UTF-8 character (e.g. 2-byte
// Cyrillic), leaving a dangling lead byte = invalid UTF-8. That crashed strict
// decoders downstream — notably the homelab-memory-recall.py UserPromptSubmit
// hook (subprocess text=True), which surfaced as a recurring "UserPromptSubmit
// hook error" for Cyrillic-language users.
func truncatePreview(s string, maxRunes int) string {
r := []rune(s)
if len(r) <= maxRunes {
return s
}
return string(r[:maxRunes]) + "…"
}
func memoryRecall(args []string) error {
req := memRecallReq{}
jsonOut := false

View file

@ -4,6 +4,7 @@ import (
"bufio"
"encoding/base64"
"encoding/json"
"errors"
"fmt"
"os"
"os/exec"
@ -15,43 +16,60 @@ import (
// Identity is the kernel UID; per-user creds live in that user's isolated Vault
// path (secret/workstation/claude-users/<user>) read via their scoped token, and
// decryption is done by the official `bw` CLI. See
// docs/superpowers/specs/2026-06-24-homelab-vault-design.md.
// docs/runbooks/homelab-vault-onboarding.md.
func vaultCommands() []Command {
return []Command{
cmds := []Command{
// Vaultwarden — your personal password manager (logins/passwords/TOTP).
{Path: []string{"vault", "setup"}, Tier: TierWrite,
Summary: "one-time: store your Vaultwarden master password + API key in your Vault path", Run: vaultSetup},
Summary: "[vaultwarden] one-time: store your master password + API key in your Vault path", Run: vaultSetup},
{Path: []string{"vault", "status"}, Tier: TierRead,
Summary: "show whether your vault is configured/reachable (no secrets)", Run: vaultStatus},
Summary: "[vaultwarden] show whether your vault is configured/reachable (no secrets)", Run: vaultStatus},
{Path: []string{"vault", "list"}, Tier: TierRead,
Summary: "list your item names: vault list [--search Q]", Run: vaultList},
Summary: "[vaultwarden] list your item names: vault list [--search Q]", Run: vaultList},
{Path: []string{"vault", "get"}, Tier: TierRead,
Summary: "fetch one item: vault get <name> [--field password|username|uri|notes|totp] [--json]", Run: vaultGet},
Summary: "[vaultwarden] fetch one login: vault get <name> [--field password|username|uri|notes|totp] [--json] [--all]", Run: vaultGet},
{Path: []string{"vault", "search"}, Tier: TierRead,
Summary: "search your item names: vault search <query>", Run: vaultSearch},
Summary: "[vaultwarden] search your item names: vault search <query>", Run: vaultSearch},
{Path: []string{"vault", "code"}, Tier: TierRead,
Summary: "current TOTP code for an item: vault code <name>", Run: vaultCode},
Summary: "[vaultwarden] current TOTP code for an item: vault code <name>", Run: vaultCode},
{Path: []string{"vault", "lock"}, Tier: TierWrite,
Summary: "lock/log out the local bw session", Run: vaultLock},
Summary: "[vaultwarden] lock/log out the local bw session", Run: vaultLock},
{Path: []string{"vault"}, Tier: TierRead,
Summary: "Vaultwarden access for your own vault (run `homelab vault` for help)",
Summary: "two stores: Vaultwarden (logins) + HashiCorp Vault/OpenBao kv (infra secrets) — run `homelab vault` for help",
Run: func([]string) error { fmt.Print(vaultHelp()); return nil }},
}
// HashiCorp Vault / OpenBao — homelab INFRA secrets (the secret/… KV store).
return append(cmds, vaultKVCommands()...)
}
// vaultHelp is shown for bare `homelab vault`.
// vaultHelp is shown for bare `homelab vault`. It LEADS with the distinction
// between the two unrelated "vaults" this command fronts, because the name
// collides: Vaultwarden (a password manager) vs HashiCorp Vault / OpenBao (the
// infra secrets store).
func vaultHelp() string {
return `homelab vault read YOUR OWN Vaultwarden logins (no-HITL after one-time setup)
return `homelab vault two different secret stores under one command:
Vaultwarden your personal PASSWORD MANAGER (logins / passwords / TOTP)
HashiCorp Vault / OpenBao homelab INFRA secrets (the secret/ KV store) 'vault kv '
Vaultwarden (reads YOUR OWN vault; no-HITL after one-time setup)
homelab vault setup one-time: store your master password + API key in your Vault path
homelab vault status configured / unlocked / reachable (no secrets)
homelab vault list [--search Q] list your item names (no secrets)
homelab vault get <name> [--field password|username|uri|notes|totp] [--json]
TTY clipboard (auto-clears); piped stdout
homelab vault get <name> --all all fields (incl. custom) as JSON; piped only.
TOTP shown as presence flag use 'vault code' for a code.
homelab vault code <name> current TOTP code
homelab vault lock lock / log out the local bw session
Creds live only in your own Vault path; the admin never sees them. Identity is
your unix UID. Security model: docs/superpowers/specs/2026-06-24-homelab-vault-design.md
HashiCorp Vault / OpenBao (infra secrets; uses your own OIDC vault token)
homelab vault kv get <path> [--field K] read an infra KV secret
homelab vault kv list <path> list sub-paths
homelab vault kv put <path> <key> write one key (value via stdin)
Vaultwarden creds live only in your own Vault path; the admin never sees them.
Security model: docs/runbooks/homelab-vault-onboarding.md
(note: anything running as your user can decrypt your vault the accepted no-HITL trade).
`
}
@ -79,7 +97,33 @@ func realRunner(name string, argv, envv []string) (string, error) {
out, err := cmd.Output()
// Trim only the trailing newline the tool appends — NOT all whitespace, so a
// fetched secret with significant leading/trailing spaces is preserved.
return strings.TrimRight(string(out), "\r\n"), err
return strings.TrimRight(string(out), "\r\n"), augmentErr(err, exitStderr(err))
}
// exitStderr returns the stderr captured by cmd.Output() on a failed exec (it
// stows it on *exec.ExitError), or nil. The tools we shell out to (vault, bw)
// write the actionable message there — "connection refused", "permission
// denied" — which the caller would otherwise never see behind a bare
// "exit status N".
func exitStderr(err error) []byte {
var ee *exec.ExitError
if errors.As(err, &ee) {
return ee.Stderr
}
return nil
}
// augmentErr appends captured stderr to an error so failures are diagnosable
// (not just "exit status 2"). Returns nil when err is nil, and err unchanged
// when there's no stderr; preserves the wrapped error for errors.Is/As.
func augmentErr(err error, stderr []byte) error {
if err == nil {
return nil
}
if s := strings.TrimSpace(string(stderr)); s != "" {
return fmt.Errorf("%w: %s", err, s)
}
return err
}
// realRunnerStdin runs a command feeding `stdin` to it, for secret values that
@ -92,7 +136,7 @@ func realRunnerStdin(name string, argv, envv []string, stdin string) (string, er
}
cmd.Stdin = strings.NewReader(stdin)
out, err := cmd.Output()
return strings.TrimRight(string(out), "\r\n"), err
return strings.TrimRight(string(out), "\r\n"), augmentErr(err, exitStderr(err))
}
func vwCredsPath(user string) string { return vwUserPathPrefix + user }
@ -128,6 +172,89 @@ func loadCreds(run cmdRunner, user string) (vwCreds, error) {
var vaultCurrentUser = func() string { return os.Getenv("USER") }
var vaultCurrentUID = func() string { return fmt.Sprintf("%d", os.Getuid()) }
// scopedTokenPath is where claude-auth-sync keeps the user's scoped Vault token.
// MUST match CAS_VAULT_TOKEN_FILE in scripts/workstation/claude-auth-sync.sh.
func scopedTokenPath(home string) string {
return home + "/.config/claude-auth-sync/vault-token"
}
// vaultTokenSource decides which Vault token the `vault` child processes should
// use. Precedence: an explicit $VAULT_TOKEN (deliberate override), then the
// per-user scoped token claude-auth-sync maintains at scopedTokenPath(HOME)
// (policy workstation-claude-<user>, which grants exactly the create/read/update
// this tool needs on the user's own path), then a native ~/.vault-token.
//
// The scoped token MUST beat ~/.vault-token: this tool only ever touches the
// caller's own secret/workstation/claude-users/<user> path, and a power-user who
// ran `vault login -method=oidc` carries a read-only ~/.vault-token whose
// capability on that path is `deny` — letting it win shadows the scoped token
// and every op fails 403/deny (emo, 2026-06-28). ~/.vault-token is only the
// right credential when there is no scoped token (admins). Returns the token to
// export — "" when the vault CLI should read the ambient/native credential —
// plus a source tag for tests/logging.
func vaultTokenSource(envToken string, haveVaultTokenFile bool, scopedToken string) (token, source string) {
switch {
case envToken != "":
return "", "env"
case strings.TrimSpace(scopedToken) != "":
return strings.TrimSpace(scopedToken), "scoped"
case haveVaultTokenFile:
return "", "file"
default:
return "", "none"
}
}
// vaultAddrDefault is the cluster Vault the workstation talks to. The bw server
// is likewise hardcoded (openSession), so a sane default here is consistent.
const vaultAddrDefault = "https://vault.viktorbarzin.me"
// vaultAddrToSet returns the VAULT_ADDR to export when the caller's environment
// doesn't already set one, else "". homelab vault is invoked by AFK agent
// sessions — frequently non-login shells (tmux panes, agent subprocesses) that
// never sourced /etc/environment — so, like claude-auth-sync, the CLI must NOT
// depend on an ambient VAULT_ADDR; otherwise every `vault` child falls back to
// the 127.0.0.1:8200 default and fails "connection refused" (exit 2).
func vaultAddrToSet(envAddr string) string {
if strings.TrimSpace(envAddr) == "" {
return vaultAddrDefault
}
return ""
}
// ensureVaultAddr exports the default VAULT_ADDR when none is set, so the vault
// child processes reach the cluster Vault regardless of the caller's shell. An
// explicit VAULT_ADDR (admins, CI) is left untouched.
func ensureVaultAddr() {
if a := vaultAddrToSet(os.Getenv("VAULT_ADDR")); a != "" {
os.Setenv("VAULT_ADDR", a)
}
}
// fileNonEmpty reports whether path exists and has content.
func fileNonEmpty(path string) bool {
fi, err := os.Stat(path)
return err == nil && fi.Size() > 0
}
// ensureVaultToken wires vaultTokenSource to the real environment: when the user
// has no ambient Vault credential, it exports the claude-auth-sync scoped token
// so the `vault` child processes authenticate as workstation-claude-<user>. It
// is idempotent and safe for admins, whose explicit $VAULT_TOKEN / ~/.vault-token
// take precedence and are left untouched.
func ensureVaultToken() {
// Every vault verb funnels through here, so this is the one place that also
// guarantees VAULT_ADDR is set (see vaultAddrToSet for why it can't be
// assumed from the caller's shell).
ensureVaultAddr()
home := os.Getenv("HOME")
scoped, _ := os.ReadFile(scopedTokenPath(home))
tok, src := vaultTokenSource(os.Getenv("VAULT_TOKEN"), home != "" && fileNonEmpty(home+"/.vault-token"), string(scoped))
if src == "scoped" {
os.Setenv("VAULT_TOKEN", tok)
}
}
// bwBaseEnv is the minimal non-secret environment bw/node need. We deliberately
// do NOT inherit the full parent env (keeps stray secrets out of the child).
func bwBaseEnv(appdata string) []string {
@ -160,7 +287,9 @@ func bwSecretEnv(appdata string, c vwCreds, session string) []string {
func bwLoginArgs() []string { return []string{"login", "--apikey"} }
func bwUnlockArgs() []string { return []string{"unlock", "--passwordenv", "BW_PASSWORD", "--raw"} }
func bwGetArgs(field, name string) []string { return []string{"get", field, name} }
func bwItemArgs(name string) []string { return []string{"get", "item", name} }
func bwStatusArgs() []string { return []string{"status"} }
func bwSyncArgs() []string { return []string{"sync"} }
// bwNeedsLogin parses `bw status` JSON and reports whether a `bw login` is
// required. Unparseable/empty output → true (safer to attempt login).
@ -327,13 +456,23 @@ func openSession(run cmdRunner, user, uid string) (session, error) {
if err != nil {
return session{}, err
}
return session{env: bwSecretEnv(appdata, creds, sess)}, nil
sessEnv := bwSecretEnv(appdata, creds, sess)
// Pull the latest server-side state so reads reflect current values. `bw
// unlock` only decrypts the LOCAL cache, so a persisted (already-logged-in)
// session would otherwise serve stale data until the next login. Best-effort:
// a transient sync failure must not break a read — fall back to the cached
// vault and warn (status reports reachability separately).
if _, err := run("bw", bwSyncArgs(), sessEnv); err != nil {
fmt.Fprintln(os.Stderr, "homelab vault: warning: bw sync failed; using cached vault (values may be stale): "+err.Error())
}
return session{env: sessEnv}, nil
}
type getOpts struct {
name string
field string
json bool
all bool // dump every field (incl. custom) as normalized JSON
}
var validGetFields = map[string]bool{"password": true, "username": true, "uri": true, "notes": true, "totp": true}
@ -345,6 +484,8 @@ func parseGetArgs(args []string) (getOpts, error) {
switch {
case a == "--json":
o.json = true
case a == "--all":
o.all = true
case a == "--field" && i+1 < len(args):
o.field = args[i+1]
i++
@ -355,9 +496,10 @@ func parseGetArgs(args []string) (getOpts, error) {
}
}
if o.name == "" {
return o, fmt.Errorf("usage: homelab vault get <name> [--field password|username|uri|notes|totp] [--json]")
return o, fmt.Errorf("usage: homelab vault get <name> [--field password|username|uri|notes|totp] [--json] [--all]")
}
if !validGetFields[o.field] {
// --all dumps the whole item, so --field is irrelevant — skip its allowlist.
if !o.all && !validGetFields[o.field] {
return o, fmt.Errorf("invalid --field %q (want password|username|uri|notes|totp)", o.field)
}
return o, nil
@ -373,6 +515,81 @@ func getValue(run cmdRunner, user, uid string, o getOpts) (string, error) {
return bwGet(run, s.env, o.field, o.name)
}
// getItem opens a session and returns the whole item as raw `bw get item` JSON.
// Used by `get --all`; normalization is a separate, pure step (normalizeItem).
func getItem(run cmdRunner, user, uid, name string) (string, error) {
s, err := openSession(run, user, uid)
if err != nil {
return "", err
}
return run("bw", bwItemArgs(name), s.env)
}
// normalizedItem is the browse-all-fields projection of a Vaultwarden item: the
// standard login fields that are present, notes, and a flat map of custom field
// name→value. bw internals (id, object, reprompt, passwordHistory) are dropped,
// and the TOTP *seed* is reduced to a presence flag — the only seed-derived path
// stays the specially-audited `vault code` (see the design §10/§16).
type normalizedItem struct {
Name string `json:"name"`
Username string `json:"username,omitempty"`
Password string `json:"password,omitempty"`
URIs []string `json:"uris,omitempty"`
TOTP bool `json:"totp,omitempty"` // presence only, never the seed
Notes string `json:"notes,omitempty"`
Fields map[string]string `json:"fields,omitempty"` // custom field name→value
}
// bwFieldLinked is the Bitwarden custom-field type for a "linked" field: it
// references another field and carries a null value, so it is not real data.
const bwFieldLinked = 3
// normalizeItem parses a `bw get item` payload into the browse projection. It is
// pure (no I/O), so it is the unit-tested heart of `get --all`.
func normalizeItem(raw string) (normalizedItem, error) {
var it struct {
Name string `json:"name"`
Notes string `json:"notes"`
Login *struct {
Username string `json:"username"`
Password string `json:"password"`
Totp string `json:"totp"`
URIs []struct {
URI string `json:"uri"`
} `json:"uris"`
} `json:"login"`
Fields []struct {
Name string `json:"name"`
Value string `json:"value"`
Type int `json:"type"`
} `json:"fields"`
}
if err := json.Unmarshal([]byte(raw), &it); err != nil {
return normalizedItem{}, fmt.Errorf("parse bw item: %w", err)
}
n := normalizedItem{Name: it.Name, Notes: it.Notes}
if it.Login != nil {
n.Username = it.Login.Username
n.Password = it.Login.Password
n.TOTP = it.Login.Totp != ""
for _, u := range it.Login.URIs {
if u.URI != "" {
n.URIs = append(n.URIs, u.URI)
}
}
}
for _, f := range it.Fields {
if f.Type == bwFieldLinked {
continue // references another field, no value of its own
}
if n.Fields == nil {
n.Fields = map[string]string{}
}
n.Fields[f.Name] = f.Value // duplicate names: last-wins (rare; documented)
}
return n, nil
}
// clipboardDecision picks how to return a secret value. "stdout" prints it (a
// pipe/agent — the intended machine path); "clipboard" copies via OSC52;
// "refuse" emits nothing sensitive (would otherwise risk dumping the secret's
@ -443,6 +660,7 @@ func runList(run cmdRunner, user, uid, search string) ([]string, error) {
func vaultList(args []string) error {
hardenProcess()
ensureVaultToken()
search := ""
for i := 0; i < len(args); i++ {
if args[i] == "--search" && i+1 < len(args) {
@ -477,6 +695,7 @@ func vaultSearch(args []string) error {
func vaultCode(args []string) error {
hardenProcess()
ensureVaultToken()
if len(args) == 0 {
return fmt.Errorf("usage: homelab vault code <name>")
}
@ -508,7 +727,9 @@ func statusSummary(run cmdRunner, user, uid string) string {
if err != nil {
return "vault: configured, but unlock/login FAILED (creds stale? run `homelab vault setup`): " + err.Error()
}
if _, err := run("bw", []string{"sync"}, s.env); err != nil {
// openSession already did a best-effort sync; status re-runs it explicitly so
// a reachability failure surfaces in this report rather than only on stderr.
if _, err := run("bw", bwSyncArgs(), s.env); err != nil {
return "vault: configured + unlocked, but sync/reachability failed: " + err.Error()
}
return "vault: configured, unlocked, reachable ✓"
@ -516,6 +737,7 @@ func statusSummary(run cmdRunner, user, uid string) string {
func vaultStatus(args []string) error {
hardenProcess()
ensureVaultToken()
uid := vaultCurrentUID()
unlock, err := withUserLock(uid)
if err != nil {
@ -542,32 +764,61 @@ func vaultLock(args []string) error {
return nil // lock/logout best-effort; never error the caller
}
// vaultPatchPublicArgs writes the non-secret identifiers via argv. Neither the
// kvWriteVerb selects the KV write semantics. merge=true → `kv patch -method=rw`
// (read-modify-write: needs only read+update, NOT the `patch` capability the
// scoped workstation-claude-<user> policy lacks, and preserves co-located keys
// such as claude-auth-sync's claude_ai_oauth_json). merge=false → `kv put`
// (creates the path on first use, before any sibling keys exist).
func kvWriteVerb(merge bool) []string {
if merge {
return []string{"kv", "patch", "-method=rw"}
}
return []string{"kv", "put"}
}
// vaultWritePublicArgs writes the non-secret identifiers via argv. Neither the
// email nor the API client_id is a usable credential on its own.
func vaultPatchPublicArgs(user, email, clientID string) []string {
return []string{"kv", "patch", vwCredsPath(user),
func vaultWritePublicArgs(merge bool, user, email, clientID string) []string {
return append(kvWriteVerb(merge), vwCredsPath(user),
"vaultwarden_email="+email,
"vaultwarden_client_id="+clientID,
}
)
}
// vaultPatchSecretArgs writes ONE secret value via the `key=-` stdin form, so
// the value never appears in argv (ps / /proc/<pid>/cmdline). The value is fed
// on stdin by realRunnerStdin.
func vaultPatchSecretArgs(user, key string) []string {
return []string{"kv", "patch", vwCredsPath(user), key + "=-"}
// vaultWriteSecretArgs writes ONE secret value via the `key=-` stdin form, so the
// value never appears in argv (ps / /proc/<pid>/cmdline). Fed on stdin by
// realRunnerStdin.
func vaultWriteSecretArgs(merge bool, user, key string) []string {
return append(kvWriteVerb(merge), vwCredsPath(user), key+"=-")
}
// writeCreds stores all four fields in the user's Vault path. The two real
// secrets (master password, API client_secret) go via stdin — never argv.
func writeCreds(user string, c vwCreds) error {
if _, err := realRunner("vault", vaultPatchPublicArgs(user, c.Email, c.ClientID), nil); err != nil {
// credsPathExists reports whether the user's KV path already holds data. Used to
// pick create (`kv put`) vs merge (`kv patch -method=rw`) for the first write:
// claude-auth-sync usually creates the path first (Claude OAuth backup), but a
// user could run `homelab vault setup` before that ever happens.
func credsPathExists(run cmdRunner, user string) bool {
_, err := run("vault", []string{"kv", "get", "-format=json", vwCredsPath(user)}, nil)
return err == nil
}
// cmdRunnerStdin is realRunnerStdin's shape, injected so writeCreds is testable.
type cmdRunnerStdin func(name string, argv, envv []string, stdin string) (string, error)
// writeCreds stores all four fields in the user's Vault path using only the
// capabilities the scoped policy grants (create/read/update — NOT `patch`). The
// first (public) write creates the path when absent; the two real secrets then
// merge in via read-modify-write so the public keys — and any claude-auth-sync
// keys already present — survive. Secret values travel on stdin, never argv.
func writeCreds(run cmdRunner, runStdin cmdRunnerStdin, user string, c vwCreds) error {
merge := credsPathExists(run, user)
if _, err := run("vault", vaultWritePublicArgs(merge, user, c.Email, c.ClientID), nil); err != nil {
return err
}
if _, err := realRunnerStdin("vault", vaultPatchSecretArgs(user, "vaultwarden_master_password"), nil, c.MasterPassword); err != nil {
// The path now exists regardless of the branch above → merge the secrets in.
if _, err := runStdin("vault", vaultWriteSecretArgs(true, user, "vaultwarden_master_password"), nil, c.MasterPassword); err != nil {
return err
}
if _, err := realRunnerStdin("vault", vaultPatchSecretArgs(user, "vaultwarden_client_secret"), nil, c.ClientSecret); err != nil {
if _, err := runStdin("vault", vaultWriteSecretArgs(true, user, "vaultwarden_client_secret"), nil, c.ClientSecret); err != nil {
return err
}
return nil
@ -593,6 +844,7 @@ func promptLine(prompt string) (string, error) {
func vaultSetup(args []string) error {
hardenProcess()
ensureVaultToken()
fmt.Fprintln(os.Stderr, "One-time setup. Stored ONLY in your own Vault path; the admin never sees it.")
fmt.Fprintln(os.Stderr, "Get your API key at https://vaultwarden.viktorbarzin.me → Settings → Security → Keys → View API key.")
email, err := promptLine("Vaultwarden email: ")
@ -615,7 +867,7 @@ func vaultSetup(args []string) error {
return fmt.Errorf("all fields are required")
}
c := vwCreds{Email: email, MasterPassword: master, ClientID: clientID, ClientSecret: clientSecret}
if err := writeCreds(vaultCurrentUser(), c); err != nil {
if err := writeCreds(realRunner, realRunnerStdin, vaultCurrentUser(), c); err != nil {
return fmt.Errorf("writing creds to your Vault path failed (scoped token present?): %w", err)
}
fmt.Fprintln(os.Stderr, "Stored. Verifying unlock…")
@ -634,6 +886,7 @@ func vaultSetup(args []string) error {
func vaultGet(args []string) error {
hardenProcess()
ensureVaultToken()
o, err := parseGetArgs(args)
if err != nil {
return err
@ -645,6 +898,9 @@ func vaultGet(args []string) error {
}
defer unlock()
user := vaultCurrentUser()
if o.all {
return getAllFields(user, uid, o.name)
}
val, err := getValue(realRunner, user, uid, o)
if err != nil {
return err
@ -661,3 +917,28 @@ func vaultGet(args []string) error {
return nil
}
// getAllFields prints every field of one item as normalized JSON. Like
// `get --json`, the payload is all secret values, so it refuses a terminal
// (pipe it). The TOTP seed is never emitted — only a presence flag — so no extra
// TOTP audit is needed; the op-log uses a distinct verb so a bulk dump is
// distinguishable from a single-field get (the item name is still never logged).
func getAllFields(user, uid, name string) error {
if !jsonToStdoutOK(stdoutIsTTY()) {
return fmt.Errorf("refusing to print all fields as JSON to a terminal; pipe it (e.g. | jq)")
}
raw, err := getItem(realRunner, user, uid, name)
if err != nil {
return err
}
item, err := normalizeItem(raw)
if err != nil {
return err
}
out, err := json.Marshal(item)
if err != nil {
return err
}
writeOpLog(opRecord{User: user, Verb: "get-all", PID: os.Getpid(), PPID: os.Getppid(), ParentComm: parentComm(os.Getppid()), ItemName: name})
fmt.Println(string(out))
return nil
}

248
cli/cmd_vault_kv.go Normal file
View file

@ -0,0 +1,248 @@
package main
import (
"encoding/json"
"fmt"
"io"
"os"
"strings"
)
// The `vault kv` verbs talk to HashiCorp Vault / OpenBao — the homelab INFRA
// secrets store (the `secret/…` KV-v2 mount at vault.viktorbarzin.me) — NOT
// Vaultwarden. They are a thin, TTY-aware wrapper over the `vault` CLI that adds
// the same conveniences as the Vaultwarden verbs: a self-defaulted VAULT_ADDR
// (so non-login agent shells work) and clipboard/refuse-on-TTY secret handling.
//
// CREDENTIALS DIFFER FROM THE VAULTWARDEN VERBS. Those use the per-user *scoped*
// token (bound only to secret/workstation/claude-users/<user>). A general kv read
// of e.g. secret/viktor must use the caller's OWN Vault token (the OIDC
// ~/.vault-token or an explicit $VAULT_TOKEN) — the scoped token has `deny`
// everywhere else and would 403. So the kv handlers call ensureVaultAddr() to
// guarantee VAULT_ADDR but deliberately do NOT call ensureVaultToken() (which
// injects the scoped token). Access is then whatever the caller's policy grants.
func vaultKVCommands() []Command {
return []Command{
{Path: []string{"vault", "kv", "get"}, Tier: TierRead,
Summary: "[hashicorp-vault] read an infra KV secret: vault kv get <path> [--field K]", Run: vaultKVGet},
{Path: []string{"vault", "kv", "list"}, Tier: TierRead,
Summary: "[hashicorp-vault] list infra KV sub-paths: vault kv list <path>", Run: vaultKVList},
{Path: []string{"vault", "kv", "put"}, Tier: TierWrite,
Summary: "[hashicorp-vault] write one KV key (value via stdin): vault kv put <path> <key>", Run: vaultKVPut},
{Path: []string{"vault", "kv"}, Tier: TierRead,
Summary: "[hashicorp-vault] infra secrets (run `homelab vault kv` for help)",
Run: func([]string) error { fmt.Print(vaultKVHelp()); return nil }},
}
}
func vaultKVHelp() string {
return `homelab vault kv HashiCorp Vault / OpenBao (homelab INFRA secrets, the secret/ KV store)
homelab vault kv get <path> [--field K] read a secret
--field K one value (TTY clipboard; piped stdout)
no --field all fields as JSON (piped only)
homelab vault kv list <path> list sub-paths under <path> (no values)
homelab vault kv put <path> <key> write one key; value read from stdin
(piped, or no-echo prompt); merges never clobbers siblings
Uses YOUR Vault token (vault login -method=oidc ~/.vault-token); access is
whatever your policy grants. This is NOT Vaultwarden for your personal logins
use 'homelab vault get' (see 'homelab vault').
`
}
// --- arg builders (pure; values never travel via argv) --------------------
func vaultKVGetFieldArgs(path, field string) []string {
return []string{"kv", "get", "-field=" + field, path}
}
func vaultKVGetJSONArgs(path string) []string { return []string{"kv", "get", "-format=json", path} }
func vaultKVListArgs(path string) []string { return []string{"kv", "list", "-format=json", path} }
// vaultKVPutArgs builds the write argv. merge=true → `kv patch -method=rw`
// (read-modify-write: merges, needs only read+update — not the `patch` capability
// — and preserves sibling keys); merge=false → `kv put` (creates the path on
// first write). The value is ALWAYS read from stdin via the `<key>=-` form, so it
// never appears in argv (visible via ps / /proc/<pid>/cmdline to same-UID procs).
func vaultKVPutArgs(merge bool, path, key string) []string {
return append(kvWriteVerb(merge), path, key+"=-")
}
// --- pure parsers ----------------------------------------------------------
// extractKVData returns the inner secret object from a `vault kv get -format=json`
// envelope (`{"data":{"data":{…},"metadata":{…}}}`), dropping the metadata/request
// wrapper so only the secret's own key→value data is emitted.
func extractKVData(jsonOut string) (string, error) {
var env struct {
Data struct {
Data json.RawMessage `json:"data"`
} `json:"data"`
}
if err := json.Unmarshal([]byte(jsonOut), &env); err != nil {
return "", fmt.Errorf("parse vault kv json: %w", err)
}
if len(env.Data.Data) == 0 {
return "", fmt.Errorf("no secret data at that path")
}
return string(env.Data.Data), nil
}
// parseKVList parses the JSON array `vault kv list -format=json` prints.
func parseKVList(jsonOut string) ([]string, error) {
var keys []string
if err := json.Unmarshal([]byte(jsonOut), &keys); err != nil {
return nil, fmt.Errorf("parse vault kv list json: %w", err)
}
return keys, nil
}
// --- testable cores (injected cmdRunner) -----------------------------------
func kvGetField(run cmdRunner, path, field string) (string, error) {
return run("vault", vaultKVGetFieldArgs(path, field), nil)
}
func kvGetJSON(run cmdRunner, path string) (string, error) {
out, err := run("vault", vaultKVGetJSONArgs(path), nil)
if err != nil {
return "", err
}
return extractKVData(out)
}
func kvList(run cmdRunner, path string) ([]string, error) {
out, err := run("vault", vaultKVListArgs(path), nil)
if err != nil {
return nil, err
}
return parseKVList(out)
}
// kvPathExists reports whether the KV path already holds data, to pick create
// (`kv put`) vs merge (`kv patch -method=rw`) — so a write never clobbers
// sibling keys on an existing path.
func kvPathExists(run cmdRunner, path string) bool {
_, err := run("vault", vaultKVGetJSONArgs(path), nil)
return err == nil
}
// kvPut writes one key, creating the path when absent and merging when present.
// The value travels on stdin only (never argv).
func kvPut(run cmdRunner, runStdin cmdRunnerStdin, path, key, value string) error {
merge := kvPathExists(run, path)
_, err := runStdin("vault", vaultKVPutArgs(merge, path, key), nil, value)
return err
}
// --- handlers --------------------------------------------------------------
func vaultKVGet(args []string) error {
hardenProcess()
ensureVaultAddr() // own token, NOT the scoped one (see file header)
var path, field string
for i := 0; i < len(args); i++ {
a := args[i]
switch {
case a == "--field" && i+1 < len(args):
field = args[i+1]
i++
case strings.HasPrefix(a, "--field="):
field = strings.TrimPrefix(a, "--field=")
case !strings.HasPrefix(a, "-") && path == "":
path = a
}
}
if path == "" {
return fmt.Errorf("usage: homelab vault kv get <path> [--field <key>]")
}
if field != "" {
val, err := kvGetField(realRunner, path, field)
if err != nil {
return err
}
emitSecret(val) // TTY-aware: clipboard on a terminal, stdout when piped
return nil
}
// No --field → the whole secret. All values, so refuse a bare TTY (like
// `vault get --json`): pick a --field for the clipboard path, or pipe it.
if !jsonToStdoutOK(stdoutIsTTY()) {
return fmt.Errorf("refusing to print all KV fields as JSON to a terminal; use --field <key>, or pipe it (e.g. | jq)")
}
out, err := kvGetJSON(realRunner, path)
if err != nil {
return err
}
fmt.Println(out)
return nil
}
func vaultKVList(args []string) error {
ensureVaultAddr()
var path string
for _, a := range args {
if !strings.HasPrefix(a, "-") {
path = a
break
}
}
if path == "" {
return fmt.Errorf("usage: homelab vault kv list <path>")
}
keys, err := kvList(realRunner, path)
if err != nil {
return err
}
for _, k := range keys {
fmt.Println(k)
}
return nil
}
func vaultKVPut(args []string) error {
hardenProcess()
ensureVaultAddr()
var path, key string
for _, a := range args {
if strings.HasPrefix(a, "-") {
continue
}
switch {
case path == "":
path = a
case key == "":
key = a
}
}
if path == "" || key == "" {
return fmt.Errorf("usage: homelab vault kv put <path> <key> (value read from stdin)")
}
value, err := readSecretValue("Value for " + key + ": ")
if err != nil {
return err
}
if value == "" {
return fmt.Errorf("empty value; aborting (nothing written)")
}
if err := kvPut(realRunner, realRunnerStdin, path, key, value); err != nil {
return fmt.Errorf("writing %q to %s failed (does your token have write access? path correct?): %w", key, path, err)
}
fmt.Fprintln(os.Stderr, "wrote "+key+" to "+path)
return nil
}
// readSecretValue obtains a secret value WITHOUT putting it in argv: piped stdin
// is read verbatim (trailing newline trimmed, internal newlines preserved so
// multi-line values like PEM keys survive); an interactive TTY is prompted
// without echo.
func readSecretValue(prompt string) (string, error) {
fi, err := os.Stdin.Stat()
if err == nil && fi.Mode()&os.ModeCharDevice == 0 {
b, rerr := io.ReadAll(os.Stdin)
if rerr != nil {
return "", rerr
}
return strings.TrimRight(string(b), "\r\n"), nil
}
return promptNoEcho(prompt)
}

View file

@ -2,6 +2,8 @@ package main
import (
"encoding/base64"
"encoding/json"
"errors"
"fmt"
"os"
"reflect"
@ -233,12 +235,181 @@ func TestStatusSummaryUnconfigured(t *testing.T) {
}
}
func TestVaultPatchPublicArgs(t *testing.T) {
got := vaultPatchPublicArgs("emo", "e@x.me", "user.ci")
want := []string{"kv", "patch", "secret/workstation/claude-users/emo",
func TestEnsureVaultTokenSetsScopedFallback(t *testing.T) {
dir := t.TempDir()
cfg := dir + "/.config/claude-auth-sync"
if err := os.MkdirAll(cfg, 0o700); err != nil {
t.Fatal(err)
}
if err := os.WriteFile(cfg+"/vault-token", []byte("SCOPED-TOK\n"), 0o600); err != nil {
t.Fatal(err)
}
t.Setenv("HOME", dir)
t.Setenv("VAULT_TOKEN", "") // no ambient token
ensureVaultToken()
if got := os.Getenv("VAULT_TOKEN"); got != "SCOPED-TOK" {
t.Fatalf("VAULT_TOKEN = %q, want scoped fallback to be exported", got)
}
}
func TestEnsureVaultTokenKeepsExplicitEnv(t *testing.T) {
dir := t.TempDir()
cfg := dir + "/.config/claude-auth-sync"
if err := os.MkdirAll(cfg, 0o700); err != nil {
t.Fatal(err)
}
if err := os.WriteFile(cfg+"/vault-token", []byte("SCOPED-TOK"), 0o600); err != nil {
t.Fatal(err)
}
t.Setenv("HOME", dir)
t.Setenv("VAULT_TOKEN", "ADMIN-TOK")
ensureVaultToken()
if got := os.Getenv("VAULT_TOKEN"); got != "ADMIN-TOK" {
t.Fatalf("VAULT_TOKEN = %q, must not override an explicit token", got)
}
}
func TestEnsureVaultTokenPrefersScopedOverFile(t *testing.T) {
// Regression: a power-user's read-only OIDC ~/.vault-token must NOT shadow the
// purpose-built scoped token (emo's setup hit 403 because it did, 2026-06-28).
dir := t.TempDir()
cfg := dir + "/.config/claude-auth-sync"
if err := os.MkdirAll(cfg, 0o700); err != nil {
t.Fatal(err)
}
if err := os.WriteFile(cfg+"/vault-token", []byte("SCOPED-TOK"), 0o600); err != nil {
t.Fatal(err)
}
if err := os.WriteFile(dir+"/.vault-token", []byte("STALE-OIDC-TOK"), 0o600); err != nil {
t.Fatal(err)
}
t.Setenv("HOME", dir)
t.Setenv("VAULT_TOKEN", "")
ensureVaultToken()
if got := os.Getenv("VAULT_TOKEN"); got != "SCOPED-TOK" {
t.Fatalf("VAULT_TOKEN = %q, want the scoped token to win over a stale ~/.vault-token", got)
}
}
func TestScopedTokenPath(t *testing.T) {
if got := scopedTokenPath("/home/emo"); got != "/home/emo/.config/claude-auth-sync/vault-token" {
t.Fatalf("scopedTokenPath = %q", got)
}
}
func TestVaultTokenSource(t *testing.T) {
// Precedence: explicit $VAULT_TOKEN > the claude-auth-sync per-user scoped
// token > a native ~/.vault-token. Scoped beats the file so a power-user's
// read-only OIDC ~/.vault-token can't shadow the scoped token on the user's
// own path (emo, 2026-06-28).
cases := []struct {
name string
env string
haveVaultToken bool
scoped string
wantTok, wantSrc string
}{
{"explicit env wins", "abc", true, "S", "", "env"},
{"scoped beats a stale ~/.vault-token", "", true, "S-TOK", "S-TOK", "scoped"},
{"scoped used when no file", "", false, "S-TOK", "S-TOK", "scoped"},
{"native ~/.vault-token only when no scoped", "", true, "", "", "file"},
{"scoped value is trimmed", "", false, " S-TOK\n", "S-TOK", "scoped"},
{"whitespace-only scoped falls back to file", "", true, " \n", "", "file"},
{"nothing configured", "", false, "", "", "none"},
}
for _, c := range cases {
tok, src := vaultTokenSource(c.env, c.haveVaultToken, c.scoped)
if tok != c.wantTok || src != c.wantSrc {
t.Errorf("%s: vaultTokenSource(%q,%v,%q) = (%q,%q), want (%q,%q)",
c.name, c.env, c.haveVaultToken, c.scoped, tok, src, c.wantTok, c.wantSrc)
}
}
}
func TestVaultAddrToSet(t *testing.T) {
// homelab vault is invoked by AFK agent sessions (non-login shells that
// never sourced /etc/environment), so the CLI must self-default VAULT_ADDR
// rather than rely on the ambient env — else every `vault` child hits the
// 127.0.0.1:8200 default and fails "connection refused" (exit 2).
cases := []struct {
name, env, want string
}{
{"unset -> default", "", vaultAddrDefault},
{"whitespace-only -> default", " \n", vaultAddrDefault},
{"explicit kept (empty = leave alone)", "https://vault.example.com", ""},
}
for _, c := range cases {
if got := vaultAddrToSet(c.env); got != c.want {
t.Errorf("%s: vaultAddrToSet(%q) = %q, want %q", c.name, c.env, got, c.want)
}
}
}
func TestEnsureVaultTokenSetsDefaultAddr(t *testing.T) {
dir := t.TempDir() // no scoped token, no ~/.vault-token
t.Setenv("HOME", dir)
t.Setenv("VAULT_TOKEN", "")
t.Setenv("VAULT_ADDR", "") // emo's non-login-shell situation
ensureVaultToken()
if got := os.Getenv("VAULT_ADDR"); got != vaultAddrDefault {
t.Fatalf("VAULT_ADDR = %q, want default %q to be exported", got, vaultAddrDefault)
}
}
func TestEnsureVaultTokenKeepsExplicitAddr(t *testing.T) {
dir := t.TempDir()
t.Setenv("HOME", dir)
t.Setenv("VAULT_TOKEN", "")
t.Setenv("VAULT_ADDR", "https://vault.example.com")
ensureVaultToken()
if got := os.Getenv("VAULT_ADDR"); got != "https://vault.example.com" {
t.Fatalf("VAULT_ADDR = %q, must not override an explicit addr", got)
}
}
func TestAugmentErrSurfacesStderr(t *testing.T) {
if got := augmentErr(nil, []byte("ignored")); got != nil {
t.Fatalf("augmentErr(nil, …) = %v, want nil", got)
}
base := errors.New("exit status 2")
got := augmentErr(base, []byte(" dial tcp 127.0.0.1:8200: connect: connection refused\n"))
if got == nil || !strings.Contains(got.Error(), "connection refused") || !strings.Contains(got.Error(), "exit status 2") {
t.Fatalf("augmentErr did not surface stderr: %v", got)
}
if !errors.Is(got, base) {
t.Fatal("augmentErr lost the wrapped error (errors.Is failed)")
}
if got := augmentErr(base, []byte(" ")); got != base {
t.Fatalf("augmentErr with blank stderr = %v, want the original error unchanged", got)
}
}
func TestKvWriteVerb(t *testing.T) {
// merge=true → read-modify-write patch (needs only read+update, NOT the
// `patch` capability the scoped workstation policy lacks).
if got := kvWriteVerb(true); !reflect.DeepEqual(got, []string{"kv", "patch", "-method=rw"}) {
t.Fatalf("kvWriteVerb(true) = %v", got)
}
// merge=false → put (creates the path on first use)
if got := kvWriteVerb(false); !reflect.DeepEqual(got, []string{"kv", "put"}) {
t.Fatalf("kvWriteVerb(false) = %v", got)
}
}
func TestVaultWritePublicArgs(t *testing.T) {
got := vaultWritePublicArgs(true, "emo", "e@x.me", "user.ci")
want := []string{"kv", "patch", "-method=rw", "secret/workstation/claude-users/emo",
"vaultwarden_email=e@x.me", "vaultwarden_client_id=user.ci"}
if !reflect.DeepEqual(got, want) {
t.Fatalf("vaultPatchPublicArgs = %v", got)
t.Fatalf("vaultWritePublicArgs(merge) = %v", got)
}
if got := vaultWritePublicArgs(false, "emo", "e@x.me", "user.ci"); got[0] != "kv" || got[1] != "put" {
t.Fatalf("vaultWritePublicArgs(create) must use `kv put`, got %v", got)
}
for _, a := range got {
if strings.Contains(a, "master_password") || strings.Contains(a, "client_secret") {
@ -247,12 +418,12 @@ func TestVaultPatchPublicArgs(t *testing.T) {
}
}
func TestVaultPatchSecretArgsNoValueInArgv(t *testing.T) {
func TestVaultWriteSecretArgsNoValueInArgv(t *testing.T) {
for _, key := range []string{"vaultwarden_master_password", "vaultwarden_client_secret"} {
got := vaultPatchSecretArgs("emo", key)
want := []string{"kv", "patch", "secret/workstation/claude-users/emo", key + "=-"}
got := vaultWriteSecretArgs(true, "emo", key)
want := []string{"kv", "patch", "-method=rw", "secret/workstation/claude-users/emo", key + "=-"}
if !reflect.DeepEqual(got, want) {
t.Fatalf("vaultPatchSecretArgs(%q) = %v", key, got)
t.Fatalf("vaultWriteSecretArgs(%q) = %v", key, got)
}
if got[len(got)-1] != key+"=-" {
t.Fatalf("secret value must be read from stdin (`%s=-`), got %v", key, got)
@ -260,6 +431,90 @@ func TestVaultPatchSecretArgsNoValueInArgv(t *testing.T) {
}
}
// recStdin records a stdin-bearing call for assertions.
type recStdin struct {
argv []string
stdin string
}
// TestWriteCredsCreatesThenMerges: when the path is ABSENT the first (public)
// write must `kv put` (create), and the two secrets must merge via patch -rw
// with values on stdin only — never the buggy plain `kv patch` (needs `patch`).
func TestWriteCredsCreatesThenMerges(t *testing.T) {
var calls [][]string
var stdinCalls []recStdin
run := func(name string, argv, envv []string) (string, error) {
calls = append(calls, append([]string{name}, argv...))
if len(argv) >= 2 && argv[0] == "kv" && argv[1] == "get" {
return "", fmt.Errorf("no value found") // path absent
}
return "", nil
}
runStdin := func(name string, argv, envv []string, stdin string) (string, error) {
stdinCalls = append(stdinCalls, recStdin{append([]string{name}, argv...), stdin})
return "", nil
}
c := vwCreds{Email: "e@x.me", MasterPassword: "PW", ClientID: "user.ci", ClientSecret: "CS"}
if err := writeCreds(run, runStdin, "emo", c); err != nil {
t.Fatalf("writeCreds: %v", err)
}
var sawPut, sawPlainPatch bool
for _, cl := range calls {
j := strings.Join(cl, " ")
if strings.Contains(j, "kv put") {
sawPut = true
}
if strings.Contains(j, "kv patch") && !strings.Contains(j, "-method=rw") {
sawPlainPatch = true
}
}
if !sawPut {
t.Fatalf("path absent → public write must be `kv put`; calls=%v", calls)
}
if sawPlainPatch {
t.Fatalf("must never use plain `kv patch` (needs `patch` capability); calls=%v", calls)
}
if len(stdinCalls) != 2 {
t.Fatalf("want 2 stdin secret writes, got %d", len(stdinCalls))
}
for _, sc := range stdinCalls {
if !strings.Contains(strings.Join(sc.argv, " "), "kv patch -method=rw") {
t.Errorf("secret write must use patch -method=rw: %v", sc.argv)
}
for _, a := range sc.argv {
if strings.Contains(a, "PW") || strings.Contains(a, "CS") {
t.Errorf("secret leaked into argv: %v", sc.argv)
}
}
}
if stdinCalls[0].stdin != "PW" || stdinCalls[1].stdin != "CS" {
t.Errorf("stdin values wrong: %q,%q", stdinCalls[0].stdin, stdinCalls[1].stdin)
}
}
// TestWriteCredsMergesWhenPresent: when the path EXISTS, every write must merge
// (patch -rw) — a `kv put` would wipe sibling keys (e.g. claude_ai_oauth_json).
func TestWriteCredsMergesWhenPresent(t *testing.T) {
var calls [][]string
run := func(name string, argv, envv []string) (string, error) {
calls = append(calls, append([]string{name}, argv...))
return "{}", nil // get succeeds → path exists
}
runStdin := func(name string, argv, envv []string, stdin string) (string, error) {
calls = append(calls, append([]string{name}, argv...))
return "", nil
}
c := vwCreds{Email: "e@x.me", MasterPassword: "PW", ClientID: "user.ci", ClientSecret: "CS"}
if err := writeCreds(run, runStdin, "emo", c); err != nil {
t.Fatalf("writeCreds: %v", err)
}
for _, cl := range calls {
if strings.Contains(strings.Join(cl, " "), "kv put") {
t.Fatalf("path exists → must NOT `kv put` (wipes siblings): %v", cl)
}
}
}
// TestNoSecretInArgvAcrossFlow is the load-bearing security test: across the
// whole get flow (vault reads, bw config/status/login/unlock/get) NO secret
// value may appear in any command's argv — secrets travel via env/stdin only.
@ -366,3 +621,437 @@ func TestGetValueFlow(t *testing.T) {
t.Fatalf("getValue = %q, %v", val, err)
}
}
// --- vault get --all (browse all fields) ----------------------------------
func TestParseGetArgsAll(t *testing.T) {
o, err := parseGetArgs([]string{"github", "--all"})
if err != nil || o.name != "github" || !o.all {
t.Fatalf("parseGetArgs(--all) = %+v err=%v", o, err)
}
// --all must skip --field validation (field is irrelevant for a full dump).
if _, err := parseGetArgs([]string{"github", "--all", "--field", "evil"}); err != nil {
t.Fatalf("--all must ignore an otherwise-invalid --field, got err=%v", err)
}
// A name is still required.
if _, err := parseGetArgs([]string{"--all"}); err == nil {
t.Fatal("get --all with no name must error")
}
// Without --all, the field allowlist still applies.
if _, err := parseGetArgs([]string{"github", "--field", "evil"}); err == nil {
t.Fatal("invalid --field without --all must still error")
}
}
func TestBwItemArgs(t *testing.T) {
argv := bwItemArgs("github")
if !reflect.DeepEqual(argv, []string{"get", "item", "github"}) {
t.Fatalf("bwItemArgs = %v", argv)
}
for _, a := range argv {
if strings.Contains(a, "SESSION") || a == "--session" {
t.Fatalf("session must travel via env, not argv: %v", argv)
}
}
}
// a representative `bw get item` payload: login fields, multiple URIs, a TOTP
// seed, notes, custom fields (text/hidden/boolean), plus bw internals that MUST
// be dropped (id/object/reprompt/passwordHistory).
const sampleLoginItemJSON = `{
"object":"item","id":"abc-123","folderId":null,"type":1,"reprompt":0,
"name":"GitHub","notes":"my notes","favorite":false,
"fields":[
{"name":"PIN","value":"1234","type":1},
{"name":"endpoint","value":"https://api.gh","type":0},
{"name":"enabled","value":"true","type":2}
],
"login":{
"username":"octocat","password":"hunter2",
"totp":"otpauth://totp/GitHub:octocat?secret=SEEDSEEDSEED",
"uris":[{"match":null,"uri":"https://github.com"},{"match":null,"uri":"https://gist.github.com"}]
},
"passwordHistory":[{"password":"OLD-PASSWORD-XYZ"}]
}`
func TestNormalizeItemLogin(t *testing.T) {
n, err := normalizeItem(sampleLoginItemJSON)
if err != nil {
t.Fatalf("normalizeItem: %v", err)
}
if n.Name != "GitHub" || n.Username != "octocat" || n.Password != "hunter2" || n.Notes != "my notes" {
t.Fatalf("standard fields wrong: %+v", n)
}
if !n.TOTP {
t.Fatal("TOTP presence flag must be true when a seed exists")
}
if !reflect.DeepEqual(n.URIs, []string{"https://github.com", "https://gist.github.com"}) {
t.Fatalf("URIs = %v", n.URIs)
}
want := map[string]string{"PIN": "1234", "endpoint": "https://api.gh", "enabled": "true"}
if !reflect.DeepEqual(n.Fields, want) {
t.Fatalf("custom fields = %v want %v", n.Fields, want)
}
}
// The load-bearing security test: the raw TOTP seed (more powerful than a
// one-time code) and the password history must NEVER appear in the dump.
func TestNormalizeItemNeverLeaksSeedOrHistory(t *testing.T) {
n, err := normalizeItem(sampleLoginItemJSON)
if err != nil {
t.Fatalf("normalizeItem: %v", err)
}
out, err := json.Marshal(n)
if err != nil {
t.Fatalf("marshal: %v", err)
}
for _, leak := range []string{"SEEDSEEDSEED", "otpauth", "OLD-PASSWORD-XYZ", "passwordHistory", "abc-123"} {
if strings.Contains(string(out), leak) {
t.Fatalf("dump leaked %q: %s", leak, out)
}
}
}
func TestNormalizeItemNoTOTP(t *testing.T) {
n, err := normalizeItem(`{"name":"X","type":1,"login":{"username":"u","password":"p"}}`)
if err != nil {
t.Fatalf("normalizeItem: %v", err)
}
if n.TOTP {
t.Fatal("TOTP must be false when no seed present")
}
out, _ := json.Marshal(n)
if strings.Contains(string(out), "totp") {
t.Fatalf("no-totp item must omit the totp key entirely: %s", out)
}
}
func TestNormalizeItemEmptyStandardFieldsOmitted(t *testing.T) {
n, err := normalizeItem(`{"name":"Bare","type":1,"login":{"username":"","password":"","totp":"","uris":[]},"fields":[{"name":"only","value":"x","type":0}]}`)
if err != nil {
t.Fatalf("normalizeItem: %v", err)
}
out, _ := json.Marshal(n)
for _, k := range []string{"username", "password", "uris", "notes", "totp"} {
if strings.Contains(string(out), `"`+k+`"`) {
t.Fatalf("empty standard field %q must be omitted: %s", k, out)
}
}
if !strings.Contains(string(out), `"name":"Bare"`) || !strings.Contains(string(out), `"only":"x"`) {
t.Fatalf("name + custom field must survive: %s", out)
}
}
func TestNormalizeItemSecureNoteNullLogin(t *testing.T) {
// type 2 (secure note): login is null — must not panic; notes + custom fields survive.
n, err := normalizeItem(`{"name":"SN","type":2,"notes":"secret note","login":null,"fields":[{"name":"k","value":"v","type":1}]}`)
if err != nil {
t.Fatalf("normalizeItem(null login): %v", err)
}
if n.Name != "SN" || n.Notes != "secret note" || n.Fields["k"] != "v" {
t.Fatalf("secure-note normalize wrong: %+v", n)
}
if n.Username != "" || n.Password != "" || n.TOTP {
t.Fatalf("login fields must be empty for a login-less item: %+v", n)
}
}
func TestNormalizeItemDuplicateCustomNames(t *testing.T) {
// Bitwarden permits duplicate custom-field names; a JSON object can't hold
// dups, so last-wins (documented).
n, err := normalizeItem(`{"name":"D","fields":[{"name":"k","value":"first","type":0},{"name":"k","value":"second","type":0}]}`)
if err != nil {
t.Fatalf("normalizeItem: %v", err)
}
if n.Fields["k"] != "second" {
t.Fatalf("duplicate custom names must be last-wins, got %q", n.Fields["k"])
}
}
func TestNormalizeItemLinkedFieldSkipped(t *testing.T) {
// type 3 (linked) fields reference another field and carry a null value —
// they are not real data and must be skipped.
n, err := normalizeItem(`{"name":"L","login":{"username":"u"},"fields":[{"name":"linked","value":null,"type":3},{"name":"real","value":"r","type":0}]}`)
if err != nil {
t.Fatalf("normalizeItem: %v", err)
}
if _, ok := n.Fields["linked"]; ok {
t.Fatalf("linked field must be skipped: %v", n.Fields)
}
if n.Fields["real"] != "r" {
t.Fatalf("real custom field dropped: %v", n.Fields)
}
}
func TestNormalizeItemMalformed(t *testing.T) {
if _, err := normalizeItem("not json"); err == nil {
t.Fatal("malformed item JSON must error")
}
}
// getItem opens a session and runs `bw get item <name>`, returning raw JSON.
func TestGetItemFlow(t *testing.T) {
f := &fakeRunner{out: map[string]string{
"vault kv get -field=vaultwarden_master_password secret/workstation/claude-users/emo": "pw",
"vault kv get -field=vaultwarden_client_id secret/workstation/claude-users/emo": "user.x",
"vault kv get -field=vaultwarden_client_secret secret/workstation/claude-users/emo": "cs",
"bw status": `{"status":"locked"}`,
"bw unlock": "SESS",
"bw get item github": sampleLoginItemJSON,
}}
uid := fmt.Sprintf("%d", os.Getuid())
raw, err := getItem(f.run, "emo", uid, "github")
if err != nil || !strings.Contains(raw, `"name":"GitHub"`) {
t.Fatalf("getItem = %q, %v", raw, err)
}
// The session key must reach bw via env, never argv.
for _, call := range f.calls {
for _, arg := range call {
if strings.Contains(arg, "SESS") {
t.Errorf("session leaked into argv: %v", call)
}
}
}
}
func TestVaultHelpMentionsAll(t *testing.T) {
if !strings.Contains(vaultHelp(), "--all") {
t.Error("vault help must document --all")
}
}
// --- bw sync on read (freshness) ------------------------------------------
func TestBwSyncArgs(t *testing.T) {
if got := bwSyncArgs(); !reflect.DeepEqual(got, []string{"sync"}) {
t.Fatalf("bwSyncArgs = %v", got)
}
}
// Every read opens a session that first `bw sync`s, so reads reflect the latest
// server-side values: `bw unlock` is local-only, so without a sync a persisted
// (already-logged-in) session serves a stale local cache.
func TestOpenSessionSyncsBeforeRead(t *testing.T) {
f := &fakeRunner{out: map[string]string{
"vault kv get -field=vaultwarden_master_password secret/workstation/claude-users/emo": "pw",
"vault kv get -field=vaultwarden_client_id secret/workstation/claude-users/emo": "user.x",
"vault kv get -field=vaultwarden_client_secret secret/workstation/claude-users/emo": "cs",
"bw status": `{"status":"locked"}`,
"bw unlock": "SESS",
"bw sync": "Syncing complete.",
"bw get password github": "p@ss",
}}
uid := fmt.Sprintf("%d", os.Getuid())
if _, err := getValue(f.run, "emo", uid, getOpts{name: "github", field: "password"}); err != nil {
t.Fatalf("getValue: %v", err)
}
idx := func(prefix string) int {
for i, c := range f.calls {
if strings.HasPrefix(strings.Join(c, " "), prefix) {
return i
}
}
return -1
}
syncAt, unlockAt, getAt := idx("bw sync"), idx("bw unlock"), idx("bw get password github")
if syncAt < 0 {
t.Fatal("expected a `bw sync` before the read")
}
if !(unlockAt < syncAt && syncAt < getAt) {
t.Fatalf("order wrong: unlock=%d sync=%d get=%d (want unlock<sync<get)", unlockAt, syncAt, getAt)
}
}
// Sync is best-effort: a transient sync failure must NOT fail the read — the
// cached value is still returned (a stderr warning is emitted, not asserted here).
func TestReadSucceedsWhenSyncFails(t *testing.T) {
f := &fakeRunner{
out: map[string]string{
"vault kv get -field=vaultwarden_master_password secret/workstation/claude-users/emo": "pw",
"vault kv get -field=vaultwarden_client_id secret/workstation/claude-users/emo": "user.x",
"vault kv get -field=vaultwarden_client_secret secret/workstation/claude-users/emo": "cs",
"bw status": `{"status":"locked"}`,
"bw unlock": "SESS",
"bw get password github": "p@ss",
},
err: map[string]error{"bw sync": errors.New("Failed to sync: network error")},
}
uid := fmt.Sprintf("%d", os.Getuid())
val, err := getValue(f.run, "emo", uid, getOpts{name: "github", field: "password"})
if err != nil || val != "p@ss" {
t.Fatalf("read must succeed despite a sync failure: val=%q err=%v", val, err)
}
}
// --- vault kv (HashiCorp Vault / OpenBao infra secrets) --------------------
func TestVaultKVCommandsRegistered(t *testing.T) {
want := map[string]Tier{
"vault kv get": TierRead,
"vault kv list": TierRead,
"vault kv put": TierWrite,
}
got := map[string]Tier{}
for _, c := range vaultCommands() {
got[c.name()] = c.Tier
}
for name, tier := range want {
if got[name] != tier {
t.Errorf("command %q: tier=%q, want %q", name, got[name], tier)
}
}
}
func TestVaultKVArgs(t *testing.T) {
if got := vaultKVGetFieldArgs("secret/viktor", "github_pat"); !reflect.DeepEqual(got, []string{"kv", "get", "-field=github_pat", "secret/viktor"}) {
t.Fatalf("vaultKVGetFieldArgs = %v", got)
}
if got := vaultKVGetJSONArgs("secret/viktor"); !reflect.DeepEqual(got, []string{"kv", "get", "-format=json", "secret/viktor"}) {
t.Fatalf("vaultKVGetJSONArgs = %v", got)
}
if got := vaultKVListArgs("secret/"); !reflect.DeepEqual(got, []string{"kv", "list", "-format=json", "secret/"}) {
t.Fatalf("vaultKVListArgs = %v", got)
}
// create (path absent) → put; merge (path present) → patch -method=rw. Either
// way the VALUE travels via the `key=-` stdin form, never argv.
create := vaultKVPutArgs(false, "secret/x", "api_key")
if !reflect.DeepEqual(create, []string{"kv", "put", "secret/x", "api_key=-"}) {
t.Fatalf("vaultKVPutArgs(create) = %v", create)
}
merge := vaultKVPutArgs(true, "secret/x", "api_key")
if !reflect.DeepEqual(merge, []string{"kv", "patch", "-method=rw", "secret/x", "api_key=-"}) {
t.Fatalf("vaultKVPutArgs(merge) = %v", merge)
}
for _, args := range [][]string{create, merge} {
for _, a := range args {
if strings.Contains(a, "SECRETVALUE") || strings.HasSuffix(a, "=SECRETVALUE") {
t.Fatalf("value must not appear in argv: %v", args)
}
}
}
}
func TestExtractKVData(t *testing.T) {
// `vault kv get -format=json` wraps the secret in {"data":{"data":{...},"metadata":{...}}}.
env := `{"request_id":"x","data":{"data":{"github_pat":"ghp_abc","email":"e@x.me"},"metadata":{"version":3}}}`
out, err := extractKVData(env)
if err != nil {
t.Fatalf("extractKVData: %v", err)
}
// Round-trip to a map so key order doesn't matter.
var m map[string]string
if err := json.Unmarshal([]byte(out), &m); err != nil {
t.Fatalf("result not a JSON object: %q (%v)", out, err)
}
if m["github_pat"] != "ghp_abc" || m["email"] != "e@x.me" {
t.Fatalf("extractKVData inner data wrong: %v", m)
}
// metadata must NOT leak into the output.
if strings.Contains(out, "metadata") || strings.Contains(out, "request_id") {
t.Fatalf("envelope internals leaked: %s", out)
}
if _, err := extractKVData("not json"); err == nil {
t.Fatal("malformed envelope must error")
}
}
func TestParseKVList(t *testing.T) {
keys, err := parseKVList(`["app1","app2/","viktor"]`)
if err != nil {
t.Fatalf("parseKVList: %v", err)
}
if !reflect.DeepEqual(keys, []string{"app1", "app2/", "viktor"}) {
t.Fatalf("parseKVList = %v", keys)
}
if _, err := parseKVList("not json"); err == nil {
t.Fatal("malformed list must error")
}
}
func TestKVGetFieldFlow(t *testing.T) {
f := &fakeRunner{out: map[string]string{
"vault kv get -field=github_pat secret/viktor": "ghp_secret",
}}
val, err := kvGetField(f.run, "secret/viktor", "github_pat")
if err != nil || val != "ghp_secret" {
t.Fatalf("kvGetField = %q, %v", val, err)
}
}
func TestKVListFlow(t *testing.T) {
f := &fakeRunner{out: map[string]string{
"vault kv list -format=json secret/": `["app1","app2/"]`,
}}
keys, err := kvList(f.run, "secret/")
if err != nil || !reflect.DeepEqual(keys, []string{"app1", "app2/"}) {
t.Fatalf("kvList = %v, %v", keys, err)
}
}
// kvPut creates the path on first write and merges thereafter, with the value on
// stdin only (mirrors writeCreds). Never plain `kv patch` (needs the patch cap).
func TestKVPutCreatesThenMerges(t *testing.T) {
for _, tc := range []struct {
name string
exists bool
wantCreate bool
}{
{"absent path → create (put)", false, true},
{"present path → merge (patch -rw)", true, false},
} {
t.Run(tc.name, func(t *testing.T) {
var stdinCalls []recStdin
run := func(name string, argv, envv []string) (string, error) {
if len(argv) >= 2 && argv[0] == "kv" && argv[1] == "get" {
if tc.exists {
return `{"data":{"data":{}}}`, nil
}
return "", fmt.Errorf("No value found at secret/x")
}
return "", nil
}
runStdin := func(name string, argv, envv []string, stdin string) (string, error) {
stdinCalls = append(stdinCalls, recStdin{append([]string{name}, argv...), stdin})
return "", nil
}
if err := kvPut(run, runStdin, "secret/x", "api_key", "SECRETVALUE"); err != nil {
t.Fatalf("kvPut: %v", err)
}
if len(stdinCalls) != 1 {
t.Fatalf("want exactly 1 stdin write, got %d", len(stdinCalls))
}
sc := stdinCalls[0]
joined := strings.Join(sc.argv, " ")
if tc.wantCreate && !strings.Contains(joined, "kv put") {
t.Fatalf("absent path must use `kv put`: %v", sc.argv)
}
if !tc.wantCreate && !strings.Contains(joined, "kv patch -method=rw") {
t.Fatalf("present path must merge via `kv patch -method=rw`: %v", sc.argv)
}
if strings.Contains(joined, "kv patch") && !strings.Contains(joined, "-method=rw") {
t.Fatalf("must never use plain `kv patch`: %v", sc.argv)
}
if sc.stdin != "SECRETVALUE" {
t.Fatalf("value must travel via stdin, got %q", sc.stdin)
}
for _, a := range sc.argv {
if strings.Contains(a, "SECRETVALUE") {
t.Fatalf("value leaked into argv: %v", sc.argv)
}
}
})
}
}
func TestVaultHelpMentionsBothSystems(t *testing.T) {
h := vaultHelp()
for _, want := range []string{"Vaultwarden", "vault kv"} {
if !strings.Contains(h, want) {
t.Errorf("vault help must mention %q (distinguish the two systems)", want)
}
}
// Must name the infra-secrets system so the distinction is unambiguous.
if !strings.Contains(h, "HashiCorp") && !strings.Contains(h, "OpenBao") {
t.Error("vault help must name HashiCorp Vault / OpenBao (the infra secrets store)")
}
}

164
cli/edges.go Normal file
View file

@ -0,0 +1,164 @@
package main
import (
"fmt"
"regexp"
"strconv"
"strings"
)
// edgesOpts is the parsed filter set for `homelab edges` (the who-talks-to-whom
// investigation helper over the goldmane_edges trail; see ADR-0014).
type edgesOpts struct {
ns string // edges touching this namespace (either direction)
src string // edges where src_ns = this
dst string // edges where dst_ns = this
peersOf string // distinct peers of this namespace (both directions)
newSince string // first_seen >= duration (24h/7d/30m) or date (YYYY-MM-DD)
denied bool // action = 'deny' only
asJSON bool // wrap result as a JSON array
limit int // row cap (default 200)
}
// parseEdgesArgs parses the edges flag surface. Unknown flags error out so a
// typo surfaces instead of silently dumping the whole table.
func parseEdgesArgs(args []string) (edgesOpts, error) {
o := edgesOpts{limit: 200}
i := 0
for i < len(args) {
a := args[i]
key, inline, hasInline := a, "", false
if eq := strings.IndexByte(a, '='); eq >= 0 {
key, inline, hasInline = a[:eq], a[eq+1:], true
}
needVal := func() (string, error) {
if hasInline {
return inline, nil
}
if i+1 < len(args) {
i++
return args[i], nil
}
return "", fmt.Errorf("flag %s needs a value", key)
}
var err error
switch key {
case "--ns":
o.ns, err = needVal()
case "--src":
o.src, err = needVal()
case "--dst":
o.dst, err = needVal()
case "--peers-of":
o.peersOf, err = needVal()
case "--new-since":
o.newSince, err = needVal()
case "--denied":
o.denied = true
case "--json":
o.asJSON = true
case "--limit":
var v string
if v, err = needVal(); err == nil {
if o.limit, err = strconv.Atoi(v); err != nil {
err = fmt.Errorf("--limit must be an integer: %q", v)
}
}
default:
return o, fmt.Errorf("unknown flag: %s", a)
}
if err != nil {
return o, err
}
i++
}
return o, nil
}
// nsRE is the safe namespace-token charset (k8s names + "Global"). Used as the
// injection guard — anything else is rejected rather than quoted-and-hoped.
var nsRE = regexp.MustCompile(`^[A-Za-z0-9][A-Za-z0-9_.-]*$`)
func validateNS(s string) error {
if s == "" || len(s) > 63 || !nsRE.MatchString(s) {
return fmt.Errorf("invalid namespace name: %q", s)
}
return nil
}
// sqlStr renders a SQL string literal (belt-and-suspenders on top of validateNS).
func sqlStr(s string) string { return "'" + strings.ReplaceAll(s, "'", "''") + "'" }
var (
durRE = regexp.MustCompile(`^(\d+)([smhd])$`)
dateRE = regexp.MustCompile(`^\d{4}-\d{2}-\d{2}([ T]\d{2}:\d{2}(:\d{2})?)?$`)
)
// newSinceCond turns a duration (24h/7d/30m/90s) or a date (YYYY-MM-DD[ HH:MM])
// into a first_seen predicate.
func newSinceCond(v string) (string, error) {
if m := durRE.FindStringSubmatch(v); m != nil {
unit := map[string]string{"s": "seconds", "m": "minutes", "h": "hours", "d": "days"}[m[2]]
return fmt.Sprintf("first_seen >= now() - interval '%s %s'", m[1], unit), nil
}
if dateRE.MatchString(v) {
return "first_seen >= " + sqlStr(v), nil
}
return "", fmt.Errorf("--new-since must be a duration (e.g. 24h, 7d, 30m) or a date (YYYY-MM-DD): %q", v)
}
// buildEdgesQuery renders the SQL for the given filters against the `edge` table.
func buildEdgesQuery(o edgesOpts) (string, error) {
limit := o.limit
if limit <= 0 {
limit = 200
}
// peers-of is a distinct-peer summary, a different shape from the row list.
if o.peersOf != "" {
if err := validateNS(o.peersOf); err != nil {
return "", err
}
p := sqlStr(o.peersOf)
return fmt.Sprintf("SELECT DISTINCT peer, action FROM ("+
"SELECT dst_ns AS peer, action FROM edge WHERE src_ns = %s "+
"UNION SELECT src_ns AS peer, action FROM edge WHERE dst_ns = %s"+
") t ORDER BY peer LIMIT %d", p, p, limit), nil
}
var conds []string
for _, f := range []struct{ val, tmpl string }{
{o.ns, "(src_ns = %[1]s OR dst_ns = %[1]s)"},
{o.src, "src_ns = %s"},
{o.dst, "dst_ns = %s"},
} {
if f.val == "" {
continue
}
if err := validateNS(f.val); err != nil {
return "", err
}
conds = append(conds, fmt.Sprintf(f.tmpl, sqlStr(f.val)))
}
if o.denied {
conds = append(conds, "action = 'deny'")
}
if o.newSince != "" {
c, err := newSinceCond(o.newSince)
if err != nil {
return "", err
}
conds = append(conds, c)
}
q := "SELECT src_ns, dst_ns, action, flow_count, first_seen, last_seen FROM edge"
if len(conds) > 0 {
q += " WHERE " + strings.Join(conds, " AND ")
}
q += fmt.Sprintf(" ORDER BY first_seen DESC LIMIT %d", limit)
if o.asJSON {
q = "SELECT coalesce(json_agg(row_to_json(t)), '[]') FROM (" + q + ") t"
}
return q, nil
}

163
cli/edges_test.go Normal file
View file

@ -0,0 +1,163 @@
package main
import (
"strings"
"testing"
)
func TestParseEdgesArgs(t *testing.T) {
cases := []struct {
name string
args []string
want edgesOpts
}{
{"defaults", nil, edgesOpts{limit: 200}},
{"ns", []string{"--ns", "immich"}, edgesOpts{ns: "immich", limit: 200}},
{"ns equals", []string{"--ns=immich"}, edgesOpts{ns: "immich", limit: 200}},
{"src dst", []string{"--src", "a", "--dst", "b"}, edgesOpts{src: "a", dst: "b", limit: 200}},
{"peers-of", []string{"--peers-of", "authentik"}, edgesOpts{peersOf: "authentik", limit: 200}},
{"denied json", []string{"--denied", "--json"}, edgesOpts{denied: true, asJSON: true, limit: 200}},
{"new-since", []string{"--new-since", "24h"}, edgesOpts{newSince: "24h", limit: 200}},
{"limit", []string{"--limit", "50"}, edgesOpts{limit: 50}},
}
for _, c := range cases {
t.Run(c.name, func(t *testing.T) {
got, err := parseEdgesArgs(c.args)
if err != nil {
t.Fatalf("parseEdgesArgs(%v) error: %v", c.args, err)
}
if got != c.want {
t.Fatalf("parseEdgesArgs(%v) = %+v, want %+v", c.args, got, c.want)
}
})
}
}
func TestParseEdgesArgsErrors(t *testing.T) {
for _, args := range [][]string{
{"--limit", "abc"},
{"--bogus"},
} {
if _, err := parseEdgesArgs(args); err == nil {
t.Errorf("parseEdgesArgs(%v) expected error, got nil", args)
}
}
}
func TestBuildEdgesQueryDefaults(t *testing.T) {
q, err := buildEdgesQuery(edgesOpts{limit: 200})
if err != nil {
t.Fatal(err)
}
for _, want := range []string{"FROM edge", "ORDER BY first_seen DESC", "LIMIT 200"} {
if !strings.Contains(q, want) {
t.Errorf("query %q missing %q", q, want)
}
}
if strings.Contains(q, "WHERE") {
t.Errorf("no-filter query should have no WHERE: %q", q)
}
}
func TestBuildEdgesQueryFilters(t *testing.T) {
cases := []struct {
name string
o edgesOpts
want string
}{
{"ns both directions", edgesOpts{ns: "immich", limit: 10}, "(src_ns = 'immich' OR dst_ns = 'immich')"},
{"src only", edgesOpts{src: "authentik", limit: 10}, "src_ns = 'authentik'"},
{"dst only", edgesOpts{dst: "dbaas", limit: 10}, "dst_ns = 'dbaas'"},
{"denied", edgesOpts{denied: true, limit: 10}, "action = 'deny'"},
}
for _, c := range cases {
t.Run(c.name, func(t *testing.T) {
q, err := buildEdgesQuery(c.o)
if err != nil {
t.Fatal(err)
}
if !strings.Contains(q, "WHERE") || !strings.Contains(q, c.want) {
t.Errorf("query %q missing WHERE/%q", q, c.want)
}
})
}
}
func TestBuildEdgesQueryCombinedFiltersAnded(t *testing.T) {
q, err := buildEdgesQuery(edgesOpts{src: "a", denied: true, limit: 5})
if err != nil {
t.Fatal(err)
}
if !strings.Contains(q, "src_ns = 'a' AND action = 'deny'") {
t.Errorf("combined filters not AND'd: %q", q)
}
}
func TestBuildEdgesQueryPeersOf(t *testing.T) {
q, err := buildEdgesQuery(edgesOpts{peersOf: "authentik", limit: 100})
if err != nil {
t.Fatal(err)
}
for _, want := range []string{"DISTINCT", "src_ns = 'authentik'", "dst_ns = 'authentik'", "UNION"} {
if !strings.Contains(q, want) {
t.Errorf("peers-of query %q missing %q", q, want)
}
}
}
func TestBuildEdgesQueryJSON(t *testing.T) {
q, err := buildEdgesQuery(edgesOpts{asJSON: true, limit: 200})
if err != nil {
t.Fatal(err)
}
if !strings.Contains(q, "json_agg") || !strings.Contains(q, "row_to_json") {
t.Errorf("json query missing json_agg wrapper: %q", q)
}
}
func TestBuildEdgesQueryRejectsInjection(t *testing.T) {
for _, bad := range []string{"a'; DROP TABLE edge;--", "a b", "a;b", "a\"b"} {
if _, err := buildEdgesQuery(edgesOpts{ns: bad, limit: 10}); err == nil {
t.Errorf("buildEdgesQuery(ns=%q) expected validation error, got nil", bad)
}
}
}
func TestNewSinceCond(t *testing.T) {
cases := []struct {
in string
want string
}{
{"24h", "first_seen >= now() - interval '24 hours'"},
{"7d", "first_seen >= now() - interval '7 days'"},
{"30m", "first_seen >= now() - interval '30 minutes'"},
{"2026-06-28", "first_seen >= '2026-06-28'"},
}
for _, c := range cases {
got, err := newSinceCond(c.in)
if err != nil {
t.Fatalf("newSinceCond(%q) error: %v", c.in, err)
}
if got != c.want {
t.Errorf("newSinceCond(%q) = %q, want %q", c.in, got, c.want)
}
}
for _, bad := range []string{"yesterday", "1y", "'; DROP", ""} {
if _, err := newSinceCond(bad); err == nil {
t.Errorf("newSinceCond(%q) expected error, got nil", bad)
}
}
}
func TestValidateNS(t *testing.T) {
for _, ok := range []string{"immich", "calico-system", "kube-system", "Global", "pg-cluster-rw"} {
if err := validateNS(ok); err != nil {
t.Errorf("validateNS(%q) unexpected error: %v", ok, err)
}
}
for _, bad := range []string{"", "a b", "a'b", "a;b", "../x", "a$b"} {
if err := validateNS(bad); err == nil {
t.Errorf("validateNS(%q) expected error, got nil", bad)
}
}
}

View file

@ -20,6 +20,7 @@ func buildRegistry() []Command {
reg = append(reg, deployCommands()...)
reg = append(reg, netCommands()...)
reg = append(reg, obsCommands()...)
reg = append(reg, edgesCommands()...)
reg = append(reg, usageCommands()...)
reg = append(reg, haCommands()...)
reg = append(reg, browserCommands()...)

View file

@ -5,8 +5,31 @@ import (
"os"
"strings"
"testing"
"unicode/utf8"
)
func TestTruncatePreviewKeepsValidUTF8(t *testing.T) {
// Byte-slicing a long Cyrillic string at 240 splits a 2-byte rune and emits
// invalid UTF-8 — the bug that crashed the recall hook. truncatePreview must
// cut on a rune boundary and always stay valid UTF-8.
long := strings.Repeat("я", 300) // 300 runes / 600 bytes
got := truncatePreview(long, 240)
if !utf8.ValidString(got) {
t.Fatalf("truncatePreview produced invalid UTF-8: %q", got)
}
if r := []rune(got); len(r) != 241 || string(r[:240]) != strings.Repeat("я", 240) || r[240] != '…' {
t.Fatalf("truncatePreview = %d runes, want 240 Cyrillic + ellipsis", len(r))
}
// Short multibyte strings pass through untouched (no ellipsis).
if got := truncatePreview("кратко", 240); got != "кратко" {
t.Fatalf("short string altered: %q", got)
}
// ASCII boundary still works.
if got := truncatePreview(strings.Repeat("a", 500), 240); got != strings.Repeat("a", 240)+"…" {
t.Fatalf("ascii truncation wrong: %q", got)
}
}
func TestResolveMemoryBase(t *testing.T) {
old1, old2 := os.Getenv("CLAUDE_MEMORY_API_URL"), os.Getenv("MEMORY_API_URL")
defer func() { os.Setenv("CLAUDE_MEMORY_API_URL", old1); os.Setenv("MEMORY_API_URL", old2) }()

View file

@ -13,7 +13,7 @@ The trigger was a proposal to swap Forgejo out for GitHub entirely. The grilling
Do **not** swap to GitHub. Reaffirm and *complete* the model already in `CONTEXT.md`:
- Every first-party repo has exactly **one** push target — its **Canonical repo** on Forgejo. GitHub is a one-way push-mirror (off-site backup + the source GitHub Actions builds from). **No repo is ever dual-pushed.**
- A small, explicit set of **GitHub-first repos** are the exception (canonical lives on GitHub, outside the mirror policy): third-party clones/forks where GitHub is genuinely upstream (`jsoncrack.com`, `snmp_exporter`, `SparkyFitness`, `agent-rules-books`, `Plotting-Your-Dream-Book`) and the deliberately-public first-party `health`.
- A small, explicit set of **GitHub-first repos** are the exception (canonical lives on GitHub, outside the mirror policy): third-party clones/forks where GitHub is genuinely upstream (`jsoncrack.com`, `snmp_exporter`, `SparkyFitness`, `agent-rules-books`, `Plotting-Your-Dream-Book`) and the deliberately-public first-party `health`. `Plotting-Your-Dream-Book` (owned by Anca, dev in her org) keeps its GHA build in-place and pushes the image to **its own org's ghcr** (`ghcr.io/passionprojectsanca/book-plotter`, private) via the workflow's built-in `GITHUB_TOKEN` — no Forgejo mirror, no `viktorbarzin`-namespace push, no shared PAT in her repo (2026-06-27, migrated off DockerHub).
- `infra` is reconciled into the standard model: its GitHub-only `.github/workflows/build-*.yml` are brought onto Forgejo-canonical (inert on Forgejo, active on the mirror), then the mirror is enabled — ending the deliberate divergence while keeping Woodpecker on the Forgejo forge.
- Enforcement is **structural**: reconciled clones keep only the Forgejo remote, so there is no GitHub remote to habitually push to; the execution rule is "push to the canonical forge only, never the mirror."

View file

@ -5,6 +5,14 @@ exists to answer the question that drove the whole CLI — *which verbs are wort
adding next* — with data instead of one maintainer's habits (the earlier mining
covered a single user's ~51k commands, so the surface is shaped to that user).
> **Update (2026-06-26) — the cross-user privacy *norm* below is superseded by
> [ADR-0015](0015-os-is-the-authorization-boundary.md).** The prohibition this
> ADR leaned on ("reading another user's `~/.claude` is off-limits even for an
> owner in-session") no longer holds: the managed-settings policy now **defers
> to OS/sudo authorization**. The `usage top` telemetry design itself is
> unchanged and still current — only the "never read homes" framing in the
> third decision below is overtaken.
## Decisions
- **Emit on dispatch, in `dispatch()`.** The longest-prefix match already knows

View file

@ -27,3 +27,9 @@ As the Service count grows we want an audit-grade record of which Service talks
- **Enforcement gains a better data source.** Goldmane's allow/deny + policy-trace flows build the Wave 1 empirical egress allowlist faster than the current iptables-`LOG`→journald→Loki path, and policies select on namespace/label with no SA dependency.
- **New ubiquitous language** recorded in `CONTEXT.md`: **Service identity** and **Goldmane / Whisker**.
- **Revisit triggers:** adopt dedicated per-Service SAs if identity-aware NetworkPolicy needs a principal finer than namespace/label, or if mTLS is ever required; reconsider Retina if DNS/drop-level flow detail becomes necessary.
## As-built (2026-06-25)
Implemented across infra issues #57#63. **One material deviation from the decision above:** the durable trail is NOT a Goldmane→Loki emitter (no such emitter exists in OSS Calico 3.30) — it is the **`goldmane-edge-aggregator`** service, which streams Goldmane's gRPC `Flows.Stream` API over mTLS and upserts the unique namespace-pair **edge set** (`edge(src_ns,dst_ns,action,first_seen,last_seen,flow_count)`, self-edges + empty-namespace flows dropped) into **CNPG DB `goldmane_edges`**, plus a daily `goldmane-edges-digest` CronJob → `#alerts` (all Slack consolidated to `#alerts`; the `#security` channel was abandoned 2026-06-25 — the shared webhook's Slack app isn't a member of it — see runbook). The mTLS client cert **reuses the operator's Tigera-CA-signed `whisker-backend-key-pair`** rather than copying the CA private key into TF state (Goldmane verifies CA-chain only, not identity) — re-apply `stacks/goldmane-edge-aggregator` if the operator rotates it. `service-identity` labels are live on the multi-Service namespaces (`monitoring`, `dbaas`). Whisker UI is Authentik-gated at `whisker.viktorbarzin.me`. Health: Prometheus alerts `AggregatorDown` + `DigestFailing` and cluster-health check #48.
Full as-built, query recipes (incl. the Wave-1 egress-allowlist derivation), and troubleshooting: [`docs/runbooks/goldmane-flow-trail.md`](../runbooks/goldmane-flow-trail.md). Stacks: `stacks/calico` (Goldmane/Whisker + Whisker ingress), `stacks/goldmane-edge-aggregator` (the trail). Code: `~/code/goldmane-edge-aggregator`.

View file

@ -0,0 +1,57 @@
# OS is the authorization boundary: agents defer to Unix/sudo, not a stricter in-policy rule
Supersedes the cross-user privacy *norm* that the devvm managed-settings policy
carried and that ADR-0011 leaned on ("never read another user's home /
`~/.claude`, off-limits even for an owner in-session"). ADR-0011's actual
subject — `usage top` telemetry and its emit design — is unchanged and still
current; only the privacy prohibition it referenced is superseded here.
## Context
The devvm managed-settings policy (`/etc/claude-code/managed-settings.json`,
`claudeMd`) carried two rules that were, in practice, *stricter than the OS*:
"you are not the admin, do not escalate privileges" and "never read another
user's home directory, credentials, tokens, or `~/.claude`." The OS told a
different story: `wizard` holds `(ALL) NOPASSWD: ALL` — full passwordless root.
The kernel had already granted total read access; the policy was layering an
artificial refusal on top of an authorization the OS already permits, and the
"not the admin" framing was factually wrong for a NOPASSWD-root user.
Two honest ways to resolve the inconsistency: tighten sudo to match the policy,
or loosen the policy to match the OS. The owner chose the latter on 2026-06-26,
for analytics/debugging across the shared box.
## Decision
- **Authorization follows the OS, not this policy.** Agents may access whatever
their OS user can access — directly or via `sudo` where they hold sudo rights
— and must not impose restrictions stricter than the OS. On this box that
includes other users' home directories and `~/.claude` for users who hold
broad sudo.
- **No separate prompt or carve-out** for OS-authorized access. The Unix
permission model + sudoers is the single source of truth for who may read
what. Other homes are `0750`-owned, so a cross-home read necessarily transits
`sudo` and is therefore captured in the sudo/auth audit log.
- **Cluster/infra RBAC tiering is unchanged.** kubectl / Vault / infra access
stays scoped to each user's RBAC tier; "defer to the OS" is about OS-level
file access, not a licence to exceed cluster RBAC.
- **Scope is symmetric and multi-user.** The rule lives in the *shared*
managed-settings, so every user's agents defer to that user's own sudo grant.
Any user with broad sudo gets the same cross-home read capability over other
users' files. Accepted by the owner with that understanding; emo's and
ancamilea's `~/.claude` is now agent-readable by sudo-holders.
- **Takes effect in a fresh session.** managed-settings loads at session start;
the session that made the change keeps running under the old policy.
## Consequences
- The privacy-preserving telemetry rationale in ADR-0011 (`usage top` as the
"cross-user analytics without reading homes" answer) remains useful but is no
longer the *only* sanctioned path; direct reads via `sudo` are now permitted.
- Larger blast radius: if an agent session running as a sudo-holder is
prompt-injected or otherwise compromised, it can now read every user's secrets
with no in-agent friction (sudo here is passwordless). The sudo/auth audit log
is the remaining accountability control.
- Reversible: restore the prior `claudeMd` bullets (backup kept at
`/etc/claude-code/managed-settings.json.bak-2026-06-26`) and start a fresh
session.

View file

@ -86,10 +86,56 @@ Signin latency is dominated by screen count and round trips, not server time
use the explicit-consent flow (it re-prompted every 4 weeks per app).
- **Live tuning via `server.env`/`worker.env`** (the `authentik.*` Helm values
are inert due to `existingSecret`): 3 gunicorn workers, 30m flow-plan cache,
15m policy cache, 60s persistent DB connections.
15m policy cache, gunicorn `max_requests=10000`/jitter=1000 (recycle
hardening — decorrelates the 9 workers' recycles from PG blips). **No
`CONN_MAX_AGE`** — persistent Django connections pin a PgBouncer server conn
1:1 and saturate the session-mode pool (reverted 2026-06-10).
- **Static assets cached immutable**: `/static` ingress carve-out adds
`Cache-Control: public, max-age=31536000, immutable` (assets are
version-fingerprinted; authentik itself sends no max-age).
- **Rate-limit carve-out** (2026-06-28): `/` and `/static` use a dedicated
`authentik-rate-limit` (100/1000) instead of the shared 10/50 default — the
login SPA cold-loads ~70 flow-executor chunks from `/static`; the default
burst 429'd the tail and a failed ES-module import left a blank login screen.
- **Readiness tolerance** (2026-06-28): server `readinessProbe.failureThreshold:8`
(~80s, was the chart-default ~30s). The probe (`/-/health/ready/`) queries the
DB; too-tight tolerance let a sub-60s PG/pgbouncer transient return 503 on all
3 server pods at once → Traefik had no healthy backend → 502/503/504 (episodic
blank login + 30s hangs). 80s absorbs a full CNPG failover reconnect. Sessions
+ cache are PostgreSQL-only since Redis was removed in 2026.2 (no external-cache
option), so request-serving is coupled to PG — this survives a short transient,
not a total CNPG outage.
- **Rolling-update strategy** (2026-06-28): the chart key is `deploymentStrategy`
(the repo's old `strategy:` key was silently inert → live ran the chart-default
25%/25% and dropped a server pod out of rotation on every roll). Now
`maxSurge:1/maxUnavailable:0` keeps all 3 ready throughout a roll.
- **Old-browser login (SFE)** (2026-06-28): authentik's modern flow SPA is ES2022
and renders a **blank login** on Safari/WebKit ≤16.3 (every iOS browser shares
the system WebKit, so it's not browser-choice — e.g. iPadOS ≤15). The overlay
image patches `flows/views/interface.py::compat_needs_sfe()` to also serve
authentik's built-in no-JS **Simplified Flow Executor** (SFE, ES5) to old Safari
**and any iOS browser** (Chrome/Firefox on iOS are WebKit skins) on iOS ≤16.3,
so those clients get the *real* authentik login (password + MFA + reputation —
no auth downgrade). The SFE can't render Identification-stage **sources**
(authentik limitation), so the patch also injects static social-login `<a>`
links into `flow-sfe.html` (→ `/source/oauth/login/<slug>/`, plain redirects) —
required for password-less accounts (e.g. Google-only users). A Traefik
basic-auth fallback was rejected: it would have put a single spoofable-UA
password in front of `vbarzin→wizard` (passwordless root on the devvm). See
`stacks/authentik/patch-compat-sfe.py`.
- **SFE + forced-WebAuthn MFA gotcha** (2026-06-28): the `default-authentication-flow`
MFA stage (`not_configured_action=configure`, `conf_stages=[webauthn]`) force-enrols
a WebAuthn passkey for any **password**-path user with no MFA device — but the SFE
**cannot render WebAuthn** (enrol *or* validate), so that user gets
`unsupported state: ak-stage-authenticator-webauthn`. Two escape hatches, **no MFA
downgrade**: (1) **social login** — sources run `default-source-authentication`
(UserLoginStage only, **no MFA stage**), so the SFE's "Continue with <provider>"
button always completes; (2) **enrol TOTP** — the SFE *can* validate TOTP codes, and
≥1 confirmed device flips the stage from force-enrol to validate. User MFA devices are
runtime data (not Terraform): enrol via `ak shell`
(`TOTPDevice.objects.create(user=…, confirmed=True)`) and store the secret in the
user's own Vaultwarden item. (Done for emo — the Google-only iPadOS-15 case: TOTP in
his `authentik.viktorbarzin.me` Bitwarden item; e2e-verified the BW code is accepted.)
- **Outpost**: 2 replicas, `log_level=info` (was 1 replica at `trace`).
- **auth-proxy nginx**: upstream `keepalive 32` + HTTP/1.1 — no per-request
TCP setup on the forward-auth subrequest path.

View file

@ -205,6 +205,43 @@ healthy <0.3s, broken hangs). **Fix: cap `ulimit -n 65536` before x11vnc starts*
wrapper in `main.tf` (so it applies deterministically even though the image is
`:latest`/`IfNotPresent` and won't re-pull a rebuilt entrypoint). Same bug + fix
as the android-emulator stack.
### noVNC black after a browser-container restart (x11vnc supervision)
A **distinct** failure from the fd-sweep gotcha above: the noVNC client *connects*
but the view is **black**, and the novnc container logs spew
`connecting to: localhost:5900` → `Failed to connect ... [Errno 111] Connection
refused` (x11vnc is **down**, not slow). Cause: `x11vnc` and `websockify` both run
in the **novnc** container, but x11vnc attaches to the **chrome-service** (browser)
container's Xvfb over `localhost:6099` (shared pod network). When the browser
container restarts — Chrome exits cleanly (exit 0, "Completed") or crashes — its
Xvfb vanishes and x11vnc loses its X connection and exits.
`entrypoint.sh` **supervises** x11vnc: it launches x11vnc and websockify as
background children and `wait -n`s on them, exiting non-zero if **either** dies, so
the kubelet restarts the novnc container, which re-waits for Xvfb on `:6099` and
relaunches x11vnc — the bridge **self-heals** across browser-container restarts.
(Before 2026-06-27, x11vnc was an unsupervised background child of an `exec`ed
websockify; a dead x11vnc was never relaunched, leaving `:5900` dead — a
`<defunct>` zombie — and the view black until a manual pod restart. Same
supervision pattern as the android-emulator stack's entrypoint.)
**Diagnose:** `kubectl exec -c novnc -- ps aux | grep x11vnc` (a `<defunct>`/Z
entry = the bug); or the RFB-banner probe from a sibling container (`python3 -c
"import socket;s=socket.socket();s.settimeout(2);s.connect(('127.0.0.1',5900));print(s.recv(12))"`
— healthy returns `b'RFB 003.008\n'`, broken = `ConnectionRefused`). **Immediate
recovery** (no image change): restart just the novnc container with `kubectl exec
-n chrome-service deploy/chrome-service -c novnc -- kill 1` — re-runs its entrypoint
and relaunches x11vnc **without** touching the browser session/in-flight CDP jobs.
> **Deploying a rebuilt novnc entrypoint:** Keel is **off** for this deployment
> (`keel.sh/policy=never`, because the browser container's playwright image is
> version-pinned to f1-stream) and the image is `:latest`/`IfNotPresent`, so a
> rebuilt `:latest` will **not** redeploy on its own. After the
> `build-chrome-service-novnc.yml` GHA build pushes `:latest` + `:<sha>`,
> **SHA-pin** the novnc `image` in `main.tf` to the new `:<sha>` to force the pull
> and rollout (the novnc image is TF-managed — not in the deployment's
> `lifecycle.ignore_changes`).
- **snapshot-server sidecar** (`mcr.microsoft.com/playwright/python:v1.48.0-noble`)
serves `GET /api/snapshot` from `/profile/snapshots/storage-state.json`,
bearer-gated by `PW_TOKEN`. Service `chrome-snapshot` maps :8088 → :8088
@ -256,6 +293,42 @@ Key facts:
byte-identical copy of `files/stealth.js`, guarded by a drift test — so the
CLI's stealth never diverges from the in-cluster callers'.
## Multi-user access (sharing the browser)
There is ONE chrome-service browser with ONE persistent profile, warmed with
**Viktor's** logged-in sessions. CDP has no per-context auth, so anyone who can
drive the browser — over the noVNC view OR the CDP/`homelab browser` path — can
reach the persistent profile (`browser.contexts[0]`) and therefore Viktor's
sessions. Access is gated accordingly, per user.
**Decision (2026-06-28):** emo (`emil.barzin` / `emil.barzin@gmail.com`) SHARES
Viktor's browser for form-filling + captcha solving, rather than getting an
isolated instance. The session-exposure trade-off above was explicitly accepted.
Two independent grants make up "browser access" for a user:
1. **noVNC (interactive view, `chrome.viktorbarzin.me`)** — gated by the Authentik
`admin-services-restriction` policy: the `CHROME_ALLOWED` set
(`stacks/authentik/admin-services-restriction.tf`) matches the user's Authentik
username OR email. Add the user there. No kubeconfig/RBAC needed.
2. **CLI (`homelab browser`, CDP over port-forward)** — needs `pods/portforward`
in `chrome-service` PLUS a non-interactive credential (a normal devvm user's
kubeconfig is interactive-OIDC-only and can't authenticate a headless agent
session). Provided by a per-user **ServiceAccount** with a long-lived token
(`stacks/chrome-service/rbac.tf`, e.g. `emo-browser`): `pods/portforward` in
this namespace + cluster read-only (`oidc-power-user-readonly`, so it can also
resolve the Service and doesn't regress the user's normal read). The devvm
provisioner (`scripts/t3-provision-users.sh``install_browser_kubeconfig`)
reads that token and installs it as the user's DEFAULT kubeconfig context
(`<user>-browser@homelab`), keeping their personal OIDC login as the
`oidc@homelab` named context. The SA's existence is the source of truth for who
gets the CLI — the provisioner no-ops for users without a `<user>-browser` SA.
**To grant another user:** add them to `CHROME_ALLOWED` (noVNC) and/or add a
`<user>-browser` SA + bindings mirroring `emo-browser` in `rbac.tf` (CLI), then run
the provisioner. To revoke: remove from `CHROME_ALLOWED` and delete the SA (rotate
a token by deleting its `<user>-browser-token` Secret).
## Limits + risks
- **Anti-bot vs stealth arms race** — when an upstream beats us (DRM

View file

@ -115,9 +115,67 @@ claude-agent-service, claude-memory-mcp, kms-website, Freedify,
instagram-poster, payslip-ingest, broker-sync (image name `wealthfolio-sync`),
fire-planner, recruiter-responder, x402-gateway — plus **tripit** (the original
pilot, 2026-06-09). Earlier public-repo apps already on GHA (Website,
k8s-portal, apple-health-data, audiblez-web, plotting-book, insta2spotify,
k8s-portal, apple-health-data, audiblez-web, insta2spotify,
audiobook-search) now also land on ghcr.
**plotting-book** is a special case (a GitHub-first repo owned by Anca,
ADR-0003): the build runs in *her* GitHub repo
(`PassionProjectsAnca/Plotting-Your-Dream-Book`) and pushes to **private
`ghcr.io/passionprojectsanca/book-plotter`** — under her org's ghcr namespace,
not `viktorbarzin`, using the workflow's built-in `GITHUB_TOKEN` (no shared
PAT). The cluster pulls it via the Kyverno-synced `ghcr-credentials` secret (the
`plotting-book` namespace is on the allowlist; the shared `ghcr_pull_token` has
read access). Migrated off public DockerHub (`viktorbarzin/book-plotter`) on
2026-06-27. The Woodpecker deploy hook (repo 43, registered to Anca's repo) is
unchanged. Flow:
```text
DEVELOP ───────────────────────────────────────────────────────────────────────
Anca (Codex / t3 web agent)
│ git push → main
┌──────────────────────────────────────────────────────────────┐
│ GitHub: PassionProjectsAnca/Plotting-Your-Dream-Book (private)│ ← canonical
│ .github/workflows/build-and-deploy.yml on: push → main │
└───────────────────────────┬──────────────────────────────────┘
│ GitHub Actions runner (off-infra build · ADR-0002)
┌────────────────────┴─────────────────────────────────┐
▼ ▼
┌─────────────────────────────────────────────┐ ╔═══════════════════════════════════════╗
│ build job │ push ║ GHCR · PRIVATE package ║
│ • svu next --always → tag vX.Y.Z (→ repo) │═════▶║ ghcr.io/passionprojectsanca/ ║
│ • buildx linux/amd64, provenance:false │ tags ║ book-plotter :vX.Y.Z :latest ║
│ • login ghcr (GITHUB_TOKEN, packages:write)│ ╚═══════════════════╤═══════════════════╝
│ • delete-package-versions (keep newest 10) │ │
└───────────────────────┬─────────────────────┘ │ pull (private,
▼ deploy job [gate: repo var DEPLOY_ENABLED ≠ "false"] via secret)
POST ci.viktorbarzin.me/api/repos/43/pipelines {IMAGE_TAG, IMAGE_NAME} │
▼ │
┌─────────────────────────────────────────────────────────────┐ │
│ Woodpecker repo 43 · .woodpecker/deploy.yml (event: manual) │ │
│ kubectl set image deployment/plotting-book = <ghcr>:vX.Y.Z │ │
│ kubectl rollout status │ │
└───────────────────────────┬─────────────────────────────────┘ │
▼ │
═══════════════ Kubernetes · ns: plotting-book ════════════════════════════ │
┌─────────────────────────────────────────────────────────────┐ │
│ Deployment plotting-book (Recreate · image = ignore_changes)│ │
│ imagePullSecrets: ghcr-credentials ────────pull───────────┼─────────────────┘
│ Pod → Express :3001 + SQLite on PVC (proxmox-lvm) │
└─────────────────────────────────────────────────────────────┘
guards / supporting:
• Kyverno require-trusted-registries [Enforce] → ghcr.io/* ALLOWED (admission)
• Keel policy=patch @1h → watches GHCR via ghcr-credentials (backstop)
• ghcr-credentials ⇐ Kyverno generate-clone ⇐ Vault secret/viktor/ghcr_pull_token
═══════════════ Serving path (unchanged) ══════════════════════════════════
Browser ─▶ plotting-book.viktorbarzin.me (non-proxied DNS → Traefik .203)
─▶ Authentik forward-auth (gate) ─▶ Service :80 ─▶ Pod :3001
```
Governance: the Deployment + Kyverno allowlist are Terraform (`stacks/plotting-book`,
`stacks/kyverno`); the live image *tag* is CI-owned (`ignore_changes`).
### Infra-owned images (issues #29 / #30)
Images owned by the infra repo build on GHA workflows **in the infra repo's own
@ -163,9 +221,9 @@ Woodpecker is **deploy + cluster-touching steps only**:
| Pipeline | File | Purpose |
|----------|------|---------|
| per-app deploy | `.woodpecker/deploy.yml` (each repo) | `kubectl set image` + Slack notify (event: **manual**) |
| terragrunt apply | `.woodpecker/default.yml` | Changed-stacks apply on push to master (runs in `infra-ci`) |
| terragrunt apply | `.woodpecker/default.yml` | Changed-stacks apply on push to master (runs in `infra-ci`). **Skips Tier-0 `vault`** — it's human-applied via OIDC; the CI `ci` role lacks Vault-admin perms (`sys/mounts`, `sys/policies/acl`) so a CI apply 403s |
| certbot | `.woodpecker/renew-tls.yml` | TLS renewal cron |
| drift-detection | `.woodpecker/drift-detection.yml` | Nightly Terraform drift (runs in `infra-ci`) |
| drift-detection | `.woodpecker/drift-detection.yml` | Nightly Terraform drift (runs in `infra-ci`). **Skips Tier-0 `vault`** (its `plan` 403s under the `ci` role and would fail the whole run) |
| provision-user | `.woodpecker/provision-user.yml` | Add namespace-owner user from Vault spec |
| registry-config-sync | `.woodpecker/registry-config-sync.yml` | SCP `modules/docker-registry/*``10.0.20.10` on change |
| pve-nfs-exports-sync | `.woodpecker/pve-nfs-exports-sync.yml` | Sync `scripts/pve-nfs-exports``/etc/exports` on PVE |
@ -176,6 +234,38 @@ Woodpecker is **deploy + cluster-touching steps only**:
**No build/test pipeline exists on any repo.** Do not (re)introduce one.
### `default.yml` apply: dual-registration de-dup + reliability (2026-06-28)
infra is registered in Woodpecker on **both** the canonical Forgejo repo (id 82)
and the legacy GitHub mirror (id 1), and **both fire `default.yml` on every
push**. Left unguarded, two `terragrunt apply` runs race each other for the
per-stack PG state lock — historically the #1 source of `Error acquiring the
state lock` failures and push-supersede "killed" runs.
- **Forge guard** (first command in the `apply` step): the push-apply runs **only
on the canonical Forgejo forge**; on the GitHub mirror it logs `[forge-guard]`
and `exit 0`s. Detection: `CI_REPO_URL`/`CI_FORGE_URL` contains `github.com`
skip. Fail-open (unknown forge still applies). The mirror keeps running the
**crons** (drift-detection, renew-tls, …), which live on repo 1 — only its
duplicate push-apply no-ops. (Crons were NOT moved; deactivating repo 1 would
have killed them.)
- **Lock-skip matches both tiers**: a stack whose apply hits a lock is SKIPPED,
not failed. The grep now matches the Tier-0 Vault message (`is locked by`) **and**
the Tier-1 PG-backend message (`Error acquiring the state lock` / `already
locked`) — the PG case was previously miscounted as a hard failure.
- **Transient retry** (bounded, 3 attempts): only provider-registry download
timeouts (`Failed to install provider` / `Client.Timeout`) and Vault 5xx are
retried. Config errors (missing arg, invalid index) and helm `atomic` timeouts
are NOT retried — they fail fast.
A pre-apply off-infra validate gate was evaluated and rejected: `terraform
validate` runs without state but catches ~0 of the observed failures (they are
provider-config-from-Vault-data, server-side-apply conflicts, helm installs, and
lock contention — all invisible to static validate), and `plan` cannot run
off-infra (no Vault/PG access). `terragrunt apply` already fails at its plan
phase without mutating on config errors, so a separate in-pipeline plan-gate was
also dropped as redundant.
### Woodpecker API
Uses **numeric repo IDs** (`/api/repos/<id>/pipelines`), NOT owner/name paths

View file

@ -286,7 +286,7 @@ Uptime Kuma monitors: TCP SMTP (port 25) on `176.12.22.76` (external), IMAP (por
#### Security Alerts (Wave 1 — planned, beads `code-8ywc`)
Routed via **Loki ruler → Alertmanager → `#security` Slack receiver**. Same handling path as infra alerts. Single channel with severity labels inside (critical/warning/info), not three separate channels. Detection sources: K8s API audit log (`job=kube-audit`), Vault audit log (`job=vault-audit`), PVE sshd journald (`job=sshd-pve`), Calico flow logs (`job=calico-flow`, W1.6 only).
Routed via **Loki ruler → Alertmanager → the `slack-security` receiver, which posts to `#alerts`** (it keeps its `[SECURITY/<sev>]` title styling so security-lane alerts stand out there). Same handling path as infra alerts; severity labels carried in the alert (critical/warning/info). The dedicated `#security` channel was abandoned 2026-06-25 — the shared `alertmanager_slack_api_url` webhook's Slack app isn't a member of it (a `#security` override 404s), so everything consolidated to `#alerts`. Detection sources: K8s API audit log (`job=kube-audit`), Vault audit log (`job=vault-audit`), PVE sshd journald (`job=sshd-pve`), Calico flow logs (`job=calico-flow`, W1.6 only).
| # | Source | Event | Severity |
|---|---|---|---|
@ -318,9 +318,20 @@ IOPS impact estimated ~1-2 GB/day additional disk writes after custom audit-poli
Detects the inverse of the K-series alerts: a service that **must work WITHOUT Authentik SSO** getting accidentally walled off. Services on `ingress_factory auth = "required"` put Authentik forward-auth on `/`, which 302-bounces native-client / public / webhook / WebSocket / SPA-XHR paths. We carve those out with path-scoped `auth = "none"` ingresses; a TF revert, a bad deploy, or `ingress_factory`'s fail-closed `auth` default flipping back to `"required"` can silently clobber a carve-out.
- **Mechanism**: `blackbox-exporter` (monitoring ns) probes a representative GET-able URL per carve-out with `no_follow_redirects: true`. The `http_no_authentik_redirect` module FAILS the probe (`fail_if_header_matches` on the `Location` header, regex `authentik\.viktorbarzin\.me|/outpost\.goauthentik\.io|/application/o/authorize`) iff the response redirects to Authentik. `valid_status_codes` enumerates all expected non-Authentik responses **including 301/302** (so a legitimate redirect, e.g. a short-link 302, or a 404 carve-out like meshcentral `/agent.ashx`, stays green). Scrape job: `blackbox-authentik-walloff` (1m).
- **Alert**: `probe_failed_due_to_regex{job="blackbox-authentik-walloff"} == 1` for 10m → `severity=warning`, `lane=security`**`#security` Slack** (Slack-only, no paging). `probe_failed_due_to_regex` (not bare `probe_success==0`) is the signal: it isolates the Authentik-redirect from unrelated 5xx/DNS/TLS failures already covered by reachability alerts. Inhibited by `TraefikDown` and `AuthentikDown` (symptom, not regression, during those outages).
- **Alert**: `probe_failed_due_to_regex{job="blackbox-authentik-walloff"} == 1` for 10m → `severity=warning`, `lane=security`posts to **`#alerts`** via the `slack-security` receiver, which keeps its `[SECURITY]` styling (Slack-only, no paging; the dedicated `#security` channel was abandoned 2026-06-25 — the shared webhook's app isn't a member of it). `probe_failed_due_to_regex` (not bare `probe_success==0`) is the signal: it isolates the Authentik-redirect from unrelated 5xx/DNS/TLS failures already covered by reachability alerts. Inhibited by `TraefikDown` and `AuthentikDown` (symptom, not regression, during those outages).
- **Target list + how to add one**: `local.authentik_walloff_targets` in `stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf` — a map of `service → URL`. To guard a NEW carve-out, add ONE line. Verify it does NOT already 302 to Authentik first: `curl -s -o /dev/null -w '%{http_code} %{redirect_url}\n' '<url>'`. The map key becomes the `service` label on the metric + alert. (Note: openclaw `task-webhook` is intentionally NOT probed — no public DNS record.)
#### East-west flow observability (Goldmane edge-aggregator) — `AggregatorDown` / `DigestFailing` (ADR-0014)
Health for the durable "who-talks-to-whom" trail (Calico Goldmane → `goldmane-edge-aggregator` → CNPG `goldmane_edges` → daily `#alerts` digest; full trail in security.md + [runbooks/goldmane-flow-trail.md](../runbooks/goldmane-flow-trail.md)). The aggregator pod exposes **no `/metrics`**, so health is inferred from kube-state-metrics. Alert group `Network Observability (Goldmane)` in `prometheus_chart_values.tpl`; both route the default `slack-warning` receiver → **`#alerts`**.
| Alert | Expr (abridged) | For | Severity |
|---|---|---|---|
| `AggregatorDown` | `kube_deployment_status_replicas_available{namespace="goldmane-edge-aggregator",deployment="goldmane-edge-aggregator"} < 1` (+ Prometheus-restart guard) | 15m | warning |
| `DigestFailing` | `kube_job_status_failed{namespace="goldmane-edge-aggregator",job_name=~"goldmane-edges-digest.*"} > 0` within 24h | 30m | warning |
The two layers are **complementary**: `AggregatorDown` ⇒ no new edges land in the DB; `DigestFailing` ⇒ edges still land but nobody is told. (`< 1` requires the metric series to exist — a fully-deleted Deployment is instead caught by cluster-health check #48 below as "deployment missing".) A freshness probe (#61b) was deliberately skipped — `AggregatorDown` is the agreed floor. **Cluster-health check #48** (`check_goldmane_aggregator` in `scripts/cluster_healthcheck.sh`) reads the Deployment's `Available` condition independently (human / `--quiet` / `--json`; JSON key `goldmane_aggregator`).
#### Backup Alerts
- **PostgreSQLBackupStale**: >36h since last backup
- **MySQLBackupStale**: >36h since last backup

View file

@ -541,7 +541,7 @@ Separate from the in-cluster namespace-owner model above, the **devvm** (`10.0.1
**RBAC tiers:** `admin` (Viktor — cluster-admin, unlocked tree, secrets) · `power-user` (cluster-wide read-only, NO Secrets, via a dedicated `oidc-power-user-readonly` ClusterRole) · `namespace-owner` (admin in own namespace only). Each session acts as the user's **own** OIDC identity (kubelogin), never the admin's.
**Config inheritance (live):** wizard authors the base (his chezmoi-versioned `~/.claude`). Two native layers carry it to every user — the enforced org `claudeMd` in `/etc/claude-code/managed-settings.json` (top precedence, all sessions) and per-user `~/.claude/{skills,rules,…}` **symlinks** to the base (seeded via `/etc/skel`; edits propagate live). Secrets stay per-user at mode 600, never symlinked. **The managed config self-deploys from the repo** (2026-06-10): the hourly reconcile's `sync_managed_config` installs `scripts/workstation/managed-settings.json` to `/etc/claude-code/` whenever the repo copy changes — so editing the claudeMd = edit + commit, no manual install — and `refresh_codex_mirror` regenerates each user's `~/.codex/AGENTS.md` (a static mirror of the claudeMd; only files carrying the mirror header are touched, user-customized ones are left alone). Repo-level guidance (`.claude/CLAUDE.md`, `AGENTS.md`, `CONTEXT.md` in the infra repo) reaches non-admins through their auto-freshened clones — commit + push and every user has it within the hour.
**Config inheritance (live):** wizard authors the base (his chezmoi-versioned `~/.claude`). Two native layers carry it to every user — the enforced org `claudeMd` in `/etc/claude-code/managed-settings.json` (top precedence, all sessions) and per-user `~/.claude/{skills,rules,…}` **symlinks** to the base (seeded via `/etc/skel`; edits propagate live). Secrets stay per-user at mode 600, never symlinked. **(2026-06-26: the managed `claudeMd` now defers OS-level file access to the OS/sudo — a user holding broad `sudo` may read other users' files incl. `~/.claude`; the mode-600 / no-symlink posture is unchanged but is no longer reinforced by an agent "never read other homes" rule. See [ADR-0015](../adr/0015-os-is-the-authorization-boundary.md).)** **The managed config self-deploys from the repo** (2026-06-10): the hourly reconcile's `sync_managed_config` installs `scripts/workstation/managed-settings.json` to `/etc/claude-code/` whenever the repo copy changes — so editing the claudeMd = edit + commit, no manual install — and `refresh_codex_mirror` regenerates each user's `~/.codex/AGENTS.md` (a static mirror of the claudeMd; only files carrying the mirror header are touched, user-customized ones are left alone). Repo-level guidance (`.claude/CLAUDE.md`, `AGENTS.md`, `CONTEXT.md` in the infra repo) reaches non-admins through their auto-freshened clones — commit + push and every user has it within the hour.
**Memory — homelab CLI hooks (rolled out 2026-06-21, deploy-fixed 2026-06-22):** the per-user `claude_memory` MCP was retired for the **homelab-memory hooks** — the reconcile's `install_memory` (re)installs four scripts into `~/.claude/hooks/` each run (`homelab-memory-recall.py` UserPromptSubmit recall, `auto-learn.py` Stop-hook extraction, `pre-compact-backup.sh`/`post-compact-recovery.sh`), wires them into `settings.json` if-absent + additive, and removes the old `claude_memory` MCP. **The provisioner binary itself now self-deploys from the repo** (step 0: `bash -n`-gated `install` + re-exec when `scripts/t3-provision-users.sh` differs from `/usr/local/bin/t3-provision-users`, guarded against re-exec loops / DRY_RUN mutation) — added after this very rollout sat committed-but-undeployed for a day (only the manual `setup-devvm.sh` had ever deployed the binary), so the hourly reconcile kept running the pre-memory version and emo/anca silently lost memory (recall + auto-learn never wired). A latent `set -e` abort in `install_memory` (a bare `[[ -d plugin-dir ]] && …` returning non-zero) was also fixed; it had killed the reconcile after the first user the first time it actually ran. The hooks need a `MEMORY_API_KEY` (or `CLAUDE_MEMORY_API_KEY`) in the user's `settings.json` env — the `homelab` CLI defaults the API URL, so **the key is the only hard requirement**; `install_memory` reuses an existing key and only WARNs if absent (it does NOT mint one — that's an admin Vault step, see Remaining). wizard + emo carry a key from their original MCP setup; **ancamilea is keyless → her memory no-ops until a key is minted.** (`auto-learn.py`'s passive store calls the API directly, so it additionally needs `*_API_URL` in env to avoid its local-SQLite fallback; recall + manual `homelab memory store` go through the URL-defaulting CLI and need only the key.)

View file

@ -261,7 +261,7 @@ Traefik chain:
1. **Anti-AI bot-block** (`ai-bot-block` ForwardAuth, on by default via `ingress_factory`): blocks/tarpits known AI crawlers. **Fail-open** (currently a no-op `return 200` — poison-fountain scaled to 0; see `docs/architecture/security.md`).
2. **Authentik Forward-Auth** (if `protected = true`): SSO authentication via OIDC. Non-authenticated users are redirected to login. Auth headers are stripped before forwarding to backend.
3. **Rate Limiting**: Per-IP throttling. Returns **429 Too Many Requests** (not 503) when limit exceeded. Default is `rate-limit` (average 10 req/s, burst 50). Services whose clients legitimately burst harder get a dedicated middleware via `skip_default_rate_limit = true` + `extra_middlewares`: Immich (`immich-rate-limit`, 1000/20000, photo uploads) and ActualBudget (`actualbudget-rate-limit`, 50/300 — the Actual web app boots with ~70 parallel asset/migration revalidations; the default burst 429'd the tail and stalled every page load).
3. **Rate Limiting**: Per-IP throttling. Returns **429 Too Many Requests** (not 503) when limit exceeded. Default is `rate-limit` (average 10 req/s, burst 50). Services whose clients legitimately burst harder get a dedicated middleware via `skip_default_rate_limit = true` + `extra_middlewares`: Immich (`immich-rate-limit`, 1000/20000, photo uploads), ActualBudget (`actualbudget-rate-limit`, 50/300 — the Actual web app boots with ~70 parallel asset/migration revalidations; the default burst 429'd the tail and stalled every page load), and authentik (`authentik-rate-limit`, 100/1000, on `/` and `/static` — the login SPA cold-loads ~70 flow-executor JS/CSS chunks from `/static`; the default burst 429'd the tail and a failed ES-module import left a blank login screen for cold/incognito/NAT-shared clients).
4. **Retry**: 2 attempts with 100ms delay on transient failures (5xx errors, connection errors).
Additional middleware:
@ -550,7 +550,7 @@ chain — a CrowdSec/LAPI outage cannot cause 503s; it only stops new bans.) Che
**Diagnosis**: Check Traefik middleware config for the affected IngressRoute.
**Fix**: Give the service a dedicated higher-limit middleware (don't loosen the shared default): define `<service>-rate-limit` in `stacks/traefik/modules/traefik/middleware.tf`, then set `skip_default_rate_limit = true` + `extra_middlewares = ["traefik-<service>-rate-limit@kubernetescrd"]` on its `ingress_factory` call. Shared default is average 10 req/s / burst 50; Immich uses 1000/20000, ActualBudget 50/300.
**Fix**: Give the service a dedicated higher-limit middleware (don't loosen the shared default): define `<service>-rate-limit` in `stacks/traefik/modules/traefik/middleware.tf`, then set `skip_default_rate_limit = true` + `extra_middlewares = ["traefik-<service>-rate-limit@kubernetescrd"]` on its `ingress_factory` call. Shared default is average 10 req/s / burst 50; Immich uses 1000/20000, ActualBudget 50/300, authentik 100/1000 (login SPA `/static` chunk burst → blank screen).
### Large Downloads or Uploads Truncate / Fail Partway

View file

@ -132,6 +132,13 @@ for the supersession history — there is no longer an inline Traefik bouncer.)
account hard-limits to **one** list), and CAPI is already covered in-kernel on
direct hosts and by Cloudflare's own managed protections on proxied hosts.
Registered bouncer key: **`kvsync`**.
- **Rate-limit resilient (2026-06-27):** Cloudflare's Lists-API *write* endpoint
is throttled (~per-60s; `429 retry-after`). The CronJob runs `backoff_limit=0`
(one POST per cycle — the `*/2` schedule IS the retry cadence) and treats a CF
`429` as a soft-skip (exit 0, retry next cycle), the same fail-safe pattern it
uses for LAPI. An earlier `backoff_limit=2` fired 3 rapid POSTs/cycle and
escalated the throttle into a stuck state that left the list empty — a
self-inflicted DoS that this change prevents.
- **Block-only**: the single-list limit precludes a separate
captcha/managed-challenge list, so both ban and captcha decisions are enforced
as a plain block at the edge.
@ -272,7 +279,7 @@ Beads epic: `code-8ywc`. **Status: partially live as of 2026-05-18.**
The block below documents the locked design.
Response model: **(I) Slack-only, daily skim.** All security alerts land in a new `#security` Slack channel via Alertmanager. No paging. Mean detection time accepted as ~12-24h; the design weight sits on prevention (Kyverno enforce, NetworkPolicy default-deny egress) rather than runtime detection.
Response model: **(I) Slack-only, daily skim.** All security alerts post to **`#alerts`** via Alertmanager (the `slack-security` receiver keeps its distinct `[SECURITY/<sev>]` title styling so security-lane alerts still stand out). The dedicated `#security` channel was abandoned (2026-06-25) — the shared `alertmanager_slack_api_url` incoming webhook's Slack app isn't a member of it, so a channel override there returns HTTP `404 channel_not_found`; everything consolidated to `#alerts`. No paging. Mean detection time accepted as ~12-24h; the design weight sits on prevention (Kyverno enforce, NetworkPolicy default-deny egress) rather than runtime detection.
#### Detection sources
@ -285,7 +292,7 @@ Response model: **(I) Slack-only, daily skim.** All security alerts land in a ne
#### Alert rules (16 total)
Routed via **Loki ruler → Alertmanager → `#security` Slack receiver**. Same handling path as existing infra alerts — silenceable in Alertmanager UI, history queryable, severity labels (critical/warning/info) inside the single `#security` channel.
Routed via **Loki ruler → Alertmanager → the `slack-security` receiver, which posts to `#alerts`** (it keeps its `[SECURITY/<sev>]` title styling so security-lane alerts stand out there; the dedicated `#security` channel was abandoned 2026-06-25 — the shared webhook's Slack app isn't a member of it). Same handling path as existing infra alerts — silenceable in Alertmanager UI, history queryable, severity labels (critical/warning/info) carried in the alert.
**K8s API audit (K2-K9, 8 rules — K1 cluster-admin-grant intentionally skipped):**
@ -364,6 +371,69 @@ Beads: `code-8ywc` W1.6 + W1.7. **Status: planned.**
- Rare-event misses: a Sunday-only CronJob's egress won't appear in 7 days of flow logs. Mitigation: extend observation to 2 weeks for namespaces with weekly CronJobs.
- Mass-rollout cascade: the 26h March 2026 outage (memory id=390) was a mass-change cascade. Mitigation: phased per-namespace with health-check pauses, similar to the 2026-05-17 Keel phased rollout (memory id=1972).
#### Deriving the per-namespace egress allowlist from the edge trail (Wave 1 W1.7)
The durable **east-west flow trail** (below) is now the preferred data source for
the *internal* (namespace-to-namespace) half of each Wave-1 egress allowlist —
faster and identity-stamped vs the original iptables-`LOG`→journald→Loki path
(ADR-0014: "Enforcement gains a better data source"). The unique observed
namespace pairs live in CNPG DB `goldmane_edges`, table `edge`. To derive the
namespaces a source is observed talking to (the `allow` set that seeds its
NetworkPolicy):
```sql
SELECT DISTINCT dst_ns FROM edge WHERE src_ns='<ns>' AND action='allow' ORDER BY dst_ns;
```
The full SQL recipe (whole-cluster matrix, deny sanity-checks, the ≥7-day
observation caveat) is in
[runbooks/goldmane-flow-trail.md → Deriving the Wave-1 egress allowlist](../runbooks/goldmane-flow-trail.md#deriving-the-wave-1-egress-allowlist-from-the-edge-table-infra-62).
**External / public-internet egress is NOT in this table** (empty-namespace flows
are dropped) — for those destinations keep using the Calico flow-log observation
(the W1.6 snapshot, `wave1-egress-observation-2026-05-22.md`). This feeds the
existing observe-then-enforce effort (beads `code-8ywc`); **enforce-flips remain
out of scope** of the trail — it is observe-and-derive only.
### East-west flow observability (Goldmane / Whisker + edge trail) (ADR-0014)
The "who-talks-to-whom" data plane that succeeds raw iptables-`LOG` lines (which
carried no identity). **Service identity = the workload's namespace** (primary),
refined by a `service-identity` label in the few multi-Service namespaces
(`monitoring`, `kube-system`, `dbaas`). End-to-end trail, three layers:
1. **Calico Goldmane + Whisker** (`calico-system`) — Goldmane aggregates
identity-stamped flows (ns/pod/workload/labels + allow-deny + policy-trace)
streamed from Felix over gRPC into a **~60-min in-memory ring buffer** (no
etcd/API writes — the etcd-cost constraint that drove the design). **Whisker**
is its live web UI at `whisker.viktorbarzin.me` (Authentik-gated,
`auth = "required"` — Whisker has no own login; an additive NetworkPolicy ORs
Traefik past the operator's default-deny `whisker` NP). The ring buffer is
**not** a trail (lost on Goldmane restart). Enabled via operator CRs in
`stacks/calico/main.tf`; reversible toggle (Goldmane is OSS tech-preview).
2. **`goldmane-edge-aggregator`** (`stacks/goldmane-edge-aggregator`) — streams
Goldmane's gRPC `Flows.Stream` over **mTLS** and upserts the low-cardinality
namespace-pair edge set (`edge(src_ns,dst_ns,action,first_seen,last_seen,
flow_count)`) into CNPG DB `goldmane_edges`. Self-edges and empty-namespace
(public-internet) flows are dropped — in-cluster relationships only. The mTLS
client cert **reuses the operator's Tigera-CA-signed `whisker-backend-key-pair`**
(Goldmane verifies CA-chain only, not identity) rather than copying the CA
private key into TF state — **re-apply the stack if the operator rotates that
Secret**.
3. **`goldmane-edges-digest`** CronJob — posts first-seen edges daily to
**`#alerts`** (reuses the alert-digest webhook). All Slack now consolidates to
`#alerts`; the `#security` channel was abandoned 2026-06-25 because that
webhook's Slack app isn't a member of it (a `#security` override 404s). See
runbook.
The trail is **attribution-grade, not cryptographic** (reconstructs events in a
trusted cluster; cannot prove identity against a spoofing pod — accepted trust-model
limit; east-west stays plaintext, no mTLS between app pods). Health is covered by
the **`AggregatorDown`** + **`DigestFailing`** alerts and cluster-health check #48
(see monitoring.md). Full as-built, query recipes, and troubleshooting:
[runbooks/goldmane-flow-trail.md](../runbooks/goldmane-flow-trail.md). Decision:
[ADR-0014](../adr/0014-service-identity-and-east-west-observability.md); glossary
`CONTEXT.md`**Service identity**, **Goldmane / Whisker**.
### TLS & HTTP/3
**Traefik** handles TLS termination:

View file

@ -0,0 +1,117 @@
# k8s-upgrade compat-gate: classify "actionable" vs "held" blocks
**Date:** 2026-06-28
**Status:** design → implementation
**Stack:** `stacks/k8s-version-upgrade` (+ `stacks/monitoring` alert rules)
## Problem
The cluster is on k8s 1.35.6. The nightly `k8s-version-check` chain detects the
next minor (1.36.2), runs the preflight compat-gate, and the gate **refuses**
it — because no released kyverno/ESO supports k8s 1.36 yet, and gpu-operator is
deliberately pinned (its 26.3 bump needs a newer NVIDIA driver image + Ubuntu
release we're not ready for). The result, **every single night**:
- a **Failed** preflight Job (`block()` exits 1), and
- `k8s_upgrade_blocked=1` → the **K8sUpgradeBlocked** alert.
But this block is **not actionable** — there's nothing we can upgrade to clear
it; we can only wait for upstream (kyverno/ESO) and, separately, do the
gpu-operator/Ubuntu work. The gate is crying wolf: a "blocked, needs attention"
signal that's indistinguishable from a block we could actually fix.
## Goal
Make the gate **classify** each blocker and behave accordingly:
| Class | Definition | Behaviour |
|-------|-----------|-----------|
| **actionable** | the compat matrix has a newer version of the addon whose `max_k8s >= target`, and the running version is older — upgrading it would clear the block | **alert** (`k8s_upgrade_blocked=1` → K8sUpgradeBlocked), with the specific "upgrade X → Y" remediation in the nightly report |
| **waiting-upstream** | **no** matrix version of the addon supports the target yet (kyverno/ESO for 1.36) | **quiet** (`k8s_upgrade_held=1`, no alert) — nightly report only |
| **pinned** | a supporting version exists but the addon carries `"pinned": true` in the matrix (gpu-operator) | **quiet** (held) |
Removed-API and containerd blocks are always **actionable**. **Held wins:** if
*any* blocker is waiting-or-pinned, the whole target is **HELD** (quiet) —
acting on the actionable blockers wouldn't unblock it yet. The nightly report
still lists everything so the full eventual scope is visible.
Also (scope decision: "tidy the block path"): deliberate gate decisions
(actionable-block **and** held) now make the preflight Job **Complete cleanly**
(exit 0) instead of Failing. Chain progression is gated on the verdict, not the
exit code. Real failures (unhealthy nodes, kubeadm errors, crashes) still exit
1 → `K8sUpgradeChainJobFailed`.
## Design
### `compat-gate.py`
- New exit codes: `0` safe · `2` actionable-block · `3` gate-error (fail-safe) · **`4` held**.
- Each stdout reason line is tagged `[ACTIONABLE]` / `[WAITING]` / `[PINNED]`.
- `check_addons`: when an addon blocks, decide its class:
- `pinned: true` in its matrix entry → `[PINNED]`.
- else a higher matrix version with `max_k8s >= target` exists → `[ACTIONABLE]` (`upgrade X to >= V`).
- else → `[WAITING]` (`no released X version supports k8s T yet`).
- unreadable image / below-matrix → `[ACTIONABLE]` (fail-safe — a human must look).
- `check_removed_apis`, `check_containerd`: tag `[ACTIONABLE]`.
- `exit_code(reasons)`: `0` if none; `4` if any `held_reason` (WAITING/PINNED); else `2`.
### `upgrade-step.sh`
- New global `HALT_CHAIN=0`; `spawn_next()` returns early (no next Job) when set.
- Replace `block()` with `record_blocked()` / `record_held()` — push the gauge,
set `HALT_CHAIN=1`, **do not exit**.
- `phase_preflight` gate handling routes on the gate's exit code:
- `0` → push `blocked=0`+`held=0`, proceed.
- `2`/`3``record_blocked`, `return 0` (Job Completes, K8sUpgradeBlocked fires).
- `4``record_held`, `return 0` (Job Completes, **no alert**).
- Push the gauge **definitively once** per run (remove the pre-reset `blocked=0`
at gate start) so a standing block doesn't flap 1→0→1 and re-notify.
- postflight also clears `held=0` alongside the existing gauge resets.
### detector (`main.tf`, the `k8s-version-check` CronJob)
- Consequence of the tidy change: refusals now **Complete** instead of Failing,
so the old "re-spawn only a *Failed* preflight" idempotency would skip a
refused-but-Complete preflight until its 7d TTL. Fix: re-spawn nightly when the
preflight is **Complete but no `k8s-upgrade-master-<target>` Job exists** (the
gate refused — chain never advanced) — **silently** (no Slack), so a standing
hold re-evaluates each night without noise.
- The per-night `slack "K8s upgrade available…"` becomes an `echo`; the spawn
Slack fires only for a genuinely new spawn or a Failed-respawn (`ANNOUNCE`
flag), not for silent re-evaluations — killing the last nightly-noise source.
### `addon-compat.json`
- Add `"pinned": true` + `"pin_reason"` to the gpu-operator entry (its
`26.3 → 1.36` row stays; `pinned` overrides classification to held). Document
the `pinned` flag in `_comment`. Unpinning later = delete two keys.
### `stacks/monitoring` alert rules (`prometheus_chart_values.tpl`)
- `K8sUpgradeBlocked` (`k8s_upgrade_blocked == 1`): unchanged trigger, now
actionable-only; reword annotation (reasons are in the nightly report, not a
per-run chain Slack).
- `K8sUpgradeChainJobFailed`: **drop** the `unless on() (k8s_upgrade_blocked == 1)`
clause — deliberate blocks no longer create Failed Jobs, so the alert again
means a genuine wedge.
- **No alert** for `k8s_upgrade_held` (intentional — nothing to action; the
nightly report surfaces it). Add a comment recording this.
### `nightly-report.py`
- Read `k8s_upgrade_held`. New `⏸️ HELD — <target> not yet upgradable` headline.
- Group reasons by tag: *Action needed* / *Waiting on upstream* / *Pinned (held by us)*
(fallback bullets for untagged lines, so older reason strings still render).
- Fetch reasons when avail AND (blocked OR held).
## Net effect on 1.36 today
**HELD, quiet** — waiting on kyverno + ESO (upstream) + gpu-operator (pinned);
Calico listed as the lone actionable piece. No nightly Failed Job, no alert —
just the nightly report's ⏸️ line. Flips to actionable (→ alert) only once
kyverno/ESO ship support **and** gpu-operator is unpinned.
## Tests (TDD)
- `compat-gate`: waiting / actionable / pinned-is-held / mixed-held-wins,
removed-API & containerd are actionable, exit_code mapping, + existing
patch/safe cases stay green.
- `nightly-report`: held headline + grouped reasons; existing tests stay green.
- `upgrade-step.sh`: shellcheck; manual review of the HALT_CHAIN + gauge flow
(bash, not unit-tested).
## Out of scope (separate follow-up)
Auto-refreshing the matrix when upstream ships 1.36 support (a periodic
addon-readiness probe). This change only *consumes* the matrix.

View file

@ -0,0 +1,128 @@
# Post-Mortem: MetalLB ServiceL2Status Stuck Immutable → PG LB VIP Flap → Woodpecker CI Tier 1 Applies Broken
| Field | Value |
|-------|-------|
| **Date** | 2026-05-16 (mitigated) / 2026-05-26 (closed) |
| **Duration** | ~5 days of degraded CI (2026-04-21 first observed → 2026-05-16 mitigated). Symptom-only; no human-visible service downtime. |
| **Severity** | SEV3 — Woodpecker CI default.yml apply step failed on Tier 1 (PG-backend) stacks. Drift-detection ran silently broken. Manual `scripts/tg apply` continued to work. No data loss, no app downtime. |
| **Affected Services** | Woodpecker CI pipelines applying any of the 28+ Tier 1 stacks (monitoring, crowdsec, authentik, headscale, etc.). PostgreSQL backend itself was healthy. |
| **Issue** | Beads `code-aoxk` (closed 2026-05-26). |
| **Status** | Closed |
## Summary
Woodpecker CI surfaced as `ERROR: Cannot read PG credentials from Vault. Run: vault login -method=oidc` from `scripts/tg` whenever a pipeline tried to apply a Tier 1 stack. The error was misleading on two counts:
1. **Vault was healthy.** A direct `vault read database/static-creds/pg-terraform-state` from inside a Woodpecker pipeline pod (using K8s SA JWT → `auth/kubernetes/login role=ci`) succeeded every time when run in isolation.
2. **The "Cannot read PG credentials" message in `scripts/tg` was a catch-all** that fired for *any* Terraform/Terragrunt failure during PG state-lock acquire-release, including TCP RSTs against the PG LoadBalancer VIP.
Actual root cause: the MetalLB `ServiceL2Status` CR for the `postgresql-lb` service (`dbaas` namespace, VIP `10.0.20.200`) had a stuck `status.node` field that the controller treated as immutable. The L2 speaker kept failing to update it with `Invalid value: "k8s-nodeX": Value is immutable`, so the leader-elected announcer flapped between k8s-node3 and k8s-node4 every few seconds. Each flap dropped open TCP connections (RST). Terraform's state-lock acquire → operation → release sequence straddled flaps and failed mid-operation. `scripts/tg` surfaced this as the misleading "Cannot read PG credentials" message.
Manual `scripts/tg apply` from the DevVM kept working because the developer's session happened to land on whichever node currently held the VIP and complete fast enough to not straddle a flap. CI pipelines, being slower (full stack walk), reliably straddled at least one flap.
## Impact
- **CI degradation**: Tier 1 stack changes pushed to master were NOT auto-applied. Required manual `scripts/tg apply` from DevVM after every push touching one of 28+ stacks.
- **Drift-detection broken**: The daily `drift-detection.yml` Woodpecker pipeline silently failed on every Tier 1 stack — meaning unannounced manual changes to those stacks could have persisted undetected for the duration.
- **No user-facing outage**: PG cluster itself, all apps that use PG, and all in-cluster traffic to `10.0.20.200` worked normally. Only the very specific `acquire-state-lock → run operation → release-state-lock` round-trip pattern from CI was unreliable.
## Timeline (UTC)
| Time | Event |
|------|-------|
| 2026-04-21 | First broken CI pipelines (#411, #412, #413). Drift-detection failures noticed. `code-aoxk` filed. Initial hypothesis: Vault auth/role mismatch. |
| 2026-04-22 — 2026-05-15 | Multiple investigation attempts. Verified Vault K8s `auth/kubernetes/role/ci` has correct policies (`terraform-state`, `ci`). Verified `database/static-creds/pg-terraform-state` exists, rotates on schedule, credentials valid. Could not reproduce the failure in isolated `vault read` from Woodpecker pods. |
| 2026-05-16 (~12:14 UTC) | `pg-cluster-3` came up (third CNPG replica); endpoint set churn likely triggered MetalLB L2 announcer to attempt to update the existing `ServiceL2Status` CR (was `l2-rgt9d`). Update was rejected as immutable. Speaker kept retrying. VIP flapped. |
| 2026-05-16 | RCA breakthrough: noticed `kubectl logs -n metallb-system -l component=speaker` was full of `Invalid value: "k8s-node…": Value is immutable` on the postgresql-lb ServiceL2Status. Correlated with `kubectl get servicel2status` returning multiple stale entries for the same service. |
| 2026-05-16 | **Mitigation**: `kubectl delete servicel2status.metallb.io l2-rgt9d -n metallb-system`. Speaker recreated the CR cleanly (became `l2-zj9ss`). Flap stopped. PG connections stable. Manual CI re-runs of `monitoring` stack apply succeeded immediately. |
| 2026-05-17 | Audit: acceptance criteria 1 + 2 met implicitly. #3 (post-mortem) remained pending. Beads task reverted from `in_progress``open`. |
| 2026-05-25 | Node2 SCSI LUN remap → encrypted PVC emergency_ro → containerd boltdb corruption outage. Unrelated, but pulled Woodpecker server off node2. Subsequent server pod restart on k8s-node4. |
| 2026-05-26 | Verification: from a live Woodpecker pipeline pod (`wp-01kshph6pa0w6ch0zf5x9bfqgr`), `vault write auth/kubernetes/login role=ci jwt=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)` succeeded. `vault read database/static-creds/pg-terraform-state` returned valid creds (`username=terraform_state`, last_vault_rotation 2026-05-21, TTL 58h). Live `default.yml` pipeline confirmed applying Tier 1 stacks: dbaas, authentik, crowdsec, monitoring, nvidia, cloudflared, kyverno, metallb — all `OK`. `postgresql-lb` ServiceL2Status currently single allocation (`l2-sv9vv` on k8s-node3, no flap). Beads task closed. |
## Root Cause
`metallb-speaker` reconciler in the deployed MetalLB version treats `ServiceL2Status.status.node` as immutable after first set. When the L2 announcer's leader-election picks a different node to announce a given VIP (which happens on speaker pod restart, node loss, endpoint set churn, or pod-anti-affinity reshuffles), the reconciler fails to patch the existing CR and gets stuck in a retry loop. Without manual deletion, the reconciler will not progress.
Why it manifested as Vault credential errors:
1. CI's `scripts/tg` pre-flight runs `vault read database/static-creds/pg-terraform-state` (line 83 in current code) to get PG credentials. That call succeeds.
2. CI then runs `terragrunt apply` against the Tier 1 stack. Terragrunt connects to `10.0.20.200:5432` for state-lock acquire (via `pg_advisory_lock`). The TCP connection lands on whichever node MetalLB last announced the VIP from.
3. Mid-operation, MetalLB tries to re-announce from a different node, sends gratuitous ARPs, and the upstream switch updates its MAC table. Open TCP sessions on the previous announcer's node are immediately RST.
4. Terragrunt's state-lock release (or any subsequent PG operation) fails with broken pipe / connection refused.
5. `scripts/tg` interpreted the wrapper-level failure as "PG creds bad" because that's the most common failure mode it handles. The actual error from terragrunt was buried in `2>/dev/null` suppression (since fixed — see Fix #1 below).
## Detection
We did not have any of:
- A direct alert for "MetalLB ServiceL2Status reconciler errors".
- An alert for "PG LB VIP node changed N times in M minutes".
- An end-to-end probe for the CI state-lock pattern (terragrunt against `10.0.20.200`).
Detection mechanism was a human reading `kubectl logs -n metallb-system` for unrelated reasons. Took 25 days from first observed symptom to RCA.
## Fixes & Mitigations
### 1. Surface real error from `scripts/tg` (DONE)
The original `scripts/tg` swallowed the real `vault read` / terragrunt error behind `2>/dev/null` and printed a static "Cannot read PG credentials from Vault" message. Fixed in the script:
```sh
# scripts/tg lines 79-89 (current)
if ! command -v vault >/dev/null 2>&1; then
echo "ERROR: vault CLI not found on PATH. Install it or use an image that includes it (ci/Dockerfile)." >&2
exit 1
fi
VAULT_OUT=$(vault read -format=json database/static-creds/pg-terraform-state 2>&1) || {
echo "ERROR: Cannot read PG credentials from Vault. Vault output follows:" >&2
echo "$VAULT_OUT" >&2
echo "" >&2
echo "Hint: humans run 'vault login -method=oidc'; CI auths via K8s SA (role=ci)." >&2
exit 1
}
```
Comment in the code explicitly references this incident.
### 2. Stuck-CR cleanup procedure (DOCUMENTED)
Reproduction check for future sessions (also in `code-aoxk` beads notes):
```sh
kubectl logs -n metallb-system -l component=speaker --tail=200 | grep -iE 'Invalid value.*immutable'
# If matches found → same root cause. Delete the stuck CR:
kubectl get servicel2status -n metallb-system
kubectl delete servicel2status.metallb.io <name> -n metallb-system
```
Speaker recreates the CR cleanly within seconds.
### 3. Long-term MetalLB controller fix (DEFERRED)
The underlying bug — speaker not recreating the CR when the immutable field needs to change — is upstream MetalLB behaviour. Two paths possible:
- **Upgrade MetalLB** to a version where this is fixed (needs research — check changelogs).
- **File upstream issue / patch** with reproducer.
Not done as part of this post-mortem; tracked separately. Risk acceptance: until then, the manual `delete servicel2status` workaround is the playbook, and is fast (<10s).
### 4. Alerting (DEFERRED)
Suggested but not implemented:
- Prometheus alert on `metallb_speaker_reconcile_errors_total{kind="ServiceL2Status"}` rate.
- Synthetic probe: a CronJob that does `pg_advisory_lock` + release against the PG VIP every 5min from CI namespace, alert if it ever fails.
Tracked as future hardening (no beads task yet — only worth filing if recurrence happens).
## Lessons
1. **`2>/dev/null` is a time-bomb.** It hid the real error for weeks. Fix #1 already lands the principle; audit other places in `scripts/` for the same anti-pattern next time we touch them.
2. **CRD `status.*` immutability is non-obvious failure mode.** When debugging weird LB / VIP / endpoint behaviour, always grep speaker logs for `immutable`, `cannot update`, and reconciler errors. Add to cluster-health checks.
3. **Misleading wrapper errors cost weeks.** `scripts/tg` claimed "Cannot read PG credentials" — that's what the operator believed. The actual `vault read` step worked. The real failure was three steps later in a completely different subsystem. When a wrapper script makes a definitive claim about which subsystem failed, distrust it; reproduce the subsystem in isolation before chasing the claim.
4. **CNPG primary changes / endpoint churn can trigger L2 announcer flap.** The trigger (within the timeline) was likely the `pg-cluster-3` pod coming up. Worth flagging for any future CNPG topology changes.
## References
- Beads: `code-aoxk` — closed 2026-05-26.
- `scripts/tg` lines 65-95 — current pre-flight with explicit error surfacing.
- `kubectl get servicel2status -A` — current state, single allocation per service.
- This file: `infra/docs/post-mortems/2026-05-16-metallb-l2-immutable-pg-vip-flap.md`.

View file

@ -0,0 +1,97 @@
# Post-mortem: k8s 1.34→1.35 upgrade stalled — etcd IO starvation (2026-06-24)
> Filename kept for inbound links. The originally-suspected cause (kubeadm-config
> OIDC drift) turned out **not** to be the crash — see "Correction" below. The OIDC
> drift was a real *separate* latent bug fixed in the same change.
**Impact:** The autonomous k8s-version-upgrade chain (23:00 UTC nightly) reached
the master control-plane phase for the first time — preflight passed, etcd
snapshot taken, master cordoned + drained, etcd upgraded 3.6.5→3.6.6 — then the
kube-apiserver upgrade to v1.35.6 **crash-looped**. kubeadm waited its 5-minute
static-pod-hash window across all internal retries, then auto-rolled-back to
v1.34.9. The cluster stayed healthy on 1.34.9 (apiserver, all 7 nodes Ready), but
the run left **k8s-master cordoned** and the chain **wedged on `in_flight=1`**.
No data loss; no user-facing outage (the master carries control-plane taints, so
no workloads were displaced).
**Trigger:** the first *minor* upgrade the chain ever attempted (1.34→1.35) — the
first time kubeadm upgrades etcd (3.6.5→3.6.6) and regenerates the control-plane
static pods, i.e. the first time the upgrade pushes real write-IO at etcd.
## Root cause — etcd IO starvation on the shared HDD
The new kube-apiserver could not establish/keep a working connection to etcd
during the upgrade because **etcd was IO-starved**. etcd's surviving container log
from the crash window (`/var/log/pods/.../etcd/0.log`, 23:0423:20 UTC) shows:
- **1,180** `apply request took too long` warnings in 16 minutes;
- individual applies of **4.3s / 2.9s / 2.7s / 1.8s** (healthy is <100ms),
clustered at **23:18:51 UTC** — exactly when kubeadm's final attempt was trying
to bring the new apiserver up.
A reproduced 1.35.6 apiserver with no etcd dies with
`F instance.go:233 Error creating leases: error creating storage factory: context
deadline exceeded` — the same failure mode a multi-second etcd produces. etcd
lives on the contended `sdc` HDD (**beads code-oflt**: "etcd/critical VM disks on
shared sdc HDD — recurring IO-storm root cause"). The upgrade itself piled IO onto
that spindle:
1. etcd's own upgrade-restart + WAL/db re-read (it restarted ~23:04, re-elected);
2. kubeadm dumping a full **~400MB etcd DB backup** to
`/etc/kubernetes/tmp/kubeadm-backup-etcd-<ts>/` (on the same HDD) before the
etcd upgrade — and **145 of these had accumulated to 28GB** (kubeadm never
cleans them up), pushing master root fs to **73%**, above the 70% kubelet
image-GC threshold, so image GC churned during the drain too;
3. master-drain pod evictions.
### Correction — it was NOT the OIDC flag swap
`kubeadm upgrade diff v1.35.6` showed the regenerated manifest also swaps
`--authentication-config` (structured multi-issuer OIDC) back to legacy
single-issuer `--oidc-*` flags (kubeadm-config drift, see secondary finding). That
was the *first* hypothesis — but an isolated repro of the 1.35.6 apiserver with
those exact `--oidc-*` flags **and authentik reachable** initialised OIDC cleanly
(`oidc.go:313`, no error) and ran fine until it hit the (deliberately dead) test
etcd. So the auth swap does **not** crash the apiserver; it was a red herring for
the crash. Image pull (all v1.35.6 images pre-pulled), OOM (none), and disk-full
were also ruled out.
## Secondary finding (real, fixed separately) — kubeadm-config OIDC drift
apiserver auth is configured in three places that must agree:
(1) `/etc/kubernetes/pki/auth-config.yaml` (structured, two issuers: `kubernetes`
+ `k8s-dashboard`, added 2026-06-19); (2) the live static-pod manifest
(`--authentication-config`); (3) the kubeadm-config `ClusterConfiguration` CM —
which still carried the legacy `--oidc-*` extraArgs. `kubeadm upgrade` regenerates
the manifest from (3), so it would have reverted structured auth → **dashboard +
kubectl SSO break after a successful upgrade** (recoverable: the chain's
post-master `restore.sh` re-adds the flag). This is a real bug, just not the crash.
## Resolution
1. **Reclaimed the 28GB kubeadm scratch** on master (`/etc/kubernetes/tmp/kubeadm-backup-*`) — root fs 73% → 23%.
2. **Reconciled kubeadm-config live** (zero cluster impact — CM only read at upgrade time): dropped `--oidc-*`, added `--authentication-config` via `kubeadm init phase upload-config kubeadm`. `kubeadm upgrade diff` then shows only the control-plane image bumps.
3. **Recovered:** uncordoned k8s-master, cleared the stuck `in_flight` gauge + annotation, deleted last night's Complete/Failed `1-35-6` phase jobs (a Complete preflight would otherwise make the detector idempotent-skip the re-run).
## Prevention (landed in this change)
| Gap | Fix |
|-----|-----|
| kubeadm leaks ~400MB etcd-DB backups into `/etc/kubernetes/tmp` forever (→ disk fills, image-GC churn, write-IO on etcd's spindle) | **`upgrade-step.sh` preflight now prunes** `/etc/kubernetes/tmp/kubeadm-backup-*` + `kubeadm-upgraded-manifests*` older than 3 days on master, every run. Best-effort, never aborts. |
| kubeadm-config drift would silently break SSO after an upgrade | `apiserver-oidc.tf`'s remote script now **also reconciles kubeadm-config** (`kubeadm init phase upload-config`), delivered via the `apiserver-oidc-restore` ConfigMap the chain re-runs (CI needs no ssh) or a local `-replace` apply. Preflight **alerts** (not blocks — SSO drift is recoverable) if `kubeadm upgrade diff` would still drop `--authentication-config`. |
| etcd on the contended `sdc` HDD starves under upgrade IO | **Durable fix is beads code-oflt** (move etcd/critical VM disks off `sdc`). Not in this change. Mitigations above reduce the upgrade's own IO; reclaimed disk removes the image-GC variable. |
## Lessons
- **Capture the failing component's own logs before concluding.** The `kubeadm
upgrade diff` made the OIDC swap look like the cause; only etcd's log (multi-second
applies) + an isolated apiserver repro showed the truth (etcd IO). A clean diff is
"what config changes," not "why it crashed."
- **etcd on shared HDD is the cluster's recurring fragility** (immich IO storm
2026-05-25, this stall). Upgrades concentrate IO (etcd restart + kubeadm's 400MB
backup copy + drain) onto that spindle. code-oflt is the real fix.
- **Tools that leave per-operation scratch must be reaped.** kubeadm's
`/etc/kubernetes/tmp` etcd backups are throwaway (real backups → NFS) but never
GC'd; 28GB had silently accumulated.
- **Out-of-band control-plane edits must be written back to kubeadm-config** — else
`kubeadm upgrade` silently reverts them (here: SSO; could be admission/audit/API flags).

View file

@ -11,6 +11,11 @@ inference every six hours and backs up only the `claudeAiOauth` object to:
secret/workstation/claude-users/<os-user>
```
The backup **merges** into that path (`vault kv patch -method=rw`, falling back to
`kv put` only when the path does not exist yet), so keys that other tools
co-locate there — notably `homelab vault`'s `vaultwarden_*` credentials — survive.
A blind `kv put` here silently wiped them on every six-hourly run (fixed 2026-06-26).
The user's unrelated `mcpOAuth` credentials never leave their home directory.
Each renewal service has a distinct 32-day periodic Vault token, mode `0600`, at
`~/.config/claude-auth-sync/vault-token`. Its policy can access only that user's
@ -75,8 +80,64 @@ sudo --preserve-env=VAULT_ADDR,VAULT_TOKEN /usr/local/bin/t3-provision-users
```
Never copy another user's `.credentials.json` or scoped Vault token. Never restore
the old shared `CLAUDE_CODE_OAUTH_TOKEN`; environment credentials outrank per-user
login and would silently collapse all users onto one identity.
a **shared** `CLAUDE_CODE_OAUTH_TOKEN` across users; environment credentials
outrank per-user login and would silently collapse all users onto one identity.
(A **per-user**, non-rotating setup-token tied to the user's OWN Enterprise
identity is a different, sanctioned thing — see "Long-lived per-user token" below.)
## Long-lived per-user token (heavy concurrent-agent users)
The six-hourly renewal above assumes Claude owns refresh-token rotation in a
single `~/.claude/.credentials.json`. A user who runs **many concurrent Claude
sessions** (interactive tmux panes + their `t3-serve` instance + always-on
`start-claude.sh` agents) breaks that assumption: when the shared access token
expires, the processes refresh **simultaneously**, the OAuth server rotates the
refresh token, and the losing writer persists an **empty** refresh token —
logging the user out roughly every access-token lifetime (~8h). Re-issuing the
credential does not help; the race recurs.
The fix is a **per-user, long-lived setup-token** (`sk-ant-oat01-…`, ~1y,
**non-rotating**). With `CLAUDE_CODE_OAUTH_TOKEN` set, Claude uses it directly and
never touches `.credentials.json` — so there is nothing to race on. This is the
user's OWN Enterprise identity (scope `user:inference`; local MCP servers are
client-side and unaffected), stored only in their OWN Vault path — **NOT** the
forbidden shared token, and it never crosses OS users.
**Enable it (one-time, per user):**
1. The user mints their own token (interactive Enterprise SSO):
```bash
claude setup-token # opens an SSO URL; paste the code back -> prints sk-ant-oat01-…
```
2. An admin stores it in that user's Vault path (MERGE, never `kv put` — siblings
like `claude_ai_oauth_json` / `vaultwarden_*` must survive):
```bash
vault kv patch -method=rw secret/workstation/claude-users/<os-user> \
setup_token=sk-ant-oat01-…
```
3. Materialize + activate (or just wait ≤6h for the timer):
```bash
systemctl start claude-auth-sync@<os-user>.service
```
`claude-auth-sync` writes `~/.config/claude-auth-sync/claude-oauth.env`
(`CLAUDE_CODE_OAUTH_TOKEN=…`, mode 0600) and, while a token is present, **skips**
the rotating-credential validate/backup/restore (so no false
`WorkstationClaudeAuthInvalid`). `start-claude.sh` and `t3-serve@.service` load
that env file. **Sessions started before activation keep the old credential
until relaunched** — the user must restart their agents / `t3-serve` to cut over.
**Disable it:** clear the field (`vault kv patch -method=rw
secret/workstation/claude-users/<os-user> setup_token=""`) — the next sync removes
the env file and the user reverts to the per-user SSO credential flow.
**Rotate before expiry:** setup-tokens expire 1y after mint. Re-mint (step 1) and
re-store (step 2); the env file refreshes on the next sync.
## Verification

View file

@ -0,0 +1,346 @@
# Goldmane Flow Trail — east-west "who-talks-to-whom" observability
> As-built runbook for the Calico Goldmane + Whisker flow plane and the
> `goldmane-edge-aggregator` durable audit trail. Design + rationale:
> [ADR-0014](../adr/0014-service-identity-and-east-west-observability.md).
> Glossary: `CONTEXT.md`**Service identity**, **Goldmane / Whisker**.
> Implements infra issues #57 (Whisker ingress), #58 (aggregator), #61
> (monitoring), #62 (egress allowlist queries), #63 (these docs).
## What the trail is
Three layers turn raw east-west traffic into a queryable, durable record of
which Service talks to which. **Service identity = the workload's namespace**
(primary), refined by a `service-identity` label in the few multi-Service
namespaces (`monitoring`, `kube-system`, `dbaas`) — see ADR-0014.
| Layer | Component | Lifetime | Where it lives |
|---|---|---|---|
| **Live map** | Calico **Goldmane** + **Whisker** | ~60-min in-memory ring buffer (lost on Goldmane restart) | `calico-system`; Whisker UI at `whisker.viktorbarzin.me` |
| **Durable trail** | `goldmane-edge-aggregator` (`aggregate` mode) | persistent | CNPG Postgres DB `goldmane_edges`, table `edge` |
| **Notification** | `goldmane-edges-digest` CronJob (`digest` mode) | daily | Slack `#alerts` |
**Goldmane** aggregates identity-stamped flows (namespace / pod / workload /
labels + allow-deny + policy-trace) streamed from Felix (the existing
`calico-node` DaemonSet) over gRPC into a ~60-minute in-memory ring buffer —
**nothing is written to etcd or the K8s API** (the etcd-cost constraint that
drove the whole design). **Whisker** is its live web UI. Because the ring
buffer is *not* a trail (a Goldmane restart loses the window), the
`goldmane-edge-aggregator` consumes Goldmane's gRPC `Flows.Stream` API over
mTLS and upserts the unique **namespace-pair edge set** into Postgres; a daily
CronJob posts first-seen edges to Slack.
The edge set is deliberately **low-cardinality** — one row per
`(src_ns, dst_ns, action)`, *not* per-pod or per-port — so the table stays
small no matter how much traffic flows.
## Where the data lives
### Whisker UI — live, ~60 min
- `https://whisker.viktorbarzin.me` (Authentik-gated — Whisker ships no own
login; `auth = "required"`). Shows the live flow stream + a service graph for
roughly the last hour. Use it for "what is talking right now"; it is **not**
history.
- In-cluster: `Service goldmane:7443` (gRPC/mTLS), `Service whisker:8081`
(HTTP), both in `calico-system`.
- **DNS fix + self-heal:** whisker's egress to the kube-dns ClusterIP is allowed
by `whisker-allow-dns-clusterip` (`stacks/calico`) — without it the UI goes
empty after any gRPC-stream break (see Troubleshooting → "Whisker UI empty").
The `whisker-watchdog` CronJob (every 10 min) is a backstop that restarts
whisker if its backend ever wedges for another reason.
### CNPG `goldmane_edges` — durable
- Postgres DB `goldmane_edges` on the CNPG cluster
(`pg-cluster-rw.dbaas.svc.cluster.local:5432`). One table:
```
edge(src_ns text, dst_ns text, action text,
first_seen timestamptz, last_seen timestamptz, flow_count bigint,
PRIMARY KEY (src_ns, dst_ns, action))
```
- `action``allow` / `deny` / `pass` / `unspecified` (normalised Goldmane
action).
- **Self-edges (`src_ns == dst_ns`) and empty-namespace flows** (host-endpoint
/ public-internet) are **dropped** — the trail is about in-cluster service
relationships only. (Egress to the public internet is therefore NOT in this
table; it lives in the Wave-1 Calico flow-log path — see security.md.)
- A **"new edge"** = a row whose `first_seen` falls inside the digest window.
- Role `goldmane_edges` (Vault-rotated, 7-day) owns the DB. The `edge` table
is created idempotently by the aggregator at startup (canonical DDL also in
the repo at `migrations/0001_edge.sql`).
### Slack `#alerts` — daily digest
> **Channel note (2026-06-25):** posts to **`#alerts`**. The dedicated `#security` channel was abandoned — the shared `alertmanager_slack_api_url` incoming webhook's Slack app is not a member of it, so a channel override there returns HTTP `404 channel_not_found`. Everything now posts to `#alerts` (this digest plus alertmanager's `slack-security` receiver, which keeps its `[SECURITY]` styling so security-lane alerts still stand out there).
- CronJob `goldmane-edges-digest` (08:00 Europe/London) posts edges first seen
in the last 24h. Quiet when there are none. Reuses the existing alert-digest
Slack incoming webhook (Vault `secret/viktor``alertmanager_slack_api_url`)
— no new webhook was created.
## How to enable / disable
### Goldmane + Whisker (the flow plane)
Operator CRs in **`stacks/calico/main.tf`** — NOT the Helm `goldmane`/`whisker`
flags (those stay `false`; the operator's own `installation`/`apiServer` are
operator-managed via the `goldmanes`/`whiskers.operator.tigera.io` CRDs):
- `kubectl_manifest.goldmane` (kind `Goldmane`) — creating it makes the operator
re-render `calico-node` with the `FELIX_FLOWLOGSGOLDMANESERVER` env (the
operator auto-wires Felix — **do NOT patch FelixConfiguration**), triggering a
supervised `calico-node` DaemonSet roll. Yields `Deployment` + `Service
goldmane:7443`.
- `kubectl_manifest.whisker` (kind `Whisker`, `depends_on` goldmane;
`notifications = Disabled`). Yields `Deployment` + `Service whisker:8081`.
**To disable:** delete those two CRs and re-apply `stacks/calico`. Reversible
toggle (Goldmane is tech-preview in OSS Calico 3.30 — the main standing risk per
ADR-0014).
### Whisker public ingress (infra #57)
Also in `stacks/calico/main.tf`:
- `module "ingress_whisker"` (`ingress_factory`, `auth = "required"`,
`dns_type = "proxied"`) → `whisker.viktorbarzin.me`.
- `kubernetes_network_policy_v1.whisker_allow_traefik` — **required alongside the
ingress**: the operator's own `whisker` NetworkPolicy (owned by the Whisker CR)
is `policyTypes: [Ingress]` with no rules = default-deny ingress to the pod.
This additive NP ORs in an allow for `namespaceSelector
kubernetes.io/metadata.name=traefik` on TCP 8081. Without it Traefik 502s.
### The aggregator + digest (the durable trail) — `stacks/goldmane-edge-aggregator`
A Tier-1 stack (PG state) mirroring the claude-memory pattern. `scripts/tg
apply` from `stacks/goldmane-edge-aggregator/`. It provisions: the namespace,
the mTLS client material, the Postgres DB-init Job, the `DATABASE_URL`
ExternalSecret (Vault static role `pg-goldmane-edges`), the Slack ExternalSecret,
the `aggregate` Deployment, and the `digest` CronJob. **To disable the trail
without touching the flow plane:** scale `deployment/goldmane-edge-aggregator` to
0 (transient) or remove the stack (permanent) — Goldmane/Whisker keep running.
Image: `ghcr.io/viktorbarzin/goldmane-edge-aggregator` (PRIVATE) — the
`goldmane-edge-aggregator` namespace must be in the `ghcr-credentials` Kyverno
allowlist (`stacks/kyverno/modules/kyverno/ghcr-credentials.tf`,
`local.ghcr_private_namespaces`) or pulls 401. Code repo:
`~/code/goldmane-edge-aggregator` (see its `README.md` + `DEPLOY.md`).
## mTLS cert — the REUSE decision (cert-reuse gotcha)
The aggregator dials `goldmane:7443` over **mutual TLS**. Goldmane requires the
client cert to chain to the **Tigera CA**, but it does **NOT authorize by client
identity** — any Tigera-CA-signed cert is accepted.
Rather than copy the Tigera CA **private key** into Terraform state to mint our
own cert (a needless CA-key exposure; the `hashicorp/tls` provider also clashes
with this repo's global generate-providers/lockfile pattern), the stack
**REUSES the operator-minted, Tigera-CA-signed `whisker-backend-key-pair`
Secret** (`calico-system`), copying its `tls.crt`/`tls.key` into the
`goldmane-client-tls` Secret in the aggregator namespace. The CA *bundle* that
verifies Goldmane's serving cert (`tigera-ca-bundle` ConfigMap, key
`tigera-ca-bundle.crt`) is likewise copied verbatim (a ConfigMap can't be
cross-namespace-mounted).
> **GOTCHA — if the operator rotates `whisker-backend-key-pair`, re-apply
> `stacks/goldmane-edge-aggregator`** to re-sync the copied cert. Symptom of a
> stale copy: the `aggregate` pod logs TLS handshake / `Flows.Stream` failures
> and no `last_seen` updates land in the `edge` table. Hardening follow-up
> (noted in the stack): mint an own-identity cert in-namespace if Whisker is ever
> removed (which would delete the reused source Secret).
The Deployment leaves `GOLDMANE_HOST=goldmane.calico-system.svc.cluster.local:7443`
and the default cert/CA paths; the default ServerName (host sans port) is a SAN
on Goldmane's live serving cert, so no `GOLDMANE_SERVER_NAME` /
`GOLDMANE_TLS_INSECURE` override is needed.
## How to query who-talks-to-whom
**Quickest — the `homelab edges` CLI** (the investigation helper; read-only
SELECT against the DB via the dbaas primary pod, no creds/SQL to remember):
```
homelab edges --ns <ns> # edges touching <ns> (either direction)
homelab edges --peers-of <ns> # <ns>'s distinct peer namespaces
homelab edges --src <ns> # <ns>'s egress peers (--dst <ns> for ingress)
homelab edges --new-since 24h # edges first seen in the last day (or a date)
homelab edges --denied # blocked / lateral-movement attempts
homelab edges --json [...] # machine-readable, for agents/pipelines
homelab edges --help # full flag list
```
For ad-hoc SQL, `psql` into the DB (creds: Vault static role
`static-creds/pg-goldmane-edges`, or exec a CNPG pod). All queries are against
the single `edge` table.
```sql
-- Everything talking to a namespace (inbound), most-active first
SELECT src_ns, action, flow_count, first_seen, last_seen
FROM edge WHERE dst_ns = '<ns>' ORDER BY flow_count DESC;
-- Everything a namespace talks TO (outbound)
SELECT dst_ns, action, flow_count, first_seen, last_seen
FROM edge WHERE src_ns = '<ns>' ORDER BY last_seen DESC;
-- New edges in the last 24h (what the digest reports)
SELECT src_ns, dst_ns, action, flow_count, first_seen
FROM edge WHERE first_seen > now() - interval '24 hours'
ORDER BY first_seen DESC;
-- Any DENIED edges (policy is dropping this pair)
SELECT src_ns, dst_ns, flow_count, last_seen
FROM edge WHERE action = 'deny' ORDER BY last_seen DESC;
-- Full edge set as a graph adjacency list
SELECT src_ns, dst_ns, action, flow_count FROM edge ORDER BY src_ns, dst_ns;
```
For the **live** (sub-hour) view including pod/port detail, use the Whisker UI —
the `edge` table intentionally aggregates that away.
## Deriving the Wave-1 egress allowlist from the edge table (infra #62)
The durable edge set is a faster, identity-stamped data source for the existing
**observe-then-enforce** egress effort (beads `code-8ywc`; snapshot
`docs/architecture/wave1-egress-observation-2026-05-22.md`) than the original
iptables-`LOG` → journald → Loki path (ADR-0014 consequence: "Enforcement gains
a better data source"). It replaces the *internal* (namespace-to-namespace) leg
of the allowlist; **external/public-internet egress is NOT in this table** (empty
dst namespace, dropped) — for those destinations keep using the Calico flow-log
path described in security.md.
**Per-namespace internal egress allowlist** — the set of in-cluster namespaces a
given source is *observed* talking to with `action='allow'`:
```sql
-- Internal egress allowlist for one namespace (feeds its NetworkPolicy)
SELECT DISTINCT dst_ns
FROM edge
WHERE src_ns = '<ns>' AND action = 'allow'
ORDER BY dst_ns;
```
```sql
-- Full internal egress matrix for all namespaces at once
SELECT src_ns, array_agg(DISTINCT dst_ns ORDER BY dst_ns) AS allowed_dst_ns
FROM edge
WHERE action = 'allow'
GROUP BY src_ns
ORDER BY src_ns;
```
```sql
-- Sanity: namespaces with a DENY edge already (policy is biting; investigate
-- before tightening further)
SELECT DISTINCT src_ns, dst_ns FROM edge WHERE action = 'deny';
```
**How this feeds enforcement (scope):** the derived `dst_ns` set is the
*internal* half of a namespace's egress allowlist — it tells you which
in-cluster namespaces to permit before flipping that namespace to default-deny.
The universal baseline (kube-dns :53, often dbaas :3306/:5432, redis :6379) and
the external destinations still come from the Wave-1 observation snapshot.
**Enforce-flips remain OUT OF SCOPE** here — this is observe-and-derive only;
the phased per-namespace default-deny rollout (starting `recruiter-responder`)
is tracked under `code-8ywc`. Cross-links:
[security.md → NetworkPolicy Default-Deny Egress](../architecture/security.md#networkpolicy-default-deny-egress-wave-1--observe-then-enforce-tier-34),
[wave1-egress-observation-2026-05-22.md](../architecture/wave1-egress-observation-2026-05-22.md),
[ADR-0014](../adr/0014-service-identity-and-east-west-observability.md).
> **Caveat (same as the Wave-1 snapshot):** an edge only exists if it was
> *observed*. A weekly CronJob or a 7-day Vault rotation may not have fired yet —
> collect ≥7 days of edges before treating a namespace's `allow` set as
> complete. The `first_seen` column tells you how long an edge has been known;
> the digest surfaces brand-new ones daily.
## Monitoring & health (infra #61)
The aggregator pod has **no `/metrics` endpoint** — health is inferred from
kube-state-metrics. Three complementary signals (memory ids 6598, 6599;
see also [monitoring.md → Security Alerts](../architecture/monitoring.md#security-alerts-wave-1--planned-beads-code-8ywc)):
| Signal | What | Where |
|---|---|---|
| **`AggregatorDown`** | `kube_deployment_status_replicas_available{namespace="goldmane-edge-aggregator",deployment="goldmane-edge-aggregator"} < 1` for 15m → warning | Prometheus alert group `Network Observability (Goldmane)` in `stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl`; routes `slack-warning``#alerts` |
| **`DigestFailing`** | `kube_job_status_failed{...job_name=~"goldmane-edges-digest.*"} > 0` within 24h, for 30m → warning | same alert group → `#alerts` |
| **cluster-health #48** | `check_goldmane_aggregator` reads the Deployment's `Available` condition (missing or not-Available → FAIL) | `scripts/cluster_healthcheck.sh` (human / `--quiet` / `--json` modes; emits `goldmane_aggregator`) |
The two alert layers are deliberately complementary: `AggregatorDown`
**no new edges land** in the DB; `DigestFailing` → **edges still land but nobody
is told**. A freshness probe (#61b) was intentionally skipped — `AggregatorDown`
is the agreed floor.
## Troubleshooting
**Whisker UI 502 / unreachable.** The additive
`kubernetes_network_policy_v1.whisker_allow_traefik` is missing or the
operator's default-deny `whisker` NP regenerated — re-apply `stacks/calico`. A
brand-new ingress host is also invisible to LAN split-horizon until the hourly
`technitium-ingress-dns-sync` runs (memory #5349); test meanwhile with
`curl -sSI --resolve whisker.viktorbarzin.me:443:10.0.20.203 https://whisker.viktorbarzin.me`
(expect a 302 to Authentik — the gate working).
**Whisker UI empty (but reachable — 302s to Authentik fine).** ROOT CAUSE (the
2026-06-28 incident): the operator's own `whisker` NetworkPolicy is
policyTypes:[Ingress,**Egress**], and its egress allows DNS only to the kube-dns
*pods* (podSelector `k8s-app=kube-dns`). But whisker-backend resolves
`goldmane.calico-system.svc` via the kube-dns **ClusterIP** (10.96.0.10), and
**Calico drops UDP DNS to a ClusterIP under a podSelector-only egress rule**.
Verified: from the whisker pod's netns, ClusterIP DNS = 100% timeout while direct
kube-dns *pod-IP* DNS = OK, and a pod with no egress policy resolves fine.
whisker-backend resolves goldmane ONCE in the brief startup window before the
policy programs, holds its long-lived gRPC stream, and only re-resolves when that
stream breaks (e.g. a node-reboot blip) — at which point the blocked ClusterIP
DNS wedges its Go resolver (`failed to stream flows` / `code = Unavailable: dns
... i/o timeout` forever) and the UI goes blank. The durable **aggregator is a
SEPARATE pod in its own (unrestricted) namespace** and is unaffected.
FIX (applied 2026-06-28): `kubernetes_network_policy_v1.whisker_allow_dns_clusterip`
(`stacks/calico`) — an additive egress NP allowing whisker → the kube-dns
ClusterIP (`10.96.0.10/32`) on 53/UDP+TCP; k8s egress policies are additive so
the operator NP is untouched. Backstop: the `whisker-watchdog` CronJob restarts
the pod if it ever wedges for another reason. Immediate manual heal:
`kubectl -n calico-system delete pod -l k8s-app=whisker`. Diagnose by comparing,
from the whisker pod's netns, `nslookup goldmane.calico-system.svc.cluster.local
10.96.0.10` (the ClusterIP — times out if the NP fix is missing) against the same
query aimed at a kube-dns *pod IP* (always works).
**No new `last_seen` updates / `AggregatorDown` firing.** Check the `aggregate`
pod logs (`kubectl logs -n goldmane-edge-aggregator deploy/goldmane-edge-aggregator`).
Common causes, in order:
1. **Stale mTLS cert** — the operator rotated `whisker-backend-key-pair`; re-apply
`stacks/goldmane-edge-aggregator` (see cert-reuse gotcha above). Symptom: TLS
handshake / `Flows.Stream` errors.
2. **Stale DB password** — the 7-day Vault rotation bounced the credential but
the pod kept the old one. The Deployment carries
`secret.reloader.stakater.com/reload: goldmane-edges-db-creds`; if it's not
restarting on rotation, verify the Reloader annotation and the ExternalSecret.
3. **Goldmane restarted** — the in-memory window was lost (expected); the stream
reconnects automatically and resumes upserting. No data loss in the DB
(only the sub-hour live window in Whisker is gone).
**Digest never posts / `DigestFailing` firing.** Inspect the most recent
`goldmane-edges-digest-*` Job (`kubectl get jobs -n goldmane-edge-aggregator`;
`kubectl logs job/<name>`). The CronJob's `ttl_seconds_after_finished=86400` GCs
pods after a day, so check soon after a failed run. With `SLACK_WEBHOOK_URL`
empty the binary forces a dry-run (no post) — verify the `goldmane-edges-slack`
ExternalSecret resolved. A dry run / smoke test: run the image with `args:
["digest"]` + `DRY_RUN=1` to print the message instead of POSTing.
> Resolved (2026-06-28): the digest posts cleanly to `#alerts`
> (`lastSuccessfulTime` current, `DigestFailing` clear; e.g. the 2026-06-28 08:00
> London run reported "8 new edges in last 24h"). The 2026-06-25 failures were
> the `#security` channel override returning HTTP 404 — the shared
> `alertmanager_slack_api_url` webhook's Slack app isn't a member of `#security`;
> consolidating all Slack output to `#alerts` fixed it.
**No edges at all in the table.** Confirm Goldmane is enabled
(`kubectl get goldmane,whisker -A`) and `calico-node` rolled with the
`FELIX_FLOWLOGSGOLDMANESERVER` env; confirm the `goldmane-edges-db-init` Job
completed; confirm the aggregator pod is `Running` and not `ImagePullBackOff`
(ghcr allowlist).
## Related
- [ADR-0014 — Service identity & east-west observability](../adr/0014-service-identity-and-east-west-observability.md)
- [security.md — NetworkPolicy Default-Deny Egress + east-west flow observability](../architecture/security.md)
- [monitoring.md — east-west flow observability + alerts](../architecture/monitoring.md)
- [wave1-egress-observation-2026-05-22.md](../architecture/wave1-egress-observation-2026-05-22.md)
- `CONTEXT.md` glossary — **Service identity**, **Goldmane / Whisker**
- Code: `~/code/goldmane-edge-aggregator` (`README.md`, `DEPLOY.md`); stacks
`stacks/goldmane-edge-aggregator`, `stacks/calico`

View file

@ -0,0 +1,164 @@
# `homelab vault` onboarding (Vaultwarden access + `vault kv` infra secrets)
## Scope
`homelab vault` fronts **two unrelated secret stores** — the name collides, so
the command keeps them clearly separated:
- **Vaultwarden** — your personal *password manager* (logins/passwords/TOTP).
The verbs below give each devvm roster user no-HITL access to **their own**
Vaultwarden vault (and any Organization Collection shared with their account).
It shells out to the official `bw` CLI; the user's Vaultwarden credentials live
only in their isolated Vault path `secret/workstation/claude-users/<os-user>`
and are decrypted as that OS user — the admin never sees them.
- **HashiCorp Vault / OpenBao** — the homelab *infra* secrets store (the
`secret/…` KV mount at `vault.viktorbarzin.me`), under `homelab vault kv`.
These use the caller's **own** Vault token (`vault login -method=oidc`
`~/.vault-token`), **not** the scoped Vaultwarden token (which only reads the
`claude-users/<user>` path); access is whatever your Vault policy grants.
```text
# Vaultwarden (password manager)
homelab vault setup one-time: store VW email + master password + API key
homelab vault status configured / unlocked / reachable (no secrets)
homelab vault list [--search Q] item names (no secrets)
homelab vault get <name> [--field password|username|uri|notes|totp] [--json]
homelab vault get <name> --all all fields (incl. custom) as JSON; pipe it (| jq)
homelab vault code <name> current TOTP code
homelab vault lock lock / log out the local bw session
# HashiCorp Vault / OpenBao (infra secrets; uses your own OIDC token)
homelab vault kv get <path> [--field K] read an infra KV secret
homelab vault kv list <path> list sub-paths
homelab vault kv put <path> <key> write one key (value via stdin; merges)
```
## How auth works (why a non-admin can use it)
`homelab vault` runs `vault` as the calling user. It resolves a Vault token in
this order (`ensureVaultToken`, `cli/cmd_vault.go`):
1. an explicit `$VAULT_TOKEN` (a deliberate override), then
2. the per-user **scoped token** that `claude-auth-sync` maintains at
`~/.config/claude-auth-sync/vault-token` (policy `workstation-claude-<user>`), then
3. a native `~/.vault-token` (admins who carry one; non-admins usually don't).
**The scoped token deliberately beats `~/.vault-token`.** This tool only touches
your own `secret/workstation/claude-users/<user>` path, and a power-user who ran
`vault login -method=oidc` carries a read-only `~/.vault-token` (capability
`deny` on that path); letting it win would shadow the scoped token and fail every
op with `403 permission denied` (this is exactly what bit emo, 2026-06-28). The
CLI also **self-defaults `VAULT_ADDR`** to `https://vault.viktorbarzin.me` when
unset, so it works from non-login shells (tmux panes, AFK agent subprocesses)
that never sourced `/etc/environment` — otherwise every `vault` child hits the
`127.0.0.1:8200` default and fails `connection refused` (exit 2).
That scoped policy grants exactly `create`/`read`/`update` on the user's own
`secret/workstation/claude-users/<user>` path — no `patch` capability — so the
tool writes with `vault kv patch -method=rw` (read-modify-write), falling back to
`kv put` only when the path does not exist yet. This preserves the
`claude_ai_oauth_json` key that [claude-auth-sync](claude-auth-renew-workstation.md)
co-locates there. (The admin-only bugs were fixed 2026-06-27; the
`VAULT_ADDR`/token-precedence bugs above were fixed 2026-06-28.)
## Prerequisites (per user)
- The user is in `scripts/workstation/roster.yaml` and the **vault** stack has
been applied → their `workstation-claude-<user>` policy exists.
- The user's workstation was provisioned (`setup-devvm.sh`) → their scoped Vault
token exists at `~/.config/claude-auth-sync/vault-token`.
- `bw` is installed **system-wide** at `/usr/bin/bw` (see below).
- The user has a Vaultwarden account at `https://vaultwarden.viktorbarzin.me`
(self-service signup is open; admin panel is disabled).
## One-time admin steps (devvm)
`bw` must be system-wide so every user resolves it (it is a Node script, and
`node` is already system-wide at `/usr/bin/node`). `setup-devvm.sh` installs it
to the npm `/usr` prefix; the guard checks the **system** path, not
`command -v bw` (an admin's own `~/.local/bin/bw` used to mask the system
install, leaving non-admins with no backend). To install on a running box:
```bash
sudo npm install -g --prefix /usr "@bitwarden/cli@^2024"
bw --version # confirm /usr/bin/bw resolves
```
After landing a `cli/` change, rebuild the binary so users pick it up:
```bash
# version is stamped from cli/VERSION, exactly as setup-devvm.sh does it
sudo bash -c 'cd /home/wizard/code/infra/cli && \
go build -ldflags "-X main.version=$(cat VERSION 2>/dev/null || echo dev)" \
-o /usr/local/bin/homelab .'
```
(or just re-run `scripts/workstation/setup-devvm.sh` as root, which rebuilds it.)
## User onboarding
The user runs these as themselves. The master password / API key are entered
interactively (never on the command line) and stored only in the user's Vault
path.
1. In the Vaultwarden web vault → **Settings → Security → Keys → View API key**,
copy the `client_id` (`user.xxxx`) and `client_secret`.
2. Configure:
```bash
homelab vault setup # prompts: VW email, API client_id/secret, master password
homelab vault status # → "vault: configured, unlocked, reachable ✓"
homelab vault list # item names (own vault + any shared Collections)
```
## Shared-Collection access (sharing passwords with a user)
`homelab vault` surfaces Organization Collection items automatically once the
user's Vaultwarden account is a confirmed member. These steps are done by the
vault owner in the **Vaultwarden web UI** (they need the owner's master
password — not an infra/Terraform operation):
1. Create or reuse an **Organization** and a **Collection** of shared logins.
2. **Invite** the user's Vaultwarden account to the Organization, granting
**"Can view"** on that Collection (least privilege).
3. The user accepts the email invite and confirms membership.
4. The user runs `homelab vault list` — the shared items now appear alongside
their own (a `homelab vault status` sync picks them up).
## Security model (the no-HITL trade)
Identity is the kernel UID. Anything running as the user can decrypt the user's
vault — this is the accepted trade for no-human-in-the-loop fetches. Secrets
never appear in `argv` (passed via env or stdin), core dumps are disabled, TOTP
fetches are logged to syslog/Loki, and on a TTY values go to the clipboard
(auto-clearing) rather than scrollback. The admin's Vault token is never used by
a non-admin: each user authenticates with their own scoped token.
## Verification
```bash
# the scoped token carries the right policy
VAULT_TOKEN="$(sudo cat /home/<user>/.config/claude-auth-sync/vault-token)" \
vault token lookup -format=json | jq '.data.display_name, .data.policies'
# → "token-devvm-claude-auth-<user>", [..., "workstation-claude-<user>"]
sudo -u <user> -i bw --version # /usr/bin/bw resolves for the user
sudo -u <user> -i homelab vault status
```
## Troubleshooting
**`homelab vault setup` (or any verb) fails with `exit status 2`** — older
binaries swallowed the underlying `vault` error; the message now includes it.
Two historical causes (both fixed in-CLI 2026-06-28, kept here for diagnosis):
- `... connection refused` to `127.0.0.1:8200``VAULT_ADDR` wasn't set in the
caller's shell. The CLI now self-defaults it, but if you see this on an old
binary: `export VAULT_ADDR=https://vault.viktorbarzin.me`.
- `403 permission denied` on `PUT .../secret/data/workstation/claude-users/<user>`
→ a stale read-only `~/.vault-token` (e.g. from `vault login -method=oidc`,
policy `default`, capability `deny` on that path) was shadowing the scoped
token. The CLI now prefers the scoped token; on an old binary, `rm
~/.vault-token` (or `unset VAULT_TOKEN`) and retry. Confirm with
`VAULT_TOKEN="$(sudo cat /home/<user>/.config/claude-auth-sync/vault-token)" vault token capabilities secret/data/workstation/claude-users/<user>`
→ must be `create, read, update`.

View file

@ -36,11 +36,13 @@ envsubst on /template/job-template.yaml | kubectl apply -f -
Job 0 — preflight (pinned: k8s-node1)
├── compat-gate: addon/API/containerd support for target (else BLOCK+alert)
├── compat-gate: addon/API/containerd support for target (else BLOCK-actionable+alert / HOLD-quiet)
├── All nodes Ready + no Mem/Disk pressure
├── halt-on-alert (kured-style ignore-list)
├── 24h-quiet baseline (no Ready transitions <24h ago)
├── kubeadm upgrade plan matches target (skipped when master already at target — partial-resume)
├── apiserver-OIDC drift check: kubeadm upgrade diff drops --authentication-config? → Slack WARN (recoverable; not a block)
├── reclaim kubeadm scratch: prune /etc/kubernetes/tmp/kubeadm-backup-* >3d on master (kubeadm leaks ~400MB etcd-db backups)
├── Push k8s_upgrade_in_flight=1, k8s_upgrade_started_timestamp=$(date +%s)
├── Trigger backup-etcd Job, wait, verify snapshot byte count
├── SSH master: containerd skew fix (if master < workers)
@ -112,18 +114,36 @@ inert for a patch (no API removal or containerd floor occurs inside a minor).
This is the **"auto-upgrade when we can, halt + alert when we can't"** contract.
**On a block**, the gate:
- pushes `k8s_upgrade_blocked=1` to Pushgateway (→ the `K8sUpgradeBlocked`
Prometheus alert),
- Slacks the **specific reasons** (which addon/API/node, current vs required), and
- **halts the chain** — it exits **non-fatal** (the upgrade simply isn't safe yet,
this is not a failure). Because the block happens **before any mutation, no
rollback is involved**; nothing was changed.
**The gate classifies each refusal** (2026-06-28) so it only cries wolf when
there's something to do — `compat-gate.py` exit code + a `[TAG]` on every reason:
**To clear a block**: upgrade the named addon (or migrate the API caller off the
deprecated group/version, or bump containerd on the named node) so the offending
condition no longer holds. The **next nightly run then proceeds automatically**
no manual chain restart needed.
- **`[ACTIONABLE]`** (exit 2) — a newer version of the lagging addon **exists in
the compat matrix** and upgrading it would clear the block (or an in-use
deprecated API must be migrated / a node's containerd bumped).
- **`[WAITING]`** (exit 4 = held) — **no released addon version supports the
target yet** (e.g. kyverno/ESO behind a brand-new k8s minor). Only an upstream
release can clear it.
- **`[PINNED]`** (exit 4 = held) — a supporting version exists but the addon is
**deliberately pinned** in the matrix (`"pinned": true`, e.g. gpu-operator,
whose bump is coupled to a newer NVIDIA driver image + Ubuntu/kernel).
- **Held wins on a mix**: if any blocker is waiting/pinned the whole target is
held — acting on the actionable ones wouldn't unblock it yet.
**On any refusal** the preflight pushes the verdict gauge (`k8s_upgrade_blocked=1`
for actionable, `k8s_upgrade_held=1` for held), sets `HALT_CHAIN` so the chain
doesn't advance, and **exits 0 — the Job Completes cleanly** (a refusal is a
decision, not a failure: no Failed Job, no `K8sUpgradeChainJobFailed`). It's
before any mutation, so no rollback. Reasons (grouped by class) appear in the
**morning nightly report**, not a per-run Slack.
- **Actionable**`K8sUpgradeBlocked` fires (once, via alert-on-change). Clear
it by doing the named upgrade/migration; the next nightly run proceeds.
- **Held****deliberately NO alert** — only the nightly report's `⏸️ HELD`
line, because it can't be actioned now (a nightly alert would cry wolf). It
clears itself once upstream ships support (refresh `addon-compat.json`) or the
pin is lifted (delete `pinned`+`pin_reason`). The detector re-evaluates every
night, silently re-spawning the refused-but-Complete preflight (so a cleared
block is picked up next run, not after the 7d Job TTL).
The **compat matrix** lives in
`stacks/k8s-version-upgrade/scripts/addon-compat.json` — a map of `addon → highest
@ -163,6 +183,8 @@ Pushed by upgrade-step.sh during phase execution; observed by the
| `k8s_upgrade_in_flight` (1/0) | preflight Job (set to 1) | postflight Job (set to 0) |
| `k8s_upgrade_started_timestamp` (epoch s) | preflight Job | postflight Job (set to 0) |
| `k8s_upgrade_snapshot_taken` (1/0) | preflight Job (set to 1 after Job=`pre-upgrade-etcd-*` completes with `Backup done:` log of ≥1 KiB) | postflight Job (0) |
| `k8s_upgrade_blocked` (1/0) | preflight Job — set 1 on an **actionable** compat refusal (→ `K8sUpgradeBlocked`) | preflight (definitive each run; 0 when safe) / postflight (0) |
| `k8s_upgrade_held` (1/0) | preflight Job — set 1 on a **held** (waiting-upstream/pinned) refusal; **no alert** | preflight (definitive each run; 0 when safe) / postflight (0) |
| `k8s_upgrade_available{kind,running,target}` | detection CronJob | next detection run (overwrite) |
| `k8s_version_check_last_run_timestamp` | detection CronJob | (cumulative) |
@ -171,8 +193,8 @@ Pushed by upgrade-step.sh during phase execution; observed by the
- **`K8sVersionSkew`** — distinct kubelet/apiserver `gitVersion` count > 1 for 30m. Catches a half-done rollout.
- **`EtcdPreUpgradeSnapshotMissing`** — `k8s_upgrade_in_flight==1 && k8s_upgrade_snapshot_taken==0` for 10m. Catches preflight Stage 2 failing silently.
- **`K8sUpgradeStalled`** — `k8s_upgrade_in_flight==1 && time()-k8s_upgrade_started_timestamp > 5400` for 5m. Catches a Job in the chain dying without spawning its successor.
- **`K8sUpgradeChainJobFailed`** — `(kube_job_status_failed{namespace="k8s-upgrade",job_name=~"k8s-upgrade-(preflight|master|worker|postflight)-.*",reason=~"BackoffLimitExceeded|DeadlineExceeded"} > 0) unless on() (k8s_upgrade_blocked == 1)` for 15m (warning). Catches a phase Job that **terminally failed before `k8s_upgrade_in_flight` was set** — the preflight gates exit pre-metric, so the two `in_flight`-based alerts above are blind to a failed preflight (this is what hid the 5-day 1.34.9 wedge on 2026-06-12). Reason-scoped to terminal job conditions so a retry-success doesn't false-positive (a bare failed-pod-count would otherwise also block kured for the Job's 7d TTL). The `unless k8s_upgrade_blocked == 1` clause (added 2026-06-21) excludes a preflight that failed because the **compat gate deliberately refused** the target — that's owned by `K8sUpgradeBlocked` and was double-firing here; a genuine wedge exits without setting the blocked gauge, so it still fires.
- **`K8sUpgradeBlocked`** — `k8s_upgrade_blocked == 1` (warning). A k8s **auto-upgrade was refused** by the compat gate because a critical addon, an in-use deprecated API, or a node's containerd is too old for the detected target. The **specific reasons are in Slack**; clear it by upgrading the named addon / migrating the API caller / bumping containerd, after which the next nightly run proceeds (see "Auto-upgrade compat gate"). No upgrade was attempted, so this is not a half-done-rollout alert.
- **`K8sUpgradeChainJobFailed`** — `kube_job_status_failed{namespace="k8s-upgrade",job_name=~"k8s-upgrade-(preflight|master|worker|postflight)-.*",reason=~"BackoffLimitExceeded|DeadlineExceeded"} > 0` for 15m (warning). Catches a phase Job that **terminally failed before `k8s_upgrade_in_flight` was set** — the preflight gates exit pre-metric, so the two `in_flight`-based alerts above are blind to a failed preflight (this is what hid the 5-day 1.34.9 wedge on 2026-06-12). Reason-scoped to terminal job conditions so a retry-success doesn't false-positive (a bare failed-pod-count would otherwise also block kured for the Job's 7d TTL). The old `unless on() (k8s_upgrade_blocked == 1)` clause was **dropped 2026-06-28**: compat-gate refusals now Complete cleanly (exit 0) instead of Failing, so a terminally-Failed chain Job again means a genuine wedge with nothing to exclude.
- **`K8sUpgradeBlocked`** — `k8s_upgrade_blocked == 1` (warning). An **ACTIONABLE** compat-gate refusal — a newer version of the lagging addon exists and upgrading it would clear the block (or an in-use deprecated API must be migrated / a node's containerd bumped). Reasons (grouped by class) are in the **morning nightly report**; clear it by doing the named upgrade/migration, after which the next nightly run proceeds (see "Auto-upgrade compat gate"). No upgrade was attempted, so this is not a half-done-rollout alert. **There is deliberately NO companion alert for the held verdict** (`k8s_upgrade_held=1` — waiting-on-upstream / pinned): nothing can be actioned now, so it is surfaced only by the nightly report's `⏸️ HELD` line.
- The first four alerts ALSO block kured (same `--prometheus-url` halt-on-alert mechanism) so the OS-reboot pipeline can't run on top of a half-done version upgrade.
### Nightly upgrade report (Slack)
@ -181,8 +203,8 @@ CronJob `k8s-upgrade-nightly-report` (k8s-upgrade ns, `var.report_schedule`,
default `7 6 * * *` = 06:07 UTC — after the 23:00 chain, before the 08:00 London
alert-digest) posts ONE Slack summary each morning of the previous night's run:
running version, detector freshness, detected target + kind, the outcome
(⚪ no upgrade needed / 🔴 blocked + live blocker reasons / 🟢 upgraded /
🟡 in progress / ⚠️ detector stale), and recent chain jobs. Read-only — it reads
(⚪ no upgrade needed / 🔴 blocked-actionable + reasons / ⏸️ held = waiting-upstream/pinned /
🟢 upgraded / 🟡 in progress / ⚠️ detector stale), and recent chain jobs. Read-only — it reads
the Pushgateway gauges + live nodes/jobs and re-runs `compat-gate.py` for fresh
blocker reasons; reuses the chain's SA + `slack_webhook` + scripts ConfigMap.
Logic + unit tests: `scripts/nightly-report.py`, `scripts/test_nightly_report.py`.
@ -222,22 +244,34 @@ Exposed in K8s via ExternalSecret `k8s-upgrade-creds` in the `k8s-upgrade` names
## Common Operations
### Post-upgrade: apiserver OIDC restore (AUTOMATED by the chain since 2026-06-19)
### apiserver OIDC + kubeadm upgrades (kubeadm-config reconciliation since 2026-06-24)
`kubeadm upgrade apply` **regenerates `/etc/kubernetes/manifests/kube-apiserver.yaml`
and drops the `--authentication-config` flag**, silently disabling apiserver
OIDC (kubectl/kubelogin CLI **and** the web dashboard SSO break — tokens get
401). This used to require a manual re-apply after **every** control-plane bump.
from kubeadm-config**. apiserver auth uses a structured multi-issuer
`--authentication-config` (kubectl + dashboard SSO), but kubeadm-config used to
still carry the legacy single-issuer `--oidc-*` extraArgs — so every upgrade
reverted the flag, **silently breaking SSO after the upgrade** (the apiserver does
NOT crash on this — verified by isolated repro; it's recoverable via the restore
script below). NB: the **1.34→1.35 stall on 2026-06-24 was a *separate* issue —
etcd IO starvation**, not this drift; post-mortem:
`docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md`.
**Now automated:** the `rbac` stack publishes its OIDC restore script to the
`kube-system/apiserver-oidc-restore` ConfigMap, and the version-upgrade chain's
`phase_master` re-runs it on master immediately after `kubeadm upgrade apply`
(while tigera-operator is still quiesced, so the flag-add apiserver restart can't
crashloop the operator). It's idempotent, health-gates `/livez` with
auto-rollback, and is **non-fatal** — a failure only lags SSO until the next rbac
apply (the version upgrade itself already succeeded). So a chain-driven
control-plane bump no longer breaks SSO. The master phase self-skips when master
is already at target, so this only runs when master was actually upgraded.
**Primary fix (2026-06-24):** `stacks/rbac/modules/rbac/apiserver-oidc.tf` now
**reconciles kubeadm-config** (`kubeadm init phase upload-config kubeadm`, rewriting
`apiServer.extraArgs`: drop `--oidc-*`, add `--authentication-config`) as part of
its remote script. So kubeadm regenerates a **correct** manifest and the apiserver
upgrades with a pure image bump — `kubeadm upgrade diff <target>` shows only the
image change. Zero live impact (the CM is read only during an upgrade).
**Backstops:**
- **Preflight check 4b** runs `kubeadm upgrade diff` and **alerts** (Slack WARN, does
NOT block — the drift only breaks SSO, which is recoverable) if
`--authentication-config` would still be dropped.
- The `rbac` stack still publishes its restore script to the
`kube-system/apiserver-oidc-restore` ConfigMap, and `phase_master` re-runs it on
master right after `kubeadm upgrade apply` (idempotent, `/livez`-gated with
auto-rollback, non-fatal) — now redundant belt-and-suspenders that *also*
re-reconciles kubeadm-config. Self-skips when master is already at target.
**Manual fallback** — only for an out-of-band/manual `kubeadm` upgrade, or if the
chain logged `WARN: --authentication-config absent after re-apply`:

View file

@ -0,0 +1,72 @@
# Runbook: pfSense WAN / egress outage
**Scope:** the cluster (and home) loses **internet egress** while pfSense is
otherwise alive — internal VLAN routing and DNS keep working. This is the
**2026-06-27 incident class**: pfSense (Proxmox **VMID 101**) stopped passing
IPv4 egress for ~20 min (00:02→00:23 UTC) while LAN/OPT1 routing + Unbound
stayed up; recovery required a manual reboot, and **nothing alerted** (no egress
probe existed; the cloudflared replica metric stayed green). The alerts +
probes below close that gap. Incident detail: memory ids #6715#6723.
pfSense is a **single point of failure** (no HA): it is the k8s default gateway
(`10.0.20.1`), Kea DHCP, Unbound DNS, NAT, and the WireGuard hub. WAN is
**static** `192.168.1.2/24`, upstream gateway `WANGW = 192.168.1.1` (the TP-Link
Archer AX6000). The sole IPv4 default gateway, no gateway-group/failover.
## Alerts (all in `stacks/monitoring/modules/monitoring/`)
| Alert | Signal | Means |
|-------|--------|-------|
| `WANGatewayUnreachable` (critical) | in-cluster ICMP to `192.168.1.1` fails >3m | pfSense's upstream gateway is unreachable from the cluster |
| `InternetEgressDown` (critical) | in-cluster ICMP to **both** `9.9.9.9` and `1.1.1.1` fails >2m | internet egress through pfSense NAT is black-holed |
| `ExternalDNSResolutionDown` (warning) | UDP/53 to both public resolvers fails >3m | egress or external-DNS path broken |
| `EgressOnlyDivergence` (critical) | t3-probe `cloudflare` leg down **while** `internal` leg up >3m | egress-specific failure, internal healthy (the exact 2026-06-27 signature) |
| `PfSenseVMDown` (critical) | `pve_up{id="qemu/101"}==0` while host up >2m | the pfSense VM stopped/crashed (host fine) |
| `CloudflaredTunnelConnLoss` (warning, Loki) | >20 cloudflared edge-conn failures/5m | tunnel/egress trouble (canary that fires first; replica metric is blind) |
Probes run **from inside the cluster** (blackbox-exporter, pod → node → pfSense
NAT), so they exercise the exact egress path that fails. `WANGatewayUnreachable`
/ `InternetEgressDown` **inhibit** the downstream egress symptoms so one root
alert pages, not a storm.
`PfSenseVMDown` **does not** catch a *guest-internal* reboot — `pve_up` tracks
the qemu process, which survives an in-guest reboot (this is why 2026-06-27 was
metric-invisible). `CloudflaredTunnelConnLoss` + the probe alerts cover that case.
## Diagnose (read-only first)
1. **Confirm scope** — is it egress-only or total?
- `kubectl -n monitoring` Grafana → `probe_success{job=~"wan-gateway-icmp|internet-egress-icmp"}` and `t3probe_connected` by `leg`.
- Internal still up? `pve_up{id="qemu/101"}` should be `1`; internal k8s DNS (`10.0.20.1`) still resolving = pfSense alive, egress-only.
2. **Capture pfSense on-box logs BEFORE rebooting** (they persist on disk — no RAM-disk — and are the only source that proves the mechanism; they are NOT shipped to Loki):
```
ssh -i ~/.ssh/id_ed25519 admin@10.0.20.1 # devvm wizard key (id #6784)
clog /var/log/gateways.log | grep -iE 'WANGW|down|up|delay|loss' # dpinger gateway alarms
clog /var/log/routing.log | grep -iE 'default|route' # default-route add/delete
clog /var/log/system.log | tail -200
netstat -rn | head # is the default route present?
ls -la /var/crash/ # panic/textdump?
```
(If SSH is rejected post-reboot, the reboot regenerated `authorized_keys` from
config.xml — re-add the key via console or WebGUI; see id #6718.)
3. **Upstream check** — is the TP-Link / ISP up? It held the same public IP with
clean DHCP renewals through the 2026-06-27 event, so a *sustained* upstream
fault is unlikely; a reboot fixing it points at **pfSense-side state**.
## Recover
- **Fast path (known fix):** reboot pfSense — re-adds the default route, re-arms
dpinger, flushes pf state. **Capture the logs above FIRST** (a reboot wipes
the volatile evidence needed to find the real mechanism).
- Targeted (if logs show a dpinger gateway-down): System → Routing → Gateways →
WANGW; check the monitor IP + dpinger state; re-enable the gateway / let it
re-eval. Confirm `netstat -rn` shows the default route restored.
## Prevent / harden (deferred, needs a live-pfSense change)
Not done in this monitoring change — tracked for a follow-up with hands-on
pfSense access: point dpinger's monitor at the local gateway (`192.168.1.1`)
instead of an external IP + widen thresholds; disable `gw_down_kill_states` for
the single WAN; add a failover gateway group; a 60s auto-recovery watchdog;
ship pfSense system/gateway/routing syslog to the cluster so these logs become
centrally queryable.

View file

@ -27,7 +27,7 @@ KUBECONFIG_PATH="${KUBECONFIG:-${HOME}/.kube/config}"
[[ -f "$KUBECONFIG_PATH" ]] || KUBECONFIG_PATH="$(pwd)/config"
KUBECTL=""
JSON_RESULTS=()
TOTAL_CHECKS=47
TOTAL_CHECKS=48
# Parallel execution settings. Each check function is self-contained — it
# only reads cluster state and mutates the in-memory counters / JSON_RESULTS
@ -3156,6 +3156,44 @@ PYEOF
esac
}
# --- 48. Goldmane edge-aggregator availability ---
#
# The goldmane-edge-aggregator Deployment (ADR-0014 / infra #58) streams Calico
# Goldmane flows into the goldmane_edges CNPG DB — the durable who-talks-to-whom
# trail. The pod has NO /metrics endpoint, so its liveness can't be scraped;
# this check reads the Deployment's Available condition directly so the trail
# silently dying surfaces in the health board (mirrors the AggregatorDown
# Prometheus alert). Missing Deployment / not-Available -> FAIL.
check_goldmane_aggregator() {
section 48 "Goldmane Edge-Aggregator"
local ns="goldmane-edge-aggregator" dep="goldmane-edge-aggregator"
local avail desired ready
# One get; absent Deployment is a hard fail (the trail isn't deployed).
if ! $KUBECTL get deploy "$dep" -n "$ns" >/dev/null 2>&1; then
[[ "$QUIET" == true ]] && section_always 48 "Goldmane Edge-Aggregator"
fail "Deployment $ns/$dep not found — who-talks-to-whom edge trail is not running"
json_add "goldmane_aggregator" "FAIL" "deployment missing"
return 0
fi
avail=$($KUBECTL get deploy "$dep" -n "$ns" \
-o jsonpath='{.status.conditions[?(@.type=="Available")].status}' 2>/dev/null)
ready=$($KUBECTL get deploy "$dep" -n "$ns" -o jsonpath='{.status.readyReplicas}' 2>/dev/null)
desired=$($KUBECTL get deploy "$dep" -n "$ns" -o jsonpath='{.spec.replicas}' 2>/dev/null)
ready=${ready:-0}
desired=${desired:-0}
if [[ "$avail" == "True" ]]; then
pass "Edge-aggregator Available ($ready/$desired ready)"
json_add "goldmane_aggregator" "PASS" "${ready}/${desired} ready"
else
[[ "$QUIET" == true ]] && section_always 48 "Goldmane Edge-Aggregator"
fail "Edge-aggregator NOT Available ($ready/$desired ready) — edge trail has stopped recording"
json_add "goldmane_aggregator" "FAIL" "${ready}/${desired} ready; Available=${avail:-unknown}"
fi
}
# --- Summary ---
print_summary() {
if [[ "$JSON" == true ]]; then
@ -3224,7 +3262,7 @@ main() {
check_monitoring_prom_am check_monitoring_vault check_monitoring_css
check_external_replicas check_external_divergence check_pve_thermals
check_pve_load check_external_traefik_5xx check_ha_status_dashboard
check_immich_search check_csi_ghost_drift
check_immich_search check_csi_ghost_drift check_goldmane_aggregator
)
# Auto-fix mutates cluster state inside individual checks — keep that

View file

@ -240,6 +240,79 @@ EOF
log "wrote OIDC kubeconfig -> $user:~/.kube/config"
}
# Hands-off chrome-service browser credential. For a user who has a
# `<os_user>-browser` ServiceAccount in the chrome-service namespace (created in
# stacks/chrome-service/rbac.tf), install a DUAL-CONTEXT kubeconfig whose DEFAULT
# context authenticates with that SA's long-lived token — so `homelab browser`
# (which shells out to `kubectl port-forward -n chrome-service`) works
# non-interactively, even from a headless agent session (the user's interactive
# OIDC login can't authenticate a headless kubectl). The user's personal OIDC
# identity is retained as the `oidc@homelab` named context
# (`kubectl --context oidc@homelab`). TF (the SA's existence) is the source of
# truth for WHO gets this — there is no roster flag. Idempotent (cmp-guarded; SA
# tokens are stable) + best-effort (cluster/secret unreachable -> WARN, never aborts).
install_browser_kubeconfig() {
local user="$1" home kc sa secret token server ca tmp
home="$(getent passwd "$user" | cut -d: -f6)"
[[ -z "$home" ]] && return 0
sa="${user}-browser"
secret="${sa}-token"
[[ -r "$ADMIN_KUBECONFIG" ]] || return 0
# Gate: only users with a chrome-service browser SA (TF-driven). Best-effort read.
KUBECONFIG="$ADMIN_KUBECONFIG" kubectl --request-timeout=10s -n chrome-service get serviceaccount "$sa" >/dev/null 2>&1 || return 0
token="$(KUBECONFIG="$ADMIN_KUBECONFIG" kubectl --request-timeout=10s -n chrome-service get secret "$secret" -o jsonpath='{.data.token}' 2>/dev/null | base64 -d 2>/dev/null || true)"
[[ -n "$token" ]] || { log "WARN: browser SA token not ready for $user (secret chrome-service/$secret) — skipped"; return 0; }
server="$(KUBECONFIG="$ADMIN_KUBECONFIG" kubectl config view --raw --minify -o jsonpath='{.clusters[0].cluster.server}')"
ca="$(KUBECONFIG="$ADMIN_KUBECONFIG" kubectl config view --raw --minify -o jsonpath='{.clusters[0].cluster.certificate-authority-data}')"
[[ -n "$server" && -n "$ca" ]] || { log "WARN: could not read cluster server/CA -> skip browser kubeconfig for $user"; return 0; }
kc="$home/.kube/config"
tmp="$(mktemp)"
cat > "$tmp" <<EOF
apiVersion: v1
kind: Config
clusters:
- name: homelab
cluster:
server: $server
certificate-authority-data: $ca
contexts:
- name: ${sa}@homelab
context:
cluster: homelab
user: $sa
- name: oidc@homelab
context:
cluster: homelab
user: oidc
current-context: ${sa}@homelab
users:
- name: $sa
user:
token: $token
- name: oidc
user:
exec:
apiVersion: client.authentication.k8s.io/v1beta1
command: kubectl
args:
- oidc-login
- get-token
- --oidc-issuer-url=$OIDC_ISSUER
- --oidc-client-id=kubernetes
- --oidc-extra-scope=email
- --oidc-extra-scope=profile
- --oidc-extra-scope=groups
interactiveMode: IfAvailable
EOF
if cmp -s "$tmp" "$kc" 2>/dev/null; then rm -f "$tmp"; return 0; fi # already current -> no churn
if [[ "$DRY_RUN" == 1 ]]; then echo "[dry-run] dual-context (SA default + OIDC) browser kubeconfig -> $user:$kc"; rm -f "$tmp"; return 0; fi
install -d -o "$user" -g "$user" -m 0700 "$home/.kube"
install -o "$user" -g "$user" -m 0600 "$tmp" "$kc" || { log "WARN: failed to write browser kubeconfig for $user"; rm -f "$tmp"; return 0; }
rm -f "$tmp"
log "wrote dual-context browser kubeconfig (SA default + OIDC) -> $user:~/.kube/config"
return 0
}
# Idempotently set KEY=VALUE in a t3-serve env file, PRESERVING other lines — so writing
# T3_PORT never clobbers an injected CLAUDE_CODE_OAUTH_TOKEN, and vice-versa. Mode 0600.
env_set() {
@ -594,6 +667,7 @@ while IFS=$'\t' read -r os_user tier shell groups_csv code_layout repos_csv; do
refresh_user_clone "$os_user" code
fi
install_user_kubeconfig "$os_user"
install_browser_kubeconfig "$os_user" # hands-off chrome-service CLI cred (no-op unless the user has a browser SA)
deploy_user_launcher "$os_user" # keep ~/start-claude.sh current (skel only seeds new accounts)
fi
refresh_codex_mirror "$os_user" # all tiers — mirror of the managed claudeMd

View file

@ -11,6 +11,12 @@ Environment=HOME=/home/%i
Environment=PATH=/usr/local/bin:/usr/bin:/bin:/home/%i/.local/bin
Environment=NODE_ENV=production
EnvironmentFile=/etc/t3-serve/%i.env
# Optional per-user long-lived CLAUDE_CODE_OAUTH_TOKEN, materialized by
# claude-auth-sync from the user's own Vault path. Non-rotating, so t3's
# concurrent agent sessions can't race on OAuth refresh-token rotation and wipe
# the shared ~/.claude/.credentials.json. Leading '-' = optional (absent for
# users on the normal per-user Enterprise-SSO credential flow).
EnvironmentFile=-/home/%i/.config/claude-auth-sync/claude-oauth.env
WorkingDirectory=/home/%i
ExecStart=/usr/bin/t3 serve --host 0.0.0.0 --port ${T3_PORT} --base-dir /home/%i/.t3
Restart=on-failure

View file

@ -28,5 +28,61 @@ ok "accept own scoped Vault token" cas_vault_identity_ok token-devvm-claude-auth
no "reject another user's token" cas_vault_identity_ok token-devvm-claude-auth-anca default,workstation-claude-anca
no "reject wrong policy" cas_vault_identity_ok token-devvm-claude-auth-emo default,workstation-claude-anca
# --- Regression: cas_backup must MERGE into the shared Vault path, preserving
# sibling keys that other tools co-locate there (e.g. `homelab vault`'s
# vaultwarden_* creds) — NOT overwrite the whole KV document. A blind `kv put`
# wiped them every 6h (claude-auth-sync clobber, 2026-06-26).
fakebin="$tmp/bin"; mkdir -p "$fakebin"
store="$tmp/vault-store.json"
cat > "$fakebin/vault" <<'FAKE'
#!/usr/bin/env bash
# Minimal KV-v2 fake backed by $VAULT_FAKE_STORE (a flat JSON object).
[[ "$1" == kv ]] || { echo '{}'; exit 0; } # token lookup etc. -> ignore
op="$2"; shift 2
store="$VAULT_FAKE_STORE"
case "$op" in
get)
for a in "$@"; do [[ "$a" == -field=* ]] && field="${a#-field=}"; done
if [[ "$*" == *-format=json* ]]; then
[[ -f "$store" ]] || { echo "No value found"; exit 2; }
jq -n --argjson d "$(cat "$store")" '{data:{data:$d}}'; exit 0
fi
[[ -f "$store" ]] || exit 2 # bare get == existence check
if [[ -n "${field:-}" ]]; then
v="$(jq -r --arg k "$field" '.[$k] // empty' "$store")"; [[ -n "$v" ]] || exit 1
printf '%s' "$v"; exit 0
fi
exit 0 ;;
put) echo '{}' > "$store" ;; # full replace
patch) [[ -f "$store" ]] || { echo "No value found"; exit 2; } ;; # merge (rw)
*) exit 1 ;;
esac
for a in "$@"; do
case "$a" in
-*|secret/*) continue ;; # flags + the path arg
*=*) k="${a%%=*}"; v="${a#*=}"
t="$(mktemp)"; jq --arg k "$k" --arg v "$v" '.[$k]=$v' "$store" > "$t" && mv "$t" "$store" ;;
esac
done
exit 0
FAKE
chmod +x "$fakebin/vault"
CAS_VAULT_PATH="secret/workstation/claude-users/test"
CAS_CREDENTIALS="$tmp/credentials.json"
CAS_STATE_DIR="$tmp/state"
_oldpath="$PATH"; PATH="$fakebin:$PATH"; export VAULT_FAKE_STORE="$store"
printf '{"vaultwarden_master_password":"keep-me"}\n' > "$store" # pretend `homelab vault setup` ran
ok "backup succeeds (existing doc)" cas_backup
eq "merge preserves sibling key" keep-me "$(jq -r '.vaultwarden_master_password' "$store")"
eq "merge writes claude oauth" access "$(jq -r '.claude_ai_oauth_json|fromjson|.accessToken' "$store")"
rm -f "$store" # fresh user: no doc yet
ok "backup succeeds (creates doc)" cas_backup
eq "create writes claude oauth" access "$(jq -r '.claude_ai_oauth_json|fromjson|.accessToken' "$store")"
PATH="$_oldpath"; unset VAULT_FAKE_STORE
printf '\n%d passed, %d failed\n' "$pass" "$fail"
(( fail == 0 ))

View file

@ -13,6 +13,10 @@ CAS_VAULT_TOKEN_FILE="${CLAUDE_AUTH_VAULT_TOKEN_FILE:-$CAS_CONFIG_DIR/vault-toke
CAS_VAULT_PATH="${CLAUDE_AUTH_VAULT_PATH:-secret/workstation/claude-users/$CAS_USER}"
CAS_STATE_DIR="${CLAUDE_AUTH_STATE_DIR:-$CAS_HOME/.local/state/claude-auth-sync}"
CAS_LOG="$CAS_STATE_DIR/sync.log"
# Where a long-lived per-user setup-token is materialized as an env file
# (KEY=VALUE) for start-claude.sh + t3-serve@.service to load. Lives under the
# already-ReadWritePaths config dir so the sandboxed service may write it.
CAS_TOKEN_ENV_FILE="${CLAUDE_AUTH_TOKEN_ENV_FILE:-$CAS_CONFIG_DIR/claude-oauth.env}"
cas_log() {
mkdir -p "$CAS_STATE_DIR"
@ -82,7 +86,17 @@ cas_backup() {
return 1
}
expires="$(jq -r '.expiresAt' <<<"$oauth")"
vault kv put "$CAS_VAULT_PATH" \
# MERGE into the shared path so sibling keys other tools co-locate there
# (e.g. `homelab vault`'s vaultwarden_* creds) survive. `kv patch -method=rw`
# is read+update (needs no `patch` capability) but requires the secret to
# already exist, so create it with `kv put` on the very first backup only.
local -a write_cmd
if vault kv get "$CAS_VAULT_PATH" >/dev/null 2>&1; then
write_cmd=(vault kv patch -method=rw "$CAS_VAULT_PATH")
else
write_cmd=(vault kv put "$CAS_VAULT_PATH")
fi
"${write_cmd[@]}" \
claude_ai_oauth_json="$oauth" \
credential_expires_at_ms="$expires" \
backed_up_at="$(date -Is)" >/dev/null || {
@ -123,6 +137,41 @@ cas_restore() {
cas_log "RECOVERED restored Claude OAuth state from Vault"
}
# A user-scoped, long-lived setup-token (`sk-ant-oat01-…`, ~1y, NON-rotating) may
# be stored in this user's OWN Vault path (field `setup_token`). When present it
# is the authoritative credential: it bypasses the shared
# ~/.claude/.credentials.json OAuth refresh-token rotation entirely — the fix for
# users running many concurrent Claude sessions (interactive + t3-serve + always-on
# agents) that otherwise race on refresh and wipe each other's refresh token.
# We materialize it to a user-owned env file that start-claude.sh and
# t3-serve@.service load as CLAUDE_CODE_OAUTH_TOKEN. This is the user's OWN
# Enterprise identity, NOT the forbidden legacy SHARED token — it never crosses
# OS users. Returns 0 when a token is active, so the caller skips the
# rotating-credential validate/backup/restore (probing the now-vestigial
# credential would otherwise emit false WorkstationClaudeAuthInvalid alerts).
cas_sync_setup_token() {
local token desired tmp
token="$(vault kv get -field=setup_token "$CAS_VAULT_PATH" 2>/dev/null)" || token=""
if [[ "$token" != sk-ant-oat01-* ]]; then
if [[ -e "$CAS_TOKEN_ENV_FILE" ]]; then
rm -f "$CAS_TOKEN_ENV_FILE"
cas_log "removed stale CLAUDE_CODE_OAUTH_TOKEN env (no setup-token in Vault)"
fi
return 1
fi
desired="CLAUDE_CODE_OAUTH_TOKEN=$token"
if [[ -r "$CAS_TOKEN_ENV_FILE" && "$(<"$CAS_TOKEN_ENV_FILE")" == "$desired" ]]; then
cas_log "OK long-lived setup-token active (CLAUDE_CODE_OAUTH_TOKEN current); credential checks skipped"
return 0
fi
tmp="$(mktemp "${CAS_TOKEN_ENV_FILE}.XXXXXX")" || { cas_log "FAIL could not stage token env file"; return 1; }
printf '%s\n' "$desired" > "$tmp"
chmod 0600 "$tmp"
mv "$tmp" "$CAS_TOKEN_ENV_FILE"
cas_log "OK long-lived setup-token active; CLAUDE_CODE_OAUTH_TOKEN materialized; credential checks skipped"
return 0
}
cas_main() {
umask 077
for bin in jq vault claude timeout flock; do
@ -133,6 +182,11 @@ cas_main() {
flock -n 9 || { cas_log "SKIP another sync is already running"; return 0; }
cas_prepare_vault || return 1
# A long-lived per-user setup-token, if provisioned, is authoritative and
# non-rotating — materialize it and skip the rotating-credential dance.
if cas_sync_setup_token; then
return 0
fi
if cas_live_auth_ok; then
cas_backup
return

View file

@ -45,9 +45,15 @@ def main() -> None:
try:
res = subprocess.run(
[homelab, "memory", "recall", prompt, "--limit", "5"],
capture_output=True, text=True, timeout=4, env=os.environ,
capture_output=True, text=True, errors="replace", timeout=4,
env=os.environ,
)
except (subprocess.TimeoutExpired, OSError):
except Exception:
# Best-effort: ANY failure — timeout, OSError, or a UnicodeDecodeError on
# truncated multibyte (Cyrillic) output — must silently skip recall this
# turn, exactly like the MCP being unavailable. errors="replace" above
# also keeps a mid-rune-truncated payload from raising here at all. Never
# let this hook surface a "UserPromptSubmit hook error".
return
out = (res.stdout or "").strip()

View file

@ -19,13 +19,29 @@ unpinned-CLI dependencies out of the hourly **root** reconcile.
- `mattpocock/skills` (https://github.com/mattpocock/skills) — all except `find-skills`
- `vercel-labs/skills` (https://github.com/vercel-labs/skills) — `find-skills`
- **homelab-local, emo-PERSONALIZED**`cluster-health` here is an
**emo-specific variant**, not a copy of the canonical skill. It started as a
copy of this repo's `.claude/skills/cluster-health/` but was rewritten on
2026-06-26 to focus on ha-sofia + emo's Sofia devices (emo is the only entry
in `SKILL_USERS`, a read-only power-user). The canonical admin skill
(`.claude/skills/cluster-health/`) is the full 47-check version and is left
untouched. **Do NOT `cp -a` the canonical copy over this one** — that would
clobber the personalization. Maintain the two independently.
## Refreshing
Re-snapshot from a current install and commit the diff:
Re-snapshot the upstream skills from a current install and commit the diff:
```sh
cp -a ~/.agents/skills/. scripts/workstation/claude-skills/
```
Snapshot taken 2026-06-23.
`cluster-health` is hand-maintained (emo variant) — it is **not** covered by the
`cp -a` above and must **not** be overwritten from `.claude/skills/`. Edit it in
place here when emo's needs change, then refresh his live copy (the provisioner's
`install_skills()` is if-absent, so it won't update an existing `~/.agents/skills`
copy — `cp` the new `SKILL.md` to `/home/emo/.agents/skills/cluster-health/` and
`chown emo:emo`, or remove emo's copy and re-run the reconcile).
Snapshot taken 2026-06-23 (upstream); `cluster-health` vendored 2026-06-26,
personalized for emo 2026-06-26.

View file

@ -0,0 +1,146 @@
---
name: cluster-health
description: |
Personalized for emo. Check whether the homelab Kubernetes cluster is
affecting ha-sofia or the Sofia smart-home devices it runs (Tuya devices,
the MPPT ATS, lights, climate, security, irrigation). Use when:
(1) "is ha-sofia ok", "are my devices / the ATS / the lights down",
(2) "is the cluster affecting Sofia / my devices",
(3) "check the cluster", "cluster health", "is everything running",
(4) a device on the Барзини → Статус dashboard looks offline.
Runs the cluster-wide healthcheck read-only and triages it by what
ha-sofia actually depends on; the rest of the cluster is the admin's area.
author: Claude Code
version: 3.0.0-emo
date: 2026-06-26
---
# Cluster Health — personalized for emo (ha-sofia focus)
## What you actually care about
You care about **ha-sofia** and the **Sofia smart-home devices** it runs —
the Tuya devices, the **MPPT ATS**, and the lights / climate / security /
irrigation on your **Барзини → Статус** dashboard. The wider Kubernetes
cluster matters to you **only when it's breaking something ha-sofia or your
devices depend on.** Anything else is the admin's (wizard's) area — note it in
one line and move on; don't chase it.
You have **read-only** cluster access. You can SEE everything but change
nothing — so when something on your chain is broken, the job is to confirm it
and hand it off, not to repair it.
## How ha-sofia depends on the cluster
ha-sofia itself runs at the house (HAOS at https://ha-sofia.viktorbarzin.me) —
**not** in the cluster. The cluster reaches it through exactly two things:
1. **tuya-bridge** (namespace `tuya-bridge`) — the REST API ha-sofia calls for
every Tuya device **and the MPPT ATS**. If it's unhealthy, your Tuya devices
+ ATS stop responding. **This is the #1 thing to check.**
2. **The path that carries ha-sofia ⇄ tuya-bridge and keeps ha-sofia
reachable**: cloudflared (tunnel) → Traefik (LB) → the ingress + TLS cert
for `tuya-bridge.viktorbarzin.me` and `ha-sofia.viktorbarzin.me`, plus
Technitium DNS. If any of these break, ha-sofia can't reach tuya-bridge and
you can't reach ha-sofia remotely.
Everything else in the cluster is unrelated to you unless it's hosting one of
those pods.
## Step 1 — run the healthcheck (read-only, with your HA token)
Your account can't read Vault, so load your own ha-sofia token first (it was
minted for you and lives at `~/.config/cluster-health/haos_token`). Then run
the script from YOUR clone, read-only:
```bash
cd /home/emo/code
export HOME_ASSISTANT_SOFIA_TOKEN="$(cat ~/.config/cluster-health/haos_token)"
bash scripts/cluster_healthcheck.sh --no-fix --quiet
# machine-readable instead:
# bash scripts/cluster_healthcheck.sh --no-fix --quiet --json | tee /tmp/cluster-health.json
```
- **Never pass `--fix`** — it deletes pods (a write); you're read-only and it
will fail.
- Exit codes: `0` healthy, `1` warnings, `2` failures.
With the token exported, the **ha-sofia checks run for you**:
26 Entity Availability · 27 Integration Health · 28 Automation Status ·
29 System Resources · **45 Status Dashboard** — your Барзини → Статус view,
classifying every device tile as OK / ⚠️ / Offline across Сигурност, Мрежа &
IT, Енергия, Климат, Уреди, Мултимедия, Осветление, Поливна. Check 30 also
covers the **tuya** exporter.
## Step 2 — triage the output by relevance to YOU
Read the PASS/WARN/FAIL summary, then split the WARN/FAIL items in two:
- **On your chain → this is what matters.** Anything touching: `tuya-bridge`,
`cloudflared`, `traefik`, DNS (check 21), the TLS cert / ingress for your two
hosts (checks 12, 22, 31, 32), or a **node** hosting those pods — plus all the
**ha-sofia** checks (2629, 45) and the **tuya** exporter (30).
- **Not on your chain → one line, then drop it.** Summarise as "N unrelated
cluster issues (admin's area)" and don't investigate.
## Step 3 — read-only checks for your chain
All of these work with your read-only access:
```bash
# tuya-bridge — your devices + the ATS
kubectl get pods -n tuya-bridge
kubectl rollout status deploy/tuya-bridge -n tuya-bridge
kubectl logs -n tuya-bridge deploy/tuya-bridge --tail=50
# the reachability path ha-sofia uses
kubectl get pods -n cloudflared
kubectl get pods -n traefik
kubectl get ingress -A | grep -Ei 'tuya-bridge|ha-sofia'
# whole external path in one shot (DNS + tunnel + Traefik + cert):
curl -sI --max-time 10 https://tuya-bridge.viktorbarzin.me | head -1
# reachable -> HTTP/2 200 / 401 / 403 (any HTTP response = path is up)
# broken -> curl: timeout / could not resolve host
```
The fastest **device-level** signal is your own dashboard: open
**https://ha-sofia.viktorbarzin.me → Барзини → Статус**. If devices show
Offline / Разкачен / ⚠️ **but tuya-bridge is healthy**, the problem is at the
house (device power / Wi-Fi / the Sofia TP-Link network) — **not** the cluster.
## Step 4 — if something on your chain is broken
You can't fix the cluster (read-only), so **capture + hand off**:
```bash
kubectl describe pod -n tuya-bridge <pod>
kubectl logs -n tuya-bridge <pod> --previous --tail=200
```
Then file it for the admin with the **`/file-issue`** skill — e.g. *"ha-sofia
Tuya devices + ATS unresponsive; tuya-bridge pod CrashLooping"* with the output
above. cloudflared / Traefik / DNS outages are cluster-wide — the admin's
alerting is already firing, but file it so it's tracked from your side too.
## What will skip for you (expected — not failures)
A few checks need access your account doesn't have. They warn/skip — that's
normal, and **none of them are on your ha-sofia chain**:
- **Uptime Kuma (14)** — needs an admin password from Vault.
- **PVE host checks** — 36 (LVM snapshots), 43 (host thermals), 44 (host load),
and the Proxmox CSI ghost-disk check — all need root SSH to the Proxmox host.
- **`--fix`** — pod deletion (a write); not available to you.
(The ha-sofia checks are **not** in this list — your token makes them work.)
## Your ha-sofia token
- Stored at `~/.config/cluster-health/haos_token` (yours, mode 600).
- It's a **dedicated** long-lived token, named `emo-cluster-health` under
ha-sofia → your profile → **Long-Lived Access Tokens**. Revoking it there
affects only you.
- It currently carries admin-level HA scope (Home Assistant only lets a token
be minted for the account that created it, and it was minted via the admin
account). If it ever stops working, tell wizard and a fresh one can be minted.

View file

@ -1,4 +1,4 @@
{
"claudeMd": "# Viktor Barzin homelab — shared multi-user Claude Code Workstation (devvm)\n\nYou are running as a specific OS user on a SHARED devvm Workstation, not as the admin. These org-wide rules apply to EVERY user and sit at the top of settings precedence (they cannot be overridden by a user's own config):\n\n- Respect your permission tier. kubectl, Vault, and infra access are scoped to your RBAC tier (admin / power-user / namespace-owner). Do not attempt to escalate privileges or reach another user's resources.\n- Secrets are per-user. Never read another user's home directory, credentials, tokens, or ~/.claude secrets. Your own secrets live in your home at mode 600.\n- Infrastructure changes go through Terraform/Terragrunt — never direct kubectl apply/edit/patch. Committed stack changes are auto-applied by CI on push to master; verify the live result with your read-only kubectl.\n- The AGENT does ALL git mechanics silently — the user may not know git, so never ask them to commit, push, pull, or open anything, and never surface git jargon. Lifecycle (worktrees, landing, cleanup): ~/.claude/rules/execution.md. Org red-lines on top:\n - THE COMMIT MESSAGE IS THE AUDIT TRAIL — subject says WHAT changed; body says WHY in plain words (paraphrase the user's actual request).\n - Never use [ci skip] as a non-admin (it hides the change from the audit feed).\n - Push rejected by branch protection (user not whitelisted) → fall back to a <os-user>/<topic> branch + PR via the Forgejo API (token = password field in ~/.git-credentials).\n - Keep every clone on a clean master when done; tell the user in plain words what happened.\n - Full recipe: AGENTS.md → \"Non-admin workstation users\" in your infra clone.\n- Follow the engineering rules in ~/.claude/rules/ (execution, planning) and every CLAUDE.md in the repo tree.\n- Code lives under ~/code in one of two per-user layouts: either ~/code IS the git-crypt-LOCKED infra clone (single layout), or ~/code is a workspace directory of per-project clones — the locked infra clone at ~/code/infra plus other project repos alongside it. [ -d ~/code/.git ] means single. In locked infra clones secret files read as ciphertext — that is expected, not an error.\n",
"claudeMd": "# Viktor Barzin homelab — shared multi-user Claude Code Workstation (devvm)\n\nYou are running as a specific OS user on a SHARED devvm Workstation, not as the admin. These org-wide rules apply to EVERY user and sit at the top of settings precedence (they cannot be overridden by a user's own config):\n\n- Respect your permission tier for cluster/infra. kubectl, Vault, and infra access are scoped to your RBAC tier (admin / power-user / namespace-owner) — don't try to exceed it. At the OS level, authorization follows the OS: where your user holds `sudo`, using it is permitted and audited.\n- OS access follows OS permissions, not a separate rule here. You may read whatever your OS user can read — directly or via `sudo` where you hold it — including other users' home directories, credentials, tokens, and `~/.claude`. Do not impose restrictions stricter than the OS; no extra prompt is required for OS-authorized access. Protect your own secrets at mode 600.\n- Infrastructure changes go through Terraform/Terragrunt — never direct kubectl apply/edit/patch. Committed stack changes are auto-applied by CI on push to master; verify the live result with your read-only kubectl.\n- The AGENT does ALL git mechanics silently — the user may not know git, so never ask them to commit, push, pull, or open anything, and never surface git jargon. Lifecycle (worktrees, landing, cleanup): ~/.claude/rules/execution.md. Org red-lines on top:\n - THE COMMIT MESSAGE IS THE AUDIT TRAIL — subject says WHAT changed; body says WHY in plain words (paraphrase the user's actual request).\n - Never use [ci skip] as a non-admin (it hides the change from the audit feed).\n - Push rejected by branch protection (user not whitelisted) → fall back to a <os-user>/<topic> branch + PR via the Forgejo API (token = password field in ~/.git-credentials).\n - Keep every clone on a clean master when done; tell the user in plain words what happened.\n - Full recipe: AGENTS.md → \"Non-admin workstation users\" in your infra clone.\n- Follow the engineering rules in ~/.claude/rules/ (execution, planning) and every CLAUDE.md in the repo tree.\n- Code lives under ~/code in one of two per-user layouts: either ~/code IS the git-crypt-LOCKED infra clone (single layout), or ~/code is a workspace directory of per-project clones — the locked infra clone at ~/code/infra plus other project repos alongside it. [ -d ~/code/.git ] means single. In locked infra clones secret files read as ciphertext — that is expected, not an error.\n",
"model": "claude-opus-4-8"
}

View file

@ -72,11 +72,14 @@ if [[ -n "$want_t3" && "$(t3 --version 2>/dev/null | awk '{print $NF}' | sed 's/
fi
# 2c) Bitwarden CLI — backs `homelab vault` (per-user no-HITL Vaultwarden access).
# npm-global so every user's PATH resolves it. Pinned major; best-effort (a
# failure only disables `homelab vault`, nothing else on the box).
if ! command -v bw >/dev/null; then
log "npm: installing @bitwarden/cli (homelab vault backend)"
npm install -g "@bitwarden/cli@^2024" >/dev/null 2>&1 || log "WARN: @bitwarden/cli install failed; homelab vault unavailable"
# Install SYSTEM-WIDE (npm prefix /usr → /usr/bin/bw) so EVERY user's PATH
# resolves it. The guard tests the SYSTEM path, NOT `command -v bw`: the
# latter is satisfied by an admin's own ~/.local/bin/bw and would skip the
# system install, leaving non-admins (emo, anca, …) with no backend. Pinned
# major; best-effort (a failure only disables `homelab vault`).
if [ ! -x /usr/bin/bw ] && [ ! -x /usr/local/bin/bw ]; then
log "npm: installing @bitwarden/cli system-wide (homelab vault backend)"
npm install -g --prefix /usr "@bitwarden/cli@^2024" >/dev/null 2>&1 || log "WARN: @bitwarden/cli install failed; homelab vault unavailable"
fi
# 3) kubelogin (kubectl oidc-login) system-wide — NOT the apt 'kubelogin' (= Azure tool).

View file

@ -93,6 +93,15 @@ ensure_onboarding() {
}
ensure_onboarding
# Load a per-user long-lived CLAUDE_CODE_OAUTH_TOKEN if claude-auth-sync has
# materialized one from this user's own Vault path. A non-rotating setup-token
# sidesteps the shared ~/.claude/.credentials.json OAuth refresh-token race that
# logs out users running many concurrent agents (interactive + t3 + always-on).
# Absent file -> no-op (normal per-user Enterprise-SSO flow). The user's OWN
# token; never shared between OS users.
_oauth_env="$HOME/.config/claude-auth-sync/claude-oauth.env"
if [ -r "$_oauth_env" ]; then set -a; . "$_oauth_env"; set +a; fi
# Deliberately not `exec` so we can branch on the exit code: clean quit ends the
# pane (ttyd closes the terminal); a crash drops to a shell so the tmux session
# isn't destroyed-and-recreated in a ttyd auto-reconnect loop.

View file

@ -5,6 +5,9 @@ variable "tls_secret_name" {
variable "nfs_server" { type = string }
resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = {
apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret"

View file

@ -5,6 +5,9 @@ variable "tls_secret_name" {
variable "nfs_server" { type = string }
resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = {
apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret"
@ -42,6 +45,9 @@ data "kubernetes_secret" "eso_secrets" {
# DB credentials from Vault database engine (rotated automatically)
# Provides DATABASE_URL that auto-updates when password rotates
resource "kubernetes_manifest" "db_external_secret" {
field_manager {
force_conflicts = true
}
manifest = {
apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret"

View file

@ -0,0 +1,46 @@
# SLOW-1a overlay over the official authentik server image.
#
# The login flow's identification stage renders each enabled source's UI login
# button. Upstream authentik/stages/identification/stage.py does:
# current_stage.sources.filter(enabled=True).order_by("name").select_subclasses()
# The bare no-arg select_subclasses() (django-model-utils InheritanceManager)
# LEFT-JOINs EVERY Source subtype table; on the cold-login hot path that is ~1.5s
# (verified live on 2026.2.4: 1527ms vs 14ms). Passing only the subtypes that
# actually render a UI login button — every concrete Source type that overrides
# ui_login_button: oauth/saml/plex/telegram/kerberos, NOT the sync-only ldap/scim —
# is ~100x faster and BYTE-IDENTICAL output (verified: concrete types + rendered
# buttons match). django-model-utils accepts the lowercase subclass *accessor
# names* as strings, so no new import is needed (no circular-import risk) — the
# patch is a single, reviewable line edit.
#
# RE-VERIFY ON EVERY AUTHENTIK BUMP: bump the FROM tag below AND the image tag in
# modules/authentik/values.yaml together. The grep guards fail the build LOUDLY if
# the upstream target line moved. If a future authentik version adds a NEW
# login-capable source type, add its lowercase accessor to the list below.
# Upstream: the bare select_subclasses() is still present in main (no fix/PR as of
# 2026-06-28) — drop this overlay once upstream narrows the query.
FROM ghcr.io/goauthentik/server:2026.2.4
USER root
RUN set -eux; \
F=/authentik/stages/identification/stage.py; \
grep -q 'order_by("name").select_subclasses()' "$F"; \
sed -i 's/order_by("name")\.select_subclasses()/order_by("name").select_subclasses("oauthsource", "samlsource", "plexsource", "telegramsource", "kerberossource")/' "$F"; \
grep -q 'select_subclasses("oauthsource", "samlsource", "plexsource", "telegramsource", "kerberossource")' "$F"; \
PY="$(command -v python || command -v python3)"; "$PY" -c "import ast,sys; ast.parse(open('$F').read())"; \
rm -f /authentik/stages/identification/__pycache__/stage.*.pyc
# PATCH #2 — old-browser BLANK LOGIN. authentik's modern flow SPA is ES2022 and
# hard-fails (blank login) on Safari<=16.3 (e.g. iPadOS<=16.3). authentik already
# ships a no-JS Simplified Flow Executor (SFE, ES5) but only serves it to
# IE/old-Edge/PKeyAuth. patch-compat-sfe.py (a) extends compat_needs_sfe() to
# serve the SFE to old Safari AND any iOS browser (Chrome/CriOS, Firefox/FxiOS —
# all share the system WebKit) on iOS<=16.3, and (b) injects static social-login
# <a> links into the SFE shell (the SFE can't render Identification-stage sources;
# needed for password-less Google-only accounts). Clients get the REAL authentik
# login (password + MFA + reputation, NO auth downgrade) instead of a blank page.
# The script is guarded (asserts both upstream anchors + ast-parses) so the build
# fails loudly if upstream moves — re-verify on every authentik bump.
COPY patch-compat-sfe.py /tmp/patch-compat-sfe.py
RUN python3 /tmp/patch-compat-sfe.py && rm -f /tmp/patch-compat-sfe.py
USER authentik

View file

@ -49,14 +49,15 @@ resource "authentik_policy_expression" "admin_services_restriction" {
host = request.context.get("host", "")
# chrome-service noVNC (chrome.viktorbarzin.me) exposes Viktor's LIVE
# logged-in browser sessions, so lock it to Viktor's own accounts ONLY.
# "Home Server Admins" is NOT sufficient emo (emil.barzin@gmail.com) is a
# member. akadmin kept as break-glass. The homelab-browser CDP path is
# already RBAC-gated (emo = oidc-power-user-readonly, no pods/portforward),
# so this closes the only remaining, human, noVNC path. Match username OR
# email so neither attribute alone can lock Viktor out.
CHROME_ALLOWED = {"akadmin", "akadmin@viktorbarzin.me", "vbarzin@gmail.com"}
# chrome-service noVNC (chrome.viktorbarzin.me) exposes LIVE logged-in browser
# sessions from the SHARED persistent profile. Originally Viktor-only.
# 2026-06-28 (Viktor's explicit decision): emo SHARES Viktor's browser, so emo
# (emil.barzin / emil.barzin@gmail.com) is allowed in for noVNC form-filling +
# captcha solving. Trade-off accepted: emo can therefore reach Viktor's warmed
# sessions (the CLI half is the emo-browser ServiceAccount in
# stacks/chrome-service/rbac.tf). akadmin kept as break-glass. Match username OR
# email so neither attribute alone can lock anyone out.
CHROME_ALLOWED = {"akadmin", "akadmin@viktorbarzin.me", "vbarzin@gmail.com", "emil.barzin", "emil.barzin@gmail.com"}
if host == "chrome.viktorbarzin.me":
return request.user.username in CHROME_ALLOWED or request.user.email in CHROME_ALLOWED

View file

@ -6,6 +6,9 @@
# are non-secret and live in values.yaml. The reloader annotation rolls the
# authentik pods if the password ever changes.
resource "kubernetes_manifest" "authentik_email_secret" {
field_manager {
force_conflicts = true
}
manifest = {
apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret"

View file

@ -29,7 +29,12 @@ resource "kubernetes_namespace" "authentik" {
labels = {
tier = var.tier
"resource-governance/custom-quota" = "true"
"keel.sh/enrolled" = "true"
# Keel intentionally NOT enrolled: server+worker run our custom overlay image
# (ghcr.io/viktorbarzin/authentik-server see values.yaml global.image +
# stacks/authentik/Dockerfile). The tag is pinned explicitly and bumped
# manually (rebuild the overlay FROM the new authentik version + repoint), so
# a Keel auto-bump would only risk re-introducing the upstream tag / the
# 2026-06-10 downgrade-boot-storm class. Re-enroll only if the overlay is dropped.
}
}
lifecycle {
@ -82,6 +87,11 @@ module "ingress" {
service_name = "goauthentik-server"
tls_secret_name = var.tls_secret_name
anti_ai_scraping = false
# Swap the shared 10/50 default limiter for a dedicated 100/1000 carve-out:
# the login SPA + flow-executor API burst on a cold load otherwise 429s into
# a blank screen (see traefik middleware "authentik-rate-limit").
skip_default_rate_limit = true
extra_middlewares = ["traefik-authentik-rate-limit@kubernetescrd"]
extra_annotations = {
"gethomepage.dev/enabled" = "true"
"gethomepage.dev/name" = "Authentik"
@ -149,5 +159,12 @@ module "ingress-static" {
tls_secret_name = var.tls_secret_name
anti_ai_scraping = false
homepage_enabled = false
extra_middlewares = ["authentik-static-cache-headers@kubernetescrd"]
# /static serves ALL the SPA JS/CSS chunks; the default 10/50 limiter 429s the
# cold-load fan-out blank screen. Dedicated 100/1000 carve-out (note the two
# namespaces: cache-headers is in ns authentik, rate-limit is in ns traefik).
skip_default_rate_limit = true
extra_middlewares = [
"authentik-static-cache-headers@kubernetescrd",
"traefik-authentik-rate-limit@kubernetescrd",
]
}

View file

@ -39,6 +39,16 @@ server:
value: "3"
- name: AUTHENTIK_WEB__THREADS
value: "4"
# Gunicorn worker recycle hardening (defaults max_requests=1000/jitter=50).
# A worker recycle that coincides with a transient PG/pgbouncer blip stalls
# in-flight requests (sessions+cache are on PostgreSQL since Redis was removed
# in 2026.2), and with 9 workers recycling on a tight 50-jitter window the
# recycles cluster — feeding the episodic all-pods-NotReady 502/504 cascade.
# 10x rarer recycles + 20x wider jitter (1000) decorrelate them from DB blips.
- name: AUTHENTIK_WEB__MAX_REQUESTS
value: "10000"
- name: AUTHENTIK_WEB__MAX_REQUESTS_JITTER
value: "1000"
# Cache flow plans for 30m and policy evaluations for 15m (defaults 300s).
# Authentik 2026.2 stores cache in Postgres, so a TTL hit is still a
# SELECT — but a single indexed lookup beats re-planning the flow
@ -87,11 +97,28 @@ server:
livenessProbe:
failureThreshold: 6
timeoutSeconds: 5
strategy:
# Readiness widened from the chart default (3x10s/3s ~= 30s) to ~80s. The
# readiness probe (/-/health/ready/) queries the DB, so a sub-~60s PG/pgbouncer
# transient otherwise returns 503 and drops ALL 3 server pods from the Service
# at once -> Traefik has no healthy backend -> 502/504 (the episodic blank
# screen + 30s hang). 80s absorbs a full CNPG failover reconnect; liveness
# still reaps a truly hung pod. Partial override — the chart deep-merges the
# httpGet path /-/health/ready/ (same as the livenessProbe override above).
readinessProbe:
failureThreshold: 8
periodSeconds: 10
timeoutSeconds: 5
# RollingUpdate strategy. The chart key is `deploymentStrategy`, NOT `strategy`
# (authentik.server reads .Values.server.deploymentStrategy) — the old
# `strategy:` key was silently ignored, so live ran the chart default 25%/25%
# and every rolling event dropped a server pod out of rotation, amplifying the
# NotReady cascade. maxSurge:1 + maxUnavailable:0 keeps all 3 ready throughout
# a roll (PDB minAvailable:2 + ResourceQuota headroom allow the transient pod).
deploymentStrategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 0
maxUnavailable: 1
maxSurge: 1
maxUnavailable: 0
resources:
requests:
cpu: 100m
@ -118,15 +145,23 @@ server:
global:
addPrometheusAnnotations: true
image:
# Pin to the Keel-managed live tag. Keel (diun-annotated, keel.sh/enrolled
# namespace) bumps the IMAGE between chart releases, while helm defaults
# the tag to the chart appVersion — so any helm upgrade silently
# DOWNGRADES the running pods to the chart pin (2026-06-10: a values-only
# apply rolled live 2026.2.4 back to 2026.2.2 against a 2026.2.4-migrated
# DB → boot storm, see docs/post-mortems/2026-06-10-authentik-downgrade-
# boot-storm.md). Keep this tag in sync with what Keel has deployed when
# touching this chart; clear it only when bumping the chart version itself.
tag: "2026.2.4"
# CUSTOM OVERLAY: two thin patches over the official authentik server image
# (see stacks/authentik/Dockerfile): (1) SLOW-1a — narrows the login-flow
# select_subclasses() query, ~1.4s -> ~14ms; (2) serve authentik's no-JS SFE
# login to old Safari/WebKit AND any iOS browser (Chrome/Firefox = WebKit) on
# iOS<=16.3 so old devices (e.g. iPadOS<=15) get a working login instead of a
# blank page, and injects social-login links into the SFE (it can't render
# sources; needed for password-less Google-only accounts). Built by
# .github/workflows/build-authentik.yml to ghcr.io/viktorbarzin/authentik-server
# (public package, anonymous pull — no imagePullSecret needed, like the
# upstream goauthentik image). Keel is NO LONGER enrolled for this namespace
# (see main.tf) so it can't bump/downgrade the tag; helm also defaults the tag
# to the chart appVersion (2026.2.2) — so BOTH repository AND tag are pinned
# explicitly here to prevent the 2026-06-10 downgrade-boot-storm class.
# UPGRADE = bump the Dockerfile FROM tag + this tag together (e.g. ->
# 2026.3.0-patch1), let GHA rebuild, then apply.
repository: ghcr.io/viktorbarzin/authentik-server
tag: "2026.2.4-patch3"
worker:
# 2 replicas: workers handle background tasks (LDAP sync, email,
@ -166,7 +201,10 @@ worker:
secretKeyRef:
name: authentik-email
key: AUTHENTIK_EMAIL__PASSWORD
strategy:
# Chart key is `deploymentStrategy`, not `strategy` (see server above). Workers
# serve no user traffic, so maxSurge:0/maxUnavailable:1 is fine — this is just
# the dead-key cleanup so the declared intent actually takes effect.
deploymentStrategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 0

View file

@ -0,0 +1,96 @@
#!/usr/bin/env python3
"""Overlay patch — make authentik usable on OLD browsers (no modern-JS SPA).
authentik's modern flow SPA is ES2022 (static{} init blocks) that hard-fail on
Safari/WebKit <= 16.3 (e.g. iPadOS <= 16.3) and render a COMPLETELY BLANK login.
authentik ships a no-JS Simplified Flow Executor (SFE, ES5) but only serves it to
IE / old-Edge / PKeyAuth, and the SFE itself canNOT render Identification-stage
sources (social-login buttons) authentik docs list "Sources" as unsupported.
This patch does TWO things, both guarded (assert the upstream anchor + verify the
result) so the image build fails LOUDLY if upstream moves. RE-VERIFY on every
authentik upgrade.
1. flows/views/interface.py::compat_needs_sfe() -> also return True for old
Safari/WebKit: (a) Safari/Mobile Safari Version <= 16.3 (covers desktop-mode
iPadOS which reports as Mac Safari), and (b) ANY iOS browser (Chrome/CriOS,
Firefox/FxiOS, Edge all share the system WebKit) on iOS <= 16.3. So old
iPads get the SFE on EVERY browser, not just Safari.
2. flows/templates/if/flow-sfe.html -> inject static social-login <a> links
(plain redirects to /source/oauth/login/<slug>/, work on ANY browser) so SFE
users (who otherwise see only username/password) can use social login
required for accounts with no password (e.g. Google-only users like emo).
"""
import ast
import glob
import os
# --- Patch 1: compat_needs_sfe() UA gate -------------------------------------
INTERFACE = "/authentik/flows/views/interface.py"
ANCHOR = (
' if "PKeyAuth" in ua["string"]:\n'
" return True\n"
" return False"
)
REPLACEMENT = (
' if "PKeyAuth" in ua["string"]:\n'
" return True\n"
" # OVERLAY: old WebKit can't parse the modern ES2022 flow SPA (blank\n"
" # login) -> serve the SFE (real authentik login). (a) desktop-mode\n"
" # Safari/iPadOS reports as Mac Safari with Version<=16.3:\n"
' if ua["user_agent"]["family"] in ("Safari", "Mobile Safari"):\n'
" try:\n"
' _maj = int(ua["user_agent"]["major"] or 0)\n'
' _min = int(ua["user_agent"]["minor"] or 0)\n'
" except (TypeError, ValueError):\n"
" _maj = _min = 0\n"
" if _maj and (_maj < 16 or (_maj == 16 and _min <= 3)):\n"
" return True\n"
" # (b) ANY iOS browser (Chrome/CriOS, Firefox/FxiOS, Edge) shares the\n"
" # system WebKit, so iOS<=16.3 fails regardless of the browser family:\n"
' if ua["os"]["family"] == "iOS":\n'
" try:\n"
' _omaj = int(ua["os"]["major"] or 0)\n'
' _omin = int(ua["os"]["minor"] or 0)\n'
" except (TypeError, ValueError):\n"
" _omaj = _omin = 0\n"
" if _omaj and (_omaj < 16 or (_omaj == 16 and _omin <= 3)):\n"
" return True\n"
" return False"
)
src = open(INTERFACE).read()
assert "def compat_needs_sfe" in src, "compat_needs_sfe() not found — upstream changed"
assert src.count(ANCHOR) == 1, f"anchor not found exactly once in {INTERFACE}"
src = src.replace(ANCHOR, REPLACEMENT)
open(INTERFACE, "w").write(src)
ast.parse(src)
assert 'ua["os"]["family"] == "iOS"' in open(INTERFACE).read()
for pyc in glob.glob("/authentik/flows/views/__pycache__/interface.*.pyc"):
os.remove(pyc)
# --- Patch 2: social-login links on the SFE shell ----------------------------
SFE_HTML = "/authentik/flows/templates/if/flow-sfe.html"
HTML_ANCHOR = (
" </main>\n"
" <span class=\"mt-3 mb-0 text-muted text-center\">{% trans 'Powered by authentik' %}</span>"
)
HTML_REPLACEMENT = (
" </main>\n"
" <!-- OVERLAY: the SFE can't render Identification-stage sources, so add\n"
" static social-login links (plain redirects, work on any browser).\n"
" Re-verify slugs on source changes; shown on all SFE flows. -->\n"
' <div class="form-signin w-100 m-auto pt-2 mt-2 border-top">\n'
' <a class="btn btn-outline-secondary w-100 mb-2" href="/source/oauth/login/google/">Continue with Google</a>\n'
' <a class="btn btn-outline-secondary w-100 mb-2" href="/source/oauth/login/github/">Continue with GitHub</a>\n'
' <a class="btn btn-outline-secondary w-100 mb-2" href="/source/oauth/login/facebook/">Continue with Facebook</a>\n'
" </div>\n"
" <span class=\"mt-3 mb-0 text-muted text-center\">{% trans 'Powered by authentik' %}</span>"
)
html = open(SFE_HTML).read()
assert html.count(HTML_ANCHOR) == 1, f"SFE html anchor not found exactly once in {SFE_HTML}"
html = html.replace(HTML_ANCHOR, HTML_REPLACEMENT)
open(SFE_HTML, "w").write(html)
assert "Continue with Google" in open(SFE_HTML).read()
print("patch-compat-sfe: SFE for old Safari + all iOS<=16.3; social-login links added to SFE")

View file

@ -601,6 +601,9 @@ resource "kubernetes_config_map" "beadboard_config" {
# Pulls the claude-agent-service bearer token from Vault so BeadBoard can
# dispatch agent jobs via the in-cluster HTTP API.
resource "kubernetes_manifest" "beadboard_agent_service_secret" {
field_manager {
force_conflicts = true
}
manifest = {
apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret"

View file

@ -28,6 +28,9 @@ resource "kubernetes_namespace" "broker_sync" {
# trading212_api_keys JSON array of {account_id, account_type, api_key, name, currency}
# imap_host, imap_user, imap_password, imap_directory for InvestEngine + Schwab email ingest
resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = {
apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret"

View file

@ -212,3 +212,229 @@ resource "kubectl_manifest" "whisker" {
spec = { notifications = "Disabled" }
})
}
# ---------------------------------------------------------------------------
# Gated public ingress for the Whisker UI (infra #57 / ADR-0014).
#
# whisker.viktorbarzin.me -> whisker:8081, Authentik-gated (auth="required":
# Whisker ships NO own login it's an admin observability UI, so Authentik
# forward-auth is the only gate between strangers and the flow view). The
# operator replicated `tls-secret` into calico-system already.
#
# TWO coupled pieces are required because the operator's own `whisker`
# NetworkPolicy (owned by the Whisker CR above) sets policyTypes:[Ingress]
# with NO ingress rules => default-deny on ingress to the whisker pod. The
# additive NP below ORs in a Traefik allow (k8s NetworkPolicies are additive
# across policies selecting the same pod), so we never edit the operator NP.
module "ingress_whisker" {
source = "../../modules/kubernetes/ingress_factory"
dns_type = "proxied"
namespace = "calico-system"
name = "whisker"
service_name = "whisker"
port = 8081
auth = "required"
tls_secret_name = "tls-secret"
extra_annotations = {
"gethomepage.dev/enabled" = "true"
"gethomepage.dev/name" = "Whisker"
"gethomepage.dev/description" = "Calico flow observability (who-talks-to-whom)"
"gethomepage.dev/icon" = "calico.png"
"gethomepage.dev/group" = "Infrastructure"
}
}
# Additive NetworkPolicy: permit Traefik -> whisker:8081. ORs with the
# operator's default-deny `whisker` NP (selecting the same pod) so Traefik
# can reach the UI without touching the operator-owned policy.
resource "kubernetes_network_policy_v1" "whisker_allow_traefik" {
metadata {
name = "whisker-allow-traefik"
namespace = "calico-system"
}
spec {
pod_selector {
match_labels = {
"app.kubernetes.io/name" = "whisker"
}
}
policy_types = ["Ingress"]
ingress {
from {
namespace_selector {
match_labels = {
"kubernetes.io/metadata.name" = "traefik"
}
}
}
ports {
port = "8081"
protocol = "TCP"
}
}
}
}
# Additive egress NetworkPolicy: permit whisker -> the kube-dns ClusterIP for DNS.
#
# ROOT CAUSE of the 2026-06-28 "Whisker UI empty" incident: the operator's own
# `whisker` NetworkPolicy is policyTypes:[Ingress,Egress] and its egress allows
# DNS only to the kube-dns *pods* (podSelector k8s-app=kube-dns). But
# whisker-backend resolves `goldmane...svc` via the kube-dns *ClusterIP*
# (10.96.0.10), and Calico drops UDP DNS to a ClusterIP under a podSelector-only
# egress rule (verified: from whisker's netns, ClusterIP DNS = 100% timeout
# while direct kube-dns pod-IP DNS = OK; a pod with no egress policy resolves
# fine). whisker-backend resolves once in the brief startup window before the
# policy programs, establishes its long-lived gRPC stream, and only re-resolves
# when that stream breaks at which point the blocked ClusterIP DNS wedges its
# Go resolver and the UI goes empty (the durable aggregator, in its own
# unrestricted namespace, is unaffected). k8s egress policies are additive, so
# this ORs in an allow for the ClusterIP; the operator NP is left untouched.
# (Empirically: adding this ipBlock rule flips ClusterIP DNS from 100% fail to
# 100% ok.) See docs/runbooks/goldmane-flow-trail.md.
resource "kubernetes_network_policy_v1" "whisker_allow_dns_clusterip" {
metadata {
name = "whisker-allow-dns-clusterip"
namespace = "calico-system"
}
spec {
pod_selector {
match_labels = {
"app.kubernetes.io/name" = "whisker"
}
}
policy_types = ["Egress"]
egress {
# 10.96.0.10 is the kube-dns ClusterIP (cluster invariant service CIDR
# 10.96.0.0/12, DNS always .10; the same IP CoreDNS/Technitium configs pin).
to {
ip_block {
cidr = "10.96.0.10/32"
}
}
ports {
port = "53"
protocol = "UDP"
}
ports {
port = "53"
protocol = "TCP"
}
}
}
}
# ---------------------------------------------------------------------------
# Whisker self-heal watchdog (ADR-0014; added 2026-06-28 after a live incident).
#
# BACKSTOP. The REAL fix is kubernetes_network_policy_v1.whisker_allow_dns_clusterip
# above (it unblocks the root-cause ClusterIP DNS). This watchdog stays as
# defense-in-depth: whisker-backend has NO operator liveness probe, so if its
# long-lived goldmane gRPC stream ever wedges for any OTHER reason (the Go
# resolver spams `failed to stream flows` / `code = Unavailable` and never
# reconnects -> empty UI, while the durable aggregator in its own namespace is
# unaffected), nothing else would restart it. Whisker is operator-managed
# (Whisker CR) so we can't inject a probe; this is the supported-pattern
# alternative. With the DNS fix in place it should rarely, if ever, fire.
#
# It restarts the pod ONLY when the wedged signature is present AND Goldmane is
# Ready (so a real Goldmane outage doesn't cause restart-thrash). A fresh pod
# reconnects cleanly. See docs/runbooks/goldmane-flow-trail.md.
resource "kubernetes_service_account" "whisker_watchdog" {
metadata {
name = "whisker-watchdog"
namespace = kubernetes_namespace.calico_system.metadata[0].name
}
}
# Namespaced Role (least privilege only calico-system): read pod logs to
# detect the wedge, delete the whisker pod to heal it.
resource "kubernetes_role" "whisker_watchdog" {
metadata {
name = "whisker-watchdog"
namespace = kubernetes_namespace.calico_system.metadata[0].name
}
rule {
api_groups = [""]
resources = ["pods"]
verbs = ["get", "list", "delete"]
}
rule {
api_groups = [""]
resources = ["pods/log"]
verbs = ["get"]
}
}
resource "kubernetes_role_binding" "whisker_watchdog" {
metadata {
name = "whisker-watchdog"
namespace = kubernetes_namespace.calico_system.metadata[0].name
}
role_ref {
api_group = "rbac.authorization.k8s.io"
kind = "Role"
name = kubernetes_role.whisker_watchdog.metadata[0].name
}
subject {
kind = "ServiceAccount"
name = kubernetes_service_account.whisker_watchdog.metadata[0].name
namespace = kubernetes_namespace.calico_system.metadata[0].name
}
}
resource "kubernetes_cron_job_v1" "whisker_watchdog" {
metadata {
name = "whisker-watchdog"
namespace = kubernetes_namespace.calico_system.metadata[0].name
}
spec {
schedule = "*/10 * * * *"
successful_jobs_history_limit = 1
failed_jobs_history_limit = 1
concurrency_policy = "Forbid"
job_template {
metadata {
name = "whisker-watchdog"
}
spec {
template {
metadata {
name = "whisker-watchdog"
}
spec {
service_account_name = kubernetes_service_account.whisker_watchdog.metadata[0].name
container {
name = "watchdog"
image = "bitnami/kubectl:latest"
command = ["/bin/sh", "-c", <<-EOT
set -eu
NS=calico-system
# Don't thrash if Goldmane itself is down that's not a whisker bug.
if ! kubectl -n "$NS" get pod -l k8s-app=goldmane \
-o jsonpath='{.items[*].status.conditions[?(@.type=="Ready")].status}' 2>/dev/null | grep -q True; then
echo "goldmane not Ready — skipping (not a whisker problem)"; exit 0
fi
ERRS=$(kubectl -n "$NS" logs -l k8s-app=whisker -c whisker-backend --since=11m --tail=500 2>/dev/null \
| grep -cE 'failed to stream flows|failed to list filter hints|code = Unavailable|i/o timeout' || true)
ERRS=$${ERRS:-0}
if [ "$ERRS" -ge 10 ]; then
echo "whisker-backend WEDGED: $ERRS goldmane-connection errors in 11m — restarting whisker pod"
kubectl -n "$NS" delete pod -l k8s-app=whisker --ignore-not-found
else
echo "whisker-backend healthy: $ERRS goldmane-connection errors in 11m"
fi
EOT
]
}
restart_policy = "Never"
}
}
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}

View file

@ -19,6 +19,9 @@ resource "kubernetes_namespace" "changedetection" {
}
resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = {
apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret"

View file

@ -19,14 +19,14 @@ for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15; do
sleep 2
done
# websockify runs as PID 1; x11vnc is a child so its logs land on container stdout
# `-noshm` skips MIT-SHM probes that fail across container boundaries (each
# container has its own /dev/shm); `-noxdamage` skips XDAMAGE which Xvfb
# doesn't expose; `-quiet` keeps the polling chatter out of pod logs.
# Both x11vnc and websockify run as supervised children of this entrypoint (PID
# 1) so their logs land on container stdout and the `wait -n` at the end can catch
# either one dying. `-noshm` skips MIT-SHM probes that fail across container
# boundaries (each container has its own /dev/shm); `-noxdamage` skips XDAMAGE
# which Xvfb doesn't expose; `-quiet` keeps the polling chatter out of pod logs.
echo "starting x11vnc -> :5900"
x11vnc -display localhost:99 -nopw -listen 0.0.0.0 -rfbport 5900 \
-forever -shared -noshm -noxdamage -quiet 2>&1 &
X11VNC_PID=$!
for i in 1 2 3 4 5 6 7 8 9 10; do
if echo > /dev/tcp/127.0.0.1/5900 2>/dev/null; then
@ -43,4 +43,18 @@ if ! echo > /dev/tcp/127.0.0.1/5900 2>/dev/null; then
fi
echo "starting websockify -> :6080"
exec websockify --web=/usr/share/novnc 6080 localhost:5900
# Run websockify in the background (it was `exec`ed before) so BOTH it and x11vnc
# are supervised. x11vnc attaches to the chrome-service container's Xvfb over
# localhost:6099 (shared pod network); when that container restarts, x11vnc loses
# its X connection and exits. Previously websockify was PID 1 and x11vnc was an
# unsupervised child, so a dead x11vnc was never relaunched: :5900 stayed dead and
# the noVNC view went black until a manual pod restart. Now if EITHER process
# exits, `wait -n` returns and we exit non-zero so the kubelet restarts this
# container, which re-waits for Xvfb and relaunches x11vnc — the bridge self-heals
# across browser-container restarts. (Same supervision pattern as the
# android-emulator stack's entrypoint.)
websockify --web=/usr/share/novnc 6080 localhost:5900 &
wait -n || true
echo "novnc: a supervised process (x11vnc or websockify) exited; exiting so the kubelet restarts this container." >&2
exit 1

View file

@ -41,6 +41,9 @@ resource "kubernetes_namespace" "chrome_service" {
# --- Secrets (single-key extract: api_bearer_token) ---
resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = {
apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret"
@ -330,15 +333,23 @@ resource "kubernetes_deployment" "chrome_service" {
container {
name = "novnc"
# Phase 3 cutover 2026-05-07 Forgejo registry consolidation.
image = "ghcr.io/viktorbarzin/chrome-service-novnc:latest"
# SHA-pinned (not :latest): Keel is OFF for this deployment
# (keel.sh/policy=never, below) and :latest/IfNotPresent won't re-pull a
# rebuilt image, so a new noVNC entrypoint only deploys when this digest
# is bumped here. Bump after build-chrome-service-novnc.yml pushes a new
# SHA tag then WAIT for that apply pipeline to finish before pushing
# anything else: Woodpecker cancel-previous SIGKILLs an in-flight apply
# mid-run (memory id=1957), which is exactly how the 2026-06-27 apply got
# killed. 2026-06-27: bumped to land the x11vnc-supervision self-heal fix
# (noVNC went black after a browser-container restart; see
# docs/architecture/chrome-service.md "x11vnc supervision").
image = "ghcr.io/viktorbarzin/chrome-service-novnc:19d0f0933a8ec75be6cfa077db88e0f8c3760f40"
image_pull_policy = "IfNotPresent"
# Cap RLIMIT_NOFILE before the entrypoint runs. Containerd grants pods
# nofile=2^31; x11vnc sweeps the whole fd table on each client connect,
# so every VNC connection hangs on "Connecting" until it times out
# (fd-sweep bug, same as android-emulator). entrypoint.sh now also sets
# this, but the image is :latest/IfNotPresent so a rebuilt entrypoint
# isn't guaranteed to be pulled this wrapper applies the cap
# deterministically on every rollout off the cached image.
# (fd-sweep bug, same as android-emulator). entrypoint.sh also sets this;
# the wrapper keeps the cap deterministic even off a cached image.
command = ["bash", "-c", "ulimit -n 65536; exec /entrypoint.sh"]
port {
name = "http"
@ -348,9 +359,13 @@ resource "kubernetes_deployment" "chrome_service" {
# x11vnc connects to the chrome-service container's Xvfb over
# localhost TCP (shared pod network). Same uid 1000 as chrome
# container so we can read MIT-MAGIC-COOKIE if Xvfb adds one.
# 256Mi (was 96Mi): the 96Mi cap OOMKilled (exit 137) the sidecar under
# ACTIVE VNC use x11vnc + websockify framebuffer/encode buffers spike
# well past idle (~37Mi) when a client streams the 1280x720 screen, so the
# noVNC view froze/hung on connect. Bumped 2026-06-28.
resources {
requests = { cpu = "10m", memory = "32Mi" }
limits = { memory = "96Mi" }
requests = { cpu = "10m", memory = "64Mi" }
limits = { memory = "256Mi" }
}
}

View file

@ -0,0 +1,95 @@
# emo's hands-off "homelab browser" credential + chrome-service port-forward RBAC.
#
# Access decision (2026-06-28, Viktor's explicit call): emo SHARES Viktor's single
# chrome-service browser rather than getting an isolated instance. The noVNC half of
# that grant is the Authentik allowlist in
# stacks/authentik/admin-services-restriction.tf (CHROME_ALLOWED); THIS file is the
# CLI half it lets emo's `homelab browser` reach the headed Chrome over CDP.
#
# `homelab browser` shells out to `kubectl port-forward -n chrome-service svc/chrome-service`
# (cli/browser.go). emo's normal kubeconfig is interactive-OIDC-only (kubelogin) and
# can't authenticate a headless agent session, and his power-user tier has no
# pods/portforward. So we mint a dedicated ServiceAccount with a long-lived token
# (the dashboard-sa.tf pattern) that the devvm provisioner installs as emo's DEFAULT
# kubeconfig context (scripts/t3-provision-users.sh install_browser_kubeconfig); his
# personal OIDC login stays available as the `oidc@homelab` named context.
#
# TRADE-OFF (accepted): CDP access == full control of the shared browser, including
# the persistent profile (browser.contexts[0]) where Viktor's warmed logins live.
# CDP has no per-context auth, so this SA can reach Viktor's sessions. That is inherent
# to sharing one browser (the isolated per-user instance was declined).
# See docs/architecture/chrome-service.md "Multi-user access".
resource "kubernetes_service_account" "emo_browser" {
metadata {
name = "emo-browser"
namespace = kubernetes_namespace.chrome_service.metadata[0].name
}
}
# Long-lived (non-expiring) token for the SA the devvm provisioner reads this and
# writes it into emo's kubeconfig. Same pattern as stacks/rbac/.../dashboard-sa.tf.
resource "kubernetes_secret" "emo_browser_token" {
metadata {
name = "emo-browser-token"
namespace = kubernetes_namespace.chrome_service.metadata[0].name
annotations = {
"kubernetes.io/service-account.name" = kubernetes_service_account.emo_browser.metadata[0].name
}
}
type = "kubernetes.io/service-account-token"
wait_for_service_account_token = true
}
# The ONLY verb emo's SA lacks for `kubectl port-forward svc/chrome-service`: the
# port-forward subresource. (get/list of pods + services + endpoints comes from the
# cluster-read binding below.) Namespace-scoped to chrome-service.
resource "kubernetes_role" "browser_portforward" {
metadata {
name = "chrome-service-portforward"
namespace = kubernetes_namespace.chrome_service.metadata[0].name
}
rule {
api_groups = [""]
resources = ["pods/portforward"]
verbs = ["create"]
}
}
resource "kubernetes_role_binding" "emo_browser_portforward" {
metadata {
name = "emo-browser-portforward"
namespace = kubernetes_namespace.chrome_service.metadata[0].name
}
role_ref {
api_group = "rbac.authorization.k8s.io"
kind = "Role"
name = kubernetes_role.browser_portforward.metadata[0].name
}
subject {
kind = "ServiceAccount"
name = kubernetes_service_account.emo_browser.metadata[0].name
namespace = kubernetes_namespace.chrome_service.metadata[0].name
}
}
# Cluster-wide read-only (NO secrets), mirroring emo's power-user OIDC access, bound
# to the SA. Needed because the SA becomes emo's DEFAULT kubectl context, so without
# this his everyday `kubectl get ...` would regress AND port-forward itself needs
# get/list on services + pods + endpoints (all covered by oidc-power-user-readonly).
# That ClusterRole is defined in stacks/rbac (modules/rbac/main.tf); referenced by name.
resource "kubernetes_cluster_role_binding" "emo_browser_readonly" {
metadata {
name = "emo-browser-readonly"
}
role_ref {
api_group = "rbac.authorization.k8s.io"
kind = "ClusterRole"
name = "oidc-power-user-readonly"
}
subject {
kind = "ServiceAccount"
name = kubernetes_service_account.emo_browser.metadata[0].name
namespace = kubernetes_namespace.chrome_service.metadata[0].name
}
}

View file

@ -49,6 +49,9 @@ resource "kubernetes_namespace" "ci_pipeline_health" {
# billing on PRIVATE mirrors, which a future scoped read:packages rotation of
# the alias could not do. Blast radius = this single-CronJob namespace.
resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = {
apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret"

View file

@ -38,6 +38,9 @@ resource "kubernetes_namespace" "claude_agent" {
# --- Secrets ---
resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = {
apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret"

View file

@ -57,6 +57,9 @@ resource "kubernetes_service_account" "breakglass" {
# DENIED this path (see stacks/vault/main.tf) so the shared, prompt-injectable
# pod can never read it.
resource "kubernetes_manifest" "external_secret_ssh" {
field_manager {
force_conflicts = true
}
manifest = {
apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret"
@ -82,6 +85,9 @@ resource "kubernetes_manifest" "external_secret_ssh" {
# Env secrets: the Anthropic OAuth token (shared with claude-agent-service
# same account) and the app bearer token (in-cluster/CLI fallback caller auth).
resource "kubernetes_manifest" "external_secret_env" {
field_manager {
force_conflicts = true
}
manifest = {
apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret"

View file

@ -29,6 +29,9 @@ resource "kubernetes_namespace" "claude-memory" {
}
resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = {
apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret"
@ -57,6 +60,9 @@ resource "kubernetes_manifest" "external_secret" {
# DB credentials from Vault database engine (rotated every 24h)
resource "kubernetes_manifest" "db_external_secret" {
field_manager {
force_conflicts = true
}
manifest = {
apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret"

View file

@ -5,6 +5,9 @@ variable "tls_secret_name" {
variable "public_ip" { type = string }
resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = {
apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret"

View file

@ -23,6 +23,9 @@ resource "kubernetes_namespace" "dawarich" {
}
resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = {
apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret"

View file

@ -745,7 +745,10 @@ resource "kubernetes_deployment" "phpmyadmin" {
labels = {
"app" = "phpmyadmin"
tier = var.tier
# ADR-0014 service identity: dbaas is a multi-Service namespace, so the
# namespace alone can't attribute Goldmane flows. Value = the fronting
# Service name (kubernetes_service.phpmyadmin is named "pma").
"service-identity" = "pma"
}
annotations = {
"reloader.stakater.com/search" = "true"
@ -762,6 +765,10 @@ resource "kubernetes_deployment" "phpmyadmin" {
metadata {
labels = {
"app" = "phpmyadmin"
# ADR-0014: Goldmane/Felix stamps POD labels onto flows, so the
# disambiguating identity must live on the pod template (not just
# the Deployment metadata above). Not in selector no replace.
"service-identity" = "pma"
}
}
spec {
@ -812,8 +819,19 @@ resource "kubernetes_deployment" "phpmyadmin" {
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
# This Deployment is Keel-enrolled (keel.sh/policy=patch). Ignore the
# attributes Keel/Kyverno mutate at runtime so `terragrunt apply` (incl.
# the daily drift plan) doesn't fight them or revert the live image
# canonical KEEL/KYVERNO lifecycle guard, matches linkwarden/chrome-service.
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
metadata[0].annotations["keel.sh/match-tag"],
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE Keel manages tag updates
spec[0].template[0].metadata[0].annotations["keel.sh/update-time"], # KEEL_LIFECYCLE_V1
]
}
}
@ -1499,6 +1517,10 @@ resource "kubernetes_deployment" "pgadmin" {
}
labels = {
tier = var.tier
# ADR-0014 service identity: dbaas is a multi-Service namespace, so the
# namespace alone can't attribute Goldmane flows. Value = the fronting
# Service name (kubernetes_service.pgadmin is named "pgadmin").
"service-identity" = "pgadmin"
}
}
spec {
@ -1514,6 +1536,10 @@ resource "kubernetes_deployment" "pgadmin" {
metadata {
labels = {
app = "pgadmin"
# ADR-0014: Goldmane/Felix stamps POD labels onto flows, so the
# disambiguating identity must live on the pod template (not just
# the Deployment metadata above). Not in selector no replace.
"service-identity" = "pgadmin"
}
}
spec {
@ -1568,8 +1594,20 @@ resource "kubernetes_deployment" "pgadmin" {
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
# This Deployment is Keel-enrolled (keel.sh/policy=patch) and Keel has
# bumped the live image (dpage/pgadmin4:9.16). Ignore the Keel/Kyverno
# runtime-mutated attributes so `terragrunt apply` (incl. the daily drift
# plan) doesn't revert the image to bare `dpage/pgadmin4` or strip Keel's
# annotations canonical guard, matches linkwarden/chrome-service.
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
metadata[0].annotations["keel.sh/match-tag"],
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE Keel manages tag updates
spec[0].template[0].metadata[0].annotations["keel.sh/update-time"], # KEEL_LIFECYCLE_V1
]
}
}
resource "kubernetes_service" "pgadmin" {

View file

@ -20,6 +20,9 @@ resource "kubernetes_namespace" "diun" {
}
resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = {
apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret"

View file

@ -20,6 +20,9 @@ resource "kubernetes_namespace" "ebooks" {
# ExternalSecrets for all three sources
resource "kubernetes_manifest" "calibre_external_secret" {
field_manager {
force_conflicts = true
}
manifest = {
apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret"
@ -47,6 +50,9 @@ resource "kubernetes_manifest" "calibre_external_secret" {
}
resource "kubernetes_manifest" "audiobookshelf_external_secret" {
field_manager {
force_conflicts = true
}
manifest = {
apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret"
@ -74,6 +80,9 @@ resource "kubernetes_manifest" "audiobookshelf_external_secret" {
}
resource "kubernetes_manifest" "servarr_external_secret" {
field_manager {
force_conflicts = true
}
manifest = {
apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret"

View file

@ -33,6 +33,9 @@ resource "kubernetes_namespace" "f1-stream" {
}
resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = {
apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret"
@ -62,6 +65,9 @@ resource "kubernetes_manifest" "external_secret" {
# Pull the chrome-service bearer token into this namespace as a separate
# Secret so the verifier can reach the in-cluster Playwright pool.
resource "kubernetes_manifest" "chrome_service_client_secret" {
field_manager {
force_conflicts = true
}
manifest = {
apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret"

View file

@ -53,6 +53,9 @@ resource "kubernetes_namespace" "fire_planner" {
# Seed before applying:
# secret/fire-planner -> property `recompute_bearer_token`
resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = {
apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret"
@ -115,6 +118,9 @@ resource "kubernetes_manifest" "external_secret" {
# Template builds the asyncpg DSN consumed by the FastAPI app + CronJob
# as DB_CONNECTION_STRING.
resource "kubernetes_manifest" "db_external_secret" {
field_manager {
force_conflicts = true
}
manifest = {
apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret"
@ -159,6 +165,9 @@ resource "kubernetes_manifest" "db_external_secret" {
# pg-sync sidecar populates `daily_account_valuation` etc. hourly; the
# fire-planner ingest reads those tables via this role.
resource "kubernetes_manifest" "wealthfolio_sync_db_external_secret" {
field_manager {
force_conflicts = true
}
manifest = {
apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret"
@ -450,6 +459,90 @@ resource "kubernetes_cron_job_v1" "fire_planner_recompute" {
]
}
# Monthly FIRE-countdown target solve on the 2nd at 10:00 UTC (an hour after
# recompute-all, so account_snapshot is fresh). Binary-searches each Case's FIRE
# number per country at the 99% Guyton-Klinger bar and upserts fire_target, which
# the wealth Grafana dashboard's "FIRE Countdown" section reads.
resource "kubernetes_cron_job_v1" "fire_planner_fire_targets" {
metadata {
name = "fire-planner-fire-targets"
namespace = kubernetes_namespace.fire_planner.metadata[0].name
}
spec {
schedule = "0 10 2 * *"
concurrency_policy = "Forbid"
successful_jobs_history_limit = 3
failed_jobs_history_limit = 5
starting_deadline_seconds = 600
job_template {
metadata {
labels = local.labels
}
spec {
backoff_limit = 1
ttl_seconds_after_finished = 86400
# The full country sweep is CPU-bound (binary search × ~22 cities ×
# 3 cases). Give it room rather than letting it run forever.
active_deadline_seconds = 3600
template {
metadata {
labels = local.labels
}
spec {
restart_policy = "OnFailure"
image_pull_secrets {
name = "registry-credentials"
}
image_pull_secrets {
name = "ghcr-credentials"
}
container {
name = "fire-targets"
image = local.image
# --horizon 72: Viktor retires ~age 28 and plans to live to 100, so
# the portfolio must last 72 years (was the 60y default to age 88).
command = ["python", "-m", "fire_planner", "recompute-fire-targets",
"--countries", "all", "--horizon", "72"]
env_from {
secret_ref {
name = "fire-planner-secrets"
}
}
env_from {
secret_ref {
name = "fire-planner-db-creds"
}
}
resources {
requests = {
cpu = "500m"
memory = "1Gi"
}
limits = {
memory = "2Gi"
}
}
}
}
}
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
depends_on = [
kubernetes_manifest.external_secret,
kubernetes_manifest.db_external_secret,
]
}
# Weekly refresh of the COL cache: walks col_snapshot for rows
# expiring within 7 days, re-scrapes Numbeo + Expatistan, upserts. With
# the user-chosen 1-year TTL, a healthy cache has 0 stale rows on most
@ -569,16 +662,53 @@ module "ingress_api" {
auth = "none"
}
# Plan-time read of the ESO-created K8s Secret for Grafana datasource
# password. First-apply gotcha: must
# `terragrunt apply -target=kubernetes_manifest.db_external_secret` so
# the Secret exists before this data source plans.
data "kubernetes_secret" "fire_planner_db_creds" {
metadata {
name = "fire-planner-db-creds"
namespace = kubernetes_namespace.fire_planner.metadata[0].name
# ExternalSecret in the monitoring namespace mirroring the rotating
# fire_planner DB password. Grafana mounts this via envFromSecrets in
# monitoring/grafana_chart_values.yaml; the datasource ConfigMap below
# references it as $__env{FIRE_PLANNER_PG_PASSWORD}. Reloader restarts
# Grafana whenever ESO updates this secret (on the 7d static-role
# rotation), so the provisioned datasource never goes stale replaces
# the old plan-time `data.kubernetes_secret` bake that broke weekly.
# Mirrors the wealth-pg / payslips-pg pattern.
resource "kubernetes_manifest" "grafana_fire_planner_pg_creds" {
field_manager {
force_conflicts = true
}
manifest = {
apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret"
metadata = {
name = "grafana-fire-planner-pg-creds"
namespace = "monitoring"
}
spec = {
refreshInterval = "15m"
secretStoreRef = {
name = "vault-database"
kind = "ClusterSecretStore"
}
target = {
name = "grafana-fire-planner-pg-creds"
template = {
metadata = {
annotations = {
"reloader.stakater.com/match" = "true"
}
}
data = {
FIRE_PLANNER_PG_PASSWORD = "{{ .password }}"
}
}
}
data = [{
secretKey = "password"
remoteRef = {
key = "static-creds/pg-fire-planner"
property = "password"
}
}]
}
}
depends_on = [kubernetes_manifest.db_external_secret]
}
# Grafana datasource for fire_planner PostgreSQL DB.
@ -615,12 +745,15 @@ resource "kubernetes_config_map" "grafana_fire_planner_datasource" {
timescaledb = false
}
secureJsonData = {
password = data.kubernetes_secret.fire_planner_db_creds.data["DB_PASSWORD"]
# Live env from grafana-fire-planner-pg-creds (above), injected into
# Grafana via envFromSecrets; reloader refreshes it on rotation.
password = "$__env{FIRE_PLANNER_PG_PASSWORD}"
}
editable = true
}]
})
}
depends_on = [kubernetes_manifest.grafana_fire_planner_pg_creds]
}
# CI retrigger 2026-05-16T13:42:57+00:00 bulk enrollment apply (pipeline #689 killed)
@ -661,6 +794,9 @@ variable "run_examples_bulk_ingest" {
# Reddit OAuth creds pulled from Vault secret/viktor.
resource "kubernetes_manifest" "external_secret_examples_reddit" {
field_manager {
force_conflicts = true
}
manifest = {
apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret"
@ -701,6 +837,9 @@ resource "kubernetes_manifest" "external_secret_examples_reddit" {
# claude-agent-service bearer pulled separately so its rotation cadence
# is decoupled from the Reddit creds.
resource "kubernetes_manifest" "external_secret_examples_claude" {
field_manager {
force_conflicts = true
}
manifest = {
apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret"

View file

@ -6,6 +6,9 @@
# (stacks/authentik/email-secret.tf) one credential, one rotation point. The
# reloader annotation rolls the Forgejo pod if the password is ever rotated.
resource "kubernetes_manifest" "forgejo_email_secret" {
field_manager {
force_conflicts = true
}
manifest = {
apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret"

View file

@ -3,6 +3,9 @@ variable "tls_secret_name" {
sensitive = true
}
resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = {
apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret"

View file

@ -18,6 +18,9 @@ resource "kubernetes_namespace" "immich" {
}
resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = {
apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret"

View file

@ -0,0 +1,499 @@
# =============================================================================
# goldmane-edge-aggregator durable who-talks-to-whom audit trail (ADR-0014 / #58)
# =============================================================================
# A small Go service that streams Calico Goldmane's gRPC Flows API (mTLS) and
# upserts the unique service-to-service edge set into Postgres, plus a daily
# Slack digest CronJob of first-seen edges. Code lives in the standalone
# `goldmane-edge-aggregator` repo; the authoritative deploy spec is its
# DEPLOY.md. This stack is the infra side of that spec.
#
# Goldmane runs as `Service goldmane:7443` (gRPC/mTLS) in calico-system, enabled
# via the operator CR in stacks/calico/main.tf. The durable Loki path is NOT
# the operator CRs this service IS the durable trail.
#
# Structure mirrors stacks/claude-memory (the canonical Tier-1 pattern): a
# per-service namespace, a CNPG Postgres DB + role + Vault 7-day rotation +
# ExternalSecret -> DATABASE_URL, the Reloader annotation, and the
# Terragrunt-generated backend.tf/providers.tf/tiers.tf layout. The novel bit is
# minting an mTLS client cert from the Tigera CA (hashicorp/tls; see versions.tf).
#
# IMAGE: ghcr.io/viktorbarzin/goldmane-edge-aggregator is PRIVATE. Onboarding
# MUST add the "goldmane-edge-aggregator" namespace to the ghcr-credentials
# Kyverno allowlist (stacks/kyverno/modules/kyverno/ghcr-credentials.tf,
# local.ghcr_private_namespaces) so the Kyverno-synced `ghcr-credentials` secret
# is cloned into this namespace otherwise the pulls 401. The imagePullSecrets
# reference below assumes that entry exists.
# =============================================================================
variable "postgresql_host" { type = string }
# Plan-time root creds for the idempotent DB-init Job (mirrors claude-memory).
data "vault_kv_secret_v2" "secrets" {
mount = "secret"
name = "goldmane-edge-aggregator"
}
# -----------------------------------------------------------------------------
# 1. Namespace
# -----------------------------------------------------------------------------
resource "kubernetes_namespace" "goldmane_edge_aggregator" {
metadata {
name = "goldmane-edge-aggregator"
labels = {
name = "goldmane-edge-aggregator"
# Tier 4-aux: a small off-path consumer service, like claude-memory.
tier = local.tiers.aux
"keel.sh/enrolled" = "true"
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
# -----------------------------------------------------------------------------
# 2. Goldmane mTLS client certificate (minted from the Tigera CA)
# -----------------------------------------------------------------------------
# The aggregator dials goldmane:7443 over mutual TLS. We mint a client cert
# signed by the Tigera CA (the same CA that issues Goldmane's serving cert), so
# Goldmane requires mutual TLS on :7443 and verifies the client cert chains to
# the Tigera CA it does NOT authorize by client identity, so ANY Tigera-CA-
# signed cert is accepted. Rather than copy the Tigera CA PRIVATE KEY into TF
# state to mint our own (a needless CA-key exposure; the hashicorp/tls provider
# is also incompatible with this repo's global generate-providers/lockfile
# pattern), we REUSE the operator-minted, Tigera-CA-signed client cert
# `whisker-backend-key-pair` (calico-system). We never touch the CA key.
# Trade-off: if the operator rotates that cert, re-apply to re-sync (hardening
# follow-up: mint an own-identity cert in-namespace if Whisker is ever removed).
data "kubernetes_secret" "whisker_backend" {
metadata {
name = "whisker-backend-key-pair"
namespace = "calico-system"
}
}
# The CA bundle that verifies Goldmane's serving cert. It lives ONLY in
# calico-system (verified: ConfigMap `tigera-ca-bundle`, 2 keys present
# `ca-bundle.crt` AND `tigera-ca-bundle.crt`, both the trusted bundle). We read
# it and recreate it as a ConfigMap in this namespace so the pod can mount it
# (a ConfigMap cannot be cross-namespace-mounted).
data "kubernetes_config_map" "tigera_ca_bundle" {
metadata {
name = "tigera-ca-bundle"
namespace = "calico-system"
}
}
resource "kubernetes_config_map" "tigera_ca_bundle" {
metadata {
name = "tigera-ca-bundle"
namespace = kubernetes_namespace.goldmane_edge_aggregator.metadata[0].name
}
# Copy the upstream bundle verbatim. We mount the `tigera-ca-bundle.crt` key
# at /etc/tigera-ca/tigera-ca-bundle.crt so the service's default
# CA_CERT_PATH (/etc/tigera-ca/tigera-ca-bundle.crt) resolves with no override.
data = data.kubernetes_config_map.tigera_ca_bundle.data
}
# Client cert + key for mTLS to goldmane:7443, mounted at TLS_CERT_PATH /
# TLS_KEY_PATH defaults (/etc/goldmane-client-tls/tls.crt and .../tls.key).
# Sourced verbatim from the operator's whisker-backend client key-pair (read
# above) already Tigera-CA-signed, which is all Goldmane verifies. No CA key
# is touched and no cross-namespace CA RBAC is needed.
resource "kubernetes_secret" "goldmane_client_tls" {
metadata {
name = "goldmane-client-tls"
namespace = kubernetes_namespace.goldmane_edge_aggregator.metadata[0].name
}
type = "Opaque"
data = {
"tls.crt" = data.kubernetes_secret.whisker_backend.data["tls.crt"]
"tls.key" = data.kubernetes_secret.whisker_backend.data["tls.key"]
}
}
# -----------------------------------------------------------------------------
# 3. Postgres: DB + role `goldmane_edges`, Vault 7-day rotation, DATABASE_URL
# -----------------------------------------------------------------------------
# Idempotent create of the role + DB using the CNPG root creds from Vault
# (dbaas_root_password), exactly mirroring claude-memory's db_init Job. The
# service creates the `edge` table itself at startup (migrations/0001_edge.sql),
# so no migration Job is needed.
resource "kubernetes_job" "db_init" {
metadata {
name = "goldmane-edges-db-init"
namespace = kubernetes_namespace.goldmane_edge_aggregator.metadata[0].name
}
spec {
template {
metadata {}
spec {
container {
name = "db-init"
image = "postgres:16-alpine"
command = [
"sh", "-c",
<<-EOT
set -e
# -d postgres: psql defaults the database name to the username;
# the root user has no root-named database, so be explicit.
PGPASSWORD='${data.vault_kv_secret_v2.secrets.data["dbaas_root_password"]}' psql -h ${var.postgresql_host} -U root -d postgres -tc "SELECT 1 FROM pg_roles WHERE rolname='goldmane_edges'" | grep -q 1 || \
PGPASSWORD='${data.vault_kv_secret_v2.secrets.data["dbaas_root_password"]}' psql -h ${var.postgresql_host} -U root -d postgres -c "CREATE ROLE goldmane_edges WITH LOGIN PASSWORD '${data.vault_kv_secret_v2.secrets.data["db_password"]}'"
PGPASSWORD='${data.vault_kv_secret_v2.secrets.data["dbaas_root_password"]}' psql -h ${var.postgresql_host} -U root -d postgres -tc "SELECT 1 FROM pg_database WHERE datname='goldmane_edges'" | grep -q 1 || \
PGPASSWORD='${data.vault_kv_secret_v2.secrets.data["dbaas_root_password"]}' psql -h ${var.postgresql_host} -U root -d postgres -c "CREATE DATABASE goldmane_edges OWNER goldmane_edges"
PGPASSWORD='${data.vault_kv_secret_v2.secrets.data["dbaas_root_password"]}' psql -h ${var.postgresql_host} -U root -d postgres -c "GRANT ALL PRIVILEGES ON DATABASE goldmane_edges TO goldmane_edges"
echo "Database init complete"
EOT
]
}
restart_policy = "Never"
}
}
backoff_limit = 3
}
wait_for_completion = true
timeouts {
create = "2m"
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno injects dns_config (ndots=2); ignore it so
# this idempotent Job isn't replaced (Jobs are immutable) on every apply.
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
# ExternalSecret projecting the Vault-rotated (7-day) credential into a K8s
# Secret as DATABASE_URL. The Vault DB static role `pg-goldmane-edges` and its
# place in the CNPG connection allowlist are added in stacks/vault/main.tf
# (see this stack's terragrunt.hcl note). remoteRef key: static-creds/pg-goldmane-edges.
resource "kubernetes_manifest" "db_external_secret" {
field_manager {
force_conflicts = true
}
manifest = {
apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret"
metadata = {
name = "goldmane-edges-db-creds"
namespace = kubernetes_namespace.goldmane_edge_aggregator.metadata[0].name
}
spec = {
refreshInterval = "15m"
secretStoreRef = {
name = "vault-database"
kind = "ClusterSecretStore"
}
target = {
name = "goldmane-edges-db-creds"
template = {
data = {
DATABASE_URL = "postgresql://goldmane_edges:{{ .password }}@${var.postgresql_host}:5432/goldmane_edges"
}
}
}
data = [{
secretKey = "password"
remoteRef = {
key = "static-creds/pg-goldmane-edges"
property = "password"
}
}]
}
}
depends_on = [kubernetes_namespace.goldmane_edge_aggregator]
}
# -----------------------------------------------------------------------------
# 4. Slack webhook (reuse the alert-digest incoming webhook)
# -----------------------------------------------------------------------------
# The monitoring alert-digest CronJob posts with the Slack incoming webhook at
# Vault secret/monitoring -> key `alertmanager_slack_api_url`
# (stacks/monitoring/modules/monitoring/alert_digest.tf). Project that same URL
# into this namespace as SLACK_WEBHOOK_URL via an ExternalSecret (no new
# webhook). The digest CronJob defaults to #security.
resource "kubernetes_manifest" "slack_external_secret" {
field_manager {
force_conflicts = true
}
manifest = {
apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret"
metadata = {
name = "goldmane-edges-slack"
namespace = kubernetes_namespace.goldmane_edge_aggregator.metadata[0].name
}
spec = {
refreshInterval = "1h"
secretStoreRef = {
name = "vault-kv"
kind = "ClusterSecretStore"
}
target = {
name = "goldmane-edges-slack"
}
data = [{
secretKey = "SLACK_WEBHOOK_URL"
remoteRef = {
key = "viktor"
property = "alertmanager_slack_api_url"
}
}]
}
}
depends_on = [kubernetes_namespace.goldmane_edge_aggregator]
}
# -----------------------------------------------------------------------------
# 5. aggregate Deployment (long-running gRPC stream -> Postgres upserts)
# -----------------------------------------------------------------------------
resource "kubernetes_deployment" "aggregate" {
depends_on = [
kubernetes_job.db_init,
kubernetes_manifest.db_external_secret,
]
metadata {
name = "goldmane-edge-aggregator"
namespace = kubernetes_namespace.goldmane_edge_aggregator.metadata[0].name
labels = {
app = "goldmane-edge-aggregator"
tier = local.tiers.aux
}
annotations = {
# Credential is env-injected and read only at startup; the 7-day rotation
# must bounce the pod or it keeps the stale password and silently fails
# DB auth (infra CLAUDE.md Reloader rule).
"secret.reloader.stakater.com/reload" = "goldmane-edges-db-creds"
}
}
spec {
# 1 replica: the edge set is a global upsert keyed on (src_ns, dst_ns,
# action); a second replica only doubles writes for no benefit (Goldmane
# streams per-flow). Stateless (no PVC) so RollingUpdate is fine.
replicas = 1
selector {
match_labels = {
app = "goldmane-edge-aggregator"
}
}
template {
metadata {
labels = {
app = "goldmane-edge-aggregator"
}
}
spec {
# PRIVATE ghcr image cloned into this namespace by the Kyverno
# sync-ghcr-credentials allowlist policy (add this ns to that list).
image_pull_secrets {
name = "ghcr-credentials"
}
container {
name = "aggregate"
# CI (GHA -> ghcr) overwrites this to :<sha8> via `kubectl set image`;
# the image tag is in ignore_changes below so the SHA sticks across
# `terragrunt apply` (fleet image-pin convention). Placeholder :latest
# until the deploy pipeline runs.
image = "ghcr.io/viktorbarzin/goldmane-edge-aggregator:latest"
args = ["aggregate"]
# Goldmane mTLS. GOLDMANE_HOST default host sans port =>
# ServerName "goldmane.calico-system.svc.cluster.local", which is a SAN
# on the live Goldmane serving cert (verified 2026-06-24:
# DNS:goldmane{,.calico-system{,.svc{,.cluster.local}}}). So no
# GOLDMANE_SERVER_NAME override and no GOLDMANE_TLS_INSECURE needed.
env {
name = "GOLDMANE_HOST"
value = "goldmane.calico-system.svc.cluster.local:7443"
}
# TLS_CERT_PATH / TLS_KEY_PATH / CA_CERT_PATH are left at their image
# defaults (/etc/goldmane-client-tls/tls.{crt,key} and
# /etc/tigera-ca/tigera-ca-bundle.crt) the mounts below match them.
env {
name = "DATABASE_URL"
value_from {
secret_key_ref {
name = "goldmane-edges-db-creds"
key = "DATABASE_URL"
}
}
}
volume_mount {
name = "goldmane-client-tls"
mount_path = "/etc/goldmane-client-tls"
read_only = true
}
volume_mount {
name = "tigera-ca"
mount_path = "/etc/tigera-ca"
read_only = true
}
resources {
# Idles low: a single gRPC stream + periodic upserts. requests=limits
# per the repo memory rule; no CPU limit (CFS throttling). Right-size
# later with krr.
requests = {
cpu = "10m"
memory = "64Mi"
}
limits = {
memory = "64Mi"
}
}
}
volume {
name = "goldmane-client-tls"
secret {
secret_name = kubernetes_secret.goldmane_client_tls.metadata[0].name
}
}
volume {
name = "tigera-ca"
config_map {
name = kubernetes_config_map.tigera_ca_bundle.metadata[0].name
}
}
}
}
}
lifecycle {
ignore_changes = [
# CI pipeline owns the image tag (kubectl set image from GHA/Woodpecker).
spec[0].template[0].spec[0].container[0].image,
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
metadata[0].annotations["keel.sh/match-tag"],
metadata[0].annotations["kubernetes.io/change-cause"],
metadata[0].annotations["deployment.kubernetes.io/revision"],
spec[0].template[0].metadata[0].annotations["keel.sh/update-time"], # KEEL_LIFECYCLE_V1
]
}
}
# -----------------------------------------------------------------------------
# 6. digest daily CronJob (first-seen edges -> Slack)
# -----------------------------------------------------------------------------
resource "kubernetes_cron_job_v1" "digest" {
depends_on = [
kubernetes_job.db_init,
kubernetes_manifest.db_external_secret,
kubernetes_manifest.slack_external_secret,
]
metadata {
name = "goldmane-edges-digest"
namespace = kubernetes_namespace.goldmane_edge_aggregator.metadata[0].name
labels = {
app = "goldmane-edge-aggregator"
tier = local.tiers.aux
}
}
spec {
# Daily 08:00 Europe/London aligns with the alert-digest cadence.
schedule = "0 8 * * *"
timezone = "Europe/London"
concurrency_policy = "Forbid"
successful_jobs_history_limit = 3
failed_jobs_history_limit = 3
starting_deadline_seconds = 600
job_template {
metadata {
labels = {
app = "goldmane-edge-aggregator"
}
annotations = {
# 7-day DB rotation: bounce the Job pod's stale env (Reloader rule).
"secret.reloader.stakater.com/reload" = "goldmane-edges-db-creds"
}
}
spec {
backoff_limit = 2
active_deadline_seconds = 300
ttl_seconds_after_finished = 86400
template {
metadata {
labels = {
app = "goldmane-edge-aggregator"
}
}
spec {
restart_policy = "OnFailure"
image_pull_secrets {
name = "ghcr-credentials"
}
container {
name = "digest"
# CronJobs track :latest + imagePullPolicy: Always (fleet
# convention) so the daily run picks up the current image.
image = "ghcr.io/viktorbarzin/goldmane-edge-aggregator:latest"
image_pull_policy = "Always"
args = ["digest"]
env {
name = "DATABASE_URL"
value_from {
secret_key_ref {
name = "goldmane-edges-db-creds"
key = "DATABASE_URL"
}
}
}
env {
name = "SLACK_WEBHOOK_URL"
value_from {
secret_key_ref {
name = "goldmane-edges-slack"
key = "SLACK_WEBHOOK_URL"
}
}
}
env {
name = "SLACK_CHANNEL"
# Posts to #alerts. The dedicated #security channel was abandoned
# 2026-06-25 the shared alertmanager_slack_api_url webhook's
# Slack app isn't a member of it (channel override 404s), so all
# Slack (incl. alertmanager's security-lane alerts) consolidated
# to #alerts. See docs/runbooks/goldmane-flow-trail.md.
value = "#alerts"
}
resources {
requests = {
cpu = "10m"
memory = "64Mi"
}
limits = {
memory = "64Mi"
}
}
}
}
}
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1 (CronJob path): Kyverno mutates dns_config with ndots=2.
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}
# -----------------------------------------------------------------------------
# 7. Egress (default-deny consideration)
# -----------------------------------------------------------------------------
# Goldmane's own NetworkPolicy already allows INGRESS on 7443 from anywhere, so
# nothing is needed on the Goldmane side. No egress policy is declared here:
# this namespace is default-allow egress today. IF/WHEN it is brought under the
# wave-1 default-deny egress enforcement (per-namespace allowlists), add
# (Global)NetworkPolicy egress rules permitting:
# - goldmane.calico-system.svc.cluster.local:7443 (the flow stream)
# - pg-cluster-rw.dbaas.svc.cluster.local:5432 (Postgres)
# - hooks.slack.com:443 (digest -> Slack, internet)
# - kube-dns / CoreDNS :53 (DNS, every namespace)

View file

@ -0,0 +1,24 @@
include "root" {
path = find_in_parent_folders()
}
# Tier-1 stack (PG state backend). The root terragrunt.hcl generates backend.tf
# (pg backend, schema_name = "goldmane-edge-aggregator"), providers.tf,
# cloudflare_provider.tf and tiers.tf automatically do NOT hand-write those.
# This stack adds the hashicorp/tls provider via a local versions.tf (merged
# into the generated required_providers).
dependency "platform" {
config_path = "../platform"
skip_outputs = true
}
dependency "vault" {
config_path = "../vault"
skip_outputs = true
}
# The Vault DB static role pg-goldmane-edges (7-day rotation) and the CNPG
# connection allowlist entry live in the vault stack (stacks/vault/main.tf).
# The vault dependency above orders this stack after it so the ExternalSecret
# can materialize the rotated credential on first apply.

View file

@ -5,6 +5,9 @@ variable "tls_secret_name" {
variable "nfs_server" { type = string }
resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = {
apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret"

View file

@ -208,6 +208,9 @@ module "ingress" {
}
resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = {
apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret"

View file

@ -250,6 +250,9 @@ module "ingress_test" {
}
resource "kubernetes_manifest" "external_secret_db" {
field_manager {
force_conflicts = true
}
manifest = {
apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret"
@ -284,6 +287,9 @@ resource "kubernetes_manifest" "external_secret_db" {
}
resource "kubernetes_manifest" "external_secret_kv" {
field_manager {
force_conflicts = true
}
manifest = {
apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret"

View file

@ -37,6 +37,9 @@ module "tls_secret" {
# --- Secrets (ESO from Vault) ---
resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = {
apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret"

155
stacks/immich/frame-emo.tf Normal file
View file

@ -0,0 +1,155 @@
# Immich photo-frame for Emo (emil.barzin@gmail.com) a second instance cloned
# from the London frame in frame.tf, scoped to Emo's Immich account + Sofia
# weather. Served at highlights-immich-emo.viktorbarzin.me and shown on Emo's
# Portal Mini (Sofia) via the portal-immich-frame app.
# API key: Vault secret/immich -> frame_api_key_emo (minted on Emo's account).
resource "kubernetes_config_map" "frame_config_emo" {
metadata {
name = "config-emo"
namespace = "immich"
labels = {
app = "frame-config-emo"
}
annotations = {
"reloader.stakater.com/match" = "true"
}
}
data = {
"Settings.yml" = <<-EOF
General:
Layout: single
Interval: 45
ImageZoom: true
ShowAlbumName: false
ShowProgressBar: false
ClockFormat: "HH:mm"
PhotoDateFormat: "dd/MM/yyyy"
WeatherApiKey: ${data.vault_kv_secret_v2.secrets.data["frame_weather_api_key"]}
UnitSystem: metric
WeatherLatLong: "42.6977,23.3219"
Language: en
Accounts:
- ImmichServerUrl: http://immich.viktorbarzin.me
ApiKey: ${data.vault_kv_secret_v2.secrets.data["frame_api_key_emo"]}
ImagesFromDays: 730
EOF
}
}
resource "kubernetes_deployment" "immich-frame-emo" {
metadata {
name = "immich-frame-emo"
namespace = "immich"
annotations = {
"reloader.stakater.com/search" = "true"
}
labels = {
tier = local.tiers.gpu
}
}
spec {
replicas = 1
selector {
match_labels = {
app = "immich-frame-emo"
}
}
strategy {
type = "RollingUpdate"
}
template {
metadata {
labels = {
app = "immich-frame-emo"
}
annotations = {
"dependency.kyverno.io/wait-for" = "immich-server.immich:2283"
}
}
spec {
container {
image = "ghcr.io/immichframe/immichframe:v1.0.32.0"
name = "immich-frame-emo"
resources {
requests = {
cpu = "10m"
memory = "64Mi"
}
limits = {
memory = "128Mi"
}
}
port {
container_port = 8080
protocol = "TCP"
name = "http"
}
volume_mount {
name = "config"
mount_path = "/app/Config"
read_only = true
}
}
volume {
name = "config"
config_map {
name = "config-emo"
}
}
}
}
}
lifecycle {
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
metadata[0].annotations["keel.sh/match-tag"],
metadata[0].annotations["kubernetes.io/change-cause"],
metadata[0].annotations["deployment.kubernetes.io/revision"],
spec[0].template[0].metadata[0].annotations["keel.sh/update-time"], # KEEL_LIFECYCLE_V1
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE
]
}
}
resource "kubernetes_service" "immich-frame-emo" {
metadata {
name = "immich-frame-emo"
namespace = "immich"
labels = {
"app" = "immich-frame-emo"
}
}
spec {
selector = {
app = "immich-frame-emo"
}
port {
port = 80
target_port = 8080
}
}
}
module "ingress_emo" {
source = "../../modules/kubernetes/ingress_factory"
# Photo-frame kiosk display on Emo's Portal headless browser pulling images
# via an Immich API key (no user login). Forward-auth would 302 the device to
# Authentik with no way to complete login.
# auth = "none": photo-frame kiosk; headless browser with API key; no user login.
auth = "none"
dns_type = "proxied"
namespace = "immich"
name = "highlights-immich-emo"
tls_secret_name = var.tls_secret_name
service_name = "immich-frame-emo"
}

View file

@ -162,6 +162,9 @@ resource "kubernetes_resource_quota" "immich" {
}
resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = {
apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret"

View file

@ -20,6 +20,9 @@ resource "kubernetes_namespace" "insta2spotify" {
}
resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = {
apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret"

View file

@ -35,6 +35,14 @@ resource "kubernetes_namespace" "instagram_poster" {
# - immich_tag_instagram (optional auto-resolved if missing)
# - immich_tag_posted (optional auto-resolved if missing)
resource "kubernetes_manifest" "external_secret" {
# The external-secrets controller takes server-side-apply ownership of
# .spec.refreshInterval, so a plain TF apply conflicts. force_conflicts lets
# TF win (values match, so it's stable) same pattern as grafana/woodpecker/
# traefik/k8s-version-upgrade. Surfaced 2026-06-24 by the first IG apply since
# the ESO v1 migration (the scale-to-0 push).
field_manager {
force_conflicts = true
}
manifest = {
apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret"
@ -139,6 +147,11 @@ resource "kubernetes_manifest" "external_secret" {
# ESO refreshes the K8s Secret every 15m. `reloader.stakater.com/match`
# bounces the pod when the password changes.
resource "kubernetes_manifest" "benchmark_db_external_secret" {
# See external_secret above ESO owns .spec.refreshInterval; force_conflicts
# lets the TF apply win instead of erroring on the field-manager conflict.
field_manager {
force_conflicts = true
}
manifest = {
apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret"
@ -227,7 +240,11 @@ resource "kubernetes_deployment" "instagram_poster" {
}
spec {
replicas = 1
# Scaled to 0 (2026-06-24): Instagram Graph integration is unused and its
# ExternalSecret is dead (missing ig_graph_long_lived_token /
# ig_business_account_id in Vault secret/instagram-poster). Set back to 1
# after minting a Meta long-lived token and populating those keys.
replicas = 0
# RWO PVC cannot rolling-update.
strategy {
type = "Recreate"

View file

@ -41,6 +41,9 @@ resource "kubernetes_namespace" "job_hunter" {
# digest_to_address where the weekly digest goes
# digest_from_address From: header for the digest
resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = {
apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret"
@ -105,6 +108,9 @@ resource "kubernetes_manifest" "external_secret" {
# DB credentials from Vault database engine (7-day rotation).
# Template builds the asyncpg DSN consumed by the FastAPI app as DB_CONNECTION_STRING.
resource "kubernetes_manifest" "db_external_secret" {
field_manager {
force_conflicts = true
}
manifest = {
apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret"
@ -325,6 +331,9 @@ resource "kubernetes_service" "job_hunter" {
# references it as $__env{JOB_HUNTER_PG_PASSWORD}. Reloader restarts
# Grafana whenever ESO updates this secret (every 7d on rotation).
resource "kubernetes_manifest" "grafana_job_hunter_db_external_secret" {
field_manager {
force_conflicts = true
}
manifest = {
apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret"

View file

@ -5,6 +5,9 @@
# -----------------------------------------------------------------------------
resource "kubernetes_manifest" "oauth2_proxy_externalsecret" {
field_manager {
force_conflicts = true
}
manifest = {
apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret"

View file

@ -5,9 +5,11 @@
<main>
<h1>Kubernetes Access Portal</h1>
<div class="callout warning">
<strong>VPN Required</strong> — The cluster is on a private network. You need Headscale VPN access before kubectl will work.
<a href="/onboarding">See the Getting Started guide</a> for VPN setup instructions.
<div class="callout info">
<strong>Fastest way in:</strong> open the <a href="https://t3.viktorbarzin.me">web terminal</a> or the
<a href="https://k8s.viktorbarzin.me">dashboard</a> and sign in — no install, no VPN needed. Prefer your
own machine? The <a href="/onboarding#path-laptop">local-setup guide</a> covers VPN + kubectl, and the
<a href="/onboarding">Getting Started page</a> compares all three access paths.
</div>
<section>
@ -26,6 +28,7 @@
<p><strong>Assigned namespaces:</strong> {data.namespaces.join(', ')}</p>
<h3>Quick Commands</h3>
<p>Run these as-is in the <a href="https://t3.viktorbarzin.me">web terminal</a> — it's already signed in as you.</p>
<pre>
# Check your pods
kubectl get pods -n {data.namespaces[0]}
@ -47,16 +50,23 @@ vault write kubernetes/creds/{data.namespaces[0]}-deployer \
<section>
<h2>Get Started</h2>
<h3>No setup — start now</h3>
<ol>
<li><a href="https://t3.viktorbarzin.me">Open the web terminal</a> — a ready shell with kubectl, Vault and your repos already set up</li>
<li><a href="https://k8s.viktorbarzin.me">Open the dashboard</a> — point-and-click view of your workloads</li>
</ol>
<h3>On your own machine</h3>
<ol>
{#if data.role === 'namespace-owner'}
<li><a href="/onboarding?role=namespace-owner">Complete the namespace-owner onboarding guide</a></li>
<li><a href="/onboarding?role=namespace-owner#path-laptop">Follow the namespace-owner setup</a> (VPN, kubectl, Vault, encrypted state)</li>
{:else}
<li><a href="/onboarding">Complete the onboarding guide</a> (VPN, kubectl, git)</li>
<li><a href="/onboarding#path-laptop">Follow the local setup</a> (VPN, kubectl, git)</li>
{/if}
<li><a href="/setup">Install kubectl and kubelogin</a></li>
<li><a href="/download">Download your kubeconfig</a></li>
<li>Run <code>kubectl get namespaces</code> to verify access</li>
</ol>
<p><a href="/onboarding">Compare all three access paths →</a></p>
</section>
<section>
@ -91,12 +101,12 @@ vault write kubernetes/creds/{data.namespaces[0]}-deployer \
border-radius: 6px;
margin: 1rem 0;
}
.callout.warning {
background: #fff3cd;
border-left: 4px solid #ffc107;
.callout.info {
background: #e8f4fd;
border-left: 4px solid #2196f3;
}
.callout a {
color: #856404;
color: #0d47a1;
font-weight: 600;
}
</style>

View file

@ -5,22 +5,123 @@
<main class="content">
<h1>Getting Started</h1>
<p>Welcome! Follow these steps to get access to the home Kubernetes cluster.</p>
<div class="role-tabs">
<a href="/onboarding" class:active={!showNamespaceOwner}>General User</a>
<a href="/onboarding?role=namespace-owner" class:active={showNamespaceOwner}>Namespace Owner</a>
</div>
<p>
Welcome! There are three ways to reach the home Kubernetes cluster. Pick the one that fits —
the first two need <strong>zero setup</strong> and open right in your browser.
</p>
<section>
<h2>Step 0 — Join the VPN</h2>
<p>The cluster is on a private network (<code>10.0.20.0/24</code>). You need VPN access first.</p>
<h2>Three ways in</h2>
<table>
<thead><tr><th>Path</th><th>Best for</th><th>Setup</th></tr></thead>
<tbody>
<tr>
<td><a href="#path-terminal"><strong>A — Web terminal</strong></a></td>
<td>Just want to start working now</td>
<td>None — opens in your browser</td>
</tr>
<tr>
<td><a href="#path-dashboard"><strong>B — Web dashboard</strong></a></td>
<td>Click around, watch your app, read logs</td>
<td>None — opens in your browser</td>
</tr>
<tr>
<td><a href="#path-laptop"><strong>C — Your own machine</strong></a></td>
<td>kubectl / Terraform locally, full control</td>
<td>VPN + one-line installer</td>
</tr>
</tbody>
</table>
<div class="callout info">
<strong>Not sure?</strong> Start with the <a href="#path-terminal">web terminal (Path A)</a>.
Everything is already installed and your repos are already cloned — you can run your first
<code>kubectl</code> command within a minute, from any device.
</div>
</section>
<section id="path-terminal" class="path">
<h2>Path A — Web terminal <span class="badge rec">Recommended</span> <span class="badge none">No setup</span></h2>
<p>
A full terminal that runs in your browser — nothing to install, works from any device
(even a tablet). It drops you into your own account on the shared workstation, with every
tool already set up.
</p>
<ol>
<li>Open <a href="https://t3.viktorbarzin.me" target="_blank">t3.viktorbarzin.me</a></li>
<li>Sign in with your Authentik account (the same SSO login as this portal)</li>
<li>You land in a ready-to-use shell. Try it:
<pre>kubectl get pods -n YOUR_NAMESPACE</pre>
</li>
</ol>
<div class="callout info">
<strong>Already done for you</strong> on the workstation:
<ul>
<li><code>kubectl</code> + your kubeconfig, scoped to your namespaces (no login dance)</li>
<li><code>vault</code>, <code>terragrunt</code>, <code>terraform</code>, <code>sops</code>, <code>kubeseal</code></li>
<li>Your repos cloned under <code>~/code</code> — the <code>infra</code> repo plus your own project repos</li>
<li>Claude Code, ready to pair with you on changes</li>
</ul>
</div>
<div class="callout warning">
<strong>No access yet?</strong> The workstation is provisioned per person. If
<code>t3.viktorbarzin.me</code> says you're not authorized, ask Viktor to add you
(<a href="mailto:vbarzin@gmail.com">vbarzin@gmail.com</a> or Slack).
</div>
</section>
<section id="path-dashboard" class="path">
<h2>Path B — Web dashboard <span class="badge none">No setup</span></h2>
<p>
A point-and-click view of the cluster — browse your pods, read logs, restart a deployment,
check events. Nothing to install.
</p>
<ol>
<li>Open <a href="https://k8s.viktorbarzin.me" target="_blank">k8s.viktorbarzin.me</a></li>
<li>Sign in with your Authentik account</li>
<li>
You're dropped straight into the Kubernetes Dashboard, already authenticated as you —
<strong>no token to paste</strong>. The portal injects your personal access token for you.
</li>
</ol>
<div class="callout info">
Scoped to your namespace(s): you can see and manage your own workloads, but not other
tenants'. This path uses a per-user token that does <em>not</em> depend on CLI login, so it
keeps working even if <code>kubectl</code> OIDC login is having a bad day — making it the
reliable fallback for Path C.
</div>
</section>
<section id="path-laptop" class="path c">
<h2>Path C — From your own machine</h2>
<p>
For running <code>kubectl</code>, <code>vault</code> and Terraform locally. This is the most
powerful path and the one to use for infrastructure changes — it just needs a bit more setup
because the cluster API lives on a private network.
</p>
<div class="role-tabs">
<a href="/onboarding?role=general#path-laptop" class:active={!showNamespaceOwner}>General User</a>
<a href="/onboarding?role=namespace-owner#path-laptop" class:active={showNamespaceOwner}>Namespace Owner</a>
</div>
<p class="prereq">
{#if showNamespaceOwner}
Namespace owner — you'll also set up Vault and encrypted Terraform state so you can deploy
your own app stacks.
{:else}
General user — VPN, kubectl and git access. (Managing your own app stack? Switch to the
<strong>Namespace Owner</strong> tab above.)
{/if}
</p>
<section>
<h3>Step 1 — Join the VPN</h3>
<p>The cluster API is on a private network (<code>10.0.20.0/24</code>), so you need VPN access first.</p>
<ol>
<li>Install <a href="https://tailscale.com/download" target="_blank">Tailscale</a> for your OS</li>
<li>Run this in your terminal:
<pre>tailscale login --login-server https://headscale.viktorbarzin.me</pre>
</li>
<li>A browser window will open with a registration URL</li>
<li>A browser window opens with a registration URL</li>
<li>Send that URL to Viktor via email (<a href="mailto:vbarzin@gmail.com">vbarzin@gmail.com</a>) or Slack</li>
<li>Wait for approval (usually within a few hours)</li>
<li>Once approved, test: <pre>ping 10.0.20.100</pre></li>
@ -28,62 +129,49 @@
</section>
<section>
<h2>Step 1 — Log in to the portal</h2>
<p>Visit <a href="https://k8s-portal.viktorbarzin.me">k8s-portal.viktorbarzin.me</a> and sign in with your Authentik account.</p>
<p>If you don't have an account yet, ask Viktor to create one.</p>
<h3>Step 2 — Install the tools</h3>
<p>Run one of these to install everything automatically (kubectl, kubelogin, vault, terragrunt, terraform, kubeseal) and write your kubeconfig to <code>~/.kube/config-home</code>:</p>
<h4>macOS</h4>
<p class="prereq">Requires <a href="https://brew.sh" target="_blank">Homebrew</a>. Install it first if you don't have it.</p>
<pre>bash &lt;(curl -fsSL https://k8s-portal.viktorbarzin.me/setup/script?os=mac)</pre>
<h4>Linux</h4>
<pre>bash &lt;(curl -fsSL https://k8s-portal.viktorbarzin.me/setup/script?os=linux)</pre>
<h4>Windows</h4>
<p>Use <a href="https://learn.microsoft.com/en-us/windows/wsl/install" target="_blank">WSL2</a> and follow the Linux instructions.</p>
</section>
<section>
<h2>Step 2 — Set up kubectl</h2>
<p>Run one of these commands in your terminal to install everything automatically:</p>
<h3>macOS</h3>
<p class="prereq">Requires <a href="https://brew.sh" target="_blank">Homebrew</a>. Install it first if you don't have it.</p>
<pre>bash &lt;(curl -fsSL https://k8s-portal.viktorbarzin.me/setup/script?os=mac)</pre>
<h3>Linux</h3>
<pre>bash &lt;(curl -fsSL https://k8s-portal.viktorbarzin.me/setup/script?os=linux)</pre>
<h3>Windows</h3>
<p>Use <a href="https://learn.microsoft.com/en-us/windows/wsl/install" target="_blank">WSL2</a> and follow the Linux instructions.</p>
<h3>Step 3 — Verify access</h3>
<p>Run this. The first time, it opens your browser for SSO login:</p>
<pre>kubectl get {showNamespaceOwner ? 'pods -n YOUR_NAMESPACE' : 'namespaces'}</pre>
<p>You should see your resources (or an empty list if you haven't deployed anything yet).</p>
<div class="callout warning">
<strong>Browser login loops, or kubectl says "Unauthorized"?</strong> Command-line SSO
(OIDC) can occasionally be unavailable. When that happens, use the
<a href="#path-dashboard">web dashboard (Path B)</a> or the
<a href="#path-terminal">web terminal (Path A)</a> — both authenticate a different way and
keep working — and let Viktor know.
</div>
<p class="prereq">Connection error instead? Make sure the VPN is up: <code>tailscale status</code>.</p>
</section>
{#if showNamespaceOwner}
<section>
<h2>Step 3 — Log into Vault</h2>
<h3>Step 4 — Log into Vault</h3>
<p>Vault manages your secrets and issues dynamic Kubernetes credentials.</p>
<pre>vault login -method=oidc</pre>
<p>This opens your browser for Authentik SSO. After login, your token is saved to <code>~/.vault-token</code>.</p>
</section>
<section>
<h2>Step 4 — Verify kubectl access</h2>
<p>Run this command. It will open your browser for OIDC login the first time:</p>
<pre>kubectl get pods -n YOUR_NAMESPACE</pre>
<p>You should see an empty list (no resources) or your running pods.</p>
</section>
<section>
<h2>Step 5 — Clone the infra repo</h2>
<h3>Step 5 — Clone the infra repo</h3>
<pre>git clone https://github.com/ViktorBarzin/infra.git
cd infra</pre>
<p>This is where all the infrastructure configuration lives. Terraform state is committed as encrypted files.</p>
</section>
<section>
<h2>Step 6 — Install tools</h2>
<p>You need <code>sops</code> and <code>terragrunt</code> to work with infrastructure state:</p>
<h3>macOS</h3>
<pre>brew install sops terragrunt</pre>
<h3>Linux</h3>
<pre># sops
curl -LO https://github.com/getsops/sops/releases/latest/download/sops-v3.9.4.linux.amd64
sudo mv sops-*.linux.amd64 /usr/local/bin/sops && sudo chmod +x /usr/local/bin/sops
# terragrunt
curl -LO https://github.com/gruntwork-io/terragrunt/releases/latest/download/terragrunt_linux_amd64
sudo mv terragrunt_linux_amd64 /usr/local/bin/terragrunt && sudo chmod +x /usr/local/bin/terragrunt</pre>
</section>
<section>
<h2>Step 7 — Decrypt your state</h2>
<h3>Step 6 — Decrypt your state</h3>
<p>Terraform state is encrypted with SOPS. Your Vault login gives you access to <strong>only your stacks</strong>.</p>
<pre># Make sure you're logged into Vault
vault login -method=oidc
@ -132,7 +220,7 @@ cd stacks/YOUR_NAMESPACE
</section>
<section>
<h2>Step 8 — Create your first app stack</h2>
<h3>Step 7 — Create your first app stack</h3>
<ol>
<li>Copy the template: <pre>cp -r stacks/_template stacks/myapp
mv stacks/myapp/main.tf.example stacks/myapp/main.tf</pre></li>
@ -153,7 +241,7 @@ git push</pre>
</section>
<section>
<h2>Architecture Overview</h2>
<h3>Architecture Overview</h3>
<p>Here's how your changes flow through the system:</p>
<div class="diagram">
@ -204,31 +292,18 @@ git push</pre>
</section>
{:else}
<section>
<h2>Step 3 — Verify access</h2>
<p>Run this command. It will open your browser for login the first time:</p>
<pre>kubectl get namespaces</pre>
<p>You should see output like:</p>
<pre class="output">NAME STATUS AGE
default Active 200d
kube-system Active 200d
monitoring Active 200d
...</pre>
<p>If you get a connection error, make sure your VPN is connected (<code>tailscale status</code>).</p>
</section>
<section>
<h2>Step 4 — Clone the repo</h2>
<h3>Step 4 — Clone the repo</h3>
<pre>git clone https://github.com/ViktorBarzin/infra.git
cd infra</pre>
<p>This is where all the infrastructure configuration lives.</p>
</section>
<section>
<h2>Step 5 — Your first change</h2>
<h3>Step 5 — Your first change</h3>
<ol>
<li>Create a branch: <pre>git checkout -b my-first-change</pre></li>
<li>Edit a service file (e.g., change an image tag in <code>stacks/echo/main.tf</code>)</li>
<li>Commit and push: <pre>git add . && git commit -m "my first change" && git push -u origin my-first-change</pre></li>
<li>Commit and push: <pre>git add . &amp;&amp; git commit -m "my first change" &amp;&amp; git push -u origin my-first-change</pre></li>
<li>Open a Pull Request on GitHub</li>
<li>Viktor reviews and merges</li>
<li>Woodpecker CI automatically applies the change to the cluster</li>
@ -236,19 +311,29 @@ cd infra</pre>
</ol>
</section>
{/if}
</section>
</main>
<style>
.content { max-width: 768px; margin: 2rem auto; padding: 0 1rem; font-family: system-ui, -apple-system, sans-serif; line-height: 1.6; }
.content h1 { border-bottom: 1px solid #e0e0e0; padding-bottom: 0.5rem; }
.content h2 { margin-top: 2rem; color: #333; }
.content h3 { color: #666; margin: 1rem 0 0.25rem; }
.content h3 { color: #444; margin: 1.25rem 0 0.25rem; }
.content h4 { color: #666; margin: 0.75rem 0 0.25rem; }
.content pre { background: #1e1e1e; color: #d4d4d4; padding: 1rem; border-radius: 6px; overflow-x: auto; }
.content pre.output { background: #f5f5f5; color: #333; }
.content code { background: #f0f0f0; padding: 2px 6px; border-radius: 3px; }
.content .prereq { font-size: 0.9rem; color: #666; font-style: italic; }
section { margin: 2rem 0; }
.role-tabs { display: flex; gap: 0; margin: 1.5rem 0; border-bottom: 2px solid #e0e0e0; }
section section { margin: 1.25rem 0; }
.path { border-left: 4px solid #4fc3f7; padding-left: 1.25rem; scroll-margin-top: 4rem; }
.path.c { border-left-color: #bbb; }
.badge { display: inline-block; font-size: 0.65rem; font-weight: 700; text-transform: uppercase; letter-spacing: 0.5px; padding: 0.15rem 0.5rem; border-radius: 4px; vertical-align: middle; margin-left: 0.4rem; }
.badge.rec { background: #d4f8d4; color: #1b5e20; }
.badge.none { background: #e3f2fd; color: #0d47a1; }
.role-tabs { display: flex; gap: 0; margin: 1.5rem 0 0.5rem; border-bottom: 2px solid #e0e0e0; }
.role-tabs a { padding: 0.5rem 1.5rem; text-decoration: none; color: #666; border-bottom: 2px solid transparent; margin-bottom: -2px; }
.role-tabs a.active { color: #333; border-bottom-color: #333; font-weight: 600; }
table { border-collapse: collapse; width: 100%; margin: 0.5rem 0; }
@ -258,6 +343,7 @@ cd infra</pre>
.callout { padding: 1rem; border-radius: 6px; margin: 1rem 0; }
.callout.info { background: #e8f4fd; border-left: 4px solid #2196f3; }
.callout.warning { background: #fff3cd; border-left: 4px solid #ffc107; }
.callout ul { margin: 0.5rem 0 0; padding-left: 1.25rem; }
.diagram { background: #fafafa; border: 1px solid #e0e0e0; border-radius: 8px; padding: 1.5rem; margin: 1.5rem 0; }
.diagram h3 { margin: 0 0 1rem 0; color: #333; font-size: 0.95rem; text-transform: uppercase; letter-spacing: 0.5px; }

View file

@ -2,6 +2,19 @@
<h1>Service Catalog</h1>
<p>70+ services running on the cluster. Here are the most commonly used:</p>
<section>
<h2>Cluster Access</h2>
<table>
<thead><tr><th>Service</th><th>URL</th><th>Description</th></tr></thead>
<tbody>
<tr><td>Web Terminal</td><td><a href="https://t3.viktorbarzin.me">t3.viktorbarzin.me</a></td><td>Browser shell on the shared workstation — kubectl, Vault &amp; your repos preinstalled (zero setup)</td></tr>
<tr><td>Kubernetes Dashboard</td><td><a href="https://k8s.viktorbarzin.me">k8s.viktorbarzin.me</a></td><td>Point-and-click view of your workloads, auto-authenticated (zero setup)</td></tr>
<tr><td>Access Portal</td><td><a href="https://k8s-portal.viktorbarzin.me">k8s-portal.viktorbarzin.me</a></td><td>This portal — onboarding, kubeconfig download, setup script</td></tr>
<tr><td>Vault</td><td><a href="https://vault.viktorbarzin.me">vault.viktorbarzin.me</a></td><td>Secrets &amp; dynamic credentials — <code>vault login -method=oidc</code></td></tr>
</tbody>
</table>
</section>
<section>
<h2>Core Services</h2>
<table>
@ -22,7 +35,7 @@
<tbody>
<tr><td>Nextcloud</td><td><a href="https://nextcloud.viktorbarzin.me">nextcloud.viktorbarzin.me</a></td><td>File storage, calendar, contacts</td></tr>
<tr><td>Immich</td><td><a href="https://immich.viktorbarzin.me">immich.viktorbarzin.me</a></td><td>Photo library (Google Photos alternative)</td></tr>
<tr><td>Vaultwarden</td><td><a href="https://vault.viktorbarzin.me">vault.viktorbarzin.me</a></td><td>Password manager</td></tr>
<tr><td>Vaultwarden</td><td><a href="https://vaultwarden.viktorbarzin.me">vaultwarden.viktorbarzin.me</a></td><td>Password manager</td></tr>
<tr><td>Paperless-ngx</td><td><a href="https://pdf.viktorbarzin.me">pdf.viktorbarzin.me</a></td><td>Document management</td></tr>
<tr><td>Navidrome</td><td><a href="https://music.viktorbarzin.me">music.viktorbarzin.me</a></td><td>Music streaming</td></tr>
<tr><td>Tandoor</td><td><a href="https://recipes.viktorbarzin.me">recipes.viktorbarzin.me</a></td><td>Recipe manager</td></tr>

View file

@ -11,6 +11,26 @@
</ol>
</section>
<section>
<h2>Browser login loops, or kubectl says "Unauthorized"</h2>
<p>Command-line SSO (OIDC) login can occasionally be unavailable. You don't have to wait for it — these authenticate a different way and keep working:</p>
<ul>
<li><a href="https://k8s.viktorbarzin.me">Web dashboard</a> — auto-authenticated, no token to paste</li>
<li><a href="https://t3.viktorbarzin.me">Web terminal</a> — its kubectl is already wired up</li>
</ul>
<p>Let Viktor know so the CLI login path gets fixed.</p>
</section>
<section>
<h2>Don't want to set up a local machine at all?</h2>
<p>Skip the VPN and CLI install entirely:</p>
<ul>
<li><a href="https://t3.viktorbarzin.me">t3.viktorbarzin.me</a> — a browser shell with everything preinstalled</li>
<li><a href="https://k8s.viktorbarzin.me">k8s.viktorbarzin.me</a> — a point-and-click dashboard</li>
</ul>
<p>Both just need your Authentik login. See the <a href="/onboarding">Getting Started</a> guide.</p>
</section>
<section>
<h2>"Forbidden" or "Permission denied"</h2>
<p>You may not have access to that namespace. Your access is scoped to specific namespaces.</p>

View file

@ -483,31 +483,49 @@ resource "kubernetes_cron_job_v1" "k8s_version_check" {
exit 0
fi
slack "K8s upgrade available: v$RUNNING → v$TARGET ($KIND)"
echo "K8s upgrade available: v$RUNNING -> v$TARGET ($KIND)"
if [ "$DRY_RUN" = "true" ]; then
slack "DRY_RUN — not spawning preflight Job"
slack "DRY_RUN — target v$TARGET detected, not spawning preflight Job"
exit 0
fi
# 7. Spawn Job 0 (preflight) via envsubst on the job-template
# Idempotency: deterministic name reconciles via `apply`.
JOB_NAME="k8s-upgrade-preflight-$${TARGET//./-}"
MASTER_JOB="k8s-upgrade-master-$${TARGET//./-}"
ANNOUNCE=yes # Slack the spawn? Suppressed for silent nightly re-evaluations of a standing gate refusal.
# Retry-on-failure idempotency: skip only if an existing preflight
# Job is Active/Complete. A *Failed* preflight (aborted on a
# transient gate, e.g. a spurious critical alert) is deleted and
# re-spawned otherwise its deterministic name + 7d TTL wedges
# the entire pipeline until it ages out. (Stuck-pipeline fix
# 2026-06-17: a transient critical alert wedged 1.34.9 for 5 days.)
# Idempotency + nightly re-evaluation:
# - FAILED preflight (transient gate abort, e.g. a spurious
# critical alert / unhealthy node) -> delete + re-spawn, announced.
# - COMPLETE preflight but NO master Job spawned -> the compat
# gate REFUSED the target (blocked/held now Complete cleanly
# rather than Failing). Re-spawn SILENTLY so the gate re-checks
# nightly (the refusal may have cleared: addon upgraded / matrix
# updated / upstream shipped) WITHOUT nightly Slack noise for a
# standing refusal the morning report (+ K8sUpgradeBlocked for
# actionable) is the signal.
# - Otherwise (Active, or Complete with the chain advanced) -> skip.
# The old "Failed-only re-spawn" left a refused-but-Complete preflight
# skipped until its 7d TTL too slow now that refusals Complete
# instead of Failing (2026-06-28). Deterministic names; `apply`
# reconciles. (Stuck-pipeline history: a transient critical alert
# wedged 1.34.9 for 5 days, 2026-06-17 hence Failed always re-spawns.)
if /usr/local/bin/kubectl -n k8s-upgrade get job "$JOB_NAME" >/dev/null 2>&1; then
JOB_FAILED=$(/usr/local/bin/kubectl -n k8s-upgrade get job "$JOB_NAME" \
-o jsonpath='{.status.conditions[?(@.type=="Failed")].status}' 2>/dev/null || true)
JOB_COMPLETE=$(/usr/local/bin/kubectl -n k8s-upgrade get job "$JOB_NAME" \
-o jsonpath='{.status.conditions[?(@.type=="Complete")].status}' 2>/dev/null || true)
if [ "$JOB_FAILED" = "True" ]; then
slack "Preflight Job $JOB_NAME exists but FAILED — deleting and re-spawning"
/usr/local/bin/kubectl -n k8s-upgrade delete job "$JOB_NAME" --wait=true >/dev/null 2>&1 || true
elif [ "$JOB_COMPLETE" = "True" ] && ! /usr/local/bin/kubectl -n k8s-upgrade get job "$MASTER_JOB" >/dev/null 2>&1; then
echo "Preflight $JOB_NAME Complete + no master Job (gate refused) — silent nightly re-evaluate"
/usr/local/bin/kubectl -n k8s-upgrade delete job "$JOB_NAME" --wait=true >/dev/null 2>&1 || true
ANNOUNCE=no
else
slack "Preflight Job $JOB_NAME already exists (active/complete) — skipping"
echo "Preflight Job $JOB_NAME already exists (active / chain advanced) — skipping"
exit 0
fi
fi
@ -521,7 +539,9 @@ resource "kubernetes_cron_job_v1" "k8s_version_check" {
< /template/job-template.yaml \
| /usr/local/bin/kubectl apply -f -
if [ "$ANNOUNCE" = "yes" ]; then
slack "Spawned $JOB_NAME (target=v$TARGET kind=$KIND)"
fi
EOT
]
env {

View file

@ -1,5 +1,5 @@
{
"_comment": "Addon -> highest k8s minor each addon version supports. The preflight compat-gate (compat-gate.py) reads the RUNNING version of each addon and blocks a k8s upgrade whose target minor exceeds what that running version supports — so the chain auto-halts + alerts instead of breaking on an unsupported addon. Keep current; sources are the addons' own k8s compat matrices (last refreshed 2026-06-19 for the 1.34->1.36 catch-up). max_k8s keys are addon-version floors (major.minor); value is the highest k8s minor that floor supports.",
"_comment": "Addon -> highest k8s minor each addon version supports. The preflight compat-gate (compat-gate.py) reads the RUNNING version of each addon and blocks a k8s upgrade whose target minor exceeds what that running version supports — so the chain auto-halts + alerts instead of breaking on an unsupported addon. Keep current; sources are the addons' own k8s compat matrices (last refreshed 2026-06-19 for the 1.34->1.36 catch-up). max_k8s keys are addon-version floors (major.minor); value is the highest k8s minor that floor supports. An addon entry may also set \"pinned\": true (+ \"pin_reason\") to mark it deliberately held: the gate classifies its block as PINNED/held (quiet — no alert, nightly report only) even if a supporting version exists, for upgrades coupled to other work we're not ready for (e.g. gpu-operator's NVIDIA-driver/Ubuntu coupling). A block with NO supporting version in the matrix is WAITING (also quiet); a block a newer matrix version would clear is ACTIONABLE (alerts).",
"addons": [
{
"name": "calico",
@ -48,7 +48,9 @@
"max_k8s": {
"25.10": "1.35",
"26.3": "1.36"
}
},
"pinned": true,
"pin_reason": "26.3 needs a newer NVIDIA driver image + Ubuntu/kernel; held until the driver/OS path is ready. Unpin = delete pinned + pin_reason."
}
],
"containerd_min": {

View file

@ -14,9 +14,20 @@ classes of blocker:
3. containerd every node's containerd >= the target's floor, if the matrix
declares one (e.g. the 1.7.x -> k8s 1.37 cliff)
Each reason line is tagged with its class so the caller can act differently:
[ACTIONABLE] a newer addon version (present in the matrix) supports the
target upgrading it clears the block. Also covers removed-API
/ containerd blocks and the unreadable-version fail-safe.
[WAITING] no released addon version supports the target yet only an
upstream release can clear it (e.g. kyverno/ESO behind a new k8s).
[PINNED] a supporting version exists but the addon is deliberately held
(matrix `pinned: true`, e.g. gpu-operator's driver/OS coupling).
Exit 0 = safe, proceed.
Exit 2 = BLOCKED prints one human reason per line (caller pushes
k8s_upgrade_blocked=1, Slacks the reasons, and halts the chain).
Exit 2 = BLOCKED, actionable >=1 blocker, none held. Caller pushes
k8s_upgrade_blocked=1 (-> K8sUpgradeBlocked alert) and halts.
Exit 4 = HELD >=1 waiting-upstream/pinned blocker (held wins over actionable).
Caller pushes k8s_upgrade_held=1 (no alert; nightly report only) and halts.
Exit 3 = the gate itself errored caller treats as a block (fail safe).
Read-only: kubectl get + one Prometheus query. No mutations. PROM is overridable
@ -62,6 +73,20 @@ def running_minor():
return min(minors) if minors else None
def _addon_resolution(a, tgt, running_ver):
"""For a BLOCKING addon, decide whether a newer matrix version would clear
the block. Returns ("actionable", hint) when some version key has
max_k8s >= target AND is newer than the running version (upgrading it clears
the block); otherwise ("waiting", hint) nothing released supports the
target yet, so only an upstream release can clear it."""
sufficient = [floor for floor, mk in a["max_k8s"].items()
if minor(mk) and minor(mk) >= tgt and minor(floor) > minor(running_ver)]
if sufficient:
best = min(sufficient, key=minor) # smallest sufficient upgrade
return "actionable", f"upgrade {a['name']} to >= {best}"
return "waiting", f"no released {a['name']} version supports k8s {tgt[0]}.{tgt[1]} yet"
def check_addons(matrix, tgt, running):
# A target at or below the RUNNING minor (a patch, or a same/lower minor)
# crosses into no new k8s minor, so every installed addon is already
@ -77,25 +102,36 @@ def check_addons(matrix, tgt, running):
"-o", "jsonpath={.spec.template.spec.containers[*].image}"])
m = re.search(a["image_re"], img or "")
if not m:
# Fail safe: if we can't read the running version, don't upgrade blind.
reasons.append(f"addon {a['name']}: could not read running version "
f"(img='{img or 'not found'}') — refusing to upgrade blind")
# Fail safe: can't read the running version → block; a human must
# look (ACTIONABLE), never upgrade blind.
reasons.append(f"[ACTIONABLE] addon {a['name']}: could not read running "
f"version (img='{img or 'not found'}') — refusing to upgrade blind")
continue
running = m.group(1) # e.g. "3.26"
running_ver = m.group(1) # e.g. "3.26"
# max_k8s maps an addon-version floor -> highest supported k8s minor.
# Pick the highest floor that is <= the running version.
max_k8s = None
for floor, mk in sorted(a["max_k8s"].items(), key=lambda kv: minor(kv[0]), reverse=True):
if minor(running) >= minor(floor):
if minor(running_ver) >= minor(floor):
max_k8s = mk
break
if max_k8s is None:
reasons.append(f"addon {a['name']} v{running}: below the lowest version "
f"in the compat matrix — unknown k8s support")
reasons.append(f"[ACTIONABLE] addon {a['name']} v{running_ver}: below the lowest "
f"version in the compat matrix — unknown k8s support")
continue
if tgt > minor(max_k8s):
reasons.append(f"addon {a['name']} v{running} supports k8s <= {max_k8s}; "
f"target {tgt[0]}.{tgt[1]} exceeds it — upgrade {a['name']} first")
base = (f"addon {a['name']} v{running_ver} supports k8s <= {max_k8s}; "
f"target {tgt[0]}.{tgt[1]} exceeds it")
# A deliberately-pinned addon is HELD even if a newer version exists
# (e.g. gpu-operator 26.3 supports 1.36 but its driver/OS coupling
# means we don't take it yet) — the pin overrides actionable.
if a.get("pinned"):
why = a.get("pin_reason", "deliberately pinned")
reasons.append(f"[PINNED] {base} — pinned ({why}); holding")
else:
kind, hint = _addon_resolution(a, tgt, running_ver)
tag = "ACTIONABLE" if kind == "actionable" else "WAITING"
reasons.append(f"[{tag}] {base}{hint}")
return reasons
@ -109,11 +145,11 @@ def check_removed_apis(tgt):
rr = lbl.get("removed_release", "")
if rr and minor(rr) and tgt >= minor(rr):
g = lbl.get("group") or "core"
reasons.append(f"deprecated API {g}/{lbl.get('version')} "
reasons.append(f"[ACTIONABLE] deprecated API {g}/{lbl.get('version')} "
f"{lbl.get('resource')} is in use and is removed in "
f"k8s {rr} (target {tgt[0]}.{tgt[1]}) — migrate callers first")
except Exception as e:
reasons.append(f"removed-API check could not query Prometheus ({e}) — "
reasons.append(f"[ACTIONABLE] removed-API check could not query Prometheus ({e}) — "
f"refusing to upgrade blind")
return reasons
@ -132,11 +168,28 @@ def check_containerd(matrix, tgt):
name, _, ver = line.partition(" ")
cv = ver.replace("containerd://", "")
if minor(cv) and minor(cv) < minor(floor):
reasons.append(f"node {name} containerd {cv} < required {floor} "
reasons.append(f"[ACTIONABLE] node {name} containerd {cv} < required {floor} "
f"for k8s {tgt[0]}.{tgt[1]} — bump containerd first")
return reasons
def held_reason(r):
"""True for a blocker the cluster cannot act on now: no released version
supports the target (WAITING) or the addon is deliberately pinned (PINNED).
These are quiet (no alert) only an upstream release / a manual unpin clears
them, so a nightly 'needs attention' alert would be crying wolf."""
return r.startswith("[WAITING]") or r.startswith("[PINNED]")
def exit_code(reasons):
"""Map reasons to the gate verdict: 0 safe · 2 actionable block · 4 held.
Held WINS over actionable on a mix if anything is waiting/pinned the target
can't proceed yet, so acting on the actionable blockers would be premature."""
if not reasons:
return 0
return 4 if any(held_reason(r) for r in reasons) else 2
def main():
if len(sys.argv) < 2:
print("usage: compat-gate.py <target-k8s-version> (matrix JSON on stdin)")
@ -158,9 +211,9 @@ def main():
if reasons:
for r in reasons:
print(r)
sys.exit(2)
else:
print(f"compat-gate OK: cluster is safe to upgrade to {sys.argv[1]}")
sys.exit(0)
sys.exit(exit_code(reasons))
if __name__ == "__main__":

View file

@ -69,6 +69,29 @@ def fmt_age(seconds):
return f"{seconds / 86400:.1f}d ago"
def _render_reasons(blocker_reasons):
"""Group compat-gate reason lines by their [ACTIONABLE]/[WAITING]/[PINNED]
tag into labelled sections, stripping the tag from each bullet. Untagged
lines (older reason format) fall back to a generic 'Blockers' list. PURE.
Returns a list of message lines."""
lines = [r.strip() for r in (blocker_reasons or "").splitlines() if r.strip()]
out, shown = [], set()
for title, tag in (("Action needed", "[ACTIONABLE]"),
("Waiting on upstream", "[WAITING]"),
("Pinned (held by us)", "[PINNED]")):
sub = [l for l in lines if l.startswith(tag)]
if sub:
out.append(f"{title}:")
for l in sub:
shown.add(l)
out.append(f"{l[len(tag):].strip()}")
rest = [l for l in lines if l not in shown]
if rest:
out.append("Blockers:")
out.extend(f"{l}" for l in rest)
return out
def compose_report(now_ts, nodes, metrics, blocker_reasons, jobs):
"""Build the Slack message text from gathered facts. PURE.
@ -98,6 +121,7 @@ def compose_report(now_ts, nodes, metrics, blocker_reasons, jobs):
avail = [(lbl, val) for lbl, val in select(metrics, "k8s_upgrade_available") if val == 1]
blocked = any(val == 1 for _, val in select(metrics, "k8s_upgrade_blocked"))
held = any(val == 1 for _, val in select(metrics, "k8s_upgrade_held"))
if avail:
lbl = avail[0][0]
@ -105,7 +129,12 @@ def compose_report(now_ts, nodes, metrics, blocker_reasons, jobs):
kind = lbl.get("kind", "?")
tgt_line = f"Detected target: *{target}* ({kind})"
if blocked:
headline = f"🔴 BLOCKED — compat gate refused {target}"
# actionable block — an addon upgrade would clear it (K8sUpgradeBlocked fired)
headline = f"🔴 BLOCKED (action needed) — {target}"
elif held:
# waiting on upstream and/or a pinned addon — nothing to do but wait;
# intentionally NO alert, this nightly line is the only signal
headline = f"⏸️ HELD — {target} not yet upgradable"
elif len(versions) == 1 and target == versions[0]:
headline = f"🟢 UPGRADED — all nodes now on {target}"
else:
@ -120,12 +149,8 @@ def compose_report(now_ts, nodes, metrics, blocker_reasons, jobs):
msg = [f"*[k8s-upgrade nightly]* {headline}", node_line, run_line, tgt_line]
if blocked and blocker_reasons:
msg.append("Blockers (live):")
for r in blocker_reasons.splitlines():
r = r.strip()
if r:
msg.append(f"{r}")
if (blocked or held) and blocker_reasons:
msg.extend(_render_reasons(blocker_reasons))
if jobs:
msg.append("Chain jobs (recent):")
@ -213,7 +238,8 @@ def main():
avail = [(lbl, val) for lbl, val in select(metrics, "k8s_upgrade_available") if val == 1]
blocked = any(val == 1 for _, val in select(metrics, "k8s_upgrade_blocked"))
reasons = get_blocker_reasons(avail[0][0].get("target", "")) if (avail and blocked) else None
held = any(val == 1 for _, val in select(metrics, "k8s_upgrade_held"))
reasons = get_blocker_reasons(avail[0][0].get("target", "")) if (avail and (blocked or held)) else None
msg = compose_report(now_ts, nodes, metrics, reasons, jobs)
post_slack(msg)

View file

@ -95,3 +95,121 @@ def test_running_minor_from_kubectl(monkeypatch):
# oldest kubelet wins (mirrors the detector): node2 on 1.33 is the floor.
monkeypatch.setattr(cg, "kget", lambda args: "v1.34.9\nv1.33.5\nv1.34.9")
assert cg.running_minor() == (1, 33)
# --- block classification: actionable / waiting-upstream / pinned ----------
# A block is ACTIONABLE if a newer addon version in the matrix supports the
# target (we can upgrade to clear it), WAITING if no released version supports
# the target yet (only upstream can clear it), or PINNED if a version exists but
# we deliberately hold the addon. Held (waiting|pinned) is quiet; actionable
# alerts.
KYVERNO_MATRIX = {
"addons": [{
"name": "kyverno",
"namespace": "kyverno",
"kind": "deployment",
"resource": "kyverno-admission-controller",
"image_re": r"kyverno:v(\d+\.\d+)",
"max_k8s": {"1.16": "1.34", "1.18": "1.35"},
}]
}
GPU_MATRIX = {
"addons": [{
"name": "gpu-operator",
"namespace": "nvidia",
"kind": "deployment",
"resource": "gpu-operator",
"image_re": r"gpu-operator:v(\d+\.\d+)",
"max_k8s": {"25.10": "1.35", "26.3": "1.36"},
"pinned": True,
"pin_reason": "needs newer NVIDIA driver + Ubuntu release",
}]
}
def test_actionable_when_higher_version_supports_target(monkeypatch):
# calico 3.30 (ceiling 1.35), target 1.36, matrix has 3.32 -> 1.36:
# upgrading calico WOULD clear it -> ACTIONABLE, with a remediation hint.
_img(monkeypatch, "quay.io/calico/node:v3.30.7")
reasons = cg.check_addons(CALICO_MATRIX, (1, 36), (1, 35))
assert len(reasons) == 1, reasons
assert reasons[0].startswith("[ACTIONABLE]"), reasons
assert "3.32" in reasons[0] and "calico" in reasons[0]
def test_waiting_when_no_version_supports_target(monkeypatch):
# kyverno 1.18 is the matrix ceiling (k8s 1.35); target 1.36 has NO
# supporting version -> WAITING on upstream (nothing to upgrade to).
_img(monkeypatch, "kyverno/kyverno:v1.18.1")
reasons = cg.check_addons(KYVERNO_MATRIX, (1, 36), (1, 35))
assert len(reasons) == 1, reasons
assert reasons[0].startswith("[WAITING]"), reasons
assert "kyverno" in reasons[0]
def test_pinned_addon_is_held_not_actionable(monkeypatch):
# gpu-operator 25.10, target 1.36; 26.3 supports 1.36 BUT the entry is
# pinned -> classified PINNED (held), never ACTIONABLE.
_img(monkeypatch, "nvcr.io/nvidia/gpu-operator:v25.10.0")
reasons = cg.check_addons(GPU_MATRIX, (1, 36), (1, 35))
assert len(reasons) == 1, reasons
assert reasons[0].startswith("[PINNED]"), reasons
assert "gpu-operator" in reasons[0]
def test_unreadable_addon_tagged_actionable(monkeypatch):
# fail-safe block on an unreadable image is ACTIONABLE (a human must look).
_img(monkeypatch, "")
reasons = cg.check_addons(ESO_MATRIX, (1, 35), (1, 34))
assert reasons and reasons[0].startswith("[ACTIONABLE]"), reasons
def test_existing_reasons_are_tagged(monkeypatch):
# the legacy "ceiling below target, newer version exists" case is ACTIONABLE.
_img(monkeypatch, "external-secrets/external-secrets:v0.12.1")
reasons = cg.check_addons(ESO_MATRIX, (1, 35), (1, 34))
assert reasons[0].startswith("[ACTIONABLE]"), reasons
def test_held_reason_classifier():
assert cg.held_reason("[WAITING] x")
assert cg.held_reason("[PINNED] x")
assert not cg.held_reason("[ACTIONABLE] x")
assert not cg.held_reason("untagged")
def test_exit_code_mapping():
assert cg.exit_code([]) == 0
assert cg.exit_code(["[ACTIONABLE] x"]) == 2
assert cg.exit_code(["[WAITING] x"]) == 4
assert cg.exit_code(["[PINNED] x"]) == 4
# held wins on a mix: an upstream/pinned wait can't be cleared by acting now
assert cg.exit_code(["[ACTIONABLE] x", "[WAITING] y"]) == 4
def test_real_matrix_136_is_held(monkeypatch):
"""Regression guard on the SHIPPED addon-compat.json: at today's running
versions a 1.36 jump must be HELD (exit 4) calico ACTIONABLE (3.32 in the
matrix), ESO+kyverno WAITING (no 1.36 release), gpu-operator PINNED. Catches
a matrix edit that silently turns the quiet held state into a nightly alert."""
import json as _json
matrix = _json.loads((HERE / "addon-compat.json").read_text())
running_imgs = {
"calico-system": "quay.io/calico/node:v3.30.7",
"external-secrets": "ghcr.io/external-secrets/external-secrets:v2.6.0",
"kyverno": "ghcr.io/kyverno/kyverno:v1.18.1",
"nvidia": "nvcr.io/nvidia/gpu-operator:v25.10.0",
}
def fake_kget(args):
ns = args[args.index("-n") + 1] if "-n" in args else ""
return running_imgs.get(ns, "")
monkeypatch.setattr(cg, "kget", fake_kget)
reasons = cg.check_addons(matrix, (1, 36), (1, 35))
pick = lambda name: next(r for r in reasons if name in r)
assert pick("calico").startswith("[ACTIONABLE]"), reasons
assert pick("external-secrets").startswith("[WAITING]"), reasons
assert pick("kyverno").startswith("[WAITING]"), reasons
assert pick("gpu-operator").startswith("[PINNED]"), reasons
assert cg.exit_code(reasons) == 4 # held wins

View file

@ -79,3 +79,41 @@ def test_compose_includes_recent_jobs():
jobs = [{"name": "k8s-upgrade-preflight-1-35-6", "status": "Failed", "age_s": 3600}]
out = nr.compose_report(LAST_RUN + 30000, NODES_UNIFORM, m, "x", jobs)
assert "k8s-upgrade-preflight-1-35-6: Failed" in out
# --- held (waiting-upstream / pinned) vs actionable-blocked rendering -------
METRICS_HELD = f"""# TYPE k8s_upgrade_available gauge
k8s_upgrade_available{{instance="",job="k8s-version-check",kind="minor",running="1.35.6",target="1.36.2"}} 1
k8s_upgrade_held{{instance="",job="k8s-version-upgrade"}} 1
k8s_upgrade_blocked{{instance="",job="k8s-version-upgrade"}} 0
k8s_version_check_last_run_timestamp{{instance="",job="k8s-version-check"}} {LAST_RUN}
"""
NODES_135 = [(f"k8s-node{i}", "v1.35.6") for i in range(7)]
def test_compose_held_headline_and_grouped_reasons():
m = nr.parse_metrics(METRICS_HELD)
reasons = (
"[WAITING] addon kyverno v1.18 supports k8s <= 1.35; target 1.36 exceeds it — no released kyverno version supports k8s 1.36 yet\n"
"[PINNED] addon gpu-operator v25.10 supports k8s <= 1.35; target 1.36 exceeds it — pinned (driver/OS); holding\n"
"[ACTIONABLE] addon calico v3.30 supports k8s <= 1.35; target 1.36 exceeds it — upgrade calico to >= 3.32"
)
out = nr.compose_report(LAST_RUN + 30000, NODES_135, m, reasons, [])
# held headline, NOT a red actionable block
assert "⏸️ HELD" in out and "1.36.2" in out
assert "🔴 BLOCKED" not in out
# grouped by class
assert "Waiting on upstream" in out and "kyverno" in out
assert "Pinned" in out and "gpu-operator" in out
# the lone actionable piece is still listed so eventual scope is visible
assert "calico" in out
# tags are stripped from the rendered bullets (no raw "[WAITING]")
assert "[WAITING]" not in out
def test_compose_blocked_groups_actionable():
m = nr.parse_metrics(METRICS_BLOCKED) # blocked=1
reasons = "[ACTIONABLE] addon calico v3.30 supports k8s <= 1.35; target 1.36 exceeds it — upgrade calico to >= 3.32"
out = nr.compose_report(LAST_RUN + 30000, NODES_UNIFORM, m, reasons, [])
assert "🔴 BLOCKED" in out
assert "Action needed" in out and "calico" in out

Some files were not shown because too many files have changed in this diff Show more