diff --git a/.claude/CLAUDE.md b/.claude/CLAUDE.md index 9c873a07..4cd12d6c 100755 --- a/.claude/CLAUDE.md +++ b/.claude/CLAUDE.md @@ -16,6 +16,7 @@ **ALL infrastructure changes MUST go through Terraform/Terragrunt.** Never use `kubectl apply/edit/patch/set`, `helm install/upgrade`, or any manual cluster mutation as the final state. - **No exceptions for "quick fixes"** — even one-line changes must be in `.tf` files and applied via `scripts/tg apply` +- **Apply locally OR let CI do it — but ALWAYS commit.** You don't have to wait for CI: with apply access you MAY run the apply yourself (`scripts/tg apply ` / `homelab tf apply `), but **from the main checkout, never a worktree** (git-crypt'd `*.tfvars` come through as ciphertext under the worktree filter-bypass, so a worktree apply reads garbage). **Every applied change MUST be committed and pushed to `master` the same session** — the repo is the source of truth, so applied-but-uncommitted HCL is drift that the next CI apply / daily drift-detection will try to revert. Order either way: apply locally then commit + push (CI's changed-stack apply then no-ops), or commit + push and let CI apply. Never apply an uncommitted edit; never leave a committed change unapplied. - **kubectl is for read-only operations and temporary debugging only** (get, describe, logs, exec, port-forward) - **If a resource isn't in Terraform yet**, evaluate whether it can be added before making manual changes. If manual change is unavoidable (e.g., emergency), document it immediately and create the Terraform resource in the same session - **kubectl scale/patch during migrations is acceptable** as a transient step, but the final state must be in Terraform and applied via `scripts/tg apply` @@ -233,7 +234,7 @@ the workflow's built-in `GITHUB_TOKEN` (`packages: write`). - **External monitoring**: `[External] ` monitors in Uptime Kuma test full external path (DNS → Cloudflare → Tunnel → Traefik). Divergence metric `external_internal_divergence_count` → alert `ExternalAccessDivergence` (15min). Config: `stacks/uptime-kuma/`, targets from `cloudflare_proxied_names` in `config.tfvars` (17 remaining centrally-managed hostnames; most DNS records now auto-created by `ingress_factory` `dns_type` param). - Key alerts: OOMKill, pod replica mismatch, 4xx/5xx error rates, UPS battery, CPU temp, SSD writes, NFS responsiveness, ClusterMemoryRequestsHigh (>85%), ContainerNearOOM (>85% limit), PodUnschedulable, ExternalAccessDivergence, ImmichSmartSearchSlow (context-search latency / clip_index cache eviction). - **E2E email monitoring**: CronJob `email-roundtrip-monitor` (every 20 min) sends test email via Brevo HTTP API to `smoke-test@viktorbarzin.me` (catch-all → `spam@`), verifies IMAP delivery, deletes test email, pushes metrics to Pushgateway + Uptime Kuma. Alerts: `EmailRoundtripFailing` (60m), `EmailRoundtripStale` (60m), `EmailRoundtripNeverRun` (60m). Outbound relay: Brevo EU (`smtp-relay.brevo.com:587`, 300/day free — migrated from Mailgun). Inbound external traffic enters via pfSense HAProxy on `10.0.20.1:{25,465,587,993}`, which forwards to k8s `mailserver-proxy` NodePort (30125-30128) with `send-proxy-v2`. Mailserver pod runs alt PROXY-speaking listeners (2525/4465/5587/10993) alongside stock PROXY-free ones (25/465/587/993) for intra-cluster clients. Real client IPs recovered from PROXY v2 header despite kube-proxy SNAT (replaces pre-2026-04-19 MetalLB `10.0.20.202` ETP:Local scheme; see bd code-yiu + `docs/runbooks/mailserver-pfsense-haproxy.md`). Vault: `brevo_api_key` in `secret/viktor` (probe + relay). -- **Authentik walling-off guard**: `blackbox-exporter` (monitoring ns, `stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf`) probes each must-stay-public `auth = "none"` carve-out URL with `no_follow_redirects` and FAILS (`fail_if_header_matches` on `Location`) iff it 302s to Authentik. Catches a carve-out regressing (TF revert / deploy / `ingress_factory` `auth` default flipping back to `"required"`). Scrape job `blackbox-authentik-walloff` (1m) → alert `AuthentikWallingOffPublicPath` (`probe_failed_due_to_regex == 1`, for 10m, `lane=security` → `#security` Slack). **To guard a new carve-out: add one line to `local.authentik_walloff_targets`** (a `service → URL` map; `valid_status_codes` includes 301/302 so legit redirects/404s stay green — only the Authentik `Location` fails the probe). `curl -sI ''` must NOT show a Location to `authentik.viktorbarzin.me` before adding. +- **Authentik walling-off guard**: `blackbox-exporter` (monitoring ns, `stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf`) probes each must-stay-public `auth = "none"` carve-out URL with `no_follow_redirects` and FAILS (`fail_if_header_matches` on `Location`) iff it 302s to Authentik. Catches a carve-out regressing (TF revert / deploy / `ingress_factory` `auth` default flipping back to `"required"`). Scrape job `blackbox-authentik-walloff` (1m) → alert `AuthentikWallingOffPublicPath` (`probe_failed_due_to_regex == 1`, for 10m, `lane=security` → posts to `#alerts` via the `slack-security` receiver, which keeps its `[SECURITY]` styling; the dedicated `#security` channel was abandoned 2026-06-25 — the shared webhook's app isn't a member of it). **To guard a new carve-out: add one line to `local.authentik_walloff_targets`** (a `service → URL` map; `valid_status_codes` includes 301/302 so legit redirects/404s stay green — only the Authentik `Location` fails the probe). `curl -sI ''` must NOT show a Location to `authentik.viktorbarzin.me` before adding. ## Security Posture (Wave 1 — locked 2026-05-18) @@ -241,9 +242,10 @@ Plan in `docs/architecture/security.md` + response playbook in `docs/runbooks/se - **Identity allowlist for security rules**: ONLY `me@viktorbarzin.me`. NOT `viktor@viktorbarzin.me`, NOT `emo@viktorbarzin.me` (those don't exist). emo's identity scheme is unknown — ask before assuming. - **Source-IP allowlist (K2, K9, V7, S1)**: `10.0.20.0/22`, `192.168.1.0/24` (Proxmox + Sofia LAN), K8s pod CIDR, K8s service CIDR, Headscale tailnet. **Policy: no public-IP access** — Vault, kube-apiserver, PVE sshd must transit LAN or Headscale. **One documented exception (2026-06-11): break-glass SSH** — PVE sshd on a WAN-exposed `:52222`, key-only, dedicated break-glass key only (`Match LocalPort`), rate-limited + fail2ban; intentionally cluster-independent so it survives an outage. As-built `docs/runbooks/breakglass-ssh.md`. (Replaced the 2026-05-30 port-knock design — circular Vault dep caused a lockout.) -- **Response model**: (I) Slack-only daily skim. All security alerts via Loki ruler → Alertmanager → `#security` Slack receiver. Single channel with severity labels inside (critical/warning/info). No paging. +- **Response model**: (I) Slack-only daily skim. All security alerts via Loki ruler → Alertmanager → the `slack-security` receiver, which posts to `#alerts` (it keeps its `[SECURITY/]` title styling so security-lane alerts stand out). Severity labels carried in the alert (critical/warning/info). No paging. The dedicated `#security` channel was abandoned 2026-06-25 — the shared `alertmanager_slack_api_url` webhook's Slack app isn't a member of it (a `#security` override 404s), so everything consolidated to `#alerts`. - **Kyverno policies (wave 1)**: `deny-privileged-containers`, `deny-host-namespaces`, `restrict-sys-admin`, `require-trusted-registries` flip Audit→Enforce with the 31-namespace exclude list (memory id=1970). `failurePolicy: Ignore` preserved. Cosign `verify-images` deferred. -- **NetworkPolicy default-deny egress (wave 1)**: observe-then-enforce (γ approach) — Calico flow logs cluster-wide + GlobalNetworkPolicy log-only on tier 3+4, build empirical allowlist after 1 week, phased per-namespace enforce starting `recruiter-responder`. Tier 0/1/2 deferred. +- **NetworkPolicy default-deny egress (wave 1)**: observe-then-enforce (γ approach) — Calico flow logs cluster-wide + GlobalNetworkPolicy log-only on tier 3+4, build empirical allowlist after 1 week, phased per-namespace enforce starting `recruiter-responder`. Tier 0/1/2 deferred. **The internal (ns-to-ns) half of each allowlist now derives faster from the east-west flow trail** (below): `SELECT DISTINCT dst_ns FROM edge WHERE src_ns='' AND action='allow'`. External egress is NOT in that table (empty-ns flows dropped) — those still come from the Calico flow-log W1.6 snapshot. Enforce-flips remain out of scope of the trail (observe-and-derive only; beads `code-8ywc`). +- **East-west flow trail (who-talks-to-whom, ADR-0014)**: Calico **Goldmane** (`goldmane.calico-system:7443`, gRPC/mTLS, ~60-min in-memory ring buffer — no etcd writes) + **Whisker** live UI (`whisker.viktorbarzin.me`, Authentik-gated) → **`goldmane-edge-aggregator`** streams Goldmane's `Flows.Stream` over mTLS and upserts the namespace-pair **edge set** (`edge(src_ns,dst_ns,action,first_seen,last_seen,flow_count)`, self-edges + public-internet flows dropped) into **CNPG DB `goldmane_edges`** → daily **`goldmane-edges-digest`** CronJob posts first-seen edges to `#alerts` (consolidated to `#alerts`; the `#security` channel was abandoned 2026-06-25 — the shared webhook's Slack app isn't a member of it, so a `#security` override 404s; see runbook). **CERT-REUSE GOTCHA**: the aggregator's mTLS client cert reuses the operator's Tigera-CA-signed `whisker-backend-key-pair` Secret (Goldmane verifies CA-chain only) — **re-apply `stacks/goldmane-edge-aggregator` if the operator rotates it** (symptom: no `last_seen` updates, `AggregatorDown`). Service identity = namespace, + `service-identity` label only in `monitoring`/`kube-system`/`dbaas`. Health: `AggregatorDown` + `DigestFailing` alerts + cluster-health #48. Runbook: `docs/runbooks/goldmane-flow-trail.md`. (Goldmane is OSS tech-preview — reversible operator-CR toggle in `stacks/calico/main.tf`.) - **What's NOT in scope**: canary tokens (rejected — self-trigger risk with Viktor's normal `vault kv list secret/viktor` and `kubectl get secret -A` workflows), Falco/Tetragon (too noisy for Slack-only daily check), Cloudflare/GitHub audit polling (deferred to wave 2). ## Storage & Backup Architecture diff --git a/.claude/reference/service-catalog.md b/.claude/reference/service-catalog.md index cd7b5274..ca1ee262 100644 --- a/.claude/reference/service-catalog.md +++ b/.claude/reference/service-catalog.md @@ -13,6 +13,8 @@ | authentik | Identity provider (SSO) | authentik | | cloudflared | Cloudflare tunnel | cloudflared | | authelia | Auth middleware (may be merged into ebooks or removed) | platform | +| goldmane | Calico 3.30 OSS flow aggregator (`goldmane.calico-system.svc:7443`, gRPC/mTLS). Stamps identity (ns/pod/workload/labels + allow-deny) on every flow from Felix into a ~60-min in-memory ring buffer — no etcd/API writes. East-west "who-talks-to-whom" source (ADR-0014). Enabled via operator CR (`kubectl_manifest.goldmane`). | calico | +| whisker | Calico 3.30 OSS live flow-observability UI (`whisker.calico-system.svc:8081`) at `whisker.viktorbarzin.me` (Authentik-gated, `auth=required` — no own login; additive NP ORs Traefik past the operator default-deny). ~60-min live view of Goldmane flows, NOT history. Enabled via operator CR (`kubectl_manifest.whisker`). | calico | | monitoring | Prometheus/Grafana/Loki stack | monitoring | ## Storage & Security (Tier: cluster) @@ -37,6 +39,7 @@ ## Active Use | Service | Description | Stack | |---------|-------------|-------| +| goldmane-edge-aggregator | Durable who-talks-to-whom audit trail (ADR-0014 / #58). Go service: `aggregate` Deployment streams Goldmane's gRPC `Flows.Stream` (mTLS) and upserts the low-cardinality namespace-pair edge set (`edge(src_ns,dst_ns,action,first_seen,last_seen,flow_count)`) into CNPG DB `goldmane_edges`; `goldmane-edges-digest` CronJob posts first-seen edges daily to `#alerts` (the `#security` channel was abandoned 2026-06-25 — shared webhook's app isn't a member of it). mTLS client cert REUSES the operator's `whisker-backend-key-pair` (re-apply if rotated). Tier-4-aux. Image `ghcr.io/viktorbarzin/goldmane-edge-aggregator` (private). Runbook: [goldmane-flow-trail.md](../../docs/runbooks/goldmane-flow-trail.md). | goldmane-edge-aggregator | | mailserver | Email (docker-mailserver) | mailserver | | shadowsocks | Proxy | shadowsocks | | webhook_handler | Webhook processing | webhook_handler | @@ -161,3 +164,4 @@ procedures) are documented in `infra/docs/runbooks/`: | pfSense + Unbound DNS | [pfsense-unbound.md](../../docs/runbooks/pfsense-unbound.md) | | Mailserver PROXY-protocol / HAProxy | [mailserver-pfsense-haproxy.md](../../docs/runbooks/mailserver-pfsense-haproxy.md) | | Technitium apply flow | [technitium-apply.md](../../docs/runbooks/technitium-apply.md) | +| Goldmane flow trail (east-west who-talks-to-whom) | [goldmane-flow-trail.md](../../docs/runbooks/goldmane-flow-trail.md) | diff --git a/.claude/skills/home-assistant/SKILL.md b/.claude/skills/home-assistant/SKILL.md index 61aaa6af..ab07a27f 100644 --- a/.claude/skills/home-assistant/SKILL.md +++ b/.claude/skills/home-assistant/SKILL.md @@ -11,8 +11,8 @@ description: | There are TWO Home Assistant deployments: ha-london (default) and ha-sofia. Always use Home Assistant for smart home control. author: Claude Code -version: 2.0.0 -date: 2026-02-07 +version: 2.1.0 +date: 2026-06-24 --- # Home Assistant Control @@ -395,14 +395,27 @@ Advanced SSH, File Editor, Studio Code Server, InfluxDB, Mosquitto, Node-RED, Fr ## ha-london Knowledge Map ### Overview -- **HA Version**: 2025.9.1 (Docker container on Raspberry Pi) +- **HA Version**: 2026.5.2 on **Home Assistant OS** (HAOS — managed appliance, NOT a `docker run` container). Latest is 2026.6.4 (update available, deliberately not applied). - **Location**: London, UK -- **Platform**: Raspberry Pi 4, HA OS (not Docker standalone) -- **SSH**: `ssh hassio@192.168.8.103` (requires `sudo` for file access) -- **Config path**: `/config/` (requires `sudo` for file access) +- **Platform**: Raspberry Pi 4, HA OS +- **Access from the Sofia devvm**: london is **remote** — `homelab ha ssh --instance london` generally WON'T connect (ADR-0012). Drive it via the API: `homelab ha token --instance london` + `https://ha-london.viktorbarzin.me/api/...`, and the WebSocket API `wss://ha-london.viktorbarzin.me/api/websocket` for dashboards / config-entries / HACS installs. +- **SSH (only from the London LAN)**: `ssh hassio@192.168.8.103` (requires `sudo` for file access) +- **Config path**: `/config/` - **3 tracked people**: Viktor Barzin, Anca Milea, Gheorghe Milea - **Zone**: London (home) +### Dashboards (redesigned 2026-06-24) +**Glossary** (HA terms — keep distinct): +- **Dashboard** = a sidebar entry (Overview, Air Quality, Map). Sidebar *order* is a per-USER frontend preference, not in any dashboard config. +- **View** = a tab inside a dashboard. View order is global (stored in the dashboard config). +- **Card** = a widget inside a view. + +- **Overview** (`lovelace`, the default): responsive **sections** views, styled with Mushroom + mini-graph-card. + - **Home** tab: *Who's home* · *Comfort & Air* (CO₂/temp/humidity/PM2.5/VOC chips + CO₂ and temp/humidity trend graphs + link to Air Quality) · *Cowboy* (battery/range/last-ride) · *Energy* (5 Kasa plugs + power trend) · *Quick actions* (Netflix/Stremio/Night). + - **More** tab: *Network* (GL-MT6000 router) · *System* (HA version/update, last backup, RPi power) · *Phones*. +- **Air Quality** (`air-quality`): deep-dive (views: Home, Detailed). (`detialed`→`detailed` path typo fixed 2026-06-24.) +- Built via the WS `lovelace/config/save` API (london is remote — no SSH path). + ### Key Systems #### 1. Smart Plugs (TP-Link Kasa) — Energy Monitoring @@ -424,10 +437,15 @@ Named plugs with power/energy tracking: - PM1.0/2.5/4.0/10 particulate sensors - VOC, NOx, ammonia, CO, ethanol, hydrogen, methane, NO2 gas sensors -#### 3. Cowboy E-Bike -- `sensor.bike_state_of_charge`: Battery % -- `sensor.bike_total_distance`: Total km -- `sensor.bike_total_co2_saved`: CO2 saved (grams) +#### 3. Cowboy E-Bike (`elsbrock/cowboy-ha`) +Bike named **"Classic Performance"** → entities are `sensor.classic_performance_*` (26 total). The old `sensor.bike_*` names are GONE (they were the dead `jdejaegh` integration). +- `sensor.classic_performance_remaining_battery`: Battery % (was `sensor.bike_state_of_charge`) +- `sensor.classic_performance_remaining_range`: Range km +- `sensor.classic_performance_mileage`: Total km (was `sensor.bike_total_distance`) +- `sensor.classic_performance_saved_co2`: Lifetime CO2 saved (was `sensor.bike_total_co2_saved`) +- Plus `_distance_today`, `_last_trip_*`, `_battery_health`, `device_tracker.classic_performance`, etc. +- **GOTCHA**: live battery/range/mileage read `unknown` while the bike is parked/asleep — Cowboy only reports live SoC when awake (ridden/charging); trip-history + `distance_today` stay live regardless. +- Auth: account **email+password** (no AWS Cognito — that was the dead `jdejaegh`/`cowboybike` lineage). Setup via UI config flow / REST `config_entries/flow`. Creds in Vaultwarden item **"cowboy bike"** (`homelab vault get "cowboy bike"`). #### 4. Uptime Monitoring (UptimeRobot) - `sensor.blog`: blog uptime @@ -446,12 +464,17 @@ Named plugs with power/energy tracking: - Scripts: `script.start_netflix`, `script.start_stremio` - Scene: `scene.night` (turns off Livia + Michelle plugs) -### Custom Components -- **cowboy**: Cowboy e-bike integration (HACS) -- **hildebrandglow_dcc**: UK smart meter DCC energy data (HACS) +### Custom Components (HACS integrations) +- **cowboy** (`elsbrock/cowboy-ha` v1.2.0): Cowboy e-bike — revived 2026-06-24. The old `jdejaegh/home-assistant-cowboy` repo is **dead (404)**; don't chase it. +- **hildebrandglow_dcc**: UK smart meter DCC energy — **DISABLED by user** (config entry `disabled_by: user`), not broken. + +### HACS frontend cards (plugins) +- **Mushroom** (`piitaya/lovelace-mushroom`), **mini-graph-card** (`kalkih/mini-graph-card`), **plotly-graph-card** (`dbuezas/lovelace-plotly-graph-card`) — used by the redesigned Overview. Install over WS `hacs/repository/download`; resources auto-register in storage mode. ### Integrations -ESPHome, TP-Link Kasa, Tapo, UptimeRobot, Cowboy, Hildebrand Glow DCC, Oral-B BLE, Ookla Speedtest, HACS, OpenRouter (multiple free LLMs), Piper (local TTS), Whisper (local STT), Android TV/ADB +ESPHome, TP-Link Kasa, Tapo, UptimeRobot, **Cowboy** (elsbrock), Oral-B BLE, Ookla Speedtest (exposes only an `update` entity, no live speed sensors), HACS, OpenRouter (free LLMs), Piper (TTS), Whisper (STT), Android TV/ADB. +- **Disabled by user (NOT broken)**: `met` + `metoffice` (weather — so `weather.*` entities are ABSENT), `roomba` (Rumi vacuum), `hildebrandglow_dcc` (energy). +- **Failing**: `tplink` **Tapo P100** projector plug — `setup_retry`, 403 KLAP handshake from 192.168.8.108 (plug off / firmware). Left as-is. ### AI / Voice Assistants - 5 free LLM conversation agents: Google Gemma 3 27B, Meta Llama 3.2 3B, Mistral Devstral 2, OpenAI GPT-OSS-20B, Z.AI GLM 4.5 Air @@ -466,15 +489,8 @@ ESPHome, TP-Link Kasa, Tapo, UptimeRobot, Cowboy, Hildebrand Glow DCC, Oral-B BL - Anca arrival/departure notifications - Night scene: turns off Livia + Michelle -### Docker Setup -```bash -docker run -d --name homeassistant --privileged \ - -e TZ=Europe/London \ - -v /home/pi/docker/homeAssistant:/config \ - -v /run/dbus:/run/dbus:ro \ - --network=host --restart=unless-stopped \ - homeassistant/home-assistant:2025.9 -``` +### Platform (HAOS — ignore any legacy `docker run` snippet) +ha-london runs **Home Assistant OS** (managed appliance), NOT a hand-run Docker container. There is no `docker run homeassistant/home-assistant` to manage. Install HACS components over the WebSocket API (`hacs/repository/download` with the repo's HACS id), then restart via `POST /api/services/homeassistant/restart` — a HAOS restart drops automations for ~1–2 min and resets `sensor.uptime` (use that as the "back up" marker). ### SSH Access ```bash diff --git a/.woodpecker/default.yml b/.woodpecker/default.yml index ef94ccee..7ec915a3 100644 --- a/.woodpecker/default.yml +++ b/.woodpecker/default.yml @@ -213,6 +213,12 @@ steps: if [ -s .platform_apply ]; then echo "=== Applying platform stacks (serial, locked) ===" while read -r stack; do + # Tier-0 `vault` is human-applied via OIDC; the CI `ci` Vault role + # lacks Vault-admin perms (sys/mounts + sys/policies/acl), so a CI + # apply always 403s and fails the pipeline. Kept in PLATFORM_STACKS + # (so the app-stack detector still excludes it) but skipped here. + # (2026-06-27 — see docs/architecture/ci-cd.md) + if [ "$stack" = "vault" ]; then echo "[vault] SKIPPED (Tier-0, human-applied via OIDC)"; continue; fi echo "[$stack] Starting apply..." set +e OUTPUT=$(cd "stacks/$stack" && ../../scripts/tg apply --non-interactive 2>&1) diff --git a/.woodpecker/drift-detection.yml b/.woodpecker/drift-detection.yml index b2e303ff..b2a552f4 100644 --- a/.woodpecker/drift-detection.yml +++ b/.woodpecker/drift-detection.yml @@ -85,6 +85,13 @@ steps: stack=$(basename "$stack_dir") [ -f "$stack_dir/terragrunt.hcl" ] || continue + # Tier-0 `vault` is human-applied via OIDC; the CI `ci` Vault role lacks + # Vault-admin perms (sys/mounts + sys/policies/acl), so `terragrunt plan` + # on it ERRORs (detailed-exitcode 1) and fails the whole nightly drift + # run. Skip it — drift on Tier-0 vault is caught at human apply time. + # (2026-06-27) + [ "$stack" = "vault" ] && continue + echo -n "[$stack] planning... " OUTPUT=$(cd "$stack_dir" && terragrunt plan -detailed-exitcode -input=false 2>&1) EXIT=$? diff --git a/AGENTS.md b/AGENTS.md index 7fbc838d..4e3ea2de 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -273,8 +273,11 @@ To land a finished change from such a clone: Slack audit feed; a no-op CI apply on a docs-only commit is harmless. 4. Leave the clone on clean `master` so auto-refresh keeps working. 5. Tell the user in plain language what happened. Stack changes are - auto-applied by CI — verify the live result with the user's read-only - kubectl before saying "it's live". + auto-applied by CI on push — or, with apply access, applied locally yourself + (`scripts/tg apply`, from the main checkout, not a worktree); either path is + fine, but the change must always be committed here, never applied + uncommitted. Verify the live result with the user's read-only kubectl before + saying "it's live". If a push to `master` is rejected by branch protection (user not on the whitelist — e.g. new users before Viktor grants it), fall back to a diff --git a/CONTEXT.md b/CONTEXT.md index 2b9bb8b3..548fa40d 100644 --- a/CONTEXT.md +++ b/CONTEXT.md @@ -125,7 +125,7 @@ How a **Service** is named in flow/audit data — its **namespace** is the prima _Avoid_: equating "service identity" with a workload's **ServiceAccount** (that's the deferred enforcement principal, not the attribution key) or with cryptographic/SPIFFE identity; "Service" here is the domain **Service**, not the K8s `Service` object. **Goldmane / Whisker**: -Calico 3.30's OSS flow-observability pair — **Goldmane** aggregates identity-stamped flows (namespace/pod/workload/labels + allow-deny + policy trace) streamed from Felix over gRPC into an in-memory ~60-min ring buffer (no etcd/API writes); **Whisker** is its live web UI. The east-west "who-talks-to-whom" data plane, succeeding raw iptables-`LOG`→journald lines (which carry no identity). Durable history requires emitting Goldmane flows to **Loki**; the in-memory buffer alone is not an audit trail. +Calico 3.30's OSS flow-observability pair — **Goldmane** aggregates identity-stamped flows (namespace/pod/workload/labels + allow-deny + policy trace) streamed from Felix over gRPC into an in-memory ~60-min ring buffer (no etcd/API writes); **Whisker** is its live web UI. The east-west "who-talks-to-whom" data plane, succeeding raw iptables-`LOG`→journald lines (which carry no identity). The in-memory buffer alone is not an audit trail — durable history is the **`goldmane-edge-aggregator`** (the implemented trail; ADR-0014 originally framed this as a Loki emitter), which streams Goldmane's gRPC `Flows.Stream` over mTLS and upserts the namespace-pair **edge set** into CNPG DB `goldmane_edges` + a daily `#alerts` digest (the `#security` channel was abandoned 2026-06-25). As-built: `docs/runbooks/goldmane-flow-trail.md`. _Avoid_: assuming Goldmane persists (it's a ring buffer — lost on restart); expecting a ServiceAccount field in its schema (it carries labels, not SA); confusing it with Cilium **Hubble** (needs the Cilium datapath, unusable on Calico) or **Kiali** (needs an Istio mesh). ### Storage diff --git a/cli/cmd_vault.go b/cli/cmd_vault.go index bf270886..6d35ba76 100644 --- a/cli/cmd_vault.go +++ b/cli/cmd_vault.go @@ -15,7 +15,7 @@ import ( // Identity is the kernel UID; per-user creds live in that user's isolated Vault // path (secret/workstation/claude-users/) read via their scoped token, and // decryption is done by the official `bw` CLI. See -// docs/superpowers/specs/2026-06-24-homelab-vault-design.md. +// docs/runbooks/homelab-vault-onboarding.md. func vaultCommands() []Command { return []Command{ {Path: []string{"vault", "setup"}, Tier: TierWrite, @@ -51,7 +51,7 @@ func vaultHelp() string { homelab vault lock lock / log out the local bw session Creds live only in your own Vault path; the admin never sees them. Identity is -your unix UID. Security model: docs/superpowers/specs/2026-06-24-homelab-vault-design.md +your unix UID. Security model: docs/runbooks/homelab-vault-onboarding.md (note: anything running as your user can decrypt your vault — the accepted no-HITL trade). ` } @@ -128,6 +128,53 @@ func loadCreds(run cmdRunner, user string) (vwCreds, error) { var vaultCurrentUser = func() string { return os.Getenv("USER") } var vaultCurrentUID = func() string { return fmt.Sprintf("%d", os.Getuid()) } +// scopedTokenPath is where claude-auth-sync keeps the user's scoped Vault token. +// MUST match CAS_VAULT_TOKEN_FILE in scripts/workstation/claude-auth-sync.sh. +func scopedTokenPath(home string) string { + return home + "/.config/claude-auth-sync/vault-token" +} + +// vaultTokenSource decides which Vault token the `vault` child processes should +// use. Precedence: an explicit $VAULT_TOKEN, then a native ~/.vault-token (what +// admins carry), then the per-user scoped token claude-auth-sync maintains at +// scopedTokenPath(HOME) (policy workstation-claude-, which grants exactly +// the create/read/update this tool needs on the user's own path). Returns the +// token to export — "" when nothing must be exported because the vault CLI reads +// the ambient credential natively — plus a source tag for tests/logging. +func vaultTokenSource(envToken string, haveVaultTokenFile bool, scopedToken string) (token, source string) { + switch { + case envToken != "": + return "", "env" + case haveVaultTokenFile: + return "", "file" + default: + if t := strings.TrimSpace(scopedToken); t != "" { + return t, "scoped" + } + return "", "none" + } +} + +// fileNonEmpty reports whether path exists and has content. +func fileNonEmpty(path string) bool { + fi, err := os.Stat(path) + return err == nil && fi.Size() > 0 +} + +// ensureVaultToken wires vaultTokenSource to the real environment: when the user +// has no ambient Vault credential, it exports the claude-auth-sync scoped token +// so the `vault` child processes authenticate as workstation-claude-. It +// is idempotent and safe for admins, whose explicit $VAULT_TOKEN / ~/.vault-token +// take precedence and are left untouched. +func ensureVaultToken() { + home := os.Getenv("HOME") + scoped, _ := os.ReadFile(scopedTokenPath(home)) + tok, src := vaultTokenSource(os.Getenv("VAULT_TOKEN"), home != "" && fileNonEmpty(home+"/.vault-token"), string(scoped)) + if src == "scoped" { + os.Setenv("VAULT_TOKEN", tok) + } +} + // bwBaseEnv is the minimal non-secret environment bw/node need. We deliberately // do NOT inherit the full parent env (keeps stray secrets out of the child). func bwBaseEnv(appdata string) []string { @@ -157,10 +204,10 @@ func bwSecretEnv(appdata string, c vwCreds, session string) []string { return env } -func bwLoginArgs() []string { return []string{"login", "--apikey"} } -func bwUnlockArgs() []string { return []string{"unlock", "--passwordenv", "BW_PASSWORD", "--raw"} } +func bwLoginArgs() []string { return []string{"login", "--apikey"} } +func bwUnlockArgs() []string { return []string{"unlock", "--passwordenv", "BW_PASSWORD", "--raw"} } func bwGetArgs(field, name string) []string { return []string{"get", field, name} } -func bwStatusArgs() []string { return []string{"status"} } +func bwStatusArgs() []string { return []string{"status"} } // bwNeedsLogin parses `bw status` JSON and reports whether a `bw login` is // required. Unparseable/empty output → true (safer to attempt login). @@ -443,6 +490,7 @@ func runList(run cmdRunner, user, uid, search string) ([]string, error) { func vaultList(args []string) error { hardenProcess() + ensureVaultToken() search := "" for i := 0; i < len(args); i++ { if args[i] == "--search" && i+1 < len(args) { @@ -477,6 +525,7 @@ func vaultSearch(args []string) error { func vaultCode(args []string) error { hardenProcess() + ensureVaultToken() if len(args) == 0 { return fmt.Errorf("usage: homelab vault code ") } @@ -516,6 +565,7 @@ func statusSummary(run cmdRunner, user, uid string) string { func vaultStatus(args []string) error { hardenProcess() + ensureVaultToken() uid := vaultCurrentUID() unlock, err := withUserLock(uid) if err != nil { @@ -542,32 +592,61 @@ func vaultLock(args []string) error { return nil // lock/logout best-effort; never error the caller } -// vaultPatchPublicArgs writes the non-secret identifiers via argv. Neither the +// kvWriteVerb selects the KV write semantics. merge=true → `kv patch -method=rw` +// (read-modify-write: needs only read+update, NOT the `patch` capability the +// scoped workstation-claude- policy lacks, and preserves co-located keys +// such as claude-auth-sync's claude_ai_oauth_json). merge=false → `kv put` +// (creates the path on first use, before any sibling keys exist). +func kvWriteVerb(merge bool) []string { + if merge { + return []string{"kv", "patch", "-method=rw"} + } + return []string{"kv", "put"} +} + +// vaultWritePublicArgs writes the non-secret identifiers via argv. Neither the // email nor the API client_id is a usable credential on its own. -func vaultPatchPublicArgs(user, email, clientID string) []string { - return []string{"kv", "patch", vwCredsPath(user), - "vaultwarden_email=" + email, - "vaultwarden_client_id=" + clientID, - } +func vaultWritePublicArgs(merge bool, user, email, clientID string) []string { + return append(kvWriteVerb(merge), vwCredsPath(user), + "vaultwarden_email="+email, + "vaultwarden_client_id="+clientID, + ) } -// vaultPatchSecretArgs writes ONE secret value via the `key=-` stdin form, so -// the value never appears in argv (ps / /proc//cmdline). The value is fed -// on stdin by realRunnerStdin. -func vaultPatchSecretArgs(user, key string) []string { - return []string{"kv", "patch", vwCredsPath(user), key + "=-"} +// vaultWriteSecretArgs writes ONE secret value via the `key=-` stdin form, so the +// value never appears in argv (ps / /proc//cmdline). Fed on stdin by +// realRunnerStdin. +func vaultWriteSecretArgs(merge bool, user, key string) []string { + return append(kvWriteVerb(merge), vwCredsPath(user), key+"=-") } -// writeCreds stores all four fields in the user's Vault path. The two real -// secrets (master password, API client_secret) go via stdin — never argv. -func writeCreds(user string, c vwCreds) error { - if _, err := realRunner("vault", vaultPatchPublicArgs(user, c.Email, c.ClientID), nil); err != nil { +// credsPathExists reports whether the user's KV path already holds data. Used to +// pick create (`kv put`) vs merge (`kv patch -method=rw`) for the first write: +// claude-auth-sync usually creates the path first (Claude OAuth backup), but a +// user could run `homelab vault setup` before that ever happens. +func credsPathExists(run cmdRunner, user string) bool { + _, err := run("vault", []string{"kv", "get", "-format=json", vwCredsPath(user)}, nil) + return err == nil +} + +// cmdRunnerStdin is realRunnerStdin's shape, injected so writeCreds is testable. +type cmdRunnerStdin func(name string, argv, envv []string, stdin string) (string, error) + +// writeCreds stores all four fields in the user's Vault path using only the +// capabilities the scoped policy grants (create/read/update — NOT `patch`). The +// first (public) write creates the path when absent; the two real secrets then +// merge in via read-modify-write so the public keys — and any claude-auth-sync +// keys already present — survive. Secret values travel on stdin, never argv. +func writeCreds(run cmdRunner, runStdin cmdRunnerStdin, user string, c vwCreds) error { + merge := credsPathExists(run, user) + if _, err := run("vault", vaultWritePublicArgs(merge, user, c.Email, c.ClientID), nil); err != nil { return err } - if _, err := realRunnerStdin("vault", vaultPatchSecretArgs(user, "vaultwarden_master_password"), nil, c.MasterPassword); err != nil { + // The path now exists regardless of the branch above → merge the secrets in. + if _, err := runStdin("vault", vaultWriteSecretArgs(true, user, "vaultwarden_master_password"), nil, c.MasterPassword); err != nil { return err } - if _, err := realRunnerStdin("vault", vaultPatchSecretArgs(user, "vaultwarden_client_secret"), nil, c.ClientSecret); err != nil { + if _, err := runStdin("vault", vaultWriteSecretArgs(true, user, "vaultwarden_client_secret"), nil, c.ClientSecret); err != nil { return err } return nil @@ -593,6 +672,7 @@ func promptLine(prompt string) (string, error) { func vaultSetup(args []string) error { hardenProcess() + ensureVaultToken() fmt.Fprintln(os.Stderr, "One-time setup. Stored ONLY in your own Vault path; the admin never sees it.") fmt.Fprintln(os.Stderr, "Get your API key at https://vaultwarden.viktorbarzin.me → Settings → Security → Keys → View API key.") email, err := promptLine("Vaultwarden email: ") @@ -615,7 +695,7 @@ func vaultSetup(args []string) error { return fmt.Errorf("all fields are required") } c := vwCreds{Email: email, MasterPassword: master, ClientID: clientID, ClientSecret: clientSecret} - if err := writeCreds(vaultCurrentUser(), c); err != nil { + if err := writeCreds(realRunner, realRunnerStdin, vaultCurrentUser(), c); err != nil { return fmt.Errorf("writing creds to your Vault path failed (scoped token present?): %w", err) } fmt.Fprintln(os.Stderr, "Stored. Verifying unlock…") @@ -634,6 +714,7 @@ func vaultSetup(args []string) error { func vaultGet(args []string) error { hardenProcess() + ensureVaultToken() o, err := parseGetArgs(args) if err != nil { return err @@ -660,4 +741,3 @@ func vaultGet(args []string) error { emitSecret(val) return nil } - diff --git a/cli/cmd_vault_test.go b/cli/cmd_vault_test.go index 36aab1f4..4f583b95 100644 --- a/cli/cmd_vault_test.go +++ b/cli/cmd_vault_test.go @@ -70,7 +70,7 @@ func (f *fakeRunner) run(name string, argv, envv []string) (string, error) { func TestLoadCredsReadsFourFields(t *testing.T) { f := &fakeRunner{out: map[string]string{ - "vault kv get -field=vaultwarden_email secret/workstation/claude-users/emo": "emo@x.me", + "vault kv get -field=vaultwarden_email secret/workstation/claude-users/emo": "emo@x.me", "vault kv get -field=vaultwarden_master_password secret/workstation/claude-users/emo": "hunter2", "vault kv get -field=vaultwarden_client_id secret/workstation/claude-users/emo": "user.abc", "vault kv get -field=vaultwarden_client_secret secret/workstation/claude-users/emo": "sek", @@ -233,12 +233,96 @@ func TestStatusSummaryUnconfigured(t *testing.T) { } } -func TestVaultPatchPublicArgs(t *testing.T) { - got := vaultPatchPublicArgs("emo", "e@x.me", "user.ci") - want := []string{"kv", "patch", "secret/workstation/claude-users/emo", +func TestEnsureVaultTokenSetsScopedFallback(t *testing.T) { + dir := t.TempDir() + cfg := dir + "/.config/claude-auth-sync" + if err := os.MkdirAll(cfg, 0o700); err != nil { + t.Fatal(err) + } + if err := os.WriteFile(cfg+"/vault-token", []byte("SCOPED-TOK\n"), 0o600); err != nil { + t.Fatal(err) + } + t.Setenv("HOME", dir) + t.Setenv("VAULT_TOKEN", "") // no ambient token + + ensureVaultToken() + if got := os.Getenv("VAULT_TOKEN"); got != "SCOPED-TOK" { + t.Fatalf("VAULT_TOKEN = %q, want scoped fallback to be exported", got) + } +} + +func TestEnsureVaultTokenKeepsExplicitEnv(t *testing.T) { + dir := t.TempDir() + cfg := dir + "/.config/claude-auth-sync" + if err := os.MkdirAll(cfg, 0o700); err != nil { + t.Fatal(err) + } + if err := os.WriteFile(cfg+"/vault-token", []byte("SCOPED-TOK"), 0o600); err != nil { + t.Fatal(err) + } + t.Setenv("HOME", dir) + t.Setenv("VAULT_TOKEN", "ADMIN-TOK") + + ensureVaultToken() + if got := os.Getenv("VAULT_TOKEN"); got != "ADMIN-TOK" { + t.Fatalf("VAULT_TOKEN = %q, must not override an explicit token", got) + } +} + +func TestScopedTokenPath(t *testing.T) { + if got := scopedTokenPath("/home/emo"); got != "/home/emo/.config/claude-auth-sync/vault-token" { + t.Fatalf("scopedTokenPath = %q", got) + } +} + +func TestVaultTokenSource(t *testing.T) { + // Precedence: explicit $VAULT_TOKEN > ~/.vault-token (vault CLI native) > + // the claude-auth-sync per-user scoped token. This is what lets a non-admin + // workstation user (no ambient token) reach their own Vault path. + cases := []struct { + name string + env string + haveVaultToken bool + scoped string + wantTok, wantSrc string + }{ + {"explicit env wins", "abc", true, "S", "", "env"}, + {"vault-token file used natively", "", true, "S", "", "file"}, + {"scoped fallback for non-admin", "", false, "S-TOK", "S-TOK", "scoped"}, + {"scoped value is trimmed", "", false, " S-TOK\n", "S-TOK", "scoped"}, + {"whitespace-only scoped is no token", "", false, " \n", "", "none"}, + {"nothing configured", "", false, "", "", "none"}, + } + for _, c := range cases { + tok, src := vaultTokenSource(c.env, c.haveVaultToken, c.scoped) + if tok != c.wantTok || src != c.wantSrc { + t.Errorf("%s: vaultTokenSource(%q,%v,%q) = (%q,%q), want (%q,%q)", + c.name, c.env, c.haveVaultToken, c.scoped, tok, src, c.wantTok, c.wantSrc) + } + } +} + +func TestKvWriteVerb(t *testing.T) { + // merge=true → read-modify-write patch (needs only read+update, NOT the + // `patch` capability the scoped workstation policy lacks). + if got := kvWriteVerb(true); !reflect.DeepEqual(got, []string{"kv", "patch", "-method=rw"}) { + t.Fatalf("kvWriteVerb(true) = %v", got) + } + // merge=false → put (creates the path on first use) + if got := kvWriteVerb(false); !reflect.DeepEqual(got, []string{"kv", "put"}) { + t.Fatalf("kvWriteVerb(false) = %v", got) + } +} + +func TestVaultWritePublicArgs(t *testing.T) { + got := vaultWritePublicArgs(true, "emo", "e@x.me", "user.ci") + want := []string{"kv", "patch", "-method=rw", "secret/workstation/claude-users/emo", "vaultwarden_email=e@x.me", "vaultwarden_client_id=user.ci"} if !reflect.DeepEqual(got, want) { - t.Fatalf("vaultPatchPublicArgs = %v", got) + t.Fatalf("vaultWritePublicArgs(merge) = %v", got) + } + if got := vaultWritePublicArgs(false, "emo", "e@x.me", "user.ci"); got[0] != "kv" || got[1] != "put" { + t.Fatalf("vaultWritePublicArgs(create) must use `kv put`, got %v", got) } for _, a := range got { if strings.Contains(a, "master_password") || strings.Contains(a, "client_secret") { @@ -247,12 +331,12 @@ func TestVaultPatchPublicArgs(t *testing.T) { } } -func TestVaultPatchSecretArgsNoValueInArgv(t *testing.T) { +func TestVaultWriteSecretArgsNoValueInArgv(t *testing.T) { for _, key := range []string{"vaultwarden_master_password", "vaultwarden_client_secret"} { - got := vaultPatchSecretArgs("emo", key) - want := []string{"kv", "patch", "secret/workstation/claude-users/emo", key + "=-"} + got := vaultWriteSecretArgs(true, "emo", key) + want := []string{"kv", "patch", "-method=rw", "secret/workstation/claude-users/emo", key + "=-"} if !reflect.DeepEqual(got, want) { - t.Fatalf("vaultPatchSecretArgs(%q) = %v", key, got) + t.Fatalf("vaultWriteSecretArgs(%q) = %v", key, got) } if got[len(got)-1] != key+"=-" { t.Fatalf("secret value must be read from stdin (`%s=-`), got %v", key, got) @@ -260,6 +344,90 @@ func TestVaultPatchSecretArgsNoValueInArgv(t *testing.T) { } } +// recStdin records a stdin-bearing call for assertions. +type recStdin struct { + argv []string + stdin string +} + +// TestWriteCredsCreatesThenMerges: when the path is ABSENT the first (public) +// write must `kv put` (create), and the two secrets must merge via patch -rw +// with values on stdin only — never the buggy plain `kv patch` (needs `patch`). +func TestWriteCredsCreatesThenMerges(t *testing.T) { + var calls [][]string + var stdinCalls []recStdin + run := func(name string, argv, envv []string) (string, error) { + calls = append(calls, append([]string{name}, argv...)) + if len(argv) >= 2 && argv[0] == "kv" && argv[1] == "get" { + return "", fmt.Errorf("no value found") // path absent + } + return "", nil + } + runStdin := func(name string, argv, envv []string, stdin string) (string, error) { + stdinCalls = append(stdinCalls, recStdin{append([]string{name}, argv...), stdin}) + return "", nil + } + c := vwCreds{Email: "e@x.me", MasterPassword: "PW", ClientID: "user.ci", ClientSecret: "CS"} + if err := writeCreds(run, runStdin, "emo", c); err != nil { + t.Fatalf("writeCreds: %v", err) + } + var sawPut, sawPlainPatch bool + for _, cl := range calls { + j := strings.Join(cl, " ") + if strings.Contains(j, "kv put") { + sawPut = true + } + if strings.Contains(j, "kv patch") && !strings.Contains(j, "-method=rw") { + sawPlainPatch = true + } + } + if !sawPut { + t.Fatalf("path absent → public write must be `kv put`; calls=%v", calls) + } + if sawPlainPatch { + t.Fatalf("must never use plain `kv patch` (needs `patch` capability); calls=%v", calls) + } + if len(stdinCalls) != 2 { + t.Fatalf("want 2 stdin secret writes, got %d", len(stdinCalls)) + } + for _, sc := range stdinCalls { + if !strings.Contains(strings.Join(sc.argv, " "), "kv patch -method=rw") { + t.Errorf("secret write must use patch -method=rw: %v", sc.argv) + } + for _, a := range sc.argv { + if strings.Contains(a, "PW") || strings.Contains(a, "CS") { + t.Errorf("secret leaked into argv: %v", sc.argv) + } + } + } + if stdinCalls[0].stdin != "PW" || stdinCalls[1].stdin != "CS" { + t.Errorf("stdin values wrong: %q,%q", stdinCalls[0].stdin, stdinCalls[1].stdin) + } +} + +// TestWriteCredsMergesWhenPresent: when the path EXISTS, every write must merge +// (patch -rw) — a `kv put` would wipe sibling keys (e.g. claude_ai_oauth_json). +func TestWriteCredsMergesWhenPresent(t *testing.T) { + var calls [][]string + run := func(name string, argv, envv []string) (string, error) { + calls = append(calls, append([]string{name}, argv...)) + return "{}", nil // get succeeds → path exists + } + runStdin := func(name string, argv, envv []string, stdin string) (string, error) { + calls = append(calls, append([]string{name}, argv...)) + return "", nil + } + c := vwCreds{Email: "e@x.me", MasterPassword: "PW", ClientID: "user.ci", ClientSecret: "CS"} + if err := writeCreds(run, runStdin, "emo", c); err != nil { + t.Fatalf("writeCreds: %v", err) + } + for _, cl := range calls { + if strings.Contains(strings.Join(cl, " "), "kv put") { + t.Fatalf("path exists → must NOT `kv put` (wipes siblings): %v", cl) + } + } +} + // TestNoSecretInArgvAcrossFlow is the load-bearing security test: across the // whole get flow (vault reads, bw config/status/login/unlock/get) NO secret // value may appear in any command's argv — secrets travel via env/stdin only. @@ -267,8 +435,8 @@ func TestNoSecretInArgvAcrossFlow(t *testing.T) { uid := fmt.Sprintf("%d", os.Getuid()) f := &fakeRunner{out: map[string]string{ "vault kv get -field=vaultwarden_master_password secret/workstation/claude-users/emo": "SUPERSECRETPW", - "vault kv get -field=vaultwarden_client_id secret/workstation/claude-users/emo": "user.x", - "vault kv get -field=vaultwarden_client_secret secret/workstation/claude-users/emo": "CLIENTSEKRET", + "vault kv get -field=vaultwarden_client_id secret/workstation/claude-users/emo": "user.x", + "vault kv get -field=vaultwarden_client_secret secret/workstation/claude-users/emo": "CLIENTSEKRET", "bw status": `{"status":"locked"}`, "bw unlock": "SESSIONXYZ", "bw get password github": "p@ss", @@ -353,8 +521,8 @@ func TestVaultBareGroupRegistered(t *testing.T) { func TestGetValueFlow(t *testing.T) { f := &fakeRunner{out: map[string]string{ "vault kv get -field=vaultwarden_master_password secret/workstation/claude-users/emo": "pw", - "vault kv get -field=vaultwarden_client_id secret/workstation/claude-users/emo": "user.x", - "vault kv get -field=vaultwarden_client_secret secret/workstation/claude-users/emo": "cs", + "vault kv get -field=vaultwarden_client_id secret/workstation/claude-users/emo": "user.x", + "vault kv get -field=vaultwarden_client_secret secret/workstation/claude-users/emo": "cs", "bw status": `{"status":"locked"}`, "bw unlock": "SESS", "bw get password github": "p@ss", diff --git a/docs/adr/0003-keep-forgejo-canonical-complete-mirror.md b/docs/adr/0003-keep-forgejo-canonical-complete-mirror.md index 9e0e2192..67022732 100644 --- a/docs/adr/0003-keep-forgejo-canonical-complete-mirror.md +++ b/docs/adr/0003-keep-forgejo-canonical-complete-mirror.md @@ -13,7 +13,7 @@ The trigger was a proposal to swap Forgejo out for GitHub entirely. The grilling Do **not** swap to GitHub. Reaffirm and *complete* the model already in `CONTEXT.md`: - Every first-party repo has exactly **one** push target — its **Canonical repo** on Forgejo. GitHub is a one-way push-mirror (off-site backup + the source GitHub Actions builds from). **No repo is ever dual-pushed.** -- A small, explicit set of **GitHub-first repos** are the exception (canonical lives on GitHub, outside the mirror policy): third-party clones/forks where GitHub is genuinely upstream (`jsoncrack.com`, `snmp_exporter`, `SparkyFitness`, `agent-rules-books`, `Plotting-Your-Dream-Book`) and the deliberately-public first-party `health`. +- A small, explicit set of **GitHub-first repos** are the exception (canonical lives on GitHub, outside the mirror policy): third-party clones/forks where GitHub is genuinely upstream (`jsoncrack.com`, `snmp_exporter`, `SparkyFitness`, `agent-rules-books`, `Plotting-Your-Dream-Book`) and the deliberately-public first-party `health`. `Plotting-Your-Dream-Book` (owned by Anca, dev in her org) keeps its GHA build in-place and pushes the image to **its own org's ghcr** (`ghcr.io/passionprojectsanca/book-plotter`, private) via the workflow's built-in `GITHUB_TOKEN` — no Forgejo mirror, no `viktorbarzin`-namespace push, no shared PAT in her repo (2026-06-27, migrated off DockerHub). - `infra` is reconciled into the standard model: its GitHub-only `.github/workflows/build-*.yml` are brought onto Forgejo-canonical (inert on Forgejo, active on the mirror), then the mirror is enabled — ending the deliberate divergence while keeping Woodpecker on the Forgejo forge. - Enforcement is **structural**: reconciled clones keep only the Forgejo remote, so there is no GitHub remote to habitually push to; the execution rule is "push to the canonical forge only, never the mirror." diff --git a/docs/adr/0011-homelab-usage-telemetry.md b/docs/adr/0011-homelab-usage-telemetry.md index c383211b..fc0c4e76 100644 --- a/docs/adr/0011-homelab-usage-telemetry.md +++ b/docs/adr/0011-homelab-usage-telemetry.md @@ -5,6 +5,14 @@ exists to answer the question that drove the whole CLI — *which verbs are wort adding next* — with data instead of one maintainer's habits (the earlier mining covered a single user's ~51k commands, so the surface is shaped to that user). +> **Update (2026-06-26) — the cross-user privacy *norm* below is superseded by +> [ADR-0015](0015-os-is-the-authorization-boundary.md).** The prohibition this +> ADR leaned on ("reading another user's `~/.claude` is off-limits even for an +> owner in-session") no longer holds: the managed-settings policy now **defers +> to OS/sudo authorization**. The `usage top` telemetry design itself is +> unchanged and still current — only the "never read homes" framing in the +> third decision below is overtaken. + ## Decisions - **Emit on dispatch, in `dispatch()`.** The longest-prefix match already knows diff --git a/docs/adr/0014-service-identity-and-east-west-observability.md b/docs/adr/0014-service-identity-and-east-west-observability.md index 5eb1c83a..cdccac4f 100644 --- a/docs/adr/0014-service-identity-and-east-west-observability.md +++ b/docs/adr/0014-service-identity-and-east-west-observability.md @@ -27,3 +27,9 @@ As the Service count grows we want an audit-grade record of which Service talks - **Enforcement gains a better data source.** Goldmane's allow/deny + policy-trace flows build the Wave 1 empirical egress allowlist faster than the current iptables-`LOG`→journald→Loki path, and policies select on namespace/label with no SA dependency. - **New ubiquitous language** recorded in `CONTEXT.md`: **Service identity** and **Goldmane / Whisker**. - **Revisit triggers:** adopt dedicated per-Service SAs if identity-aware NetworkPolicy needs a principal finer than namespace/label, or if mTLS is ever required; reconsider Retina if DNS/drop-level flow detail becomes necessary. + +## As-built (2026-06-25) + +Implemented across infra issues #57–#63. **One material deviation from the decision above:** the durable trail is NOT a Goldmane→Loki emitter (no such emitter exists in OSS Calico 3.30) — it is the **`goldmane-edge-aggregator`** service, which streams Goldmane's gRPC `Flows.Stream` API over mTLS and upserts the unique namespace-pair **edge set** (`edge(src_ns,dst_ns,action,first_seen,last_seen,flow_count)`, self-edges + empty-namespace flows dropped) into **CNPG DB `goldmane_edges`**, plus a daily `goldmane-edges-digest` CronJob → `#alerts` (all Slack consolidated to `#alerts`; the `#security` channel was abandoned 2026-06-25 — the shared webhook's Slack app isn't a member of it — see runbook). The mTLS client cert **reuses the operator's Tigera-CA-signed `whisker-backend-key-pair`** rather than copying the CA private key into TF state (Goldmane verifies CA-chain only, not identity) — re-apply `stacks/goldmane-edge-aggregator` if the operator rotates it. `service-identity` labels are live on the multi-Service namespaces (`monitoring`, `dbaas`). Whisker UI is Authentik-gated at `whisker.viktorbarzin.me`. Health: Prometheus alerts `AggregatorDown` + `DigestFailing` and cluster-health check #48. + +Full as-built, query recipes (incl. the Wave-1 egress-allowlist derivation), and troubleshooting: [`docs/runbooks/goldmane-flow-trail.md`](../runbooks/goldmane-flow-trail.md). Stacks: `stacks/calico` (Goldmane/Whisker + Whisker ingress), `stacks/goldmane-edge-aggregator` (the trail). Code: `~/code/goldmane-edge-aggregator`. diff --git a/docs/adr/0015-os-is-the-authorization-boundary.md b/docs/adr/0015-os-is-the-authorization-boundary.md new file mode 100644 index 00000000..8999682b --- /dev/null +++ b/docs/adr/0015-os-is-the-authorization-boundary.md @@ -0,0 +1,57 @@ +# OS is the authorization boundary: agents defer to Unix/sudo, not a stricter in-policy rule + +Supersedes the cross-user privacy *norm* that the devvm managed-settings policy +carried and that ADR-0011 leaned on ("never read another user's home / +`~/.claude`, off-limits even for an owner in-session"). ADR-0011's actual +subject — `usage top` telemetry and its emit design — is unchanged and still +current; only the privacy prohibition it referenced is superseded here. + +## Context + +The devvm managed-settings policy (`/etc/claude-code/managed-settings.json`, +`claudeMd`) carried two rules that were, in practice, *stricter than the OS*: +"you are not the admin, do not escalate privileges" and "never read another +user's home directory, credentials, tokens, or `~/.claude`." The OS told a +different story: `wizard` holds `(ALL) NOPASSWD: ALL` — full passwordless root. +The kernel had already granted total read access; the policy was layering an +artificial refusal on top of an authorization the OS already permits, and the +"not the admin" framing was factually wrong for a NOPASSWD-root user. + +Two honest ways to resolve the inconsistency: tighten sudo to match the policy, +or loosen the policy to match the OS. The owner chose the latter on 2026-06-26, +for analytics/debugging across the shared box. + +## Decision + +- **Authorization follows the OS, not this policy.** Agents may access whatever + their OS user can access — directly or via `sudo` where they hold sudo rights + — and must not impose restrictions stricter than the OS. On this box that + includes other users' home directories and `~/.claude` for users who hold + broad sudo. +- **No separate prompt or carve-out** for OS-authorized access. The Unix + permission model + sudoers is the single source of truth for who may read + what. Other homes are `0750`-owned, so a cross-home read necessarily transits + `sudo` and is therefore captured in the sudo/auth audit log. +- **Cluster/infra RBAC tiering is unchanged.** kubectl / Vault / infra access + stays scoped to each user's RBAC tier; "defer to the OS" is about OS-level + file access, not a licence to exceed cluster RBAC. +- **Scope is symmetric and multi-user.** The rule lives in the *shared* + managed-settings, so every user's agents defer to that user's own sudo grant. + Any user with broad sudo gets the same cross-home read capability over other + users' files. Accepted by the owner with that understanding; emo's and + ancamilea's `~/.claude` is now agent-readable by sudo-holders. +- **Takes effect in a fresh session.** managed-settings loads at session start; + the session that made the change keeps running under the old policy. + +## Consequences + +- The privacy-preserving telemetry rationale in ADR-0011 (`usage top` as the + "cross-user analytics without reading homes" answer) remains useful but is no + longer the *only* sanctioned path; direct reads via `sudo` are now permitted. +- Larger blast radius: if an agent session running as a sudo-holder is + prompt-injected or otherwise compromised, it can now read every user's secrets + with no in-agent friction (sudo here is passwordless). The sudo/auth audit log + is the remaining accountability control. +- Reversible: restore the prior `claudeMd` bullets (backup kept at + `/etc/claude-code/managed-settings.json.bak-2026-06-26`) and start a fresh + session. diff --git a/docs/architecture/chrome-service.md b/docs/architecture/chrome-service.md index 6f9c1ee4..37cb6edc 100644 --- a/docs/architecture/chrome-service.md +++ b/docs/architecture/chrome-service.md @@ -205,6 +205,43 @@ healthy <0.3s, broken hangs). **Fix: cap `ulimit -n 65536` before x11vnc starts* wrapper in `main.tf` (so it applies deterministically even though the image is `:latest`/`IfNotPresent` and won't re-pull a rebuilt entrypoint). Same bug + fix as the android-emulator stack. + +### noVNC black after a browser-container restart (x11vnc supervision) + +A **distinct** failure from the fd-sweep gotcha above: the noVNC client *connects* +but the view is **black**, and the novnc container logs spew +`connecting to: localhost:5900` → `Failed to connect ... [Errno 111] Connection +refused` (x11vnc is **down**, not slow). Cause: `x11vnc` and `websockify` both run +in the **novnc** container, but x11vnc attaches to the **chrome-service** (browser) +container's Xvfb over `localhost:6099` (shared pod network). When the browser +container restarts — Chrome exits cleanly (exit 0, "Completed") or crashes — its +Xvfb vanishes and x11vnc loses its X connection and exits. + +`entrypoint.sh` **supervises** x11vnc: it launches x11vnc and websockify as +background children and `wait -n`s on them, exiting non-zero if **either** dies, so +the kubelet restarts the novnc container, which re-waits for Xvfb on `:6099` and +relaunches x11vnc — the bridge **self-heals** across browser-container restarts. +(Before 2026-06-27, x11vnc was an unsupervised background child of an `exec`ed +websockify; a dead x11vnc was never relaunched, leaving `:5900` dead — a +`` zombie — and the view black until a manual pod restart. Same +supervision pattern as the android-emulator stack's entrypoint.) + +**Diagnose:** `kubectl exec -c novnc -- ps aux | grep x11vnc` (a ``/Z +entry = the bug); or the RFB-banner probe from a sibling container (`python3 -c +"import socket;s=socket.socket();s.settimeout(2);s.connect(('127.0.0.1',5900));print(s.recv(12))"` +— healthy returns `b'RFB 003.008\n'`, broken = `ConnectionRefused`). **Immediate +recovery** (no image change): restart just the novnc container with `kubectl exec +-n chrome-service deploy/chrome-service -c novnc -- kill 1` — re-runs its entrypoint +and relaunches x11vnc **without** touching the browser session/in-flight CDP jobs. + +> **Deploying a rebuilt novnc entrypoint:** Keel is **off** for this deployment +> (`keel.sh/policy=never`, because the browser container's playwright image is +> version-pinned to f1-stream) and the image is `:latest`/`IfNotPresent`, so a +> rebuilt `:latest` will **not** redeploy on its own. After the +> `build-chrome-service-novnc.yml` GHA build pushes `:latest` + `:`, +> **SHA-pin** the novnc `image` in `main.tf` to the new `:` to force the pull +> and rollout (the novnc image is TF-managed — not in the deployment's +> `lifecycle.ignore_changes`). - **snapshot-server sidecar** (`mcr.microsoft.com/playwright/python:v1.48.0-noble`) serves `GET /api/snapshot` from `/profile/snapshots/storage-state.json`, bearer-gated by `PW_TOKEN`. Service `chrome-snapshot` maps :8088 → :8088 diff --git a/docs/architecture/ci-cd.md b/docs/architecture/ci-cd.md index 35e041e6..6810499b 100644 --- a/docs/architecture/ci-cd.md +++ b/docs/architecture/ci-cd.md @@ -115,9 +115,67 @@ claude-agent-service, claude-memory-mcp, kms-website, Freedify, instagram-poster, payslip-ingest, broker-sync (image name `wealthfolio-sync`), fire-planner, recruiter-responder, x402-gateway — plus **tripit** (the original pilot, 2026-06-09). Earlier public-repo apps already on GHA (Website, -k8s-portal, apple-health-data, audiblez-web, plotting-book, insta2spotify, +k8s-portal, apple-health-data, audiblez-web, insta2spotify, audiobook-search) now also land on ghcr. +**plotting-book** is a special case (a GitHub-first repo owned by Anca, +ADR-0003): the build runs in *her* GitHub repo +(`PassionProjectsAnca/Plotting-Your-Dream-Book`) and pushes to **private +`ghcr.io/passionprojectsanca/book-plotter`** — under her org's ghcr namespace, +not `viktorbarzin`, using the workflow's built-in `GITHUB_TOKEN` (no shared +PAT). The cluster pulls it via the Kyverno-synced `ghcr-credentials` secret (the +`plotting-book` namespace is on the allowlist; the shared `ghcr_pull_token` has +read access). Migrated off public DockerHub (`viktorbarzin/book-plotter`) on +2026-06-27. The Woodpecker deploy hook (repo 43, registered to Anca's repo) is +unchanged. Flow: + +```text + DEVELOP ─────────────────────────────────────────────────────────────────────── + Anca (Codex / t3 web agent) + │ git push → main + ▼ + ┌──────────────────────────────────────────────────────────────┐ + │ GitHub: PassionProjectsAnca/Plotting-Your-Dream-Book (private)│ ← canonical + │ .github/workflows/build-and-deploy.yml on: push → main │ + └───────────────────────────┬──────────────────────────────────┘ + │ GitHub Actions runner (off-infra build · ADR-0002) + ┌────────────────────┴─────────────────────────────────┐ + ▼ ▼ + ┌─────────────────────────────────────────────┐ ╔═══════════════════════════════════════╗ + │ build job │ push ║ GHCR · PRIVATE package ║ + │ • svu next --always → tag vX.Y.Z (→ repo) │═════▶║ ghcr.io/passionprojectsanca/ ║ + │ • buildx linux/amd64, provenance:false │ tags ║ book-plotter :vX.Y.Z :latest ║ + │ • login ghcr (GITHUB_TOKEN, packages:write)│ ╚═══════════════════╤═══════════════════╝ + │ • delete-package-versions (keep newest 10) │ │ + └───────────────────────┬─────────────────────┘ │ pull (private, + ▼ deploy job [gate: repo var DEPLOY_ENABLED ≠ "false"] via secret) + POST ci.viktorbarzin.me/api/repos/43/pipelines {IMAGE_TAG, IMAGE_NAME} │ + ▼ │ + ┌─────────────────────────────────────────────────────────────┐ │ + │ Woodpecker repo 43 · .woodpecker/deploy.yml (event: manual) │ │ + │ kubectl set image deployment/plotting-book = :vX.Y.Z │ │ + │ kubectl rollout status │ │ + └───────────────────────────┬─────────────────────────────────┘ │ + ▼ │ + ═══════════════ Kubernetes · ns: plotting-book ════════════════════════════ │ + ┌─────────────────────────────────────────────────────────────┐ │ + │ Deployment plotting-book (Recreate · image = ignore_changes)│ │ + │ imagePullSecrets: ghcr-credentials ────────pull───────────┼─────────────────┘ + │ Pod → Express :3001 + SQLite on PVC (proxmox-lvm) │ + └─────────────────────────────────────────────────────────────┘ + guards / supporting: + • Kyverno require-trusted-registries [Enforce] → ghcr.io/* ALLOWED (admission) + • Keel policy=patch @1h → watches GHCR via ghcr-credentials (backstop) + • ghcr-credentials ⇐ Kyverno generate-clone ⇐ Vault secret/viktor/ghcr_pull_token + + ═══════════════ Serving path (unchanged) ══════════════════════════════════ + Browser ─▶ plotting-book.viktorbarzin.me (non-proxied DNS → Traefik .203) + ─▶ Authentik forward-auth (gate) ─▶ Service :80 ─▶ Pod :3001 +``` + +Governance: the Deployment + Kyverno allowlist are Terraform (`stacks/plotting-book`, +`stacks/kyverno`); the live image *tag* is CI-owned (`ignore_changes`). + ### Infra-owned images (issues #29 / #30) Images owned by the infra repo build on GHA workflows **in the infra repo's own @@ -163,9 +221,9 @@ Woodpecker is **deploy + cluster-touching steps only**: | Pipeline | File | Purpose | |----------|------|---------| | per-app deploy | `.woodpecker/deploy.yml` (each repo) | `kubectl set image` + Slack notify (event: **manual**) | -| terragrunt apply | `.woodpecker/default.yml` | Changed-stacks apply on push to master (runs in `infra-ci`) | +| terragrunt apply | `.woodpecker/default.yml` | Changed-stacks apply on push to master (runs in `infra-ci`). **Skips Tier-0 `vault`** — it's human-applied via OIDC; the CI `ci` role lacks Vault-admin perms (`sys/mounts`, `sys/policies/acl`) so a CI apply 403s | | certbot | `.woodpecker/renew-tls.yml` | TLS renewal cron | -| drift-detection | `.woodpecker/drift-detection.yml` | Nightly Terraform drift (runs in `infra-ci`) | +| drift-detection | `.woodpecker/drift-detection.yml` | Nightly Terraform drift (runs in `infra-ci`). **Skips Tier-0 `vault`** (its `plan` 403s under the `ci` role and would fail the whole run) | | provision-user | `.woodpecker/provision-user.yml` | Add namespace-owner user from Vault spec | | registry-config-sync | `.woodpecker/registry-config-sync.yml` | SCP `modules/docker-registry/*` → `10.0.20.10` on change | | pve-nfs-exports-sync | `.woodpecker/pve-nfs-exports-sync.yml` | Sync `scripts/pve-nfs-exports` → `/etc/exports` on PVE | diff --git a/docs/architecture/monitoring.md b/docs/architecture/monitoring.md index 3c75a345..06ee943f 100644 --- a/docs/architecture/monitoring.md +++ b/docs/architecture/monitoring.md @@ -286,7 +286,7 @@ Uptime Kuma monitors: TCP SMTP (port 25) on `176.12.22.76` (external), IMAP (por #### Security Alerts (Wave 1 — planned, beads `code-8ywc`) -Routed via **Loki ruler → Alertmanager → `#security` Slack receiver**. Same handling path as infra alerts. Single channel with severity labels inside (critical/warning/info), not three separate channels. Detection sources: K8s API audit log (`job=kube-audit`), Vault audit log (`job=vault-audit`), PVE sshd journald (`job=sshd-pve`), Calico flow logs (`job=calico-flow`, W1.6 only). +Routed via **Loki ruler → Alertmanager → the `slack-security` receiver, which posts to `#alerts`** (it keeps its `[SECURITY/]` title styling so security-lane alerts stand out there). Same handling path as infra alerts; severity labels carried in the alert (critical/warning/info). The dedicated `#security` channel was abandoned 2026-06-25 — the shared `alertmanager_slack_api_url` webhook's Slack app isn't a member of it (a `#security` override 404s), so everything consolidated to `#alerts`. Detection sources: K8s API audit log (`job=kube-audit`), Vault audit log (`job=vault-audit`), PVE sshd journald (`job=sshd-pve`), Calico flow logs (`job=calico-flow`, W1.6 only). | # | Source | Event | Severity | |---|---|---|---| @@ -318,9 +318,20 @@ IOPS impact estimated ~1-2 GB/day additional disk writes after custom audit-poli Detects the inverse of the K-series alerts: a service that **must work WITHOUT Authentik SSO** getting accidentally walled off. Services on `ingress_factory auth = "required"` put Authentik forward-auth on `/`, which 302-bounces native-client / public / webhook / WebSocket / SPA-XHR paths. We carve those out with path-scoped `auth = "none"` ingresses; a TF revert, a bad deploy, or `ingress_factory`'s fail-closed `auth` default flipping back to `"required"` can silently clobber a carve-out. - **Mechanism**: `blackbox-exporter` (monitoring ns) probes a representative GET-able URL per carve-out with `no_follow_redirects: true`. The `http_no_authentik_redirect` module FAILS the probe (`fail_if_header_matches` on the `Location` header, regex `authentik\.viktorbarzin\.me|/outpost\.goauthentik\.io|/application/o/authorize`) iff the response redirects to Authentik. `valid_status_codes` enumerates all expected non-Authentik responses **including 301/302** (so a legitimate redirect, e.g. a short-link 302, or a 404 carve-out like meshcentral `/agent.ashx`, stays green). Scrape job: `blackbox-authentik-walloff` (1m). -- **Alert**: `probe_failed_due_to_regex{job="blackbox-authentik-walloff"} == 1` for 10m → `severity=warning`, `lane=security` → **`#security` Slack** (Slack-only, no paging). `probe_failed_due_to_regex` (not bare `probe_success==0`) is the signal: it isolates the Authentik-redirect from unrelated 5xx/DNS/TLS failures already covered by reachability alerts. Inhibited by `TraefikDown` and `AuthentikDown` (symptom, not regression, during those outages). +- **Alert**: `probe_failed_due_to_regex{job="blackbox-authentik-walloff"} == 1` for 10m → `severity=warning`, `lane=security` → posts to **`#alerts`** via the `slack-security` receiver, which keeps its `[SECURITY]` styling (Slack-only, no paging; the dedicated `#security` channel was abandoned 2026-06-25 — the shared webhook's app isn't a member of it). `probe_failed_due_to_regex` (not bare `probe_success==0`) is the signal: it isolates the Authentik-redirect from unrelated 5xx/DNS/TLS failures already covered by reachability alerts. Inhibited by `TraefikDown` and `AuthentikDown` (symptom, not regression, during those outages). - **Target list + how to add one**: `local.authentik_walloff_targets` in `stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf` — a map of `service → URL`. To guard a NEW carve-out, add ONE line. Verify it does NOT already 302 to Authentik first: `curl -s -o /dev/null -w '%{http_code} %{redirect_url}\n' ''`. The map key becomes the `service` label on the metric + alert. (Note: openclaw `task-webhook` is intentionally NOT probed — no public DNS record.) +#### East-west flow observability (Goldmane edge-aggregator) — `AggregatorDown` / `DigestFailing` (ADR-0014) + +Health for the durable "who-talks-to-whom" trail (Calico Goldmane → `goldmane-edge-aggregator` → CNPG `goldmane_edges` → daily `#alerts` digest; full trail in security.md + [runbooks/goldmane-flow-trail.md](../runbooks/goldmane-flow-trail.md)). The aggregator pod exposes **no `/metrics`**, so health is inferred from kube-state-metrics. Alert group `Network Observability (Goldmane)` in `prometheus_chart_values.tpl`; both route the default `slack-warning` receiver → **`#alerts`**. + +| Alert | Expr (abridged) | For | Severity | +|---|---|---|---| +| `AggregatorDown` | `kube_deployment_status_replicas_available{namespace="goldmane-edge-aggregator",deployment="goldmane-edge-aggregator"} < 1` (+ Prometheus-restart guard) | 15m | warning | +| `DigestFailing` | `kube_job_status_failed{namespace="goldmane-edge-aggregator",job_name=~"goldmane-edges-digest.*"} > 0` within 24h | 30m | warning | + +The two layers are **complementary**: `AggregatorDown` ⇒ no new edges land in the DB; `DigestFailing` ⇒ edges still land but nobody is told. (`< 1` requires the metric series to exist — a fully-deleted Deployment is instead caught by cluster-health check #48 below as "deployment missing".) A freshness probe (#61b) was deliberately skipped — `AggregatorDown` is the agreed floor. **Cluster-health check #48** (`check_goldmane_aggregator` in `scripts/cluster_healthcheck.sh`) reads the Deployment's `Available` condition independently (human / `--quiet` / `--json`; JSON key `goldmane_aggregator`). + #### Backup Alerts - **PostgreSQLBackupStale**: >36h since last backup - **MySQLBackupStale**: >36h since last backup diff --git a/docs/architecture/multi-tenancy.md b/docs/architecture/multi-tenancy.md index c64a146c..2cabf9e7 100644 --- a/docs/architecture/multi-tenancy.md +++ b/docs/architecture/multi-tenancy.md @@ -541,7 +541,7 @@ Separate from the in-cluster namespace-owner model above, the **devvm** (`10.0.1 **RBAC tiers:** `admin` (Viktor — cluster-admin, unlocked tree, secrets) · `power-user` (cluster-wide read-only, NO Secrets, via a dedicated `oidc-power-user-readonly` ClusterRole) · `namespace-owner` (admin in own namespace only). Each session acts as the user's **own** OIDC identity (kubelogin), never the admin's. -**Config inheritance (live):** wizard authors the base (his chezmoi-versioned `~/.claude`). Two native layers carry it to every user — the enforced org `claudeMd` in `/etc/claude-code/managed-settings.json` (top precedence, all sessions) and per-user `~/.claude/{skills,rules,…}` **symlinks** to the base (seeded via `/etc/skel`; edits propagate live). Secrets stay per-user at mode 600, never symlinked. **The managed config self-deploys from the repo** (2026-06-10): the hourly reconcile's `sync_managed_config` installs `scripts/workstation/managed-settings.json` to `/etc/claude-code/` whenever the repo copy changes — so editing the claudeMd = edit + commit, no manual install — and `refresh_codex_mirror` regenerates each user's `~/.codex/AGENTS.md` (a static mirror of the claudeMd; only files carrying the mirror header are touched, user-customized ones are left alone). Repo-level guidance (`.claude/CLAUDE.md`, `AGENTS.md`, `CONTEXT.md` in the infra repo) reaches non-admins through their auto-freshened clones — commit + push and every user has it within the hour. +**Config inheritance (live):** wizard authors the base (his chezmoi-versioned `~/.claude`). Two native layers carry it to every user — the enforced org `claudeMd` in `/etc/claude-code/managed-settings.json` (top precedence, all sessions) and per-user `~/.claude/{skills,rules,…}` **symlinks** to the base (seeded via `/etc/skel`; edits propagate live). Secrets stay per-user at mode 600, never symlinked. **(2026-06-26: the managed `claudeMd` now defers OS-level file access to the OS/sudo — a user holding broad `sudo` may read other users' files incl. `~/.claude`; the mode-600 / no-symlink posture is unchanged but is no longer reinforced by an agent "never read other homes" rule. See [ADR-0015](../adr/0015-os-is-the-authorization-boundary.md).)** **The managed config self-deploys from the repo** (2026-06-10): the hourly reconcile's `sync_managed_config` installs `scripts/workstation/managed-settings.json` to `/etc/claude-code/` whenever the repo copy changes — so editing the claudeMd = edit + commit, no manual install — and `refresh_codex_mirror` regenerates each user's `~/.codex/AGENTS.md` (a static mirror of the claudeMd; only files carrying the mirror header are touched, user-customized ones are left alone). Repo-level guidance (`.claude/CLAUDE.md`, `AGENTS.md`, `CONTEXT.md` in the infra repo) reaches non-admins through their auto-freshened clones — commit + push and every user has it within the hour. **Memory — homelab CLI hooks (rolled out 2026-06-21, deploy-fixed 2026-06-22):** the per-user `claude_memory` MCP was retired for the **homelab-memory hooks** — the reconcile's `install_memory` (re)installs four scripts into `~/.claude/hooks/` each run (`homelab-memory-recall.py` UserPromptSubmit recall, `auto-learn.py` Stop-hook extraction, `pre-compact-backup.sh`/`post-compact-recovery.sh`), wires them into `settings.json` if-absent + additive, and removes the old `claude_memory` MCP. **The provisioner binary itself now self-deploys from the repo** (step 0: `bash -n`-gated `install` + re-exec when `scripts/t3-provision-users.sh` differs from `/usr/local/bin/t3-provision-users`, guarded against re-exec loops / DRY_RUN mutation) — added after this very rollout sat committed-but-undeployed for a day (only the manual `setup-devvm.sh` had ever deployed the binary), so the hourly reconcile kept running the pre-memory version and emo/anca silently lost memory (recall + auto-learn never wired). A latent `set -e` abort in `install_memory` (a bare `[[ -d plugin-dir ]] && …` returning non-zero) was also fixed; it had killed the reconcile after the first user the first time it actually ran. The hooks need a `MEMORY_API_KEY` (or `CLAUDE_MEMORY_API_KEY`) in the user's `settings.json` env — the `homelab` CLI defaults the API URL, so **the key is the only hard requirement**; `install_memory` reuses an existing key and only WARNs if absent (it does NOT mint one — that's an admin Vault step, see Remaining). wizard + emo carry a key from their original MCP setup; **ancamilea is keyless → her memory no-ops until a key is minted.** (`auto-learn.py`'s passive store calls the API directly, so it additionally needs `*_API_URL` in env to avoid its local-SQLite fallback; recall + manual `homelab memory store` go through the URL-defaulting CLI and need only the key.) diff --git a/docs/architecture/security.md b/docs/architecture/security.md index 7d3043ea..1cec0de6 100644 --- a/docs/architecture/security.md +++ b/docs/architecture/security.md @@ -132,6 +132,13 @@ for the supersession history — there is no longer an inline Traefik bouncer.) account hard-limits to **one** list), and CAPI is already covered in-kernel on direct hosts and by Cloudflare's own managed protections on proxied hosts. Registered bouncer key: **`kvsync`**. +- **Rate-limit resilient (2026-06-27):** Cloudflare's Lists-API *write* endpoint + is throttled (~per-60s; `429 retry-after`). The CronJob runs `backoff_limit=0` + (one POST per cycle — the `*/2` schedule IS the retry cadence) and treats a CF + `429` as a soft-skip (exit 0, retry next cycle), the same fail-safe pattern it + uses for LAPI. An earlier `backoff_limit=2` fired 3 rapid POSTs/cycle and + escalated the throttle into a stuck state that left the list empty — a + self-inflicted DoS that this change prevents. - **Block-only**: the single-list limit precludes a separate captcha/managed-challenge list, so both ban and captcha decisions are enforced as a plain block at the edge. @@ -272,7 +279,7 @@ Beads epic: `code-8ywc`. **Status: partially live as of 2026-05-18.** The block below documents the locked design. -Response model: **(I) Slack-only, daily skim.** All security alerts land in a new `#security` Slack channel via Alertmanager. No paging. Mean detection time accepted as ~12-24h; the design weight sits on prevention (Kyverno enforce, NetworkPolicy default-deny egress) rather than runtime detection. +Response model: **(I) Slack-only, daily skim.** All security alerts post to **`#alerts`** via Alertmanager (the `slack-security` receiver keeps its distinct `[SECURITY/]` title styling so security-lane alerts still stand out). The dedicated `#security` channel was abandoned (2026-06-25) — the shared `alertmanager_slack_api_url` incoming webhook's Slack app isn't a member of it, so a channel override there returns HTTP `404 channel_not_found`; everything consolidated to `#alerts`. No paging. Mean detection time accepted as ~12-24h; the design weight sits on prevention (Kyverno enforce, NetworkPolicy default-deny egress) rather than runtime detection. #### Detection sources @@ -285,7 +292,7 @@ Response model: **(I) Slack-only, daily skim.** All security alerts land in a ne #### Alert rules (16 total) -Routed via **Loki ruler → Alertmanager → `#security` Slack receiver**. Same handling path as existing infra alerts — silenceable in Alertmanager UI, history queryable, severity labels (critical/warning/info) inside the single `#security` channel. +Routed via **Loki ruler → Alertmanager → the `slack-security` receiver, which posts to `#alerts`** (it keeps its `[SECURITY/]` title styling so security-lane alerts stand out there; the dedicated `#security` channel was abandoned 2026-06-25 — the shared webhook's Slack app isn't a member of it). Same handling path as existing infra alerts — silenceable in Alertmanager UI, history queryable, severity labels (critical/warning/info) carried in the alert. **K8s API audit (K2-K9, 8 rules — K1 cluster-admin-grant intentionally skipped):** @@ -364,6 +371,69 @@ Beads: `code-8ywc` W1.6 + W1.7. **Status: planned.** - Rare-event misses: a Sunday-only CronJob's egress won't appear in 7 days of flow logs. Mitigation: extend observation to 2 weeks for namespaces with weekly CronJobs. - Mass-rollout cascade: the 26h March 2026 outage (memory id=390) was a mass-change cascade. Mitigation: phased per-namespace with health-check pauses, similar to the 2026-05-17 Keel phased rollout (memory id=1972). +#### Deriving the per-namespace egress allowlist from the edge trail (Wave 1 W1.7) + +The durable **east-west flow trail** (below) is now the preferred data source for +the *internal* (namespace-to-namespace) half of each Wave-1 egress allowlist — +faster and identity-stamped vs the original iptables-`LOG`→journald→Loki path +(ADR-0014: "Enforcement gains a better data source"). The unique observed +namespace pairs live in CNPG DB `goldmane_edges`, table `edge`. To derive the +namespaces a source is observed talking to (the `allow` set that seeds its +NetworkPolicy): + +```sql +SELECT DISTINCT dst_ns FROM edge WHERE src_ns='' AND action='allow' ORDER BY dst_ns; +``` + +The full SQL recipe (whole-cluster matrix, deny sanity-checks, the ≥7-day +observation caveat) is in +[runbooks/goldmane-flow-trail.md → Deriving the Wave-1 egress allowlist](../runbooks/goldmane-flow-trail.md#deriving-the-wave-1-egress-allowlist-from-the-edge-table-infra-62). +**External / public-internet egress is NOT in this table** (empty-namespace flows +are dropped) — for those destinations keep using the Calico flow-log observation +(the W1.6 snapshot, `wave1-egress-observation-2026-05-22.md`). This feeds the +existing observe-then-enforce effort (beads `code-8ywc`); **enforce-flips remain +out of scope** of the trail — it is observe-and-derive only. + +### East-west flow observability (Goldmane / Whisker + edge trail) (ADR-0014) + +The "who-talks-to-whom" data plane that succeeds raw iptables-`LOG` lines (which +carried no identity). **Service identity = the workload's namespace** (primary), +refined by a `service-identity` label in the few multi-Service namespaces +(`monitoring`, `kube-system`, `dbaas`). End-to-end trail, three layers: + +1. **Calico Goldmane + Whisker** (`calico-system`) — Goldmane aggregates + identity-stamped flows (ns/pod/workload/labels + allow-deny + policy-trace) + streamed from Felix over gRPC into a **~60-min in-memory ring buffer** (no + etcd/API writes — the etcd-cost constraint that drove the design). **Whisker** + is its live web UI at `whisker.viktorbarzin.me` (Authentik-gated, + `auth = "required"` — Whisker has no own login; an additive NetworkPolicy ORs + Traefik past the operator's default-deny `whisker` NP). The ring buffer is + **not** a trail (lost on Goldmane restart). Enabled via operator CRs in + `stacks/calico/main.tf`; reversible toggle (Goldmane is OSS tech-preview). +2. **`goldmane-edge-aggregator`** (`stacks/goldmane-edge-aggregator`) — streams + Goldmane's gRPC `Flows.Stream` over **mTLS** and upserts the low-cardinality + namespace-pair edge set (`edge(src_ns,dst_ns,action,first_seen,last_seen, + flow_count)`) into CNPG DB `goldmane_edges`. Self-edges and empty-namespace + (public-internet) flows are dropped — in-cluster relationships only. The mTLS + client cert **reuses the operator's Tigera-CA-signed `whisker-backend-key-pair`** + (Goldmane verifies CA-chain only, not identity) rather than copying the CA + private key into TF state — **re-apply the stack if the operator rotates that + Secret**. +3. **`goldmane-edges-digest`** CronJob — posts first-seen edges daily to + **`#alerts`** (reuses the alert-digest webhook). All Slack now consolidates to + `#alerts`; the `#security` channel was abandoned 2026-06-25 because that + webhook's Slack app isn't a member of it (a `#security` override 404s). See + runbook. + +The trail is **attribution-grade, not cryptographic** (reconstructs events in a +trusted cluster; cannot prove identity against a spoofing pod — accepted trust-model +limit; east-west stays plaintext, no mTLS between app pods). Health is covered by +the **`AggregatorDown`** + **`DigestFailing`** alerts and cluster-health check #48 +(see monitoring.md). Full as-built, query recipes, and troubleshooting: +[runbooks/goldmane-flow-trail.md](../runbooks/goldmane-flow-trail.md). Decision: +[ADR-0014](../adr/0014-service-identity-and-east-west-observability.md); glossary +`CONTEXT.md` → **Service identity**, **Goldmane / Whisker**. + ### TLS & HTTP/3 **Traefik** handles TLS termination: diff --git a/docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md b/docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md new file mode 100644 index 00000000..e6b11816 --- /dev/null +++ b/docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md @@ -0,0 +1,97 @@ +# Post-mortem: k8s 1.34→1.35 upgrade stalled — etcd IO starvation (2026-06-24) + +> Filename kept for inbound links. The originally-suspected cause (kubeadm-config +> OIDC drift) turned out **not** to be the crash — see "Correction" below. The OIDC +> drift was a real *separate* latent bug fixed in the same change. + +**Impact:** The autonomous k8s-version-upgrade chain (23:00 UTC nightly) reached +the master control-plane phase for the first time — preflight passed, etcd +snapshot taken, master cordoned + drained, etcd upgraded 3.6.5→3.6.6 — then the +kube-apiserver upgrade to v1.35.6 **crash-looped**. kubeadm waited its 5-minute +static-pod-hash window across all internal retries, then auto-rolled-back to +v1.34.9. The cluster stayed healthy on 1.34.9 (apiserver, all 7 nodes Ready), but +the run left **k8s-master cordoned** and the chain **wedged on `in_flight=1`**. +No data loss; no user-facing outage (the master carries control-plane taints, so +no workloads were displaced). + +**Trigger:** the first *minor* upgrade the chain ever attempted (1.34→1.35) — the +first time kubeadm upgrades etcd (3.6.5→3.6.6) and regenerates the control-plane +static pods, i.e. the first time the upgrade pushes real write-IO at etcd. + +## Root cause — etcd IO starvation on the shared HDD + +The new kube-apiserver could not establish/keep a working connection to etcd +during the upgrade because **etcd was IO-starved**. etcd's surviving container log +from the crash window (`/var/log/pods/.../etcd/0.log`, 23:04–23:20 UTC) shows: + +- **1,180** `apply request took too long` warnings in 16 minutes; +- individual applies of **4.3s / 2.9s / 2.7s / 1.8s** (healthy is <100ms), + clustered at **23:18:51 UTC** — exactly when kubeadm's final attempt was trying + to bring the new apiserver up. + +A reproduced 1.35.6 apiserver with no etcd dies with +`F instance.go:233 Error creating leases: error creating storage factory: context +deadline exceeded` — the same failure mode a multi-second etcd produces. etcd +lives on the contended `sdc` HDD (**beads code-oflt**: "etcd/critical VM disks on +shared sdc HDD — recurring IO-storm root cause"). The upgrade itself piled IO onto +that spindle: + +1. etcd's own upgrade-restart + WAL/db re-read (it restarted ~23:04, re-elected); +2. kubeadm dumping a full **~400MB etcd DB backup** to + `/etc/kubernetes/tmp/kubeadm-backup-etcd-/` (on the same HDD) before the + etcd upgrade — and **145 of these had accumulated to 28GB** (kubeadm never + cleans them up), pushing master root fs to **73%**, above the 70% kubelet + image-GC threshold, so image GC churned during the drain too; +3. master-drain pod evictions. + +### Correction — it was NOT the OIDC flag swap + +`kubeadm upgrade diff v1.35.6` showed the regenerated manifest also swaps +`--authentication-config` (structured multi-issuer OIDC) back to legacy +single-issuer `--oidc-*` flags (kubeadm-config drift, see secondary finding). That +was the *first* hypothesis — but an isolated repro of the 1.35.6 apiserver with +those exact `--oidc-*` flags **and authentik reachable** initialised OIDC cleanly +(`oidc.go:313`, no error) and ran fine until it hit the (deliberately dead) test +etcd. So the auth swap does **not** crash the apiserver; it was a red herring for +the crash. Image pull (all v1.35.6 images pre-pulled), OOM (none), and disk-full +were also ruled out. + +## Secondary finding (real, fixed separately) — kubeadm-config OIDC drift + +apiserver auth is configured in three places that must agree: +(1) `/etc/kubernetes/pki/auth-config.yaml` (structured, two issuers: `kubernetes` ++ `k8s-dashboard`, added 2026-06-19); (2) the live static-pod manifest +(`--authentication-config`); (3) the kubeadm-config `ClusterConfiguration` CM — +which still carried the legacy `--oidc-*` extraArgs. `kubeadm upgrade` regenerates +the manifest from (3), so it would have reverted structured auth → **dashboard + +kubectl SSO break after a successful upgrade** (recoverable: the chain's +post-master `restore.sh` re-adds the flag). This is a real bug, just not the crash. + +## Resolution + +1. **Reclaimed the 28GB kubeadm scratch** on master (`/etc/kubernetes/tmp/kubeadm-backup-*`) — root fs 73% → 23%. +2. **Reconciled kubeadm-config live** (zero cluster impact — CM only read at upgrade time): dropped `--oidc-*`, added `--authentication-config` via `kubeadm init phase upload-config kubeadm`. `kubeadm upgrade diff` then shows only the control-plane image bumps. +3. **Recovered:** uncordoned k8s-master, cleared the stuck `in_flight` gauge + annotation, deleted last night's Complete/Failed `1-35-6` phase jobs (a Complete preflight would otherwise make the detector idempotent-skip the re-run). + +## Prevention (landed in this change) + +| Gap | Fix | +|-----|-----| +| kubeadm leaks ~400MB etcd-DB backups into `/etc/kubernetes/tmp` forever (→ disk fills, image-GC churn, write-IO on etcd's spindle) | **`upgrade-step.sh` preflight now prunes** `/etc/kubernetes/tmp/kubeadm-backup-*` + `kubeadm-upgraded-manifests*` older than 3 days on master, every run. Best-effort, never aborts. | +| kubeadm-config drift would silently break SSO after an upgrade | `apiserver-oidc.tf`'s remote script now **also reconciles kubeadm-config** (`kubeadm init phase upload-config`), delivered via the `apiserver-oidc-restore` ConfigMap the chain re-runs (CI needs no ssh) or a local `-replace` apply. Preflight **alerts** (not blocks — SSO drift is recoverable) if `kubeadm upgrade diff` would still drop `--authentication-config`. | +| etcd on the contended `sdc` HDD starves under upgrade IO | **Durable fix is beads code-oflt** (move etcd/critical VM disks off `sdc`). Not in this change. Mitigations above reduce the upgrade's own IO; reclaimed disk removes the image-GC variable. | + +## Lessons + +- **Capture the failing component's own logs before concluding.** The `kubeadm + upgrade diff` made the OIDC swap look like the cause; only etcd's log (multi-second + applies) + an isolated apiserver repro showed the truth (etcd IO). A clean diff is + "what config changes," not "why it crashed." +- **etcd on shared HDD is the cluster's recurring fragility** (immich IO storm + 2026-05-25, this stall). Upgrades concentrate IO (etcd restart + kubeadm's 400MB + backup copy + drain) onto that spindle. code-oflt is the real fix. +- **Tools that leave per-operation scratch must be reaped.** kubeadm's + `/etc/kubernetes/tmp` etcd backups are throwaway (real backups → NFS) but never + GC'd; 28GB had silently accumulated. +- **Out-of-band control-plane edits must be written back to kubeadm-config** — else + `kubeadm upgrade` silently reverts them (here: SSO; could be admission/audit/API flags). diff --git a/docs/runbooks/claude-auth-renew-workstation.md b/docs/runbooks/claude-auth-renew-workstation.md index f5ce6625..727b0da4 100644 --- a/docs/runbooks/claude-auth-renew-workstation.md +++ b/docs/runbooks/claude-auth-renew-workstation.md @@ -11,6 +11,11 @@ inference every six hours and backs up only the `claudeAiOauth` object to: secret/workstation/claude-users/ ``` +The backup **merges** into that path (`vault kv patch -method=rw`, falling back to +`kv put` only when the path does not exist yet), so keys that other tools +co-locate there — notably `homelab vault`'s `vaultwarden_*` credentials — survive. +A blind `kv put` here silently wiped them on every six-hourly run (fixed 2026-06-26). + The user's unrelated `mcpOAuth` credentials never leave their home directory. Each renewal service has a distinct 32-day periodic Vault token, mode `0600`, at `~/.config/claude-auth-sync/vault-token`. Its policy can access only that user's diff --git a/docs/runbooks/goldmane-flow-trail.md b/docs/runbooks/goldmane-flow-trail.md new file mode 100644 index 00000000..51adaa8f --- /dev/null +++ b/docs/runbooks/goldmane-flow-trail.md @@ -0,0 +1,301 @@ +# Goldmane Flow Trail — east-west "who-talks-to-whom" observability + +> As-built runbook for the Calico Goldmane + Whisker flow plane and the +> `goldmane-edge-aggregator` durable audit trail. Design + rationale: +> [ADR-0014](../adr/0014-service-identity-and-east-west-observability.md). +> Glossary: `CONTEXT.md` → **Service identity**, **Goldmane / Whisker**. +> Implements infra issues #57 (Whisker ingress), #58 (aggregator), #61 +> (monitoring), #62 (egress allowlist queries), #63 (these docs). + +## What the trail is + +Three layers turn raw east-west traffic into a queryable, durable record of +which Service talks to which. **Service identity = the workload's namespace** +(primary), refined by a `service-identity` label in the few multi-Service +namespaces (`monitoring`, `kube-system`, `dbaas`) — see ADR-0014. + +| Layer | Component | Lifetime | Where it lives | +|---|---|---|---| +| **Live map** | Calico **Goldmane** + **Whisker** | ~60-min in-memory ring buffer (lost on Goldmane restart) | `calico-system`; Whisker UI at `whisker.viktorbarzin.me` | +| **Durable trail** | `goldmane-edge-aggregator` (`aggregate` mode) | persistent | CNPG Postgres DB `goldmane_edges`, table `edge` | +| **Notification** | `goldmane-edges-digest` CronJob (`digest` mode) | daily | Slack `#alerts` | + +**Goldmane** aggregates identity-stamped flows (namespace / pod / workload / +labels + allow-deny + policy-trace) streamed from Felix (the existing +`calico-node` DaemonSet) over gRPC into a ~60-minute in-memory ring buffer — +**nothing is written to etcd or the K8s API** (the etcd-cost constraint that +drove the whole design). **Whisker** is its live web UI. Because the ring +buffer is *not* a trail (a Goldmane restart loses the window), the +`goldmane-edge-aggregator` consumes Goldmane's gRPC `Flows.Stream` API over +mTLS and upserts the unique **namespace-pair edge set** into Postgres; a daily +CronJob posts first-seen edges to Slack. + +The edge set is deliberately **low-cardinality** — one row per +`(src_ns, dst_ns, action)`, *not* per-pod or per-port — so the table stays +small no matter how much traffic flows. + +## Where the data lives + +### Whisker UI — live, ~60 min +- `https://whisker.viktorbarzin.me` (Authentik-gated — Whisker ships no own + login; `auth = "required"`). Shows the live flow stream + a service graph for + roughly the last hour. Use it for "what is talking right now"; it is **not** + history. +- In-cluster: `Service goldmane:7443` (gRPC/mTLS), `Service whisker:8081` + (HTTP), both in `calico-system`. + +### CNPG `goldmane_edges` — durable +- Postgres DB `goldmane_edges` on the CNPG cluster + (`pg-cluster-rw.dbaas.svc.cluster.local:5432`). One table: + + ``` + edge(src_ns text, dst_ns text, action text, + first_seen timestamptz, last_seen timestamptz, flow_count bigint, + PRIMARY KEY (src_ns, dst_ns, action)) + ``` + + - `action` ∈ `allow` / `deny` / `pass` / `unspecified` (normalised Goldmane + action). + - **Self-edges (`src_ns == dst_ns`) and empty-namespace flows** (host-endpoint + / public-internet) are **dropped** — the trail is about in-cluster service + relationships only. (Egress to the public internet is therefore NOT in this + table; it lives in the Wave-1 Calico flow-log path — see security.md.) + - A **"new edge"** = a row whose `first_seen` falls inside the digest window. + - Role `goldmane_edges` (Vault-rotated, 7-day) owns the DB. The `edge` table + is created idempotently by the aggregator at startup (canonical DDL also in + the repo at `migrations/0001_edge.sql`). + +### Slack `#alerts` — daily digest + +> **Channel note (2026-06-25):** posts to **`#alerts`**. The dedicated `#security` channel was abandoned — the shared `alertmanager_slack_api_url` incoming webhook's Slack app is not a member of it, so a channel override there returns HTTP `404 channel_not_found`. Everything now posts to `#alerts` (this digest plus alertmanager's `slack-security` receiver, which keeps its `[SECURITY]` styling so security-lane alerts still stand out there). + +- CronJob `goldmane-edges-digest` (08:00 Europe/London) posts edges first seen + in the last 24h. Quiet when there are none. Reuses the existing alert-digest + Slack incoming webhook (Vault `secret/viktor` → `alertmanager_slack_api_url`) + — no new webhook was created. + +## How to enable / disable + +### Goldmane + Whisker (the flow plane) +Operator CRs in **`stacks/calico/main.tf`** — NOT the Helm `goldmane`/`whisker` +flags (those stay `false`; the operator's own `installation`/`apiServer` are +operator-managed via the `goldmanes`/`whiskers.operator.tigera.io` CRDs): + +- `kubectl_manifest.goldmane` (kind `Goldmane`) — creating it makes the operator + re-render `calico-node` with the `FELIX_FLOWLOGSGOLDMANESERVER` env (the + operator auto-wires Felix — **do NOT patch FelixConfiguration**), triggering a + supervised `calico-node` DaemonSet roll. Yields `Deployment` + `Service + goldmane:7443`. +- `kubectl_manifest.whisker` (kind `Whisker`, `depends_on` goldmane; + `notifications = Disabled`). Yields `Deployment` + `Service whisker:8081`. + +**To disable:** delete those two CRs and re-apply `stacks/calico`. Reversible +toggle (Goldmane is tech-preview in OSS Calico 3.30 — the main standing risk per +ADR-0014). + +### Whisker public ingress (infra #57) +Also in `stacks/calico/main.tf`: +- `module "ingress_whisker"` (`ingress_factory`, `auth = "required"`, + `dns_type = "proxied"`) → `whisker.viktorbarzin.me`. +- `kubernetes_network_policy_v1.whisker_allow_traefik` — **required alongside the + ingress**: the operator's own `whisker` NetworkPolicy (owned by the Whisker CR) + is `policyTypes: [Ingress]` with no rules = default-deny ingress to the pod. + This additive NP ORs in an allow for `namespaceSelector + kubernetes.io/metadata.name=traefik` on TCP 8081. Without it Traefik 502s. + +### The aggregator + digest (the durable trail) — `stacks/goldmane-edge-aggregator` +A Tier-1 stack (PG state) mirroring the claude-memory pattern. `scripts/tg +apply` from `stacks/goldmane-edge-aggregator/`. It provisions: the namespace, +the mTLS client material, the Postgres DB-init Job, the `DATABASE_URL` +ExternalSecret (Vault static role `pg-goldmane-edges`), the Slack ExternalSecret, +the `aggregate` Deployment, and the `digest` CronJob. **To disable the trail +without touching the flow plane:** scale `deployment/goldmane-edge-aggregator` to +0 (transient) or remove the stack (permanent) — Goldmane/Whisker keep running. + +Image: `ghcr.io/viktorbarzin/goldmane-edge-aggregator` (PRIVATE) — the +`goldmane-edge-aggregator` namespace must be in the `ghcr-credentials` Kyverno +allowlist (`stacks/kyverno/modules/kyverno/ghcr-credentials.tf`, +`local.ghcr_private_namespaces`) or pulls 401. Code repo: +`~/code/goldmane-edge-aggregator` (see its `README.md` + `DEPLOY.md`). + +## mTLS cert — the REUSE decision (cert-reuse gotcha) + +The aggregator dials `goldmane:7443` over **mutual TLS**. Goldmane requires the +client cert to chain to the **Tigera CA**, but it does **NOT authorize by client +identity** — any Tigera-CA-signed cert is accepted. + +Rather than copy the Tigera CA **private key** into Terraform state to mint our +own cert (a needless CA-key exposure; the `hashicorp/tls` provider also clashes +with this repo's global generate-providers/lockfile pattern), the stack +**REUSES the operator-minted, Tigera-CA-signed `whisker-backend-key-pair` +Secret** (`calico-system`), copying its `tls.crt`/`tls.key` into the +`goldmane-client-tls` Secret in the aggregator namespace. The CA *bundle* that +verifies Goldmane's serving cert (`tigera-ca-bundle` ConfigMap, key +`tigera-ca-bundle.crt`) is likewise copied verbatim (a ConfigMap can't be +cross-namespace-mounted). + +> **GOTCHA — if the operator rotates `whisker-backend-key-pair`, re-apply +> `stacks/goldmane-edge-aggregator`** to re-sync the copied cert. Symptom of a +> stale copy: the `aggregate` pod logs TLS handshake / `Flows.Stream` failures +> and no `last_seen` updates land in the `edge` table. Hardening follow-up +> (noted in the stack): mint an own-identity cert in-namespace if Whisker is ever +> removed (which would delete the reused source Secret). + +The Deployment leaves `GOLDMANE_HOST=goldmane.calico-system.svc.cluster.local:7443` +and the default cert/CA paths; the default ServerName (host sans port) is a SAN +on Goldmane's live serving cert, so no `GOLDMANE_SERVER_NAME` / +`GOLDMANE_TLS_INSECURE` override is needed. + +## How to query who-talks-to-whom + +`psql` into the DB (creds: Vault static role `static-creds/pg-goldmane-edges`, or +exec a CNPG pod). All queries are against the single `edge` table. + +```sql +-- Everything talking to a namespace (inbound), most-active first +SELECT src_ns, action, flow_count, first_seen, last_seen +FROM edge WHERE dst_ns = '' ORDER BY flow_count DESC; + +-- Everything a namespace talks TO (outbound) +SELECT dst_ns, action, flow_count, first_seen, last_seen +FROM edge WHERE src_ns = '' ORDER BY last_seen DESC; + +-- New edges in the last 24h (what the digest reports) +SELECT src_ns, dst_ns, action, flow_count, first_seen +FROM edge WHERE first_seen > now() - interval '24 hours' +ORDER BY first_seen DESC; + +-- Any DENIED edges (policy is dropping this pair) +SELECT src_ns, dst_ns, flow_count, last_seen +FROM edge WHERE action = 'deny' ORDER BY last_seen DESC; + +-- Full edge set as a graph adjacency list +SELECT src_ns, dst_ns, action, flow_count FROM edge ORDER BY src_ns, dst_ns; +``` + +For the **live** (sub-hour) view including pod/port detail, use the Whisker UI — +the `edge` table intentionally aggregates that away. + +## Deriving the Wave-1 egress allowlist from the edge table (infra #62) + +The durable edge set is a faster, identity-stamped data source for the existing +**observe-then-enforce** egress effort (beads `code-8ywc`; snapshot +`docs/architecture/wave1-egress-observation-2026-05-22.md`) than the original +iptables-`LOG` → journald → Loki path (ADR-0014 consequence: "Enforcement gains +a better data source"). It replaces the *internal* (namespace-to-namespace) leg +of the allowlist; **external/public-internet egress is NOT in this table** (empty +dst namespace, dropped) — for those destinations keep using the Calico flow-log +path described in security.md. + +**Per-namespace internal egress allowlist** — the set of in-cluster namespaces a +given source is *observed* talking to with `action='allow'`: + +```sql +-- Internal egress allowlist for one namespace (feeds its NetworkPolicy) +SELECT DISTINCT dst_ns +FROM edge +WHERE src_ns = '' AND action = 'allow' +ORDER BY dst_ns; +``` + +```sql +-- Full internal egress matrix for all namespaces at once +SELECT src_ns, array_agg(DISTINCT dst_ns ORDER BY dst_ns) AS allowed_dst_ns +FROM edge +WHERE action = 'allow' +GROUP BY src_ns +ORDER BY src_ns; +``` + +```sql +-- Sanity: namespaces with a DENY edge already (policy is biting; investigate +-- before tightening further) +SELECT DISTINCT src_ns, dst_ns FROM edge WHERE action = 'deny'; +``` + +**How this feeds enforcement (scope):** the derived `dst_ns` set is the +*internal* half of a namespace's egress allowlist — it tells you which +in-cluster namespaces to permit before flipping that namespace to default-deny. +The universal baseline (kube-dns :53, often dbaas :3306/:5432, redis :6379) and +the external destinations still come from the Wave-1 observation snapshot. +**Enforce-flips remain OUT OF SCOPE** here — this is observe-and-derive only; +the phased per-namespace default-deny rollout (starting `recruiter-responder`) +is tracked under `code-8ywc`. Cross-links: +[security.md → NetworkPolicy Default-Deny Egress](../architecture/security.md#networkpolicy-default-deny-egress-wave-1--observe-then-enforce-tier-34), +[wave1-egress-observation-2026-05-22.md](../architecture/wave1-egress-observation-2026-05-22.md), +[ADR-0014](../adr/0014-service-identity-and-east-west-observability.md). + +> **Caveat (same as the Wave-1 snapshot):** an edge only exists if it was +> *observed*. A weekly CronJob or a 7-day Vault rotation may not have fired yet — +> collect ≥7 days of edges before treating a namespace's `allow` set as +> complete. The `first_seen` column tells you how long an edge has been known; +> the digest surfaces brand-new ones daily. + +## Monitoring & health (infra #61) + +The aggregator pod has **no `/metrics` endpoint** — health is inferred from +kube-state-metrics. Three complementary signals (memory ids 6598, 6599; +see also [monitoring.md → Security Alerts](../architecture/monitoring.md#security-alerts-wave-1--planned-beads-code-8ywc)): + +| Signal | What | Where | +|---|---|---| +| **`AggregatorDown`** | `kube_deployment_status_replicas_available{namespace="goldmane-edge-aggregator",deployment="goldmane-edge-aggregator"} < 1` for 15m → warning | Prometheus alert group `Network Observability (Goldmane)` in `stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl`; routes `slack-warning` → `#alerts` | +| **`DigestFailing`** | `kube_job_status_failed{...job_name=~"goldmane-edges-digest.*"} > 0` within 24h, for 30m → warning | same alert group → `#alerts` | +| **cluster-health #48** | `check_goldmane_aggregator` reads the Deployment's `Available` condition (missing or not-Available → FAIL) | `scripts/cluster_healthcheck.sh` (human / `--quiet` / `--json` modes; emits `goldmane_aggregator`) | + +The two alert layers are deliberately complementary: `AggregatorDown` → +**no new edges land** in the DB; `DigestFailing` → **edges still land but nobody +is told**. A freshness probe (#61b) was intentionally skipped — `AggregatorDown` +is the agreed floor. + +## Troubleshooting + +**Whisker UI 502 / unreachable.** The additive +`kubernetes_network_policy_v1.whisker_allow_traefik` is missing or the +operator's default-deny `whisker` NP regenerated — re-apply `stacks/calico`. A +brand-new ingress host is also invisible to LAN split-horizon until the hourly +`technitium-ingress-dns-sync` runs (memory #5349); test meanwhile with +`curl -sSI --resolve whisker.viktorbarzin.me:443:10.0.20.203 https://whisker.viktorbarzin.me` +(expect a 302 to Authentik — the gate working). + +**No new `last_seen` updates / `AggregatorDown` firing.** Check the `aggregate` +pod logs (`kubectl logs -n goldmane-edge-aggregator deploy/goldmane-edge-aggregator`). +Common causes, in order: +1. **Stale mTLS cert** — the operator rotated `whisker-backend-key-pair`; re-apply + `stacks/goldmane-edge-aggregator` (see cert-reuse gotcha above). Symptom: TLS + handshake / `Flows.Stream` errors. +2. **Stale DB password** — the 7-day Vault rotation bounced the credential but + the pod kept the old one. The Deployment carries + `secret.reloader.stakater.com/reload: goldmane-edges-db-creds`; if it's not + restarting on rotation, verify the Reloader annotation and the ExternalSecret. +3. **Goldmane restarted** — the in-memory window was lost (expected); the stream + reconnects automatically and resumes upserting. No data loss in the DB + (only the sub-hour live window in Whisker is gone). + +**Digest never posts / `DigestFailing` firing.** Inspect the most recent +`goldmane-edges-digest-*` Job (`kubectl get jobs -n goldmane-edge-aggregator`; +`kubectl logs job/`). The CronJob's `ttl_seconds_after_finished=86400` GCs +pods after a day, so check soon after a failed run. With `SLACK_WEBHOOK_URL` +empty the binary forces a dry-run (no post) — verify the `goldmane-edges-slack` +ExternalSecret resolved. A dry run / smoke test: run the image with `args: +["digest"]` + `DRY_RUN=1` to print the message instead of POSTing. +> Known state (2026-06-25): the digest CronJob's first Job **failed** and it has +> never successfully posted (`lastSuccessfulTime` empty) — the digest leg is the +> live gap; `DigestFailing` is catching it. Edges still land in the DB via the +> `aggregate` Deployment; only the `#alerts` digest notification is affected. +> Investigation/fix belongs to the aggregator slice (#58/#60), not monitoring. + +**No edges at all in the table.** Confirm Goldmane is enabled +(`kubectl get goldmane,whisker -A`) and `calico-node` rolled with the +`FELIX_FLOWLOGSGOLDMANESERVER` env; confirm the `goldmane-edges-db-init` Job +completed; confirm the aggregator pod is `Running` and not `ImagePullBackOff` +(ghcr allowlist). + +## Related +- [ADR-0014 — Service identity & east-west observability](../adr/0014-service-identity-and-east-west-observability.md) +- [security.md — NetworkPolicy Default-Deny Egress + east-west flow observability](../architecture/security.md) +- [monitoring.md — east-west flow observability + alerts](../architecture/monitoring.md) +- [wave1-egress-observation-2026-05-22.md](../architecture/wave1-egress-observation-2026-05-22.md) +- `CONTEXT.md` glossary — **Service identity**, **Goldmane / Whisker** +- Code: `~/code/goldmane-edge-aggregator` (`README.md`, `DEPLOY.md`); stacks + `stacks/goldmane-edge-aggregator`, `stacks/calico` diff --git a/docs/runbooks/homelab-vault-onboarding.md b/docs/runbooks/homelab-vault-onboarding.md new file mode 100644 index 00000000..61d323ab --- /dev/null +++ b/docs/runbooks/homelab-vault-onboarding.md @@ -0,0 +1,121 @@ +# `homelab vault` onboarding (per-user Vaultwarden access) + +## Scope + +`homelab vault` gives each devvm roster user no-HITL access to **their own** +Vaultwarden vault (and any Organization Collection shared with their account) +from the command line. It shells out to the official `bw` CLI; the user's +Vaultwarden credentials live only in their isolated Vault path +`secret/workstation/claude-users/` and are decrypted as that OS user — +the admin never sees them. + +```text +homelab vault setup one-time: store VW email + master password + API key +homelab vault status configured / unlocked / reachable (no secrets) +homelab vault list [--search Q] item names (no secrets) +homelab vault get [--field password|username|uri|notes|totp] [--json] +homelab vault code current TOTP code +homelab vault lock lock / log out the local bw session +``` + +## How auth works (why a non-admin can use it) + +`homelab vault` runs `vault` as the calling user. It resolves a Vault token in +this order (`ensureVaultToken`, `cli/cmd_vault.go`): + +1. an explicit `$VAULT_TOKEN`, then +2. a native `~/.vault-token` (what admins carry), then +3. the per-user **scoped token** that `claude-auth-sync` maintains at + `~/.config/claude-auth-sync/vault-token` (policy `workstation-claude-`). + +That scoped policy grants exactly `create`/`read`/`update` on the user's own +`secret/workstation/claude-users/` path — no `patch` capability — so the +tool writes with `vault kv patch -method=rw` (read-modify-write), falling back to +`kv put` only when the path does not exist yet. This preserves the +`claude_ai_oauth_json` key that [claude-auth-sync](claude-auth-renew-workstation.md) +co-locates there. (Both bugs that previously made this admin-only were fixed +2026-06-27.) + +## Prerequisites (per user) + +- The user is in `scripts/workstation/roster.yaml` and the **vault** stack has + been applied → their `workstation-claude-` policy exists. +- The user's workstation was provisioned (`setup-devvm.sh`) → their scoped Vault + token exists at `~/.config/claude-auth-sync/vault-token`. +- `bw` is installed **system-wide** at `/usr/bin/bw` (see below). +- The user has a Vaultwarden account at `https://vaultwarden.viktorbarzin.me` + (self-service signup is open; admin panel is disabled). + +## One-time admin steps (devvm) + +`bw` must be system-wide so every user resolves it (it is a Node script, and +`node` is already system-wide at `/usr/bin/node`). `setup-devvm.sh` installs it +to the npm `/usr` prefix; the guard checks the **system** path, not +`command -v bw` (an admin's own `~/.local/bin/bw` used to mask the system +install, leaving non-admins with no backend). To install on a running box: + +```bash +sudo npm install -g --prefix /usr "@bitwarden/cli@^2024" +bw --version # confirm /usr/bin/bw resolves +``` + +After landing a `cli/` change, rebuild the binary so users pick it up: + +```bash +sudo bash -c 'cd /home/wizard/code/infra/cli && \ + go build -ldflags "-X main.version=$(git -C /home/wizard/code/infra describe --tags --always 2>/dev/null || echo dev)" \ + -o /usr/local/bin/homelab .' +``` + +(or just re-run `scripts/workstation/setup-devvm.sh` as root, which rebuilds it.) + +## User onboarding + +The user runs these as themselves. The master password / API key are entered +interactively (never on the command line) and stored only in the user's Vault +path. + +1. In the Vaultwarden web vault → **Settings → Security → Keys → View API key**, + copy the `client_id` (`user.xxxx`) and `client_secret`. +2. Configure: + + ```bash + homelab vault setup # prompts: VW email, API client_id/secret, master password + homelab vault status # → "vault: configured, unlocked, reachable ✓" + homelab vault list # item names (own vault + any shared Collections) + ``` + +## Shared-Collection access (sharing passwords with a user) + +`homelab vault` surfaces Organization Collection items automatically once the +user's Vaultwarden account is a confirmed member. These steps are done by the +vault owner in the **Vaultwarden web UI** (they need the owner's master +password — not an infra/Terraform operation): + +1. Create or reuse an **Organization** and a **Collection** of shared logins. +2. **Invite** the user's Vaultwarden account to the Organization, granting + **"Can view"** on that Collection (least privilege). +3. The user accepts the email invite and confirms membership. +4. The user runs `homelab vault list` — the shared items now appear alongside + their own (a `homelab vault status` sync picks them up). + +## Security model (the no-HITL trade) + +Identity is the kernel UID. Anything running as the user can decrypt the user's +vault — this is the accepted trade for no-human-in-the-loop fetches. Secrets +never appear in `argv` (passed via env or stdin), core dumps are disabled, TOTP +fetches are logged to syslog/Loki, and on a TTY values go to the clipboard +(auto-clearing) rather than scrollback. The admin's Vault token is never used by +a non-admin: each user authenticates with their own scoped token. + +## Verification + +```bash +# the scoped token carries the right policy +VAULT_TOKEN="$(sudo cat /home//.config/claude-auth-sync/vault-token)" \ + vault token lookup -format=json | jq '.data.display_name, .data.policies' +# → "token-devvm-claude-auth-", [..., "workstation-claude-"] + +sudo -u -i bw --version # /usr/bin/bw resolves for the user +sudo -u -i homelab vault status +``` diff --git a/docs/runbooks/k8s-version-upgrade.md b/docs/runbooks/k8s-version-upgrade.md index 08d43926..021c588f 100644 --- a/docs/runbooks/k8s-version-upgrade.md +++ b/docs/runbooks/k8s-version-upgrade.md @@ -41,6 +41,8 @@ Job 0 — preflight (pinned: k8s-node1) ├── halt-on-alert (kured-style ignore-list) ├── 24h-quiet baseline (no Ready transitions <24h ago) ├── kubeadm upgrade plan matches target (skipped when master already at target — partial-resume) + ├── apiserver-OIDC drift check: kubeadm upgrade diff drops --authentication-config? → Slack WARN (recoverable; not a block) + ├── reclaim kubeadm scratch: prune /etc/kubernetes/tmp/kubeadm-backup-* >3d on master (kubeadm leaks ~400MB etcd-db backups) ├── Push k8s_upgrade_in_flight=1, k8s_upgrade_started_timestamp=$(date +%s) ├── Trigger backup-etcd Job, wait, verify snapshot byte count ├── SSH master: containerd skew fix (if master < workers) @@ -222,22 +224,34 @@ Exposed in K8s via ExternalSecret `k8s-upgrade-creds` in the `k8s-upgrade` names ## Common Operations -### Post-upgrade: apiserver OIDC restore (AUTOMATED by the chain since 2026-06-19) +### apiserver OIDC + kubeadm upgrades (kubeadm-config reconciliation since 2026-06-24) `kubeadm upgrade apply` **regenerates `/etc/kubernetes/manifests/kube-apiserver.yaml` -and drops the `--authentication-config` flag**, silently disabling apiserver -OIDC (kubectl/kubelogin CLI **and** the web dashboard SSO break — tokens get -401). This used to require a manual re-apply after **every** control-plane bump. +from kubeadm-config**. apiserver auth uses a structured multi-issuer +`--authentication-config` (kubectl + dashboard SSO), but kubeadm-config used to +still carry the legacy single-issuer `--oidc-*` extraArgs — so every upgrade +reverted the flag, **silently breaking SSO after the upgrade** (the apiserver does +NOT crash on this — verified by isolated repro; it's recoverable via the restore +script below). NB: the **1.34→1.35 stall on 2026-06-24 was a *separate* issue — +etcd IO starvation**, not this drift; post-mortem: +`docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md`. -**Now automated:** the `rbac` stack publishes its OIDC restore script to the -`kube-system/apiserver-oidc-restore` ConfigMap, and the version-upgrade chain's -`phase_master` re-runs it on master immediately after `kubeadm upgrade apply` -(while tigera-operator is still quiesced, so the flag-add apiserver restart can't -crashloop the operator). It's idempotent, health-gates `/livez` with -auto-rollback, and is **non-fatal** — a failure only lags SSO until the next rbac -apply (the version upgrade itself already succeeded). So a chain-driven -control-plane bump no longer breaks SSO. The master phase self-skips when master -is already at target, so this only runs when master was actually upgraded. +**Primary fix (2026-06-24):** `stacks/rbac/modules/rbac/apiserver-oidc.tf` now +**reconciles kubeadm-config** (`kubeadm init phase upload-config kubeadm`, rewriting +`apiServer.extraArgs`: drop `--oidc-*`, add `--authentication-config`) as part of +its remote script. So kubeadm regenerates a **correct** manifest and the apiserver +upgrades with a pure image bump — `kubeadm upgrade diff ` shows only the +image change. Zero live impact (the CM is read only during an upgrade). + +**Backstops:** +- **Preflight check 4b** runs `kubeadm upgrade diff` and **alerts** (Slack WARN, does + NOT block — the drift only breaks SSO, which is recoverable) if + `--authentication-config` would still be dropped. +- The `rbac` stack still publishes its restore script to the + `kube-system/apiserver-oidc-restore` ConfigMap, and `phase_master` re-runs it on + master right after `kubeadm upgrade apply` (idempotent, `/livez`-gated with + auto-rollback, non-fatal) — now redundant belt-and-suspenders that *also* + re-reconciles kubeadm-config. Self-skips when master is already at target. **Manual fallback** — only for an out-of-band/manual `kubeadm` upgrade, or if the chain logged `WARN: --authentication-config absent after re-apply`: diff --git a/scripts/cluster_healthcheck.sh b/scripts/cluster_healthcheck.sh index 51a13b5d..a5088137 100755 --- a/scripts/cluster_healthcheck.sh +++ b/scripts/cluster_healthcheck.sh @@ -27,7 +27,7 @@ KUBECONFIG_PATH="${KUBECONFIG:-${HOME}/.kube/config}" [[ -f "$KUBECONFIG_PATH" ]] || KUBECONFIG_PATH="$(pwd)/config" KUBECTL="" JSON_RESULTS=() -TOTAL_CHECKS=47 +TOTAL_CHECKS=48 # Parallel execution settings. Each check function is self-contained — it # only reads cluster state and mutates the in-memory counters / JSON_RESULTS @@ -3156,6 +3156,44 @@ PYEOF esac } +# --- 48. Goldmane edge-aggregator availability --- +# +# The goldmane-edge-aggregator Deployment (ADR-0014 / infra #58) streams Calico +# Goldmane flows into the goldmane_edges CNPG DB — the durable who-talks-to-whom +# trail. The pod has NO /metrics endpoint, so its liveness can't be scraped; +# this check reads the Deployment's Available condition directly so the trail +# silently dying surfaces in the health board (mirrors the AggregatorDown +# Prometheus alert). Missing Deployment / not-Available -> FAIL. +check_goldmane_aggregator() { + section 48 "Goldmane Edge-Aggregator" + local ns="goldmane-edge-aggregator" dep="goldmane-edge-aggregator" + local avail desired ready + + # One get; absent Deployment is a hard fail (the trail isn't deployed). + if ! $KUBECTL get deploy "$dep" -n "$ns" >/dev/null 2>&1; then + [[ "$QUIET" == true ]] && section_always 48 "Goldmane Edge-Aggregator" + fail "Deployment $ns/$dep not found — who-talks-to-whom edge trail is not running" + json_add "goldmane_aggregator" "FAIL" "deployment missing" + return 0 + fi + + avail=$($KUBECTL get deploy "$dep" -n "$ns" \ + -o jsonpath='{.status.conditions[?(@.type=="Available")].status}' 2>/dev/null) + ready=$($KUBECTL get deploy "$dep" -n "$ns" -o jsonpath='{.status.readyReplicas}' 2>/dev/null) + desired=$($KUBECTL get deploy "$dep" -n "$ns" -o jsonpath='{.spec.replicas}' 2>/dev/null) + ready=${ready:-0} + desired=${desired:-0} + + if [[ "$avail" == "True" ]]; then + pass "Edge-aggregator Available ($ready/$desired ready)" + json_add "goldmane_aggregator" "PASS" "${ready}/${desired} ready" + else + [[ "$QUIET" == true ]] && section_always 48 "Goldmane Edge-Aggregator" + fail "Edge-aggregator NOT Available ($ready/$desired ready) — edge trail has stopped recording" + json_add "goldmane_aggregator" "FAIL" "${ready}/${desired} ready; Available=${avail:-unknown}" + fi +} + # --- Summary --- print_summary() { if [[ "$JSON" == true ]]; then @@ -3224,7 +3262,7 @@ main() { check_monitoring_prom_am check_monitoring_vault check_monitoring_css check_external_replicas check_external_divergence check_pve_thermals check_pve_load check_external_traefik_5xx check_ha_status_dashboard - check_immich_search check_csi_ghost_drift + check_immich_search check_csi_ghost_drift check_goldmane_aggregator ) # Auto-fix mutates cluster state inside individual checks — keep that diff --git a/scripts/test-claude-auth-sync.sh b/scripts/test-claude-auth-sync.sh index 10f07746..62c54e8b 100755 --- a/scripts/test-claude-auth-sync.sh +++ b/scripts/test-claude-auth-sync.sh @@ -28,5 +28,61 @@ ok "accept own scoped Vault token" cas_vault_identity_ok token-devvm-claude-auth no "reject another user's token" cas_vault_identity_ok token-devvm-claude-auth-anca default,workstation-claude-anca no "reject wrong policy" cas_vault_identity_ok token-devvm-claude-auth-emo default,workstation-claude-anca +# --- Regression: cas_backup must MERGE into the shared Vault path, preserving +# sibling keys that other tools co-locate there (e.g. `homelab vault`'s +# vaultwarden_* creds) — NOT overwrite the whole KV document. A blind `kv put` +# wiped them every 6h (claude-auth-sync clobber, 2026-06-26). +fakebin="$tmp/bin"; mkdir -p "$fakebin" +store="$tmp/vault-store.json" +cat > "$fakebin/vault" <<'FAKE' +#!/usr/bin/env bash +# Minimal KV-v2 fake backed by $VAULT_FAKE_STORE (a flat JSON object). +[[ "$1" == kv ]] || { echo '{}'; exit 0; } # token lookup etc. -> ignore +op="$2"; shift 2 +store="$VAULT_FAKE_STORE" +case "$op" in + get) + for a in "$@"; do [[ "$a" == -field=* ]] && field="${a#-field=}"; done + if [[ "$*" == *-format=json* ]]; then + [[ -f "$store" ]] || { echo "No value found"; exit 2; } + jq -n --argjson d "$(cat "$store")" '{data:{data:$d}}'; exit 0 + fi + [[ -f "$store" ]] || exit 2 # bare get == existence check + if [[ -n "${field:-}" ]]; then + v="$(jq -r --arg k "$field" '.[$k] // empty' "$store")"; [[ -n "$v" ]] || exit 1 + printf '%s' "$v"; exit 0 + fi + exit 0 ;; + put) echo '{}' > "$store" ;; # full replace + patch) [[ -f "$store" ]] || { echo "No value found"; exit 2; } ;; # merge (rw) + *) exit 1 ;; +esac +for a in "$@"; do + case "$a" in + -*|secret/*) continue ;; # flags + the path arg + *=*) k="${a%%=*}"; v="${a#*=}" + t="$(mktemp)"; jq --arg k "$k" --arg v "$v" '.[$k]=$v' "$store" > "$t" && mv "$t" "$store" ;; + esac +done +exit 0 +FAKE +chmod +x "$fakebin/vault" + +CAS_VAULT_PATH="secret/workstation/claude-users/test" +CAS_CREDENTIALS="$tmp/credentials.json" +CAS_STATE_DIR="$tmp/state" +_oldpath="$PATH"; PATH="$fakebin:$PATH"; export VAULT_FAKE_STORE="$store" + +printf '{"vaultwarden_master_password":"keep-me"}\n' > "$store" # pretend `homelab vault setup` ran +ok "backup succeeds (existing doc)" cas_backup +eq "merge preserves sibling key" keep-me "$(jq -r '.vaultwarden_master_password' "$store")" +eq "merge writes claude oauth" access "$(jq -r '.claude_ai_oauth_json|fromjson|.accessToken' "$store")" + +rm -f "$store" # fresh user: no doc yet +ok "backup succeeds (creates doc)" cas_backup +eq "create writes claude oauth" access "$(jq -r '.claude_ai_oauth_json|fromjson|.accessToken' "$store")" + +PATH="$_oldpath"; unset VAULT_FAKE_STORE + printf '\n%d passed, %d failed\n' "$pass" "$fail" (( fail == 0 )) diff --git a/scripts/workstation/claude-auth-sync.sh b/scripts/workstation/claude-auth-sync.sh index dc3d780d..0ea94f48 100755 --- a/scripts/workstation/claude-auth-sync.sh +++ b/scripts/workstation/claude-auth-sync.sh @@ -82,7 +82,17 @@ cas_backup() { return 1 } expires="$(jq -r '.expiresAt' <<<"$oauth")" - vault kv put "$CAS_VAULT_PATH" \ + # MERGE into the shared path so sibling keys other tools co-locate there + # (e.g. `homelab vault`'s vaultwarden_* creds) survive. `kv patch -method=rw` + # is read+update (needs no `patch` capability) but requires the secret to + # already exist, so create it with `kv put` on the very first backup only. + local -a write_cmd + if vault kv get "$CAS_VAULT_PATH" >/dev/null 2>&1; then + write_cmd=(vault kv patch -method=rw "$CAS_VAULT_PATH") + else + write_cmd=(vault kv put "$CAS_VAULT_PATH") + fi + "${write_cmd[@]}" \ claude_ai_oauth_json="$oauth" \ credential_expires_at_ms="$expires" \ backed_up_at="$(date -Is)" >/dev/null || { diff --git a/scripts/workstation/claude-skills/README.md b/scripts/workstation/claude-skills/README.md index 816cbcb7..1fa06d94 100644 --- a/scripts/workstation/claude-skills/README.md +++ b/scripts/workstation/claude-skills/README.md @@ -19,13 +19,29 @@ unpinned-CLI dependencies out of the hourly **root** reconcile. - `mattpocock/skills` (https://github.com/mattpocock/skills) — all except `find-skills` - `vercel-labs/skills` (https://github.com/vercel-labs/skills) — `find-skills` +- **homelab-local, emo-PERSONALIZED** — `cluster-health` here is an + **emo-specific variant**, not a copy of the canonical skill. It started as a + copy of this repo's `.claude/skills/cluster-health/` but was rewritten on + 2026-06-26 to focus on ha-sofia + emo's Sofia devices (emo is the only entry + in `SKILL_USERS`, a read-only power-user). The canonical admin skill + (`.claude/skills/cluster-health/`) is the full 47-check version and is left + untouched. **Do NOT `cp -a` the canonical copy over this one** — that would + clobber the personalization. Maintain the two independently. ## Refreshing -Re-snapshot from a current install and commit the diff: +Re-snapshot the upstream skills from a current install and commit the diff: ```sh cp -a ~/.agents/skills/. scripts/workstation/claude-skills/ ``` -Snapshot taken 2026-06-23. +`cluster-health` is hand-maintained (emo variant) — it is **not** covered by the +`cp -a` above and must **not** be overwritten from `.claude/skills/`. Edit it in +place here when emo's needs change, then refresh his live copy (the provisioner's +`install_skills()` is if-absent, so it won't update an existing `~/.agents/skills` +copy — `cp` the new `SKILL.md` to `/home/emo/.agents/skills/cluster-health/` and +`chown emo:emo`, or remove emo's copy and re-run the reconcile). + +Snapshot taken 2026-06-23 (upstream); `cluster-health` vendored 2026-06-26, +personalized for emo 2026-06-26. diff --git a/scripts/workstation/claude-skills/cluster-health/SKILL.md b/scripts/workstation/claude-skills/cluster-health/SKILL.md new file mode 100644 index 00000000..20d13211 --- /dev/null +++ b/scripts/workstation/claude-skills/cluster-health/SKILL.md @@ -0,0 +1,146 @@ +--- +name: cluster-health +description: | + Personalized for emo. Check whether the homelab Kubernetes cluster is + affecting ha-sofia or the Sofia smart-home devices it runs (Tuya devices, + the MPPT ATS, lights, climate, security, irrigation). Use when: + (1) "is ha-sofia ok", "are my devices / the ATS / the lights down", + (2) "is the cluster affecting Sofia / my devices", + (3) "check the cluster", "cluster health", "is everything running", + (4) a device on the Барзини → Статус dashboard looks offline. + Runs the cluster-wide healthcheck read-only and triages it by what + ha-sofia actually depends on; the rest of the cluster is the admin's area. +author: Claude Code +version: 3.0.0-emo +date: 2026-06-26 +--- + +# Cluster Health — personalized for emo (ha-sofia focus) + +## What you actually care about + +You care about **ha-sofia** and the **Sofia smart-home devices** it runs — +the Tuya devices, the **MPPT ATS**, and the lights / climate / security / +irrigation on your **Барзини → Статус** dashboard. The wider Kubernetes +cluster matters to you **only when it's breaking something ha-sofia or your +devices depend on.** Anything else is the admin's (wizard's) area — note it in +one line and move on; don't chase it. + +You have **read-only** cluster access. You can SEE everything but change +nothing — so when something on your chain is broken, the job is to confirm it +and hand it off, not to repair it. + +## How ha-sofia depends on the cluster + +ha-sofia itself runs at the house (HAOS at https://ha-sofia.viktorbarzin.me) — +**not** in the cluster. The cluster reaches it through exactly two things: + +1. **tuya-bridge** (namespace `tuya-bridge`) — the REST API ha-sofia calls for + every Tuya device **and the MPPT ATS**. If it's unhealthy, your Tuya devices + + ATS stop responding. **This is the #1 thing to check.** +2. **The path that carries ha-sofia ⇄ tuya-bridge and keeps ha-sofia + reachable**: cloudflared (tunnel) → Traefik (LB) → the ingress + TLS cert + for `tuya-bridge.viktorbarzin.me` and `ha-sofia.viktorbarzin.me`, plus + Technitium DNS. If any of these break, ha-sofia can't reach tuya-bridge and + you can't reach ha-sofia remotely. + +Everything else in the cluster is unrelated to you unless it's hosting one of +those pods. + +## Step 1 — run the healthcheck (read-only, with your HA token) + +Your account can't read Vault, so load your own ha-sofia token first (it was +minted for you and lives at `~/.config/cluster-health/haos_token`). Then run +the script from YOUR clone, read-only: + +```bash +cd /home/emo/code +export HOME_ASSISTANT_SOFIA_TOKEN="$(cat ~/.config/cluster-health/haos_token)" +bash scripts/cluster_healthcheck.sh --no-fix --quiet +# machine-readable instead: +# bash scripts/cluster_healthcheck.sh --no-fix --quiet --json | tee /tmp/cluster-health.json +``` + +- **Never pass `--fix`** — it deletes pods (a write); you're read-only and it + will fail. +- Exit codes: `0` healthy, `1` warnings, `2` failures. + +With the token exported, the **ha-sofia checks run for you**: +26 Entity Availability · 27 Integration Health · 28 Automation Status · +29 System Resources · **45 Status Dashboard** — your Барзини → Статус view, +classifying every device tile as OK / ⚠️ / Offline across Сигурност, Мрежа & +IT, Енергия, Климат, Уреди, Мултимедия, Осветление, Поливна. Check 30 also +covers the **tuya** exporter. + +## Step 2 — triage the output by relevance to YOU + +Read the PASS/WARN/FAIL summary, then split the WARN/FAIL items in two: + +- **On your chain → this is what matters.** Anything touching: `tuya-bridge`, + `cloudflared`, `traefik`, DNS (check 21), the TLS cert / ingress for your two + hosts (checks 12, 22, 31, 32), or a **node** hosting those pods — plus all the + **ha-sofia** checks (26–29, 45) and the **tuya** exporter (30). +- **Not on your chain → one line, then drop it.** Summarise as "N unrelated + cluster issues (admin's area)" and don't investigate. + +## Step 3 — read-only checks for your chain + +All of these work with your read-only access: + +```bash +# tuya-bridge — your devices + the ATS +kubectl get pods -n tuya-bridge +kubectl rollout status deploy/tuya-bridge -n tuya-bridge +kubectl logs -n tuya-bridge deploy/tuya-bridge --tail=50 + +# the reachability path ha-sofia uses +kubectl get pods -n cloudflared +kubectl get pods -n traefik +kubectl get ingress -A | grep -Ei 'tuya-bridge|ha-sofia' + +# whole external path in one shot (DNS + tunnel + Traefik + cert): +curl -sI --max-time 10 https://tuya-bridge.viktorbarzin.me | head -1 +# reachable -> HTTP/2 200 / 401 / 403 (any HTTP response = path is up) +# broken -> curl: timeout / could not resolve host +``` + +The fastest **device-level** signal is your own dashboard: open +**https://ha-sofia.viktorbarzin.me → Барзини → Статус**. If devices show +Offline / Разкачен / ⚠️ **but tuya-bridge is healthy**, the problem is at the +house (device power / Wi-Fi / the Sofia TP-Link network) — **not** the cluster. + +## Step 4 — if something on your chain is broken + +You can't fix the cluster (read-only), so **capture + hand off**: + +```bash +kubectl describe pod -n tuya-bridge +kubectl logs -n tuya-bridge --previous --tail=200 +``` + +Then file it for the admin with the **`/file-issue`** skill — e.g. *"ha-sofia +Tuya devices + ATS unresponsive; tuya-bridge pod CrashLooping"* with the output +above. cloudflared / Traefik / DNS outages are cluster-wide — the admin's +alerting is already firing, but file it so it's tracked from your side too. + +## What will skip for you (expected — not failures) + +A few checks need access your account doesn't have. They warn/skip — that's +normal, and **none of them are on your ha-sofia chain**: + +- **Uptime Kuma (14)** — needs an admin password from Vault. +- **PVE host checks** — 36 (LVM snapshots), 43 (host thermals), 44 (host load), + and the Proxmox CSI ghost-disk check — all need root SSH to the Proxmox host. +- **`--fix`** — pod deletion (a write); not available to you. + +(The ha-sofia checks are **not** in this list — your token makes them work.) + +## Your ha-sofia token + +- Stored at `~/.config/cluster-health/haos_token` (yours, mode 600). +- It's a **dedicated** long-lived token, named `emo-cluster-health` under + ha-sofia → your profile → **Long-Lived Access Tokens**. Revoking it there + affects only you. +- It currently carries admin-level HA scope (Home Assistant only lets a token + be minted for the account that created it, and it was minted via the admin + account). If it ever stops working, tell wizard and a fresh one can be minted. diff --git a/scripts/workstation/managed-settings.json b/scripts/workstation/managed-settings.json index de214a1b..6e8a13a5 100644 --- a/scripts/workstation/managed-settings.json +++ b/scripts/workstation/managed-settings.json @@ -1,4 +1,4 @@ { - "claudeMd": "# Viktor Barzin homelab — shared multi-user Claude Code Workstation (devvm)\n\nYou are running as a specific OS user on a SHARED devvm Workstation, not as the admin. These org-wide rules apply to EVERY user and sit at the top of settings precedence (they cannot be overridden by a user's own config):\n\n- Respect your permission tier. kubectl, Vault, and infra access are scoped to your RBAC tier (admin / power-user / namespace-owner). Do not attempt to escalate privileges or reach another user's resources.\n- Secrets are per-user. Never read another user's home directory, credentials, tokens, or ~/.claude secrets. Your own secrets live in your home at mode 600.\n- Infrastructure changes go through Terraform/Terragrunt — never direct kubectl apply/edit/patch. Committed stack changes are auto-applied by CI on push to master; verify the live result with your read-only kubectl.\n- The AGENT does ALL git mechanics silently — the user may not know git, so never ask them to commit, push, pull, or open anything, and never surface git jargon. Lifecycle (worktrees, landing, cleanup): ~/.claude/rules/execution.md. Org red-lines on top:\n - THE COMMIT MESSAGE IS THE AUDIT TRAIL — subject says WHAT changed; body says WHY in plain words (paraphrase the user's actual request).\n - Never use [ci skip] as a non-admin (it hides the change from the audit feed).\n - Push rejected by branch protection (user not whitelisted) → fall back to a / branch + PR via the Forgejo API (token = password field in ~/.git-credentials).\n - Keep every clone on a clean master when done; tell the user in plain words what happened.\n - Full recipe: AGENTS.md → \"Non-admin workstation users\" in your infra clone.\n- Follow the engineering rules in ~/.claude/rules/ (execution, planning) and every CLAUDE.md in the repo tree.\n- Code lives under ~/code in one of two per-user layouts: either ~/code IS the git-crypt-LOCKED infra clone (single layout), or ~/code is a workspace directory of per-project clones — the locked infra clone at ~/code/infra plus other project repos alongside it. [ -d ~/code/.git ] means single. In locked infra clones secret files read as ciphertext — that is expected, not an error.\n", + "claudeMd": "# Viktor Barzin homelab — shared multi-user Claude Code Workstation (devvm)\n\nYou are running as a specific OS user on a SHARED devvm Workstation, not as the admin. These org-wide rules apply to EVERY user and sit at the top of settings precedence (they cannot be overridden by a user's own config):\n\n- Respect your permission tier for cluster/infra. kubectl, Vault, and infra access are scoped to your RBAC tier (admin / power-user / namespace-owner) — don't try to exceed it. At the OS level, authorization follows the OS: where your user holds `sudo`, using it is permitted and audited.\n- OS access follows OS permissions, not a separate rule here. You may read whatever your OS user can read — directly or via `sudo` where you hold it — including other users' home directories, credentials, tokens, and `~/.claude`. Do not impose restrictions stricter than the OS; no extra prompt is required for OS-authorized access. Protect your own secrets at mode 600.\n- Infrastructure changes go through Terraform/Terragrunt — never direct kubectl apply/edit/patch. Committed stack changes are auto-applied by CI on push to master; verify the live result with your read-only kubectl.\n- The AGENT does ALL git mechanics silently — the user may not know git, so never ask them to commit, push, pull, or open anything, and never surface git jargon. Lifecycle (worktrees, landing, cleanup): ~/.claude/rules/execution.md. Org red-lines on top:\n - THE COMMIT MESSAGE IS THE AUDIT TRAIL — subject says WHAT changed; body says WHY in plain words (paraphrase the user's actual request).\n - Never use [ci skip] as a non-admin (it hides the change from the audit feed).\n - Push rejected by branch protection (user not whitelisted) → fall back to a / branch + PR via the Forgejo API (token = password field in ~/.git-credentials).\n - Keep every clone on a clean master when done; tell the user in plain words what happened.\n - Full recipe: AGENTS.md → \"Non-admin workstation users\" in your infra clone.\n- Follow the engineering rules in ~/.claude/rules/ (execution, planning) and every CLAUDE.md in the repo tree.\n- Code lives under ~/code in one of two per-user layouts: either ~/code IS the git-crypt-LOCKED infra clone (single layout), or ~/code is a workspace directory of per-project clones — the locked infra clone at ~/code/infra plus other project repos alongside it. [ -d ~/code/.git ] means single. In locked infra clones secret files read as ciphertext — that is expected, not an error.\n", "model": "claude-opus-4-8" } diff --git a/scripts/workstation/setup-devvm.sh b/scripts/workstation/setup-devvm.sh index 2969b803..02bd9257 100755 --- a/scripts/workstation/setup-devvm.sh +++ b/scripts/workstation/setup-devvm.sh @@ -72,11 +72,14 @@ if [[ -n "$want_t3" && "$(t3 --version 2>/dev/null | awk '{print $NF}' | sed 's/ fi # 2c) Bitwarden CLI — backs `homelab vault` (per-user no-HITL Vaultwarden access). -# npm-global so every user's PATH resolves it. Pinned major; best-effort (a -# failure only disables `homelab vault`, nothing else on the box). -if ! command -v bw >/dev/null; then - log "npm: installing @bitwarden/cli (homelab vault backend)" - npm install -g "@bitwarden/cli@^2024" >/dev/null 2>&1 || log "WARN: @bitwarden/cli install failed; homelab vault unavailable" +# Install SYSTEM-WIDE (npm prefix /usr → /usr/bin/bw) so EVERY user's PATH +# resolves it. The guard tests the SYSTEM path, NOT `command -v bw`: the +# latter is satisfied by an admin's own ~/.local/bin/bw and would skip the +# system install, leaving non-admins (emo, anca, …) with no backend. Pinned +# major; best-effort (a failure only disables `homelab vault`). +if [ ! -x /usr/bin/bw ] && [ ! -x /usr/local/bin/bw ]; then + log "npm: installing @bitwarden/cli system-wide (homelab vault backend)" + npm install -g --prefix /usr "@bitwarden/cli@^2024" >/dev/null 2>&1 || log "WARN: @bitwarden/cli install failed; homelab vault unavailable" fi # 3) kubelogin (kubectl oidc-login) system-wide — NOT the apt 'kubelogin' (= Azure tool). diff --git a/stacks/actualbudget/main.tf b/stacks/actualbudget/main.tf index 33012033..13da68a8 100644 --- a/stacks/actualbudget/main.tf +++ b/stacks/actualbudget/main.tf @@ -5,6 +5,9 @@ variable "tls_secret_name" { variable "nfs_server" { type = string } resource "kubernetes_manifest" "external_secret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/affine/main.tf b/stacks/affine/main.tf index bc63381c..10a94ad7 100644 --- a/stacks/affine/main.tf +++ b/stacks/affine/main.tf @@ -5,6 +5,9 @@ variable "tls_secret_name" { variable "nfs_server" { type = string } resource "kubernetes_manifest" "external_secret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" @@ -42,6 +45,9 @@ data "kubernetes_secret" "eso_secrets" { # DB credentials from Vault database engine (rotated automatically) # Provides DATABASE_URL that auto-updates when password rotates resource "kubernetes_manifest" "db_external_secret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/authentik/email-secret.tf b/stacks/authentik/email-secret.tf index b3a7f201..87be65d4 100644 --- a/stacks/authentik/email-secret.tf +++ b/stacks/authentik/email-secret.tf @@ -6,6 +6,9 @@ # are non-secret and live in values.yaml. The reloader annotation rolls the # authentik pods if the password ever changes. resource "kubernetes_manifest" "authentik_email_secret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/beads-server/main.tf b/stacks/beads-server/main.tf index 5b71373e..eebed876 100644 --- a/stacks/beads-server/main.tf +++ b/stacks/beads-server/main.tf @@ -601,6 +601,9 @@ resource "kubernetes_config_map" "beadboard_config" { # Pulls the claude-agent-service bearer token from Vault so BeadBoard can # dispatch agent jobs via the in-cluster HTTP API. resource "kubernetes_manifest" "beadboard_agent_service_secret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/broker-sync/main.tf b/stacks/broker-sync/main.tf index 2de168a1..76d822d8 100644 --- a/stacks/broker-sync/main.tf +++ b/stacks/broker-sync/main.tf @@ -28,6 +28,9 @@ resource "kubernetes_namespace" "broker_sync" { # trading212_api_keys — JSON array of {account_id, account_type, api_key, name, currency} # imap_host, imap_user, imap_password, imap_directory — for InvestEngine + Schwab email ingest resource "kubernetes_manifest" "external_secret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/calico/main.tf b/stacks/calico/main.tf index 39550024..1354190e 100644 --- a/stacks/calico/main.tf +++ b/stacks/calico/main.tf @@ -212,3 +212,65 @@ resource "kubectl_manifest" "whisker" { spec = { notifications = "Disabled" } }) } + +# --------------------------------------------------------------------------- +# Gated public ingress for the Whisker UI (infra #57 / ADR-0014). +# +# whisker.viktorbarzin.me -> whisker:8081, Authentik-gated (auth="required": +# Whisker ships NO own login — it's an admin observability UI, so Authentik +# forward-auth is the only gate between strangers and the flow view). The +# operator replicated `tls-secret` into calico-system already. +# +# TWO coupled pieces are required because the operator's own `whisker` +# NetworkPolicy (owned by the Whisker CR above) sets policyTypes:[Ingress] +# with NO ingress rules => default-deny on ingress to the whisker pod. The +# additive NP below ORs in a Traefik allow (k8s NetworkPolicies are additive +# across policies selecting the same pod), so we never edit the operator NP. +module "ingress_whisker" { + source = "../../modules/kubernetes/ingress_factory" + dns_type = "proxied" + namespace = "calico-system" + name = "whisker" + service_name = "whisker" + port = 8081 + auth = "required" + tls_secret_name = "tls-secret" + extra_annotations = { + "gethomepage.dev/enabled" = "true" + "gethomepage.dev/name" = "Whisker" + "gethomepage.dev/description" = "Calico flow observability (who-talks-to-whom)" + "gethomepage.dev/icon" = "calico.png" + "gethomepage.dev/group" = "Infrastructure" + } +} + +# Additive NetworkPolicy: permit Traefik -> whisker:8081. ORs with the +# operator's default-deny `whisker` NP (selecting the same pod) so Traefik +# can reach the UI without touching the operator-owned policy. +resource "kubernetes_network_policy_v1" "whisker_allow_traefik" { + metadata { + name = "whisker-allow-traefik" + namespace = "calico-system" + } + spec { + pod_selector { + match_labels = { + "app.kubernetes.io/name" = "whisker" + } + } + policy_types = ["Ingress"] + ingress { + from { + namespace_selector { + match_labels = { + "kubernetes.io/metadata.name" = "traefik" + } + } + } + ports { + port = "8081" + protocol = "TCP" + } + } + } +} diff --git a/stacks/changedetection/main.tf b/stacks/changedetection/main.tf index ee203e7b..319ebcf1 100644 --- a/stacks/changedetection/main.tf +++ b/stacks/changedetection/main.tf @@ -19,6 +19,9 @@ resource "kubernetes_namespace" "changedetection" { } resource "kubernetes_manifest" "external_secret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/chrome-service/files/novnc/entrypoint.sh b/stacks/chrome-service/files/novnc/entrypoint.sh index fae5c641..aeff9408 100644 --- a/stacks/chrome-service/files/novnc/entrypoint.sh +++ b/stacks/chrome-service/files/novnc/entrypoint.sh @@ -19,14 +19,14 @@ for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15; do sleep 2 done -# websockify runs as PID 1; x11vnc is a child so its logs land on container stdout -# `-noshm` skips MIT-SHM probes that fail across container boundaries (each -# container has its own /dev/shm); `-noxdamage` skips XDAMAGE which Xvfb -# doesn't expose; `-quiet` keeps the polling chatter out of pod logs. +# Both x11vnc and websockify run as supervised children of this entrypoint (PID +# 1) so their logs land on container stdout and the `wait -n` at the end can catch +# either one dying. `-noshm` skips MIT-SHM probes that fail across container +# boundaries (each container has its own /dev/shm); `-noxdamage` skips XDAMAGE +# which Xvfb doesn't expose; `-quiet` keeps the polling chatter out of pod logs. echo "starting x11vnc -> :5900" x11vnc -display localhost:99 -nopw -listen 0.0.0.0 -rfbport 5900 \ -forever -shared -noshm -noxdamage -quiet 2>&1 & -X11VNC_PID=$! for i in 1 2 3 4 5 6 7 8 9 10; do if echo > /dev/tcp/127.0.0.1/5900 2>/dev/null; then @@ -43,4 +43,18 @@ if ! echo > /dev/tcp/127.0.0.1/5900 2>/dev/null; then fi echo "starting websockify -> :6080" -exec websockify --web=/usr/share/novnc 6080 localhost:5900 +# Run websockify in the background (it was `exec`ed before) so BOTH it and x11vnc +# are supervised. x11vnc attaches to the chrome-service container's Xvfb over +# localhost:6099 (shared pod network); when that container restarts, x11vnc loses +# its X connection and exits. Previously websockify was PID 1 and x11vnc was an +# unsupervised child, so a dead x11vnc was never relaunched: :5900 stayed dead and +# the noVNC view went black until a manual pod restart. Now if EITHER process +# exits, `wait -n` returns and we exit non-zero so the kubelet restarts this +# container, which re-waits for Xvfb and relaunches x11vnc — the bridge self-heals +# across browser-container restarts. (Same supervision pattern as the +# android-emulator stack's entrypoint.) +websockify --web=/usr/share/novnc 6080 localhost:5900 & + +wait -n || true +echo "novnc: a supervised process (x11vnc or websockify) exited; exiting so the kubelet restarts this container." >&2 +exit 1 diff --git a/stacks/chrome-service/main.tf b/stacks/chrome-service/main.tf index 2f679c00..382cc0c7 100644 --- a/stacks/chrome-service/main.tf +++ b/stacks/chrome-service/main.tf @@ -41,6 +41,9 @@ resource "kubernetes_namespace" "chrome_service" { # --- Secrets (single-key extract: api_bearer_token) --- resource "kubernetes_manifest" "external_secret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" @@ -330,15 +333,23 @@ resource "kubernetes_deployment" "chrome_service" { container { name = "novnc" # Phase 3 cutover 2026-05-07 — Forgejo registry consolidation. - image = "ghcr.io/viktorbarzin/chrome-service-novnc:latest" + # SHA-pinned (not :latest): Keel is OFF for this deployment + # (keel.sh/policy=never, below) and :latest/IfNotPresent won't re-pull a + # rebuilt image, so a new noVNC entrypoint only deploys when this digest + # is bumped here. Bump after build-chrome-service-novnc.yml pushes a new + # SHA tag — then WAIT for that apply pipeline to finish before pushing + # anything else: Woodpecker cancel-previous SIGKILLs an in-flight apply + # mid-run (memory id=1957), which is exactly how the 2026-06-27 apply got + # killed. 2026-06-27: bumped to land the x11vnc-supervision self-heal fix + # (noVNC went black after a browser-container restart; see + # docs/architecture/chrome-service.md "x11vnc supervision"). + image = "ghcr.io/viktorbarzin/chrome-service-novnc:19d0f0933a8ec75be6cfa077db88e0f8c3760f40" image_pull_policy = "IfNotPresent" # Cap RLIMIT_NOFILE before the entrypoint runs. Containerd grants pods # nofile=2^31; x11vnc sweeps the whole fd table on each client connect, # so every VNC connection hangs on "Connecting" until it times out - # (fd-sweep bug, same as android-emulator). entrypoint.sh now also sets - # this, but the image is :latest/IfNotPresent so a rebuilt entrypoint - # isn't guaranteed to be pulled — this wrapper applies the cap - # deterministically on every rollout off the cached image. + # (fd-sweep bug, same as android-emulator). entrypoint.sh also sets this; + # the wrapper keeps the cap deterministic even off a cached image. command = ["bash", "-c", "ulimit -n 65536; exec /entrypoint.sh"] port { name = "http" diff --git a/stacks/ci-pipeline-health/main.tf b/stacks/ci-pipeline-health/main.tf index 17378f84..44aacbec 100644 --- a/stacks/ci-pipeline-health/main.tf +++ b/stacks/ci-pipeline-health/main.tf @@ -49,6 +49,9 @@ resource "kubernetes_namespace" "ci_pipeline_health" { # billing on PRIVATE mirrors, which a future scoped read:packages rotation of # the alias could not do. Blast radius = this single-CronJob namespace. resource "kubernetes_manifest" "external_secret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/claude-agent-service/main.tf b/stacks/claude-agent-service/main.tf index 9f8b6478..a039f699 100644 --- a/stacks/claude-agent-service/main.tf +++ b/stacks/claude-agent-service/main.tf @@ -38,6 +38,9 @@ resource "kubernetes_namespace" "claude_agent" { # --- Secrets --- resource "kubernetes_manifest" "external_secret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/claude-breakglass/main.tf b/stacks/claude-breakglass/main.tf index 6b996b9e..ca700945 100644 --- a/stacks/claude-breakglass/main.tf +++ b/stacks/claude-breakglass/main.tf @@ -57,6 +57,9 @@ resource "kubernetes_service_account" "breakglass" { # DENIED this path (see stacks/vault/main.tf) so the shared, prompt-injectable # pod can never read it. resource "kubernetes_manifest" "external_secret_ssh" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" @@ -82,6 +85,9 @@ resource "kubernetes_manifest" "external_secret_ssh" { # Env secrets: the Anthropic OAuth token (shared with claude-agent-service — # same account) and the app bearer token (in-cluster/CLI fallback caller auth). resource "kubernetes_manifest" "external_secret_env" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/claude-memory/main.tf b/stacks/claude-memory/main.tf index 18c21fe5..fad08b42 100644 --- a/stacks/claude-memory/main.tf +++ b/stacks/claude-memory/main.tf @@ -29,6 +29,9 @@ resource "kubernetes_namespace" "claude-memory" { } resource "kubernetes_manifest" "external_secret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" @@ -57,6 +60,9 @@ resource "kubernetes_manifest" "external_secret" { # DB credentials from Vault database engine (rotated every 24h) resource "kubernetes_manifest" "db_external_secret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/coturn/main.tf b/stacks/coturn/main.tf index caeb9a66..9ab23e5d 100644 --- a/stacks/coturn/main.tf +++ b/stacks/coturn/main.tf @@ -5,6 +5,9 @@ variable "tls_secret_name" { variable "public_ip" { type = string } resource "kubernetes_manifest" "external_secret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/dawarich/main.tf b/stacks/dawarich/main.tf index 2432e9c3..3eeb1540 100644 --- a/stacks/dawarich/main.tf +++ b/stacks/dawarich/main.tf @@ -23,6 +23,9 @@ resource "kubernetes_namespace" "dawarich" { } resource "kubernetes_manifest" "external_secret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/dbaas/modules/dbaas/main.tf b/stacks/dbaas/modules/dbaas/main.tf index 479263ed..d940f642 100644 --- a/stacks/dbaas/modules/dbaas/main.tf +++ b/stacks/dbaas/modules/dbaas/main.tf @@ -745,7 +745,10 @@ resource "kubernetes_deployment" "phpmyadmin" { labels = { "app" = "phpmyadmin" tier = var.tier - + # ADR-0014 service identity: dbaas is a multi-Service namespace, so the + # namespace alone can't attribute Goldmane flows. Value = the fronting + # Service name (kubernetes_service.phpmyadmin is named "pma"). + "service-identity" = "pma" } annotations = { "reloader.stakater.com/search" = "true" @@ -762,6 +765,10 @@ resource "kubernetes_deployment" "phpmyadmin" { metadata { labels = { "app" = "phpmyadmin" + # ADR-0014: Goldmane/Felix stamps POD labels onto flows, so the + # disambiguating identity must live on the pod template (not just + # the Deployment metadata above). Not in selector → no replace. + "service-identity" = "pma" } } spec { @@ -812,8 +819,19 @@ resource "kubernetes_deployment" "phpmyadmin" { } } lifecycle { - # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2 - ignore_changes = [spec[0].template[0].spec[0].dns_config] + ignore_changes = [ + spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2 + # This Deployment is Keel-enrolled (keel.sh/policy=patch). Ignore the + # attributes Keel/Kyverno mutate at runtime so `terragrunt apply` (incl. + # the daily drift plan) doesn't fight them or revert the live image — + # canonical KEEL/KYVERNO lifecycle guard, matches linkwarden/chrome-service. + metadata[0].annotations["keel.sh/policy"], + metadata[0].annotations["keel.sh/trigger"], + metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2 + metadata[0].annotations["keel.sh/match-tag"], + spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE — Keel manages tag updates + spec[0].template[0].metadata[0].annotations["keel.sh/update-time"], # KEEL_LIFECYCLE_V1 + ] } } @@ -1499,6 +1517,10 @@ resource "kubernetes_deployment" "pgadmin" { } labels = { tier = var.tier + # ADR-0014 service identity: dbaas is a multi-Service namespace, so the + # namespace alone can't attribute Goldmane flows. Value = the fronting + # Service name (kubernetes_service.pgadmin is named "pgadmin"). + "service-identity" = "pgadmin" } } spec { @@ -1514,6 +1536,10 @@ resource "kubernetes_deployment" "pgadmin" { metadata { labels = { app = "pgadmin" + # ADR-0014: Goldmane/Felix stamps POD labels onto flows, so the + # disambiguating identity must live on the pod template (not just + # the Deployment metadata above). Not in selector → no replace. + "service-identity" = "pgadmin" } } spec { @@ -1568,8 +1594,20 @@ resource "kubernetes_deployment" "pgadmin" { } } lifecycle { - # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2 - ignore_changes = [spec[0].template[0].spec[0].dns_config] + ignore_changes = [ + spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2 + # This Deployment is Keel-enrolled (keel.sh/policy=patch) and Keel has + # bumped the live image (dpage/pgadmin4:9.16). Ignore the Keel/Kyverno + # runtime-mutated attributes so `terragrunt apply` (incl. the daily drift + # plan) doesn't revert the image to bare `dpage/pgadmin4` or strip Keel's + # annotations — canonical guard, matches linkwarden/chrome-service. + metadata[0].annotations["keel.sh/policy"], + metadata[0].annotations["keel.sh/trigger"], + metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2 + metadata[0].annotations["keel.sh/match-tag"], + spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE — Keel manages tag updates + spec[0].template[0].metadata[0].annotations["keel.sh/update-time"], # KEEL_LIFECYCLE_V1 + ] } } resource "kubernetes_service" "pgadmin" { diff --git a/stacks/diun/main.tf b/stacks/diun/main.tf index 9933f064..81294806 100644 --- a/stacks/diun/main.tf +++ b/stacks/diun/main.tf @@ -20,6 +20,9 @@ resource "kubernetes_namespace" "diun" { } resource "kubernetes_manifest" "external_secret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/ebooks/main.tf b/stacks/ebooks/main.tf index a5754590..0813b45a 100644 --- a/stacks/ebooks/main.tf +++ b/stacks/ebooks/main.tf @@ -20,6 +20,9 @@ resource "kubernetes_namespace" "ebooks" { # ExternalSecrets for all three sources resource "kubernetes_manifest" "calibre_external_secret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" @@ -47,6 +50,9 @@ resource "kubernetes_manifest" "calibre_external_secret" { } resource "kubernetes_manifest" "audiobookshelf_external_secret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" @@ -74,6 +80,9 @@ resource "kubernetes_manifest" "audiobookshelf_external_secret" { } resource "kubernetes_manifest" "servarr_external_secret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/f1-stream/main.tf b/stacks/f1-stream/main.tf index a62ad01a..bcd66c7f 100644 --- a/stacks/f1-stream/main.tf +++ b/stacks/f1-stream/main.tf @@ -33,6 +33,9 @@ resource "kubernetes_namespace" "f1-stream" { } resource "kubernetes_manifest" "external_secret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" @@ -62,6 +65,9 @@ resource "kubernetes_manifest" "external_secret" { # Pull the chrome-service bearer token into this namespace as a separate # Secret so the verifier can reach the in-cluster Playwright pool. resource "kubernetes_manifest" "chrome_service_client_secret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/fire-planner/main.tf b/stacks/fire-planner/main.tf index 21503a37..0cab541e 100644 --- a/stacks/fire-planner/main.tf +++ b/stacks/fire-planner/main.tf @@ -53,6 +53,9 @@ resource "kubernetes_namespace" "fire_planner" { # Seed before applying: # secret/fire-planner -> property `recompute_bearer_token` resource "kubernetes_manifest" "external_secret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" @@ -115,6 +118,9 @@ resource "kubernetes_manifest" "external_secret" { # Template builds the asyncpg DSN consumed by the FastAPI app + CronJob # as DB_CONNECTION_STRING. resource "kubernetes_manifest" "db_external_secret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" @@ -159,6 +165,9 @@ resource "kubernetes_manifest" "db_external_secret" { # pg-sync sidecar populates `daily_account_valuation` etc. hourly; the # fire-planner ingest reads those tables via this role. resource "kubernetes_manifest" "wealthfolio_sync_db_external_secret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" @@ -661,6 +670,9 @@ variable "run_examples_bulk_ingest" { # Reddit OAuth creds pulled from Vault secret/viktor. resource "kubernetes_manifest" "external_secret_examples_reddit" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" @@ -701,6 +713,9 @@ resource "kubernetes_manifest" "external_secret_examples_reddit" { # claude-agent-service bearer pulled separately so its rotation cadence # is decoupled from the Reddit creds. resource "kubernetes_manifest" "external_secret_examples_claude" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/forgejo/email-secret.tf b/stacks/forgejo/email-secret.tf index 034d45f2..d0e44c1c 100644 --- a/stacks/forgejo/email-secret.tf +++ b/stacks/forgejo/email-secret.tf @@ -6,6 +6,9 @@ # (stacks/authentik/email-secret.tf) — one credential, one rotation point. The # reloader annotation rolls the Forgejo pod if the password is ever rotated. resource "kubernetes_manifest" "forgejo_email_secret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/freedify/main.tf b/stacks/freedify/main.tf index 3e2cf8b4..2f017003 100644 --- a/stacks/freedify/main.tf +++ b/stacks/freedify/main.tf @@ -3,6 +3,9 @@ variable "tls_secret_name" { sensitive = true } resource "kubernetes_manifest" "external_secret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/freshrss/main.tf b/stacks/freshrss/main.tf index 31c5d20e..61e2122e 100644 --- a/stacks/freshrss/main.tf +++ b/stacks/freshrss/main.tf @@ -18,6 +18,9 @@ resource "kubernetes_namespace" "immich" { } resource "kubernetes_manifest" "external_secret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/goldmane-edge-aggregator/main.tf b/stacks/goldmane-edge-aggregator/main.tf index f2da273d..1c6fa58a 100644 --- a/stacks/goldmane-edge-aggregator/main.tf +++ b/stacks/goldmane-edge-aggregator/main.tf @@ -57,16 +57,19 @@ resource "kubernetes_namespace" "goldmane_edge_aggregator" { # ----------------------------------------------------------------------------- # The aggregator dials goldmane:7443 over mutual TLS. We mint a client cert # signed by the Tigera CA (the same CA that issues Goldmane's serving cert), so -# Goldmane trusts the client and the client trusts Goldmane's server cert via -# the published CA bundle. -# -# The Tigera CA private key lives in the `tigera-ca-private` Secret in -# tigera-operator (Opaque; verified keys: tls.crt + tls.key). The stack's apply -# identity needs RBAC get on that secret — see the Role/RoleBinding below. -data "kubernetes_secret" "tigera_ca" { +# Goldmane requires mutual TLS on :7443 and verifies the client cert chains to +# the Tigera CA — it does NOT authorize by client identity, so ANY Tigera-CA- +# signed cert is accepted. Rather than copy the Tigera CA PRIVATE KEY into TF +# state to mint our own (a needless CA-key exposure; the hashicorp/tls provider +# is also incompatible with this repo's global generate-providers/lockfile +# pattern), we REUSE the operator-minted, Tigera-CA-signed client cert +# `whisker-backend-key-pair` (calico-system). We never touch the CA key. +# Trade-off: if the operator rotates that cert, re-apply to re-sync (hardening +# follow-up: mint an own-identity cert in-namespace if Whisker is ever removed). +data "kubernetes_secret" "whisker_backend" { metadata { - name = "tigera-ca-private" - namespace = "tigera-operator" + name = "whisker-backend-key-pair" + namespace = "calico-system" } } @@ -93,46 +96,11 @@ resource "kubernetes_config_map" "tigera_ca_bundle" { data = data.kubernetes_config_map.tigera_ca_bundle.data } -# Client private key. -resource "tls_private_key" "goldmane_client" { - algorithm = "RSA" - rsa_bits = 2048 -} - -# CSR for the client cert. CN identifies the client; the service-DNS SAN mirrors -# how Felix/whisker-backend present a client identity to Goldmane. -resource "tls_cert_request" "goldmane_client" { - private_key_pem = tls_private_key.goldmane_client.private_key_pem - subject { - common_name = "goldmane-edge-aggregator" - organization = "goldmane-edge-aggregator" - } - dns_names = [ - "goldmane-edge-aggregator", - "goldmane-edge-aggregator.goldmane-edge-aggregator.svc.cluster.local", - ] -} - -# Sign the CSR with the Tigera CA. 10-year validity (87600h): re-apply rotates -# it well before expiry; a long horizon avoids surprise mTLS outages from an -# unattended stack. The Tigera CA itself outlives this (operator-managed). -resource "tls_locally_signed_cert" "goldmane_client" { - cert_request_pem = tls_cert_request.goldmane_client.cert_request_pem - ca_private_key_pem = data.kubernetes_secret.tigera_ca.data["tls.key"] - ca_cert_pem = data.kubernetes_secret.tigera_ca.data["tls.crt"] - - validity_period_hours = 87600 # 10y - early_renewal_hours = 720 # re-sign on apply when <30d remain - - allowed_uses = [ - "client_auth", - "digital_signature", - "key_encipherment", - ] -} - -# The minted client cert + key, mounted at TLS_CERT_PATH / TLS_KEY_PATH defaults -# (/etc/goldmane-client-tls/tls.crt and .../tls.key). +# Client cert + key for mTLS to goldmane:7443, mounted at TLS_CERT_PATH / +# TLS_KEY_PATH defaults (/etc/goldmane-client-tls/tls.crt and .../tls.key). +# Sourced verbatim from the operator's whisker-backend client key-pair (read +# above) — already Tigera-CA-signed, which is all Goldmane verifies. No CA key +# is touched and no cross-namespace CA RBAC is needed. resource "kubernetes_secret" "goldmane_client_tls" { metadata { name = "goldmane-client-tls" @@ -140,47 +108,8 @@ resource "kubernetes_secret" "goldmane_client_tls" { } type = "Opaque" data = { - "tls.crt" = tls_locally_signed_cert.goldmane_client.cert_pem - "tls.key" = tls_private_key.goldmane_client.private_key_pem - } -} - -# Narrow RBAC so this stack's apply identity (and ESO/Reloader are unaffected) -# can `get` the Tigera CA private key in tigera-operator. The data source above -# reads it at apply time; this Role/RoleBinding documents + grants that access -# rather than relying on cluster-admin. The subject is the same SA the other -# Tier-1 stacks apply as (claude-agent/terraform-state for headless, the human -# OIDC identity interactively) — both are cluster-admin today, so this is -# belt-and-braces / least-privilege intent for when apply identities tighten. -resource "kubernetes_role" "read_tigera_ca" { - metadata { - name = "goldmane-edge-aggregator-read-tigera-ca" - namespace = "tigera-operator" - } - rule { - api_groups = [""] - resources = ["secrets"] - resource_names = ["tigera-ca-private"] - verbs = ["get"] - } -} - -resource "kubernetes_role_binding" "read_tigera_ca" { - metadata { - name = "goldmane-edge-aggregator-read-tigera-ca" - namespace = "tigera-operator" - } - role_ref { - api_group = "rbac.authorization.k8s.io" - kind = "Role" - name = kubernetes_role.read_tigera_ca.metadata[0].name - } - # The headless apply identity (claude-agent-service runs Tier-1 applies as the - # `terraform-state` Vault K8s role in the claude-agent namespace). - subject { - kind = "ServiceAccount" - name = "default" - namespace = "claude-agent" + "tls.crt" = data.kubernetes_secret.whisker_backend.data["tls.crt"] + "tls.key" = data.kubernetes_secret.whisker_backend.data["tls.key"] } } @@ -227,6 +156,11 @@ resource "kubernetes_job" "db_init" { timeouts { create = "2m" } + lifecycle { + # KYVERNO_LIFECYCLE_V1: Kyverno injects dns_config (ndots=2); ignore it so + # this idempotent Job isn't replaced (Jobs are immutable) on every apply. + ignore_changes = [spec[0].template[0].spec[0].dns_config] + } } # ExternalSecret projecting the Vault-rotated (7-day) credential into a K8s @@ -234,6 +168,9 @@ resource "kubernetes_job" "db_init" { # place in the CNPG connection allowlist are added in stacks/vault/main.tf # (see this stack's terragrunt.hcl note). remoteRef key: static-creds/pg-goldmane-edges. resource "kubernetes_manifest" "db_external_secret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" @@ -276,6 +213,9 @@ resource "kubernetes_manifest" "db_external_secret" { # into this namespace as SLACK_WEBHOOK_URL via an ExternalSecret (no new # webhook). The digest CronJob defaults to #security. resource "kubernetes_manifest" "slack_external_secret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" @@ -295,7 +235,7 @@ resource "kubernetes_manifest" "slack_external_secret" { data = [{ secretKey = "SLACK_WEBHOOK_URL" remoteRef = { - key = "monitoring" + key = "viktor" property = "alertmanager_slack_api_url" } }] @@ -515,8 +455,13 @@ resource "kubernetes_cron_job_v1" "digest" { } } env { - name = "SLACK_CHANNEL" - value = "#security" + name = "SLACK_CHANNEL" + # Posts to #alerts. The dedicated #security channel was abandoned + # 2026-06-25 — the shared alertmanager_slack_api_url webhook's + # Slack app isn't a member of it (channel override 404s), so all + # Slack (incl. alertmanager's security-lane alerts) consolidated + # to #alerts. See docs/runbooks/goldmane-flow-trail.md. + value = "#alerts" } resources { diff --git a/stacks/grampsweb/main.tf b/stacks/grampsweb/main.tf index 2d434ec7..139c6595 100644 --- a/stacks/grampsweb/main.tf +++ b/stacks/grampsweb/main.tf @@ -5,6 +5,9 @@ variable "tls_secret_name" { variable "nfs_server" { type = string } resource "kubernetes_manifest" "external_secret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/hackmd/main.tf b/stacks/hackmd/main.tf index bbe6db40..2e065c99 100644 --- a/stacks/hackmd/main.tf +++ b/stacks/hackmd/main.tf @@ -208,6 +208,9 @@ module "ingress" { } resource "kubernetes_manifest" "external_secret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/health/main.tf b/stacks/health/main.tf index 36fd17d6..7baf5f9c 100644 --- a/stacks/health/main.tf +++ b/stacks/health/main.tf @@ -250,6 +250,9 @@ module "ingress_test" { } resource "kubernetes_manifest" "external_secret_db" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" @@ -284,6 +287,9 @@ resource "kubernetes_manifest" "external_secret_db" { } resource "kubernetes_manifest" "external_secret_kv" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/hermes-agent/main.tf b/stacks/hermes-agent/main.tf index 1293d7a5..fff8578b 100644 --- a/stacks/hermes-agent/main.tf +++ b/stacks/hermes-agent/main.tf @@ -37,6 +37,9 @@ module "tls_secret" { # --- Secrets (ESO from Vault) --- resource "kubernetes_manifest" "external_secret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/immich/frame-emo.tf b/stacks/immich/frame-emo.tf new file mode 100644 index 00000000..577d84af --- /dev/null +++ b/stacks/immich/frame-emo.tf @@ -0,0 +1,155 @@ +# Immich photo-frame for Emo (emil.barzin@gmail.com) — a second instance cloned +# from the London frame in frame.tf, scoped to Emo's Immich account + Sofia +# weather. Served at highlights-immich-emo.viktorbarzin.me and shown on Emo's +# Portal Mini (Sofia) via the portal-immich-frame app. +# API key: Vault secret/immich -> frame_api_key_emo (minted on Emo's account). + +resource "kubernetes_config_map" "frame_config_emo" { + metadata { + name = "config-emo" + namespace = "immich" + + labels = { + app = "frame-config-emo" + } + annotations = { + "reloader.stakater.com/match" = "true" + } + } + + data = { + "Settings.yml" = <<-EOF + General: + Layout: single + Interval: 45 + ImageZoom: true + ShowAlbumName: false + ShowProgressBar: false + ClockFormat: "HH:mm" + PhotoDateFormat: "dd/MM/yyyy" + WeatherApiKey: ${data.vault_kv_secret_v2.secrets.data["frame_weather_api_key"]} + UnitSystem: metric + WeatherLatLong: "42.6977,23.3219" + Language: en + Accounts: + - ImmichServerUrl: http://immich.viktorbarzin.me + ApiKey: ${data.vault_kv_secret_v2.secrets.data["frame_api_key_emo"]} + ImagesFromDays: 730 + EOF + } +} + + +resource "kubernetes_deployment" "immich-frame-emo" { + metadata { + name = "immich-frame-emo" + namespace = "immich" + annotations = { + "reloader.stakater.com/search" = "true" + } + labels = { + tier = local.tiers.gpu + } + } + + spec { + replicas = 1 + selector { + match_labels = { + app = "immich-frame-emo" + } + } + strategy { + type = "RollingUpdate" + } + template { + metadata { + labels = { + app = "immich-frame-emo" + } + annotations = { + "dependency.kyverno.io/wait-for" = "immich-server.immich:2283" + } + } + spec { + container { + image = "ghcr.io/immichframe/immichframe:v1.0.32.0" + name = "immich-frame-emo" + resources { + requests = { + cpu = "10m" + memory = "64Mi" + } + limits = { + memory = "128Mi" + } + } + port { + container_port = 8080 + protocol = "TCP" + name = "http" + } + volume_mount { + name = "config" + mount_path = "/app/Config" + read_only = true + } + } + volume { + name = "config" + config_map { + name = "config-emo" + } + } + } + } + } + lifecycle { + ignore_changes = [ + spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1 + metadata[0].annotations["keel.sh/policy"], + metadata[0].annotations["keel.sh/trigger"], + metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2 + metadata[0].annotations["keel.sh/match-tag"], + metadata[0].annotations["kubernetes.io/change-cause"], + metadata[0].annotations["deployment.kubernetes.io/revision"], + spec[0].template[0].metadata[0].annotations["keel.sh/update-time"], # KEEL_LIFECYCLE_V1 + spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE + ] + } +} + + +resource "kubernetes_service" "immich-frame-emo" { + metadata { + name = "immich-frame-emo" + namespace = "immich" + labels = { + "app" = "immich-frame-emo" + } + } + + spec { + selector = { + app = "immich-frame-emo" + } + port { + port = 80 + target_port = 8080 + } + } +} + +module "ingress_emo" { + source = "../../modules/kubernetes/ingress_factory" + # Photo-frame kiosk display on Emo's Portal — headless browser pulling images + # via an Immich API key (no user login). Forward-auth would 302 the device to + # Authentik with no way to complete login. + # auth = "none": photo-frame kiosk; headless browser with API key; no user login. + auth = "none" + dns_type = "proxied" + namespace = "immich" + name = "highlights-immich-emo" + tls_secret_name = var.tls_secret_name + service_name = "immich-frame-emo" +} diff --git a/stacks/immich/main.tf b/stacks/immich/main.tf index 3009be5e..809d6a2e 100644 --- a/stacks/immich/main.tf +++ b/stacks/immich/main.tf @@ -162,6 +162,9 @@ resource "kubernetes_resource_quota" "immich" { } resource "kubernetes_manifest" "external_secret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/insta2spotify/main.tf b/stacks/insta2spotify/main.tf index 9770afd3..5e1cc4ef 100644 --- a/stacks/insta2spotify/main.tf +++ b/stacks/insta2spotify/main.tf @@ -20,6 +20,9 @@ resource "kubernetes_namespace" "insta2spotify" { } resource "kubernetes_manifest" "external_secret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/instagram-poster/modules/instagram-poster/main.tf b/stacks/instagram-poster/modules/instagram-poster/main.tf index 65714739..7dc3f846 100644 --- a/stacks/instagram-poster/modules/instagram-poster/main.tf +++ b/stacks/instagram-poster/modules/instagram-poster/main.tf @@ -35,6 +35,14 @@ resource "kubernetes_namespace" "instagram_poster" { # - immich_tag_instagram (optional — auto-resolved if missing) # - immich_tag_posted (optional — auto-resolved if missing) resource "kubernetes_manifest" "external_secret" { + # The external-secrets controller takes server-side-apply ownership of + # .spec.refreshInterval, so a plain TF apply conflicts. force_conflicts lets + # TF win (values match, so it's stable) — same pattern as grafana/woodpecker/ + # traefik/k8s-version-upgrade. Surfaced 2026-06-24 by the first IG apply since + # the ESO v1 migration (the scale-to-0 push). + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" @@ -139,6 +147,11 @@ resource "kubernetes_manifest" "external_secret" { # ESO refreshes the K8s Secret every 15m. `reloader.stakater.com/match` # bounces the pod when the password changes. resource "kubernetes_manifest" "benchmark_db_external_secret" { + # See external_secret above — ESO owns .spec.refreshInterval; force_conflicts + # lets the TF apply win instead of erroring on the field-manager conflict. + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" @@ -227,7 +240,11 @@ resource "kubernetes_deployment" "instagram_poster" { } spec { - replicas = 1 + # Scaled to 0 (2026-06-24): Instagram Graph integration is unused and its + # ExternalSecret is dead (missing ig_graph_long_lived_token / + # ig_business_account_id in Vault secret/instagram-poster). Set back to 1 + # after minting a Meta long-lived token and populating those keys. + replicas = 0 # RWO PVC — cannot rolling-update. strategy { type = "Recreate" diff --git a/stacks/job-hunter/main.tf b/stacks/job-hunter/main.tf index a008e83c..94927bf6 100644 --- a/stacks/job-hunter/main.tf +++ b/stacks/job-hunter/main.tf @@ -41,6 +41,9 @@ resource "kubernetes_namespace" "job_hunter" { # digest_to_address — where the weekly digest goes # digest_from_address — From: header for the digest resource "kubernetes_manifest" "external_secret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" @@ -105,6 +108,9 @@ resource "kubernetes_manifest" "external_secret" { # DB credentials from Vault database engine (7-day rotation). # Template builds the asyncpg DSN consumed by the FastAPI app as DB_CONNECTION_STRING. resource "kubernetes_manifest" "db_external_secret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" @@ -325,6 +331,9 @@ resource "kubernetes_service" "job_hunter" { # references it as $__env{JOB_HUNTER_PG_PASSWORD}. Reloader restarts # Grafana whenever ESO updates this secret (every 7d on rotation). resource "kubernetes_manifest" "grafana_job_hunter_db_external_secret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/k8s-dashboard/oauth2_proxy.tf b/stacks/k8s-dashboard/oauth2_proxy.tf index 5ed73793..032d5057 100644 --- a/stacks/k8s-dashboard/oauth2_proxy.tf +++ b/stacks/k8s-dashboard/oauth2_proxy.tf @@ -5,6 +5,9 @@ # ----------------------------------------------------------------------------- resource "kubernetes_manifest" "oauth2_proxy_externalsecret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/k8s-portal/modules/k8s-portal/files/src/routes/+page.svelte b/stacks/k8s-portal/modules/k8s-portal/files/src/routes/+page.svelte index 2d13fa39..7b617fd0 100644 --- a/stacks/k8s-portal/modules/k8s-portal/files/src/routes/+page.svelte +++ b/stacks/k8s-portal/modules/k8s-portal/files/src/routes/+page.svelte @@ -5,9 +5,11 @@

Kubernetes Access Portal

-
- VPN Required — The cluster is on a private network. You need Headscale VPN access before kubectl will work. - See the Getting Started guide for VPN setup instructions. +
+ Fastest way in: open the web terminal or the + dashboard and sign in — no install, no VPN needed. Prefer your + own machine? The local-setup guide covers VPN + kubectl, and the + Getting Started page compares all three access paths.
@@ -26,6 +28,7 @@

Assigned namespaces: {data.namespaces.join(', ')}

Quick Commands

+

Run these as-is in the web terminal — it's already signed in as you.

 # Check your pods
 kubectl get pods -n {data.namespaces[0]}
@@ -47,16 +50,23 @@ vault write kubernetes/creds/{data.namespaces[0]}-deployer \
 
 	

Get Started

+

No setup — start now

+
    +
  1. Open the web terminal — a ready shell with kubectl, Vault and your repos already set up
  2. +
  3. Open the dashboard — point-and-click view of your workloads
  4. +
+

On your own machine

    {#if data.role === 'namespace-owner'} -
  1. Complete the namespace-owner onboarding guide
  2. +
  3. Follow the namespace-owner setup (VPN, kubectl, Vault, encrypted state)
  4. {:else} -
  5. Complete the onboarding guide (VPN, kubectl, git)
  6. +
  7. Follow the local setup (VPN, kubectl, git)
  8. {/if}
  9. Install kubectl and kubelogin
  10. Download your kubeconfig
  11. Run kubectl get namespaces to verify access
+

Compare all three access paths →

@@ -91,12 +101,12 @@ vault write kubernetes/creds/{data.namespaces[0]}-deployer \ border-radius: 6px; margin: 1rem 0; } - .callout.warning { - background: #fff3cd; - border-left: 4px solid #ffc107; + .callout.info { + background: #e8f4fd; + border-left: 4px solid #2196f3; } .callout a { - color: #856404; + color: #0d47a1; font-weight: 600; } diff --git a/stacks/k8s-portal/modules/k8s-portal/files/src/routes/onboarding/+page.svelte b/stacks/k8s-portal/modules/k8s-portal/files/src/routes/onboarding/+page.svelte index d6ec35b9..6b2d73dd 100644 --- a/stacks/k8s-portal/modules/k8s-portal/files/src/routes/onboarding/+page.svelte +++ b/stacks/k8s-portal/modules/k8s-portal/files/src/routes/onboarding/+page.svelte @@ -5,87 +5,175 @@

Getting Started

-

Welcome! Follow these steps to get access to the home Kubernetes cluster.

- - +

+ Welcome! There are three ways to reach the home Kubernetes cluster. Pick the one that fits — + the first two need zero setup and open right in your browser. +

-

Step 0 — Join the VPN

-

The cluster is on a private network (10.0.20.0/24). You need VPN access first.

+

Three ways in

+ + + + + + + + + + + + + + + + + + + +
PathBest forSetup
A — Web terminalJust want to start working nowNone — opens in your browser
B — Web dashboardClick around, watch your app, read logsNone — opens in your browser
C — Your own machinekubectl / Terraform locally, full controlVPN + one-line installer
+
+ Not sure? Start with the web terminal (Path A). + Everything is already installed and your repos are already cloned — you can run your first + kubectl command within a minute, from any device. +
+
+ +
+

Path A — Web terminal Recommended No setup

+

+ A full terminal that runs in your browser — nothing to install, works from any device + (even a tablet). It drops you into your own account on the shared workstation, with every + tool already set up. +

    -
  1. Install Tailscale for your OS
  2. -
  3. Run this in your terminal: -
    tailscale login --login-server https://headscale.viktorbarzin.me
    +
  4. Open t3.viktorbarzin.me
  5. +
  6. Sign in with your Authentik account (the same SSO login as this portal)
  7. +
  8. You land in a ready-to-use shell. Try it: +
    kubectl get pods -n YOUR_NAMESPACE
  9. -
  10. A browser window will open with a registration URL
  11. -
  12. Send that URL to Viktor via email (vbarzin@gmail.com) or Slack
  13. -
  14. Wait for approval (usually within a few hours)
  15. -
  16. Once approved, test:
    ping 10.0.20.100
+
+ Already done for you on the workstation: +
    +
  • kubectl + your kubeconfig, scoped to your namespaces (no login dance)
  • +
  • vault, terragrunt, terraform, sops, kubeseal
  • +
  • Your repos cloned under ~/code — the infra repo plus your own project repos
  • +
  • Claude Code, ready to pair with you on changes
  • +
+
+
+ No access yet? The workstation is provisioned per person. If + t3.viktorbarzin.me says you're not authorized, ask Viktor to add you + (vbarzin@gmail.com or Slack). +
-
-

Step 1 — Log in to the portal

-

Visit k8s-portal.viktorbarzin.me and sign in with your Authentik account.

-

If you don't have an account yet, ask Viktor to create one.

+
+

Path B — Web dashboard No setup

+

+ A point-and-click view of the cluster — browse your pods, read logs, restart a deployment, + check events. Nothing to install. +

+
    +
  1. Open k8s.viktorbarzin.me
  2. +
  3. Sign in with your Authentik account
  4. +
  5. + You're dropped straight into the Kubernetes Dashboard, already authenticated as you — + no token to paste. The portal injects your personal access token for you. +
  6. +
+
+ Scoped to your namespace(s): you can see and manage your own workloads, but not other + tenants'. This path uses a per-user token that does not depend on CLI login, so it + keeps working even if kubectl OIDC login is having a bad day — making it the + reliable fallback for Path C. +
-
-

Step 2 — Set up kubectl

-

Run one of these commands in your terminal to install everything automatically:

-

macOS

-

Requires Homebrew. Install it first if you don't have it.

-
bash <(curl -fsSL https://k8s-portal.viktorbarzin.me/setup/script?os=mac)
-

Linux

-
bash <(curl -fsSL https://k8s-portal.viktorbarzin.me/setup/script?os=linux)
-

Windows

-

Use WSL2 and follow the Linux instructions.

-
+
+

Path C — From your own machine

+

+ For running kubectl, vault and Terraform locally. This is the most + powerful path and the one to use for infrastructure changes — it just needs a bit more setup + because the cluster API lives on a private network. +

+ + +

+ {#if showNamespaceOwner} + Namespace owner — you'll also set up Vault and encrypted Terraform state so you can deploy + your own app stacks. + {:else} + General user — VPN, kubectl and git access. (Managing your own app stack? Switch to the + Namespace Owner tab above.) + {/if} +

- {#if showNamespaceOwner}
-

Step 3 — Log into Vault

-

Vault manages your secrets and issues dynamic Kubernetes credentials.

-
vault login -method=oidc
-

This opens your browser for Authentik SSO. After login, your token is saved to ~/.vault-token.

+

Step 1 — Join the VPN

+

The cluster API is on a private network (10.0.20.0/24), so you need VPN access first.

+
    +
  1. Install Tailscale for your OS
  2. +
  3. Run this in your terminal: +
    tailscale login --login-server https://headscale.viktorbarzin.me
    +
  4. +
  5. A browser window opens with a registration URL
  6. +
  7. Send that URL to Viktor via email (vbarzin@gmail.com) or Slack
  8. +
  9. Wait for approval (usually within a few hours)
  10. +
  11. Once approved, test:
    ping 10.0.20.100
  12. +
-

Step 4 — Verify kubectl access

-

Run this command. It will open your browser for OIDC login the first time:

-
kubectl get pods -n YOUR_NAMESPACE
-

You should see an empty list (no resources) or your running pods.

+

Step 2 — Install the tools

+

Run one of these to install everything automatically (kubectl, kubelogin, vault, terragrunt, terraform, kubeseal) and write your kubeconfig to ~/.kube/config-home:

+

macOS

+

Requires Homebrew. Install it first if you don't have it.

+
bash <(curl -fsSL https://k8s-portal.viktorbarzin.me/setup/script?os=mac)
+

Linux

+
bash <(curl -fsSL https://k8s-portal.viktorbarzin.me/setup/script?os=linux)
+

Windows

+

Use WSL2 and follow the Linux instructions.

-

Step 5 — Clone the infra repo

-
git clone https://github.com/ViktorBarzin/infra.git
+			

Step 3 — Verify access

+

Run this. The first time, it opens your browser for SSO login:

+
kubectl get {showNamespaceOwner ? 'pods -n YOUR_NAMESPACE' : 'namespaces'}
+

You should see your resources (or an empty list if you haven't deployed anything yet).

+
+ Browser login loops, or kubectl says "Unauthorized"? Command-line SSO + (OIDC) can occasionally be unavailable. When that happens, use the + web dashboard (Path B) or the + web terminal (Path A) — both authenticate a different way and + keep working — and let Viktor know. +
+

Connection error instead? Make sure the VPN is up: tailscale status.

+
+ + {#if showNamespaceOwner} +
+

Step 4 — Log into Vault

+

Vault manages your secrets and issues dynamic Kubernetes credentials.

+
vault login -method=oidc
+

This opens your browser for Authentik SSO. After login, your token is saved to ~/.vault-token.

+
+ +
+

Step 5 — Clone the infra repo

+
git clone https://github.com/ViktorBarzin/infra.git
 cd infra
-

This is where all the infrastructure configuration lives. Terraform state is committed as encrypted files.

-
+

This is where all the infrastructure configuration lives. Terraform state is committed as encrypted files.

+
-
-

Step 6 — Install tools

-

You need sops and terragrunt to work with infrastructure state:

-

macOS

-
brew install sops terragrunt
-

Linux

-
# sops
-curl -LO https://github.com/getsops/sops/releases/latest/download/sops-v3.9.4.linux.amd64
-sudo mv sops-*.linux.amd64 /usr/local/bin/sops && sudo chmod +x /usr/local/bin/sops
-
-# terragrunt
-curl -LO https://github.com/gruntwork-io/terragrunt/releases/latest/download/terragrunt_linux_amd64
-sudo mv terragrunt_linux_amd64 /usr/local/bin/terragrunt && sudo chmod +x /usr/local/bin/terragrunt
-
- -
-

Step 7 — Decrypt your state

-

Terraform state is encrypted with SOPS. Your Vault login gives you access to only your stacks.

-
# Make sure you're logged into Vault
+			
+

Step 6 — Decrypt your state

+

Terraform state is encrypted with SOPS. Your Vault login gives you access to only your stacks.

+
# Make sure you're logged into Vault
 vault login -method=oidc
 
 # Decrypt your stack's state
@@ -95,160 +183,157 @@ scripts/state-sync decrypt YOUR_NAMESPACE
 cd stacks/YOUR_NAMESPACE
 ../../scripts/tg plan
-
-

How state encryption works

-
-
-
vault login -method=oidc
-
-
Authentik SSO
-
-
~/.vault-token
-
-
-
-
scripts/tg plan
-
-
state-sync decrypt
-
-
Vault Transit
sops-state-YOUR_NS
-
-
-
-
terragrunt plan/apply
-
-
state-sync encrypt
-
-
git commit + push
+
+

How state encryption works

+
+
+
vault login -method=oidc
+
+
Authentik SSO
+
+
~/.vault-token
+
+
+
+
scripts/tg plan
+
+
state-sync decrypt
+
+
Vault Transit
sops-state-YOUR_NS
+
+
+
+
terragrunt plan/apply
+
+
state-sync encrypt
+
+
git commit + push
+
-
-
- Access control: You can only decrypt state for your own namespaces. - Each namespace has its own Vault Transit encryption key. Your Vault policy - (sops-user-YOUR_USERNAME) only grants access to your keys. -
-
+
+ Access control: You can only decrypt state for your own namespaces. + Each namespace has its own Vault Transit encryption key. Your Vault policy + (sops-user-YOUR_USERNAME) only grants access to your keys. +
+
-
-

Step 8 — Create your first app stack

-
    -
  1. Copy the template:
    cp -r stacks/_template stacks/myapp
    +			
    +

    Step 7 — Create your first app stack

    +
      +
    1. Copy the template:
      cp -r stacks/_template stacks/myapp
       mv stacks/myapp/main.tf.example stacks/myapp/main.tf
    2. -
    3. Edit stacks/myapp/main.tf — replace all <placeholders>
    4. -
    5. Store secrets in Vault: -
      vault kv put secret/YOUR_USERNAME/myapp DB_PASSWORD=secret123
      -
    6. -
    7. Apply your stack: -
      cd stacks/myapp && ../../scripts/tg apply
      -
    8. -
    9. Commit encrypted state: -
      cd ../..
      +					
    10. Edit stacks/myapp/main.tf — replace all <placeholders>
    11. +
    12. Store secrets in Vault: +
      vault kv put secret/YOUR_USERNAME/myapp DB_PASSWORD=secret123
      +
    13. +
    14. Apply your stack: +
      cd stacks/myapp && ../../scripts/tg apply
      +
    15. +
    16. Commit encrypted state: +
      cd ../..
       git add stacks/myapp/ state/stacks/myapp/terraform.tfstate.enc
       git commit -m "add myapp stack"
       git push
      -
    17. -
    -
    +
  2. +
+
-
-

Architecture Overview

-

Here's how your changes flow through the system:

+
+

Architecture Overview

+

Here's how your changes flow through the system:

-
-

Apply workflow

-
-
-
Your Machine
-
git pull
-
-
scripts/tg plan
-
auto-decrypt
-
scripts/tg apply
-
auto-encrypt
-
git push
-
-
-
Vault
-
OIDC auth
Authentik SSO
-
-
Transit decrypt
sops-state-*
-
-
Transit encrypt
per-stack key
-
-
-
Cluster
-
K8s API
-
-
Your namespace
pods, services
-
-
Traefik ingress
*.viktorbarzin.me
+
+

Apply workflow

+
+
+
Your Machine
+
git pull
+
+
scripts/tg plan
+
auto-decrypt
+
scripts/tg apply
+
auto-encrypt
+
git push
+
+
+
Vault
+
OIDC auth
Authentik SSO
+
+
Transit decrypt
sops-state-*
+
+
Transit encrypt
per-stack key
+
+
+
Cluster
+
K8s API
+
+
Your namespace
pods, services
+
+
Traefik ingress
*.viktorbarzin.me
+
-
-
-

Security model

- - - - - - - - - -
LayerWhatHow
AuthenticationWho are you?Authentik SSO (OIDC) → Vault token
AuthorizationWhat can you access?Vault policy (sops-user-*) scoped to your namespaces
Encryption at restState in gitSOPS + Vault Transit (per-stack key)
Encryption fallbackBootstrap / DRage keys (admin only)
NetworkCluster accessHeadscale VPN (private 10.0.20.0/24)
-
-
- {:else} -
-

Step 3 — Verify access

-

Run this command. It will open your browser for login the first time:

-
kubectl get namespaces
-

You should see output like:

-
NAME              STATUS   AGE
-default           Active   200d
-kube-system       Active   200d
-monitoring        Active   200d
-...
-

If you get a connection error, make sure your VPN is connected (tailscale status).

-
- -
-

Step 4 — Clone the repo

-
git clone https://github.com/ViktorBarzin/infra.git
+				
+

Security model

+ + + + + + + + + +
LayerWhatHow
AuthenticationWho are you?Authentik SSO (OIDC) → Vault token
AuthorizationWhat can you access?Vault policy (sops-user-*) scoped to your namespaces
Encryption at restState in gitSOPS + Vault Transit (per-stack key)
Encryption fallbackBootstrap / DRage keys (admin only)
NetworkCluster accessHeadscale VPN (private 10.0.20.0/24)
+
+
+ {:else} +
+

Step 4 — Clone the repo

+
git clone https://github.com/ViktorBarzin/infra.git
 cd infra
-

This is where all the infrastructure configuration lives.

-
+

This is where all the infrastructure configuration lives.

+
-
-

Step 5 — Your first change

-
    -
  1. Create a branch:
    git checkout -b my-first-change
  2. -
  3. Edit a service file (e.g., change an image tag in stacks/echo/main.tf)
  4. -
  5. Commit and push:
    git add . && git commit -m "my first change" && git push -u origin my-first-change
  6. -
  7. Open a Pull Request on GitHub
  8. -
  9. Viktor reviews and merges
  10. -
  11. Woodpecker CI automatically applies the change to the cluster
  12. -
  13. Slack notification confirms it worked
  14. -
-
- {/if} +
+

Step 5 — Your first change

+
    +
  1. Create a branch:
    git checkout -b my-first-change
  2. +
  3. Edit a service file (e.g., change an image tag in stacks/echo/main.tf)
  4. +
  5. Commit and push:
    git add . && git commit -m "my first change" && git push -u origin my-first-change
  6. +
  7. Open a Pull Request on GitHub
  8. +
  9. Viktor reviews and merges
  10. +
  11. Woodpecker CI automatically applies the change to the cluster
  12. +
  13. Slack notification confirms it worked
  14. +
+
+ {/if} +