diff --git a/.claude/CLAUDE.md b/.claude/CLAUDE.md index 9c873a07..496f30d7 100755 --- a/.claude/CLAUDE.md +++ b/.claude/CLAUDE.md @@ -16,6 +16,7 @@ **ALL infrastructure changes MUST go through Terraform/Terragrunt.** Never use `kubectl apply/edit/patch/set`, `helm install/upgrade`, or any manual cluster mutation as the final state. - **No exceptions for "quick fixes"** — even one-line changes must be in `.tf` files and applied via `scripts/tg apply` +- **Apply locally OR let CI do it — but ALWAYS commit.** You don't have to wait for CI: with apply access you MAY run the apply yourself (`scripts/tg apply ` / `homelab tf apply `), but **from the main checkout, never a worktree** (git-crypt'd `*.tfvars` come through as ciphertext under the worktree filter-bypass, so a worktree apply reads garbage). **Every applied change MUST be committed and pushed to `master` the same session** — the repo is the source of truth, so applied-but-uncommitted HCL is drift that the next CI apply / daily drift-detection will try to revert. Order either way: apply locally then commit + push (CI's changed-stack apply then no-ops), or commit + push and let CI apply. Never apply an uncommitted edit; never leave a committed change unapplied. - **kubectl is for read-only operations and temporary debugging only** (get, describe, logs, exec, port-forward) - **If a resource isn't in Terraform yet**, evaluate whether it can be added before making manual changes. If manual change is unavoidable (e.g., emergency), document it immediately and create the Terraform resource in the same session - **kubectl scale/patch during migrations is acceptable** as a transient step, but the final state must be in Terraform and applied via `scripts/tg apply` @@ -203,7 +204,7 @@ the workflow's built-in `GITHUB_TOKEN` (`packages: write`). - **PDBs**: minAvailable=2 on Traefik and Authentik. - **Fallback proxies**: basicAuth when Authentik is down, fail-open when poison-fountain is down. - **CrowdSec enforcement is out-of-band** (no Traefik plugin/middleware — the dead Yaegi `crowdsec-bouncer-traefik-plugin` was removed on Traefik 3.7.5): banned IPs are dropped **in-kernel via nftables** by the `cs-firewall-bouncer` DaemonSet on **direct** hosts (drops in BOTH the `input` and `forward` hooks — Traefik is ETP=Local so client traffic is DNAT'd to the pod via `forward`; pulls ALL decisions incl. the ~31k CAPI blocklist), and **blocked at the Cloudflare edge** for **proxied** hosts (one `crowdsec_ban` Rules List + a zone WAF block rule, fed by the `crowdsec-cf-sync` CronJob in `rybbit` ns every 2 min — excludes CAPI). Zero per-request latency; **fails open** (LAPI down → no new bans, existing drops persist, legit traffic never blocked). Whitelist covers RFC1918 + tailnet + internal CIDRs. Full as-built: `docs/architecture/security.md`. -- **Rate limiting**: Return 429 (not 503). Per-service tuning via dedicated middleware + `skip_default_rate_limit` (default 10/s burst 50): Immich 1000/20000, ActualBudget 50/300 (app boot = ~70 parallel revalidations). +- **Rate limiting**: Return 429 (not 503). Per-service tuning via dedicated middleware + `skip_default_rate_limit` (default 10/s burst 50): Immich 1000/20000, ActualBudget 50/300 (app boot = ~70 parallel revalidations), authentik 100/1000 on `/`+`/static` (login SPA cold-loads ~70 flow chunks from `/static`; default burst 429'd them → blank login screen). - **Retry middleware**: 2 attempts, 100ms — in default ingress chain. - **Entrypoint transport timeouts** (`websecure` `respondingTimeouts`): `writeTimeout=0` (unlimited download duration), `readTimeout=3600s` (uploads ≤1h), `idleTimeout=600s`. These are **HARD total-duration caps**, not nginx-style per-read idle timeouts — a finite `writeTimeout` truncates *any* large download at that wall-clock mark (a prior `writeTimeout=60s` silently cut Immich videos at 60s). **Do NOT re-tighten `writeTimeout`**; keep `readTimeout` finite (slow-loris backstop) but ≥ longest expected upload. Full rationale: `docs/architecture/networking.md` → "Entrypoint Transport Timeouts". - **HTTP/3 (QUIC)**: Enabled on Traefik. Works for **direct (non-proxied) apps** via the dedicated LB IP below (ETP=Local). Proxied apps get QUIC at the Cloudflare edge. @@ -218,7 +219,7 @@ the workflow's built-in `GITHUB_TOKEN` (`packages: write`). | Immich | ML on SSD (CUDA), disable ModSecurity (breaks streaming), frequent upgrades. **`immich-machine-learning` MUST run with `MACHINE_LEARNING_MODEL_TTL > 0`** (set to `600` in `stacks/immich/main.tf`, env on the `immich-machine-learning` deployment). At `0`, no model ever unloads and onnxruntime's CUDA arena (OCR's dynamic input shapes inflate it to ~10 GB) is held forever on the **time-sliced T4 it shares with llama-swap/frigate/immich-server** — which has no VRAM isolation, so immich-ml starved llama-swap (qwen3-8b) and silently broke recruiter-responder triage for ~5 h on 2026-06-02 (post-mortem `docs/post-mortems/2026-06-02-immich-ml-ttl-gpu-oom-recruiter.md`). TTL>0 lets idle models (OCR, face — AND CLIP) free VRAM. The TTL is a single GLOBAL knob (no per-model pin), so CLIP would also unload after 600s idle; the `clip-keepalive` CronJob (`*/5 * * * *`, same stack) pings the CLIP textual encoder so smart-search stays warm without pinning the ad-hoc models. **Smart search has a SECOND warmth layer in Postgres** (don't conflate it with the ML model): the ~665MB vchord `clip_index` must stay resident in PG `shared_buffers`, else an ANN probe that lands on an evicted list pays a ~1.8s cold storage read vs ~4ms warm. The `postStart` hook prewarms it ONCE at pod start and `pg_prewarm.autoprewarm` only re-warms at *startup*, so the index decays out of cache over days under job buffer-pressure (observed ~33% resident after 9d uptime → slow context search, easily misattributed to the ML model). The `clip-index-prewarm` CronJob (`*/5`, same stack) re-runs `pg_prewarm('clip_index')` to pin it hot; `immich-search-probe` (`*/5`) measures live latency + residency → Pushgateway gauges (`immich_smart_search_db_seconds`, `immich_clip_index_cached_pct`) → alerts `ImmichSmartSearchSlow`/`ImmichClipIndexColdCache`/`ImmichSearchProbeStale` + cluster-health check #46 (`check_immich_search`). immich PG role is a superuser so the CronJobs can run `pg_prewarm`/`pg_buffercache`. **Video transcoding is GPU-accelerated**: `immich-server` is pinned to GPU node1 (nodeSelector `nvidia.com/gpu.present` + NoSchedule toleration + `gpu-workload` priority) with a time-sliced `nvidia.com/gpu=1` slice — the stock immich-server image's ffmpeg already ships h264/hevc_nvenc + NVDEC. Activated via `ffmpeg.accel=nvenc` + `accelDecode=true` in the **DB** system-config (`system_metadata` table, key `system-config`, JSONB — NOT Terraform; app config is DB-managed here like oauth/smtp). Direct DB edits need a pod **recreate** to reload (config is cached at boot; only API-driven changes broadcast a reload). **Streaming bitrate is capped** to keep 4K playback smooth on the contended HDD and over remote uplinks: `ffmpeg.maxBitrate=20000k` + `preset=medium` + `transcode=bitrate` (set 2026-06-01 — was uncapped `maxBitrate=0` + `ultrafast` + `targetResolution=original`, which produced 77–264 Mbps 4K transcodes that stuttered for every client, local and remote, since even a single stream needs ~10–13.5 MB/s off the shared `sdc` spindle). 4K resolution is preserved (`targetResolution=original`); originals are NEVER modified — only the `encoded-video/` streaming copy. To re-apply transcode settings to EXISTING videos (config changes only affect new/missing ones): delete the offenders' `asset_file` rows `WHERE type='encoded_video'` (derived/regenerable — never touches originals) then run videoConversion `force=false` (admin Jobs API → "Missing"); it regenerates them to the deterministic `.mp4` path at concurrency 1 (gentle on sdc). See `docs/runbooks/immich-transcode-bitrate.md`. If Immich is ever reinstalled fresh (not restored), re-set these keys (accel, accelDecode, **maxBitrate=20000k, preset=medium, transcode=bitrate**). Thumbnails/previews live on SSD NFS (sdb) — do NOT move to block storage (HDD sdc = slower + the contended IO domain). **Background-job concurrency is capped to protect sdc** (DB-managed system-config, `system_metadata` key `system-config`, JSONB `job.*.concurrency`; re-set on fresh install): `thumbnailGeneration=2`, `metadataExtraction=2`, `library=2` — these jobs read ORIGINALS off the HDD library. Left uncapped (were 8/4/4) a library-wide job (e.g. Duplicate Detection on 2026-06-01) fans the ML/thumbnail backfill out into a read storm that saturates sdc and starves etcd → apiserver down. `sidecar`/`smartSearch`/`faceDetection` stay at Immich defaults (small `.xmp` / SSD previews). Apply via Job Settings UI or the `system-config` API; **direct DB edits need an `immich-server` pod recreate to reload** (config cached at boot). See `docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md`. | | CrowdSec | Pin version, disable Metabase when not needed (CPU hog), LAPI scaled to 3, **DB on PostgreSQL** (migrated from MySQL), flush config: max_items=10000/max_age=7d/agents_autodelete=30d, DECISION_DURATION=168h in blocklist CronJob. **Enforcement is out-of-band, NOT a Traefik plugin** (the Yaegi `crowdsec-bouncer-traefik-plugin` was dead on Traefik 3.7.5 and removed): `cs-firewall-bouncer` DaemonSet drops in-kernel via nftables on direct hosts (bouncer key `firewall`, v0.0.34 binary fetched at runtime, hostNetwork+NET_ADMIN, `stacks/crowdsec/modules/crowdsec/firewall_bouncer.tf`); `crowdsec-cf-sync` CronJob blocks at the CF edge for proxied hosts (bouncer key `kvsync`, `stacks/rybbit/crowdsec_edge.tf`). Both fail open. See `docs/architecture/security.md` | | Frigate | GPU stall detection in liveness probe (inference speed check), high CPU | -| Authentik | 3 server replicas + 2-replica embedded outpost (PG-backed sessions), PgBouncer in front of PostgreSQL, strip auth headers before forwarding. **`authentik.*` Helm values are INERT** (existingSecret skips chart env rendering) — tune via `server.env`/`worker.env` in `modules/authentik/values.yaml`. Single-screen login (password embedded in identification stage); all first-party OIDC apps use implicit consent (2026-06-10). `/static` ingress carve-out serves assets with immutable Cache-Control. | +| Authentik | 3 server replicas + 2-replica embedded outpost (PG-backed sessions), PgBouncer in front of PostgreSQL, strip auth headers before forwarding. **`authentik.*` Helm values are INERT** (existingSecret skips chart env rendering) — tune via `server.env`/`worker.env` in `modules/authentik/values.yaml`. Single-screen login (password embedded in identification stage); all first-party OIDC apps use implicit consent (2026-06-10). `/static` ingress carve-out serves assets with immutable Cache-Control; `/`+`/static` use a dedicated `authentik-rate-limit` (100/1000) so the cold-load chunk burst isn't 429'd into a blank screen. **Reliability (2026-06-28): the chart key is `deploymentStrategy`, NOT `strategy`** — the old `strategy:` key was inert, so live ran the chart default 25%/25% and dropped a server pod out of rotation on every roll; now `maxSurge:1/maxUnavailable:0`. Readiness `failureThreshold:8` (~80s, was 30s): the DB-coupled `/-/health/ready/` returns 503 on a PG/pgbouncer blip, and with too-tight tolerance all 3 server pods left the Service at once → Traefik 502/504 (the episodic blank-screen + 30s-hang). gunicorn `max_requests=10000`/jitter=1000 decorrelates worker recycles from DB blips. Redis is GONE since 2026.2 (sessions+cache+channels on PostgreSQL, no external-cache option) — a short PG transient is now survived, but a TOTAL CNPG outage still takes authentik down. **Custom overlay image (2026-06-28):** server+worker run `ghcr.io/viktorbarzin/authentik-server:2026.2.4-patch3` (built by `.github/workflows/build-authentik.yml` from `stacks/authentik/Dockerfile` + `patch-compat-sfe.py`) with TWO guarded patches: **#1 SLOW-1a** — narrows the identification-stage `select_subclasses()` query (~1.4s→~14ms; bare upstream call LEFT-JOINs every source subtype); **#2 old-browser blank login** — `patch-compat-sfe.py` (a) extends `compat_needs_sfe()` to serve authentik's built-in no-JS **SFE** login to old Safari/WebKit AND **any iOS browser** (Chrome/CriOS, Firefox/FxiOS — all share the system WebKit) on iOS≤16.3, and (b) **injects static social-login `` links into the SFE shell** (`flow-sfe.html`) since the SFE can't render Identification-stage sources — required for password-less accounts (e.g. emo = Google-only). The modern flow SPA is ES2022 (needs Safari 16.4+) and renders BLANK on older WebKit; every iOS browser shares that WebKit, so it's not browser-choice (emo's iPadOS-15.8 iPad hit this). SFE = the *real* authentik login (password + MFA + reputation, no auth downgrade) — chosen over a Traefik basic-auth fallback which would have put a spoofable-UA single password in front of `vbarzin→wizard` passwordless-root. Social link = plain redirect to `/source/oauth/login//` (works on any browser); slugs (google/github/facebook) are static — re-verify on source changes. **Keel un-enrolled** for the ns → image pinned in `global.image` (repo+tag), **upgraded manually**: bump the Dockerfile `FROM` + the values tag (+ re-verify both patches) together, GHA rebuilds, then apply. | | Kyverno | failurePolicy=Ignore to prevent blocking cluster, pin chart version | | MySQL Standalone | Raw `kubernetes_stateful_set_v1` pinned to `mysql:8.4.8` exactly (migrated from InnoDB Cluster 2026-04-16; **pinned to 8.4.8 on 2026-05-18** after Keel-driven `mysql:8.4` → 8.4.9 bump stalled the DD upgrade and required a full PVC-wipe + dump-restore — see `docs/runbooks/restore-mysql.md` and beads code-eme8/code-k40p). `skip-log-bin`, `innodb_flush_log_at_trx_commit=2`, `innodb_doublewrite=ON`. ConfigMap `mysql-standalone-cnf`. PVC `data-mysql-standalone-0` (5Gi initial → 30Gi via autoresizer, `proxmox-lvm-encrypted`). Service `mysql.dbaas` unchanged. Anti-affinity excludes k8s-node1. Bitnami charts deprecated (Broadcom Aug 2025) — use official images. | | phpIPAM | IPAM — no active scanning. `pfsense-import` CronJob (hourly) pulls Kea leases + ARP via SSH. `dns-sync` CronJob (15min) bidirectional sync with Technitium. Kea DDNS on pfSense handles all 3 subnets. API app `claude` (ssl_token). | @@ -231,9 +232,10 @@ the workflow's built-in `GITHUB_TOKEN` (`packages: write`). - Alertmanager is now scraped (`extraScrapeConfigs` job `alertmanager`) → `alertmanager_notifications_total`/`_alerts`/`_notifications_failed_total` available; it had no `prometheus.io/scrape` annotation so notification volume was previously unmeasurable. - Every new service gets Prometheus scrape config + Uptime Kuma monitor. External monitors auto-created for Cloudflare-proxied services by `external-monitor-sync` CronJob (10min, uptime-kuma ns). Mechanism: `ingress_factory` auto-adds `uptime.viktorbarzin.me/external-monitor=true` whenever `dns_type != "none"` (see `modules/kubernetes/ingress_factory/main.tf`) — no manual action needed on new services. The `cloudflare_proxied_names` list in `config.tfvars` is a legacy fallback for the 17 hostnames not yet migrated to `ingress_factory` `dns_type`; don't check that list when debugging "is this monitored?" questions. - **External monitoring**: `[External] ` monitors in Uptime Kuma test full external path (DNS → Cloudflare → Tunnel → Traefik). Divergence metric `external_internal_divergence_count` → alert `ExternalAccessDivergence` (15min). Config: `stacks/uptime-kuma/`, targets from `cloudflare_proxied_names` in `config.tfvars` (17 remaining centrally-managed hostnames; most DNS records now auto-created by `ingress_factory` `dns_type` param). -- Key alerts: OOMKill, pod replica mismatch, 4xx/5xx error rates, UPS battery, CPU temp, SSD writes, NFS responsiveness, ClusterMemoryRequestsHigh (>85%), ContainerNearOOM (>85% limit), PodUnschedulable, ExternalAccessDivergence, ImmichSmartSearchSlow (context-search latency / clip_index cache eviction). +- Key alerts: OOMKill, pod replica mismatch, 4xx/5xx error rates, UPS battery, CPU temp, SSD writes, NFS responsiveness, ClusterMemoryRequestsHigh (>85%), ContainerNearOOM (>85% limit), PodUnschedulable, ExternalAccessDivergence, ImmichSmartSearchSlow (context-search latency / clip_index cache eviction), AuthentikRootRouter5xxHigh (all-3-server-pods-NotReady cascade → 502/503/504 on the authentik `/` router). **The Traefik scrape keeps `traefik_router_requests_total`** (per-router `code` label) — the drop-regex in the `traefik` scrape job drops only the high-cardinality `*_duration_seconds_bucket` histogram, NOT the request counter, so per-router 429/5xx is queryable + alertable. - **E2E email monitoring**: CronJob `email-roundtrip-monitor` (every 20 min) sends test email via Brevo HTTP API to `smoke-test@viktorbarzin.me` (catch-all → `spam@`), verifies IMAP delivery, deletes test email, pushes metrics to Pushgateway + Uptime Kuma. Alerts: `EmailRoundtripFailing` (60m), `EmailRoundtripStale` (60m), `EmailRoundtripNeverRun` (60m). Outbound relay: Brevo EU (`smtp-relay.brevo.com:587`, 300/day free — migrated from Mailgun). Inbound external traffic enters via pfSense HAProxy on `10.0.20.1:{25,465,587,993}`, which forwards to k8s `mailserver-proxy` NodePort (30125-30128) with `send-proxy-v2`. Mailserver pod runs alt PROXY-speaking listeners (2525/4465/5587/10993) alongside stock PROXY-free ones (25/465/587/993) for intra-cluster clients. Real client IPs recovered from PROXY v2 header despite kube-proxy SNAT (replaces pre-2026-04-19 MetalLB `10.0.20.202` ETP:Local scheme; see bd code-yiu + `docs/runbooks/mailserver-pfsense-haproxy.md`). Vault: `brevo_api_key` in `secret/viktor` (probe + relay). -- **Authentik walling-off guard**: `blackbox-exporter` (monitoring ns, `stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf`) probes each must-stay-public `auth = "none"` carve-out URL with `no_follow_redirects` and FAILS (`fail_if_header_matches` on `Location`) iff it 302s to Authentik. Catches a carve-out regressing (TF revert / deploy / `ingress_factory` `auth` default flipping back to `"required"`). Scrape job `blackbox-authentik-walloff` (1m) → alert `AuthentikWallingOffPublicPath` (`probe_failed_due_to_regex == 1`, for 10m, `lane=security` → `#security` Slack). **To guard a new carve-out: add one line to `local.authentik_walloff_targets`** (a `service → URL` map; `valid_status_codes` includes 301/302 so legit redirects/404s stay green — only the Authentik `Location` fails the probe). `curl -sI ''` must NOT show a Location to `authentik.viktorbarzin.me` before adding. +- **Authentik walling-off guard**: `blackbox-exporter` (monitoring ns, `stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf`) probes each must-stay-public `auth = "none"` carve-out URL with `no_follow_redirects` and FAILS (`fail_if_header_matches` on `Location`) iff it 302s to Authentik. Catches a carve-out regressing (TF revert / deploy / `ingress_factory` `auth` default flipping back to `"required"`). Scrape job `blackbox-authentik-walloff` (1m) → alert `AuthentikWallingOffPublicPath` (`probe_failed_due_to_regex == 1`, for 10m, `lane=security` → posts to `#alerts` via the `slack-security` receiver, which keeps its `[SECURITY]` styling; the dedicated `#security` channel was abandoned 2026-06-25 — the shared webhook's app isn't a member of it). **To guard a new carve-out: add one line to `local.authentik_walloff_targets`** (a `service → URL` map; `valid_status_codes` includes 301/302 so legit redirects/404s stay green — only the Authentik `Location` fails the probe). `curl -sI ''` must NOT show a Location to `authentik.viktorbarzin.me` before adding. +- **pfSense egress / WAN monitoring** (added 2026-06-28 after the 2026-06-27 egress-only incident — pfSense VMID 101 stopped passing internet egress for ~20 min while internal routing + Unbound stayed up, and NOTHING alerted: no egress probe existed and the cloudflared replica metric stayed green): `blackbox-exporter` gained `icmp_egress` + `dns_external` modules (+ `NET_RAW` on the pod) in `authentik_walloff_probe.tf`. Three in-cluster probe jobs (`wan-gateway-icmp` → 192.168.1.1, `internet-egress-icmp` → 9.9.9.9/1.1.1.1, `internet-egress-dns` → cloudflare.com via both) traverse the pod→node→pfSense-NAT path that fails. Alerts (group `Egress / pfSense` in `alerting_rules.yml`): `WANGatewayUnreachable`, `InternetEgressDown` (`max()==0` = both providers dead, not a single-provider blip), `ExternalDNSResolutionDown`, `EgressOnlyDivergence` (t3-probe `cloudflare` leg down WHILE `internal` leg up — the incident signature, reuses the existing t3-probe), `PfSenseVMDown` (`pve_up{id="qemu/101"}==0` while host up — does NOT catch a guest-internal reboot, `pve_up` tracks the qemu process). Plus Loki ruler `CloudflaredTunnelConnLoss` (>20 edge-conn failures/5m; calibrated live: steady-state ~2/6h vs 37-85/5m in-incident; the cloudflared replica metric is blind to tunnel-connection loss). `WANGatewayUnreachable`/`InternetEgressDown` **inhibit** the downstream egress symptoms (ExternalDNSResolutionDown/EgressOnlyDivergence/CloudflaredTunnelConnLoss/Email*/ExternalAccessDivergence). Runbook: `docs/runbooks/pfsense-egress.md`. **Deferred (needs a live-pfSense change, not in this monitoring-only change):** point dpinger's monitor at the local gateway + widen thresholds, disable `gw_down_kill_states`, add a failover gateway group + auto-recovery watchdog, and ship pfSense system/gateway/routing syslog to Loki (today only filterlog → CrowdSec; those logs are NOT centrally queryable — id #6717). No Uptime-Kuma egress monitor was added (the `external-monitor-sync` is purpose-built for `*.viktorbarzin.me` Cloudflare-path discovery; the blackbox probes cover egress directly). ## Security Posture (Wave 1 — locked 2026-05-18) @@ -241,9 +243,10 @@ Plan in `docs/architecture/security.md` + response playbook in `docs/runbooks/se - **Identity allowlist for security rules**: ONLY `me@viktorbarzin.me`. NOT `viktor@viktorbarzin.me`, NOT `emo@viktorbarzin.me` (those don't exist). emo's identity scheme is unknown — ask before assuming. - **Source-IP allowlist (K2, K9, V7, S1)**: `10.0.20.0/22`, `192.168.1.0/24` (Proxmox + Sofia LAN), K8s pod CIDR, K8s service CIDR, Headscale tailnet. **Policy: no public-IP access** — Vault, kube-apiserver, PVE sshd must transit LAN or Headscale. **One documented exception (2026-06-11): break-glass SSH** — PVE sshd on a WAN-exposed `:52222`, key-only, dedicated break-glass key only (`Match LocalPort`), rate-limited + fail2ban; intentionally cluster-independent so it survives an outage. As-built `docs/runbooks/breakglass-ssh.md`. (Replaced the 2026-05-30 port-knock design — circular Vault dep caused a lockout.) -- **Response model**: (I) Slack-only daily skim. All security alerts via Loki ruler → Alertmanager → `#security` Slack receiver. Single channel with severity labels inside (critical/warning/info). No paging. +- **Response model**: (I) Slack-only daily skim. All security alerts via Loki ruler → Alertmanager → the `slack-security` receiver, which posts to `#alerts` (it keeps its `[SECURITY/]` title styling so security-lane alerts stand out). Severity labels carried in the alert (critical/warning/info). No paging. The dedicated `#security` channel was abandoned 2026-06-25 — the shared `alertmanager_slack_api_url` webhook's Slack app isn't a member of it (a `#security` override 404s), so everything consolidated to `#alerts`. - **Kyverno policies (wave 1)**: `deny-privileged-containers`, `deny-host-namespaces`, `restrict-sys-admin`, `require-trusted-registries` flip Audit→Enforce with the 31-namespace exclude list (memory id=1970). `failurePolicy: Ignore` preserved. Cosign `verify-images` deferred. -- **NetworkPolicy default-deny egress (wave 1)**: observe-then-enforce (γ approach) — Calico flow logs cluster-wide + GlobalNetworkPolicy log-only on tier 3+4, build empirical allowlist after 1 week, phased per-namespace enforce starting `recruiter-responder`. Tier 0/1/2 deferred. +- **NetworkPolicy default-deny egress (wave 1)**: observe-then-enforce (γ approach) — Calico flow logs cluster-wide + GlobalNetworkPolicy log-only on tier 3+4, build empirical allowlist after 1 week, phased per-namespace enforce starting `recruiter-responder`. Tier 0/1/2 deferred. **The internal (ns-to-ns) half of each allowlist now derives faster from the east-west flow trail** (below): `SELECT DISTINCT dst_ns FROM edge WHERE src_ns='' AND action='allow'`. External egress is NOT in that table (empty-ns flows dropped) — those still come from the Calico flow-log W1.6 snapshot. Enforce-flips remain out of scope of the trail (observe-and-derive only; beads `code-8ywc`). +- **East-west flow trail (who-talks-to-whom, ADR-0014)**: Calico **Goldmane** (`goldmane.calico-system:7443`, gRPC/mTLS, ~60-min in-memory ring buffer — no etcd writes) + **Whisker** live UI (`whisker.viktorbarzin.me`, Authentik-gated) → **`goldmane-edge-aggregator`** streams Goldmane's `Flows.Stream` over mTLS and upserts the namespace-pair **edge set** (`edge(src_ns,dst_ns,action,first_seen,last_seen,flow_count)`, self-edges + public-internet flows dropped) into **CNPG DB `goldmane_edges`** → daily **`goldmane-edges-digest`** CronJob posts first-seen edges to `#alerts` (consolidated to `#alerts`; the `#security` channel was abandoned 2026-06-25 — the shared webhook's Slack app isn't a member of it, so a `#security` override 404s; see runbook). **CERT-REUSE GOTCHA**: the aggregator's mTLS client cert reuses the operator's Tigera-CA-signed `whisker-backend-key-pair` Secret (Goldmane verifies CA-chain only) — **re-apply `stacks/goldmane-edge-aggregator` if the operator rotates it** (symptom: no `last_seen` updates, `AggregatorDown`). Service identity = namespace, + `service-identity` label only in `monitoring`/`kube-system`/`dbaas`. Health: `AggregatorDown` + `DigestFailing` alerts + cluster-health #48. **WHISKER-WEDGE GOTCHA** (2026-06-28): the operator's `whisker` NetworkPolicy allows DNS egress only to kube-dns *pods*, but whisker-backend resolves goldmane via the kube-dns *ClusterIP* — Calico drops UDP DNS to a ClusterIP under a podSelector-only egress rule, so when whisker-backend's gRPC stream breaks and it re-resolves, it wedges and the UI goes **empty** (the aggregator, a separate pod, is unaffected). FIX = additive egress NP `whisker-allow-dns-clusterip` (`stacks/calico`, allows whisker→10.96.0.10/32:53); the `whisker-watchdog` CronJob is a backstop. Manual heal `kubectl -n calico-system delete pod -l k8s-app=whisker`. Runbook: `docs/runbooks/goldmane-flow-trail.md`. (Goldmane is OSS tech-preview — reversible operator-CR toggle in `stacks/calico/main.tf`.) - **What's NOT in scope**: canary tokens (rejected — self-trigger risk with Viktor's normal `vault kv list secret/viktor` and `kubectl get secret -A` workflows), Falco/Tetragon (too noisy for Slack-only daily check), Cloudflare/GitHub audit polling (deferred to wave 2). ## Storage & Backup Architecture diff --git a/.claude/reference/service-catalog.md b/.claude/reference/service-catalog.md index cd7b5274..ca1ee262 100644 --- a/.claude/reference/service-catalog.md +++ b/.claude/reference/service-catalog.md @@ -13,6 +13,8 @@ | authentik | Identity provider (SSO) | authentik | | cloudflared | Cloudflare tunnel | cloudflared | | authelia | Auth middleware (may be merged into ebooks or removed) | platform | +| goldmane | Calico 3.30 OSS flow aggregator (`goldmane.calico-system.svc:7443`, gRPC/mTLS). Stamps identity (ns/pod/workload/labels + allow-deny) on every flow from Felix into a ~60-min in-memory ring buffer — no etcd/API writes. East-west "who-talks-to-whom" source (ADR-0014). Enabled via operator CR (`kubectl_manifest.goldmane`). | calico | +| whisker | Calico 3.30 OSS live flow-observability UI (`whisker.calico-system.svc:8081`) at `whisker.viktorbarzin.me` (Authentik-gated, `auth=required` — no own login; additive NP ORs Traefik past the operator default-deny). ~60-min live view of Goldmane flows, NOT history. Enabled via operator CR (`kubectl_manifest.whisker`). | calico | | monitoring | Prometheus/Grafana/Loki stack | monitoring | ## Storage & Security (Tier: cluster) @@ -37,6 +39,7 @@ ## Active Use | Service | Description | Stack | |---------|-------------|-------| +| goldmane-edge-aggregator | Durable who-talks-to-whom audit trail (ADR-0014 / #58). Go service: `aggregate` Deployment streams Goldmane's gRPC `Flows.Stream` (mTLS) and upserts the low-cardinality namespace-pair edge set (`edge(src_ns,dst_ns,action,first_seen,last_seen,flow_count)`) into CNPG DB `goldmane_edges`; `goldmane-edges-digest` CronJob posts first-seen edges daily to `#alerts` (the `#security` channel was abandoned 2026-06-25 — shared webhook's app isn't a member of it). mTLS client cert REUSES the operator's `whisker-backend-key-pair` (re-apply if rotated). Tier-4-aux. Image `ghcr.io/viktorbarzin/goldmane-edge-aggregator` (private). Runbook: [goldmane-flow-trail.md](../../docs/runbooks/goldmane-flow-trail.md). | goldmane-edge-aggregator | | mailserver | Email (docker-mailserver) | mailserver | | shadowsocks | Proxy | shadowsocks | | webhook_handler | Webhook processing | webhook_handler | @@ -161,3 +164,4 @@ procedures) are documented in `infra/docs/runbooks/`: | pfSense + Unbound DNS | [pfsense-unbound.md](../../docs/runbooks/pfsense-unbound.md) | | Mailserver PROXY-protocol / HAProxy | [mailserver-pfsense-haproxy.md](../../docs/runbooks/mailserver-pfsense-haproxy.md) | | Technitium apply flow | [technitium-apply.md](../../docs/runbooks/technitium-apply.md) | +| Goldmane flow trail (east-west who-talks-to-whom) | [goldmane-flow-trail.md](../../docs/runbooks/goldmane-flow-trail.md) | diff --git a/.claude/skills/home-assistant/SKILL.md b/.claude/skills/home-assistant/SKILL.md index 61aaa6af..ab07a27f 100644 --- a/.claude/skills/home-assistant/SKILL.md +++ b/.claude/skills/home-assistant/SKILL.md @@ -11,8 +11,8 @@ description: | There are TWO Home Assistant deployments: ha-london (default) and ha-sofia. Always use Home Assistant for smart home control. author: Claude Code -version: 2.0.0 -date: 2026-02-07 +version: 2.1.0 +date: 2026-06-24 --- # Home Assistant Control @@ -395,14 +395,27 @@ Advanced SSH, File Editor, Studio Code Server, InfluxDB, Mosquitto, Node-RED, Fr ## ha-london Knowledge Map ### Overview -- **HA Version**: 2025.9.1 (Docker container on Raspberry Pi) +- **HA Version**: 2026.5.2 on **Home Assistant OS** (HAOS — managed appliance, NOT a `docker run` container). Latest is 2026.6.4 (update available, deliberately not applied). - **Location**: London, UK -- **Platform**: Raspberry Pi 4, HA OS (not Docker standalone) -- **SSH**: `ssh hassio@192.168.8.103` (requires `sudo` for file access) -- **Config path**: `/config/` (requires `sudo` for file access) +- **Platform**: Raspberry Pi 4, HA OS +- **Access from the Sofia devvm**: london is **remote** — `homelab ha ssh --instance london` generally WON'T connect (ADR-0012). Drive it via the API: `homelab ha token --instance london` + `https://ha-london.viktorbarzin.me/api/...`, and the WebSocket API `wss://ha-london.viktorbarzin.me/api/websocket` for dashboards / config-entries / HACS installs. +- **SSH (only from the London LAN)**: `ssh hassio@192.168.8.103` (requires `sudo` for file access) +- **Config path**: `/config/` - **3 tracked people**: Viktor Barzin, Anca Milea, Gheorghe Milea - **Zone**: London (home) +### Dashboards (redesigned 2026-06-24) +**Glossary** (HA terms — keep distinct): +- **Dashboard** = a sidebar entry (Overview, Air Quality, Map). Sidebar *order* is a per-USER frontend preference, not in any dashboard config. +- **View** = a tab inside a dashboard. View order is global (stored in the dashboard config). +- **Card** = a widget inside a view. + +- **Overview** (`lovelace`, the default): responsive **sections** views, styled with Mushroom + mini-graph-card. + - **Home** tab: *Who's home* · *Comfort & Air* (CO₂/temp/humidity/PM2.5/VOC chips + CO₂ and temp/humidity trend graphs + link to Air Quality) · *Cowboy* (battery/range/last-ride) · *Energy* (5 Kasa plugs + power trend) · *Quick actions* (Netflix/Stremio/Night). + - **More** tab: *Network* (GL-MT6000 router) · *System* (HA version/update, last backup, RPi power) · *Phones*. +- **Air Quality** (`air-quality`): deep-dive (views: Home, Detailed). (`detialed`→`detailed` path typo fixed 2026-06-24.) +- Built via the WS `lovelace/config/save` API (london is remote — no SSH path). + ### Key Systems #### 1. Smart Plugs (TP-Link Kasa) — Energy Monitoring @@ -424,10 +437,15 @@ Named plugs with power/energy tracking: - PM1.0/2.5/4.0/10 particulate sensors - VOC, NOx, ammonia, CO, ethanol, hydrogen, methane, NO2 gas sensors -#### 3. Cowboy E-Bike -- `sensor.bike_state_of_charge`: Battery % -- `sensor.bike_total_distance`: Total km -- `sensor.bike_total_co2_saved`: CO2 saved (grams) +#### 3. Cowboy E-Bike (`elsbrock/cowboy-ha`) +Bike named **"Classic Performance"** → entities are `sensor.classic_performance_*` (26 total). The old `sensor.bike_*` names are GONE (they were the dead `jdejaegh` integration). +- `sensor.classic_performance_remaining_battery`: Battery % (was `sensor.bike_state_of_charge`) +- `sensor.classic_performance_remaining_range`: Range km +- `sensor.classic_performance_mileage`: Total km (was `sensor.bike_total_distance`) +- `sensor.classic_performance_saved_co2`: Lifetime CO2 saved (was `sensor.bike_total_co2_saved`) +- Plus `_distance_today`, `_last_trip_*`, `_battery_health`, `device_tracker.classic_performance`, etc. +- **GOTCHA**: live battery/range/mileage read `unknown` while the bike is parked/asleep — Cowboy only reports live SoC when awake (ridden/charging); trip-history + `distance_today` stay live regardless. +- Auth: account **email+password** (no AWS Cognito — that was the dead `jdejaegh`/`cowboybike` lineage). Setup via UI config flow / REST `config_entries/flow`. Creds in Vaultwarden item **"cowboy bike"** (`homelab vault get "cowboy bike"`). #### 4. Uptime Monitoring (UptimeRobot) - `sensor.blog`: blog uptime @@ -446,12 +464,17 @@ Named plugs with power/energy tracking: - Scripts: `script.start_netflix`, `script.start_stremio` - Scene: `scene.night` (turns off Livia + Michelle plugs) -### Custom Components -- **cowboy**: Cowboy e-bike integration (HACS) -- **hildebrandglow_dcc**: UK smart meter DCC energy data (HACS) +### Custom Components (HACS integrations) +- **cowboy** (`elsbrock/cowboy-ha` v1.2.0): Cowboy e-bike — revived 2026-06-24. The old `jdejaegh/home-assistant-cowboy` repo is **dead (404)**; don't chase it. +- **hildebrandglow_dcc**: UK smart meter DCC energy — **DISABLED by user** (config entry `disabled_by: user`), not broken. + +### HACS frontend cards (plugins) +- **Mushroom** (`piitaya/lovelace-mushroom`), **mini-graph-card** (`kalkih/mini-graph-card`), **plotly-graph-card** (`dbuezas/lovelace-plotly-graph-card`) — used by the redesigned Overview. Install over WS `hacs/repository/download`; resources auto-register in storage mode. ### Integrations -ESPHome, TP-Link Kasa, Tapo, UptimeRobot, Cowboy, Hildebrand Glow DCC, Oral-B BLE, Ookla Speedtest, HACS, OpenRouter (multiple free LLMs), Piper (local TTS), Whisper (local STT), Android TV/ADB +ESPHome, TP-Link Kasa, Tapo, UptimeRobot, **Cowboy** (elsbrock), Oral-B BLE, Ookla Speedtest (exposes only an `update` entity, no live speed sensors), HACS, OpenRouter (free LLMs), Piper (TTS), Whisper (STT), Android TV/ADB. +- **Disabled by user (NOT broken)**: `met` + `metoffice` (weather — so `weather.*` entities are ABSENT), `roomba` (Rumi vacuum), `hildebrandglow_dcc` (energy). +- **Failing**: `tplink` **Tapo P100** projector plug — `setup_retry`, 403 KLAP handshake from 192.168.8.108 (plug off / firmware). Left as-is. ### AI / Voice Assistants - 5 free LLM conversation agents: Google Gemma 3 27B, Meta Llama 3.2 3B, Mistral Devstral 2, OpenAI GPT-OSS-20B, Z.AI GLM 4.5 Air @@ -466,15 +489,8 @@ ESPHome, TP-Link Kasa, Tapo, UptimeRobot, Cowboy, Hildebrand Glow DCC, Oral-B BL - Anca arrival/departure notifications - Night scene: turns off Livia + Michelle -### Docker Setup -```bash -docker run -d --name homeassistant --privileged \ - -e TZ=Europe/London \ - -v /home/pi/docker/homeAssistant:/config \ - -v /run/dbus:/run/dbus:ro \ - --network=host --restart=unless-stopped \ - homeassistant/home-assistant:2025.9 -``` +### Platform (HAOS — ignore any legacy `docker run` snippet) +ha-london runs **Home Assistant OS** (managed appliance), NOT a hand-run Docker container. There is no `docker run homeassistant/home-assistant` to manage. Install HACS components over the WebSocket API (`hacs/repository/download` with the repo's HACS id), then restart via `POST /api/services/homeassistant/restart` — a HAOS restart drops automations for ~1–2 min and resets `sensor.uptime` (use that as the "back up" marker). ### SSH Access ```bash diff --git a/.github/workflows/build-authentik.yml b/.github/workflows/build-authentik.yml new file mode 100644 index 00000000..bb43502f --- /dev/null +++ b/.github/workflows/build-authentik.yml @@ -0,0 +1,39 @@ +name: Build Custom Authentik Image + +# ADR-0002: infra-owned image built off-infra on GHA → ghcr. +# Thin SLOW-1a overlay over the official authentik server (narrows the login +# identification stage's select_subclasses() to the login-capable source subtypes; +# see stacks/authentik/Dockerfile). Rebuild only when the Dockerfile changes — on +# every authentik bump, edit the FROM tag + the patchN suffix here + the image tag +# in modules/authentik/values.yaml together. +on: + push: + branches: [master] + paths: + - 'stacks/authentik/Dockerfile' + workflow_dispatch: {} + +permissions: + contents: read + packages: write + +jobs: + build: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + - uses: docker/setup-buildx-action@v3 + - uses: docker/login-action@v3 + with: + registry: ghcr.io + username: ${{ github.actor }} + password: ${{ secrets.GITHUB_TOKEN }} + - uses: docker/build-push-action@v6 + with: + context: stacks/authentik + platforms: linux/amd64 + provenance: false + push: true + tags: | + ghcr.io/viktorbarzin/authentik-server:2026.2.4-patch3 + ghcr.io/viktorbarzin/authentik-server:latest diff --git a/.woodpecker/default.yml b/.woodpecker/default.yml index ef94ccee..d46f5ae1 100644 --- a/.woodpecker/default.yml +++ b/.woodpecker/default.yml @@ -65,6 +65,21 @@ steps: # don't need explicit token propagation. VAULT_ADDR: http://vault-active.vault.svc.cluster.local:8200 commands: + # ── Forge guard: apply ONLY on the canonical Forgejo forge ── + # infra is registered in Woodpecker on BOTH the Forgejo canonical repo and + # the legacy GitHub mirror, and BOTH fire this push pipeline. Without this + # guard both run `terragrunt apply` on every push and race each other for + # the per-stack PG state lock — the dominant cause of the "Error acquiring + # the state lock" failures + push-supersede "killed" runs. The GitHub-mirror + # registration keeps running the CRONS (drift-detection, renew-tls, …) — only + # its duplicate push-apply no-ops here. Fail-open: an unknown forge (neither + # env var set) still applies, preserving prior behaviour. + - | + if echo "${CI_REPO_URL:-}${CI_FORGE_URL:-}" | grep -qi 'github\.com'; then + echo "[forge-guard] GitHub-mirror push — apply runs only on the Forgejo canonical repo (avoids double-apply + state-lock races). Skipping." + exit 0 + fi + # ── Skip CI commits ── - | if echo "$CI_COMMIT_MESSAGE" | grep -q '\[CI SKIP\]\|\[ci skip\]'; then @@ -213,23 +228,40 @@ steps: if [ -s .platform_apply ]; then echo "=== Applying platform stacks (serial, locked) ===" while read -r stack; do + # Tier-0 `vault` is human-applied via OIDC; the CI `ci` Vault role + # lacks Vault-admin perms (sys/mounts + sys/policies/acl), so a CI + # apply always 403s and fails the pipeline. Kept in PLATFORM_STACKS + # (so the app-stack detector still excludes it) but skipped here. + # (2026-06-27 — see docs/architecture/ci-cd.md) + if [ "$stack" = "vault" ]; then echo "[vault] SKIPPED (Tier-0, human-applied via OIDC)"; continue; fi echo "[$stack] Starting apply..." - set +e - OUTPUT=$(cd "stacks/$stack" && ../../scripts/tg apply --non-interactive 2>&1) - EXIT=$? - set -e - if [ $EXIT -ne 0 ]; then - if echo "$OUTPUT" | grep -q "is locked by"; then - echo "[$stack] SKIPPED (locked by another session)" - else - echo "$OUTPUT" | tail -50 - echo "[$stack] FAILED (exit $EXIT)" - FAILED_PLATFORM_STACKS="$FAILED_PLATFORM_STACKS $stack" + ATTEMPT=0 + while :; do + ATTEMPT=$((ATTEMPT + 1)) + set +e + OUTPUT=$(cd "stacks/$stack" && ../../scripts/tg apply --non-interactive 2>&1) + EXIT=$? + set -e + if [ $EXIT -eq 0 ]; then + echo "$OUTPUT" | tail -3; echo "[$stack] OK"; break fi - else - echo "$OUTPUT" | tail -3 - echo "[$stack] OK" - fi + # Lock contention → SKIP, not fail. Match BOTH the Tier-0 Vault lock + # ("is locked by", from scripts/tg) AND the Tier-1 PG-backend lock + # ("Error acquiring the state lock" / "already locked"). The PG case + # was previously counted as a failure — the #1 source of false reds. + if echo "$OUTPUT" | grep -qE 'is locked by|Error acquiring the state lock|already locked'; then + echo "[$stack] SKIPPED (locked by another session/run)"; break + fi + # Transient: provider-registry download timeout / Vault 5xx → bounded + # retry. Deliberately NOT helm atomic-timeouts or config errors + # (missing arg, invalid index) — those must fail fast, retry can't fix + # them and can worsen a stuck helm release. + if [ $ATTEMPT -lt 3 ] && echo "$OUTPUT" | grep -qE 'Failed to install provider|Client\.Timeout exceeded while awaiting headers|error reading from Vault.*Code: 5[0-9][0-9]'; then + echo "[$stack] transient error (attempt $ATTEMPT/3) — retrying in 15s..."; sleep 15; continue + fi + echo "$OUTPUT" | tail -50; echo "[$stack] FAILED (exit $EXIT)" + FAILED_PLATFORM_STACKS="$FAILED_PLATFORM_STACKS $stack"; break + done done < .platform_apply fi # Deferred until after app stacks so both lists get a chance to run. @@ -242,22 +274,27 @@ steps: echo "=== Applying app stacks (serial, locked) ===" while read -r stack; do echo "[$stack] Starting apply..." - set +e - OUTPUT=$(cd "stacks/$stack" && ../../scripts/tg apply --non-interactive 2>&1) - EXIT=$? - set -e - if [ $EXIT -ne 0 ]; then - if echo "$OUTPUT" | grep -q "is locked by"; then - echo "[$stack] SKIPPED (locked by another session)" - else - echo "$OUTPUT" | tail -50 - echo "[$stack] FAILED (exit $EXIT)" - FAILED_APP_STACKS="$FAILED_APP_STACKS $stack" + ATTEMPT=0 + while :; do + ATTEMPT=$((ATTEMPT + 1)) + set +e + OUTPUT=$(cd "stacks/$stack" && ../../scripts/tg apply --non-interactive 2>&1) + EXIT=$? + set -e + if [ $EXIT -eq 0 ]; then + echo "$OUTPUT" | tail -3; echo "[$stack] OK"; break fi - else - echo "$OUTPUT" | tail -3 - echo "[$stack] OK" - fi + # Lock contention → SKIP, not fail (Tier-0 Vault + Tier-1 PG; see platform loop). + if echo "$OUTPUT" | grep -qE 'is locked by|Error acquiring the state lock|already locked'; then + echo "[$stack] SKIPPED (locked by another session/run)"; break + fi + # Transient provider-download / Vault 5xx → bounded retry (see platform loop). + if [ $ATTEMPT -lt 3 ] && echo "$OUTPUT" | grep -qE 'Failed to install provider|Client\.Timeout exceeded while awaiting headers|error reading from Vault.*Code: 5[0-9][0-9]'; then + echo "[$stack] transient error (attempt $ATTEMPT/3) — retrying in 15s..."; sleep 15; continue + fi + echo "$OUTPUT" | tail -50; echo "[$stack] FAILED (exit $EXIT)" + FAILED_APP_STACKS="$FAILED_APP_STACKS $stack"; break + done done < .app_apply fi # Fail the step loudly so the pipeline `default` workflow state diff --git a/.woodpecker/drift-detection.yml b/.woodpecker/drift-detection.yml index b2e303ff..b2a552f4 100644 --- a/.woodpecker/drift-detection.yml +++ b/.woodpecker/drift-detection.yml @@ -85,6 +85,13 @@ steps: stack=$(basename "$stack_dir") [ -f "$stack_dir/terragrunt.hcl" ] || continue + # Tier-0 `vault` is human-applied via OIDC; the CI `ci` Vault role lacks + # Vault-admin perms (sys/mounts + sys/policies/acl), so `terragrunt plan` + # on it ERRORs (detailed-exitcode 1) and fails the whole nightly drift + # run. Skip it — drift on Tier-0 vault is caught at human apply time. + # (2026-06-27) + [ "$stack" = "vault" ] && continue + echo -n "[$stack] planning... " OUTPUT=$(cd "$stack_dir" && terragrunt plan -detailed-exitcode -input=false 2>&1) EXIT=$? diff --git a/AGENTS.md b/AGENTS.md index 7fbc838d..4e3ea2de 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -273,8 +273,11 @@ To land a finished change from such a clone: Slack audit feed; a no-op CI apply on a docs-only commit is harmless. 4. Leave the clone on clean `master` so auto-refresh keeps working. 5. Tell the user in plain language what happened. Stack changes are - auto-applied by CI — verify the live result with the user's read-only - kubectl before saying "it's live". + auto-applied by CI on push — or, with apply access, applied locally yourself + (`scripts/tg apply`, from the main checkout, not a worktree); either path is + fine, but the change must always be committed here, never applied + uncommitted. Verify the live result with the user's read-only kubectl before + saying "it's live". If a push to `master` is rejected by branch protection (user not on the whitelist — e.g. new users before Viktor grants it), fall back to a diff --git a/CONTEXT.md b/CONTEXT.md index 2b9bb8b3..548fa40d 100644 --- a/CONTEXT.md +++ b/CONTEXT.md @@ -125,7 +125,7 @@ How a **Service** is named in flow/audit data — its **namespace** is the prima _Avoid_: equating "service identity" with a workload's **ServiceAccount** (that's the deferred enforcement principal, not the attribution key) or with cryptographic/SPIFFE identity; "Service" here is the domain **Service**, not the K8s `Service` object. **Goldmane / Whisker**: -Calico 3.30's OSS flow-observability pair — **Goldmane** aggregates identity-stamped flows (namespace/pod/workload/labels + allow-deny + policy trace) streamed from Felix over gRPC into an in-memory ~60-min ring buffer (no etcd/API writes); **Whisker** is its live web UI. The east-west "who-talks-to-whom" data plane, succeeding raw iptables-`LOG`→journald lines (which carry no identity). Durable history requires emitting Goldmane flows to **Loki**; the in-memory buffer alone is not an audit trail. +Calico 3.30's OSS flow-observability pair — **Goldmane** aggregates identity-stamped flows (namespace/pod/workload/labels + allow-deny + policy trace) streamed from Felix over gRPC into an in-memory ~60-min ring buffer (no etcd/API writes); **Whisker** is its live web UI. The east-west "who-talks-to-whom" data plane, succeeding raw iptables-`LOG`→journald lines (which carry no identity). The in-memory buffer alone is not an audit trail — durable history is the **`goldmane-edge-aggregator`** (the implemented trail; ADR-0014 originally framed this as a Loki emitter), which streams Goldmane's gRPC `Flows.Stream` over mTLS and upserts the namespace-pair **edge set** into CNPG DB `goldmane_edges` + a daily `#alerts` digest (the `#security` channel was abandoned 2026-06-25). As-built: `docs/runbooks/goldmane-flow-trail.md`. _Avoid_: assuming Goldmane persists (it's a ring buffer — lost on restart); expecting a ServiceAccount field in its schema (it carries labels, not SA); confusing it with Cilium **Hubble** (needs the Cilium datapath, unusable on Calico) or **Kiali** (needs an Istio mesh). ### Storage diff --git a/cli/README.md b/cli/README.md index 186c1ee5..fa9ff3ec 100644 --- a/cli/README.md +++ b/cli/README.md @@ -202,6 +202,69 @@ runs on the devvm, `setInputFiles` streams local files to the remote browser ove CDP — no `chmod`/staging-dir workaround. See `docs/architecture/chrome-service.md` and `docs/adr/0013`. +### v0.9 verbs — edges (east-west "who-talks-to-whom" trail) + +Read-only investigation helper over the `goldmane_edges` CNPG trail (ADR-0014): +filters render to a single safe `SELECT` (namespace values validated to the k8s +name charset) run via the dbaas primary pod — the same exec path as `k8s db`. + +| Command | Tier | What it does | +| --- | --- | --- | +| `edges --ns ` | read | edges touching `` (either direction) | +| `edges --src ` / `--dst ` | read | directional: ``'s egress / ingress peers | +| `edges --peers-of ` | read | distinct peer namespaces of `` (both directions) | +| `edges --new-since <24h\|7d\|YYYY-MM-DD>` | read | edges first seen since a duration or date | +| `edges --denied` | read | only `action='deny'` edges (blocked / lateral-movement) | +| `edges --json` / `--limit N` | read | JSON array output / row cap (default 200) | + +### v0.10 — `vault get --all` (browse every field) + +`vault get --all` returns the **whole item** as a normalized JSON object, +so an agent can discover and read fields the single-field `--field` allowlist +can't reach — notably arbitrary **custom fields**. + +| Command | Tier | What it does | +| --- | --- | --- | +| `vault get --all` | read | all fields as JSON: `{name, username?, password?, uris?, totp?, notes?, fields?}` | + +Shape notes: present standard fields only (empty ones omitted); `fields` is a +custom `name→value` map (duplicate names → last-wins; `linked` fields skipped). +The TOTP **seed is never emitted** — `totp` is a presence flag (`true`), so the +only seed-derived path stays the specially-audited `vault code`. Like +`get --json`, the dump is all secret values, so it **refuses a terminal** — pipe +it (`homelab vault get --all | jq`). + +### v0.10.1 — reads `bw sync` first (always fresh) + +Every vault read (`get`, `get --all`, `list`, `code`, `status`) now runs `bw +sync` when opening its session, so it reflects the latest server-side values. +`bw unlock` only decrypts the *local* cache, so without this a persisted +(already-logged-in) session served stale data — a password changed in the web +vault wouldn't show up until the next login. The sync is **best-effort**: a +transient failure warns on stderr and falls back to the cached vault rather than +failing the read. + +### v0.11 — `vault kv` (HashiCorp Vault / OpenBao infra secrets) + +`homelab vault` now fronts **two unrelated stores**, made explicit in the bare +`homelab vault` help and via `[vaultwarden]` / `[hashicorp-vault]` summary tags: + +- **Vaultwarden** — your personal password manager (`vault get/list/code/…`, unchanged). +- **HashiCorp Vault / OpenBao** — homelab infra secrets, the `secret/…` KV store, under `vault kv`. + +| Command | Tier | What it does | +| --- | --- | --- | +| `vault kv get [--field K]` | read | read a secret: `--field K` → one value (TTY-aware clipboard/stdout); no field → all fields as JSON (refuses a bare TTY) | +| `vault kv list ` | read | list sub-paths under `` (no values) | +| `vault kv put ` | write | write one key; **value via stdin** (piped or no-echo prompt, never argv); creates the path or **merges** (never clobbers siblings) | + +**Different credentials:** the Vaultwarden verbs use the per-user *scoped* token +(bound to `claude-users/`); `vault kv` uses your **own** Vault token +(`vault login -method=oidc` → `~/.vault-token`, or `$VAULT_TOKEN`) — the kv +handlers set `VAULT_ADDR` but never inject the scoped token (which would 403 off +its own path). Access is whatever your policy grants. Writes are merge-only; +`put` (replace) / `delete` are out of scope — use the raw `vault` CLI. + ## Build / install Built from source to `/usr/local/bin/homelab` during devvm provisioning diff --git a/cli/VERSION b/cli/VERSION index 85f7059b..fd2726c9 100644 --- a/cli/VERSION +++ b/cli/VERSION @@ -1 +1 @@ -v0.8.1 +v0.11.0 diff --git a/cli/cmd_edges.go b/cli/cmd_edges.go new file mode 100644 index 00000000..7ee528fd --- /dev/null +++ b/cli/cmd_edges.go @@ -0,0 +1,69 @@ +package main + +import "fmt" + +func edgesCommands() []Command { + return []Command{ + {Path: []string{"edges"}, Tier: TierRead, + Summary: "who-talks-to-whom trail: edges [--ns|--src|--dst|--peers-of N] [--new-since 24h] [--denied] [--json] [--limit N]", + Run: edgesRun}, + } +} + +// edgesRun renders the filter flags to SQL and runs it read-only against the +// goldmane_edges CNPG DB via the dbaas primary pod (same exec path as `k8s db`). +func edgesRun(args []string) error { + for _, a := range args { + if a == "-h" || a == "--help" { + fmt.Print(edgesUsage()) + return nil + } + } + o, err := parseEdgesArgs(args) + if err != nil { + return fmt.Errorf("%w\n\n%s", err, edgesUsage()) + } + sql, err := buildEdgesQuery(o) + if err != nil { + return err + } + // pg-cluster-rw is a Service (not exec-able); resolve the primary POD. + pod, err := kubectlCapture("dbaas", "get", "pod", "-l", "cnpg.io/instanceRole=primary", + "-o", "jsonpath={.items[0].metadata.name}") + if err != nil || pod == "" { + return fmt.Errorf("could not resolve CNPG primary pod in dbaas: %v", err) + } + exec := []string{"exec", pod, "-c", "postgres", "--", "psql", "-U", "postgres", "-d", "goldmane_edges"} + if o.asJSON { + exec = append(exec, "-tAc", sql) // raw tuple → the JSON array + } else { + exec = append(exec, "-P", "pager=off", "-c", sql) // aligned table for humans + } + return kubectlStream("dbaas", exec...) +} + +func edgesUsage() string { + return `homelab edges — query the who-talks-to-whom trail (goldmane_edges, ADR-0014) + +Usage: homelab edges [filters] + +Filters (AND-combined; namespace values are validated to the k8s name charset): + --ns NAME edges touching NAME (either direction) + --src NAME edges where source namespace = NAME + --dst NAME edges where destination namespace = NAME + --peers-of NAME distinct peer namespaces of NAME (both directions) + --new-since SPEC first seen since SPEC: a duration (24h, 7d, 30m, 90s) or a date (YYYY-MM-DD) + --denied only denied (action='deny') edges — blocked / lateral-movement attempts + --json output a JSON array (for agents/pipelines) + --limit N cap rows (default 200) + +Examples: + homelab edges --ns immich # everything immich talks to / is talked to by + homelab edges --peers-of authentik # authentik's peer namespaces + homelab edges --src recruiter-responder # that namespace's egress peers + homelab edges --new-since 24h # edges first seen in the last day + homelab edges --denied --json # blocked flows, machine-readable + +Read-only SELECT against CNPG DB goldmane_edges via the dbaas primary pod. +` +} diff --git a/cli/cmd_memory.go b/cli/cmd_memory.go index 94f3a482..7ae11ea0 100644 --- a/cli/cmd_memory.go +++ b/cli/cmd_memory.go @@ -54,10 +54,7 @@ func printMemories(raw []byte, jsonOut bool) error { return nil } for _, m := range r.Memories { - c := strings.ReplaceAll(m.Content, "\n", " ") - if len(c) > 240 { - c = c[:240] + "…" - } + c := truncatePreview(strings.ReplaceAll(m.Content, "\n", " "), 240) fmt.Printf("#%d [%s] (%.2f) %s\n", m.ID, m.Category, m.Importance, c) if m.Tags != "" { fmt.Printf(" tags: %s\n", m.Tags) @@ -66,6 +63,21 @@ func printMemories(raw []byte, jsonOut bool) error { return nil } +// truncatePreview shortens s to at most maxRunes RUNES, appending "…" when it +// trims. Counting runes (not bytes) is load-bearing: a byte slice like s[:240] +// can cut through the middle of a multibyte UTF-8 character (e.g. 2-byte +// Cyrillic), leaving a dangling lead byte = invalid UTF-8. That crashed strict +// decoders downstream — notably the homelab-memory-recall.py UserPromptSubmit +// hook (subprocess text=True), which surfaced as a recurring "UserPromptSubmit +// hook error" for Cyrillic-language users. +func truncatePreview(s string, maxRunes int) string { + r := []rune(s) + if len(r) <= maxRunes { + return s + } + return string(r[:maxRunes]) + "…" +} + func memoryRecall(args []string) error { req := memRecallReq{} jsonOut := false diff --git a/cli/cmd_vault.go b/cli/cmd_vault.go index bf270886..1a28ff14 100644 --- a/cli/cmd_vault.go +++ b/cli/cmd_vault.go @@ -4,6 +4,7 @@ import ( "bufio" "encoding/base64" "encoding/json" + "errors" "fmt" "os" "os/exec" @@ -15,43 +16,60 @@ import ( // Identity is the kernel UID; per-user creds live in that user's isolated Vault // path (secret/workstation/claude-users/) read via their scoped token, and // decryption is done by the official `bw` CLI. See -// docs/superpowers/specs/2026-06-24-homelab-vault-design.md. +// docs/runbooks/homelab-vault-onboarding.md. func vaultCommands() []Command { - return []Command{ + cmds := []Command{ + // Vaultwarden — your personal password manager (logins/passwords/TOTP). {Path: []string{"vault", "setup"}, Tier: TierWrite, - Summary: "one-time: store your Vaultwarden master password + API key in your Vault path", Run: vaultSetup}, + Summary: "[vaultwarden] one-time: store your master password + API key in your Vault path", Run: vaultSetup}, {Path: []string{"vault", "status"}, Tier: TierRead, - Summary: "show whether your vault is configured/reachable (no secrets)", Run: vaultStatus}, + Summary: "[vaultwarden] show whether your vault is configured/reachable (no secrets)", Run: vaultStatus}, {Path: []string{"vault", "list"}, Tier: TierRead, - Summary: "list your item names: vault list [--search Q]", Run: vaultList}, + Summary: "[vaultwarden] list your item names: vault list [--search Q]", Run: vaultList}, {Path: []string{"vault", "get"}, Tier: TierRead, - Summary: "fetch one item: vault get [--field password|username|uri|notes|totp] [--json]", Run: vaultGet}, + Summary: "[vaultwarden] fetch one login: vault get [--field password|username|uri|notes|totp] [--json] [--all]", Run: vaultGet}, {Path: []string{"vault", "search"}, Tier: TierRead, - Summary: "search your item names: vault search ", Run: vaultSearch}, + Summary: "[vaultwarden] search your item names: vault search ", Run: vaultSearch}, {Path: []string{"vault", "code"}, Tier: TierRead, - Summary: "current TOTP code for an item: vault code ", Run: vaultCode}, + Summary: "[vaultwarden] current TOTP code for an item: vault code ", Run: vaultCode}, {Path: []string{"vault", "lock"}, Tier: TierWrite, - Summary: "lock/log out the local bw session", Run: vaultLock}, + Summary: "[vaultwarden] lock/log out the local bw session", Run: vaultLock}, {Path: []string{"vault"}, Tier: TierRead, - Summary: "Vaultwarden access for your own vault (run `homelab vault` for help)", + Summary: "two stores: Vaultwarden (logins) + HashiCorp Vault/OpenBao kv (infra secrets) — run `homelab vault` for help", Run: func([]string) error { fmt.Print(vaultHelp()); return nil }}, } + // HashiCorp Vault / OpenBao — homelab INFRA secrets (the secret/… KV store). + return append(cmds, vaultKVCommands()...) } -// vaultHelp is shown for bare `homelab vault`. +// vaultHelp is shown for bare `homelab vault`. It LEADS with the distinction +// between the two unrelated "vaults" this command fronts, because the name +// collides: Vaultwarden (a password manager) vs HashiCorp Vault / OpenBao (the +// infra secrets store). func vaultHelp() string { - return `homelab vault — read YOUR OWN Vaultwarden logins (no-HITL after one-time setup) + return `homelab vault — two different secret stores under one command: + • Vaultwarden your personal PASSWORD MANAGER (logins / passwords / TOTP) + • HashiCorp Vault / OpenBao homelab INFRA secrets (the secret/… KV store) → 'vault kv …' + +── Vaultwarden (reads YOUR OWN vault; no-HITL after one-time setup) ── homelab vault setup one-time: store your master password + API key in your Vault path homelab vault status configured / unlocked / reachable (no secrets) homelab vault list [--search Q] list your item names (no secrets) homelab vault get [--field password|username|uri|notes|totp] [--json] TTY → clipboard (auto-clears); piped → stdout + homelab vault get --all all fields (incl. custom) as JSON; piped only. + TOTP shown as presence flag — use 'vault code' for a code. homelab vault code current TOTP code homelab vault lock lock / log out the local bw session -Creds live only in your own Vault path; the admin never sees them. Identity is -your unix UID. Security model: docs/superpowers/specs/2026-06-24-homelab-vault-design.md +── HashiCorp Vault / OpenBao (infra secrets; uses your own OIDC vault token) ── + homelab vault kv get [--field K] read an infra KV secret + homelab vault kv list list sub-paths + homelab vault kv put write one key (value via stdin) + +Vaultwarden creds live only in your own Vault path; the admin never sees them. +Security model: docs/runbooks/homelab-vault-onboarding.md (note: anything running as your user can decrypt your vault — the accepted no-HITL trade). ` } @@ -79,7 +97,33 @@ func realRunner(name string, argv, envv []string) (string, error) { out, err := cmd.Output() // Trim only the trailing newline the tool appends — NOT all whitespace, so a // fetched secret with significant leading/trailing spaces is preserved. - return strings.TrimRight(string(out), "\r\n"), err + return strings.TrimRight(string(out), "\r\n"), augmentErr(err, exitStderr(err)) +} + +// exitStderr returns the stderr captured by cmd.Output() on a failed exec (it +// stows it on *exec.ExitError), or nil. The tools we shell out to (vault, bw) +// write the actionable message there — "connection refused", "permission +// denied" — which the caller would otherwise never see behind a bare +// "exit status N". +func exitStderr(err error) []byte { + var ee *exec.ExitError + if errors.As(err, &ee) { + return ee.Stderr + } + return nil +} + +// augmentErr appends captured stderr to an error so failures are diagnosable +// (not just "exit status 2"). Returns nil when err is nil, and err unchanged +// when there's no stderr; preserves the wrapped error for errors.Is/As. +func augmentErr(err error, stderr []byte) error { + if err == nil { + return nil + } + if s := strings.TrimSpace(string(stderr)); s != "" { + return fmt.Errorf("%w: %s", err, s) + } + return err } // realRunnerStdin runs a command feeding `stdin` to it, for secret values that @@ -92,7 +136,7 @@ func realRunnerStdin(name string, argv, envv []string, stdin string) (string, er } cmd.Stdin = strings.NewReader(stdin) out, err := cmd.Output() - return strings.TrimRight(string(out), "\r\n"), err + return strings.TrimRight(string(out), "\r\n"), augmentErr(err, exitStderr(err)) } func vwCredsPath(user string) string { return vwUserPathPrefix + user } @@ -128,6 +172,89 @@ func loadCreds(run cmdRunner, user string) (vwCreds, error) { var vaultCurrentUser = func() string { return os.Getenv("USER") } var vaultCurrentUID = func() string { return fmt.Sprintf("%d", os.Getuid()) } +// scopedTokenPath is where claude-auth-sync keeps the user's scoped Vault token. +// MUST match CAS_VAULT_TOKEN_FILE in scripts/workstation/claude-auth-sync.sh. +func scopedTokenPath(home string) string { + return home + "/.config/claude-auth-sync/vault-token" +} + +// vaultTokenSource decides which Vault token the `vault` child processes should +// use. Precedence: an explicit $VAULT_TOKEN (deliberate override), then the +// per-user scoped token claude-auth-sync maintains at scopedTokenPath(HOME) +// (policy workstation-claude-, which grants exactly the create/read/update +// this tool needs on the user's own path), then a native ~/.vault-token. +// +// The scoped token MUST beat ~/.vault-token: this tool only ever touches the +// caller's own secret/workstation/claude-users/ path, and a power-user who +// ran `vault login -method=oidc` carries a read-only ~/.vault-token whose +// capability on that path is `deny` — letting it win shadows the scoped token +// and every op fails 403/deny (emo, 2026-06-28). ~/.vault-token is only the +// right credential when there is no scoped token (admins). Returns the token to +// export — "" when the vault CLI should read the ambient/native credential — +// plus a source tag for tests/logging. +func vaultTokenSource(envToken string, haveVaultTokenFile bool, scopedToken string) (token, source string) { + switch { + case envToken != "": + return "", "env" + case strings.TrimSpace(scopedToken) != "": + return strings.TrimSpace(scopedToken), "scoped" + case haveVaultTokenFile: + return "", "file" + default: + return "", "none" + } +} + +// vaultAddrDefault is the cluster Vault the workstation talks to. The bw server +// is likewise hardcoded (openSession), so a sane default here is consistent. +const vaultAddrDefault = "https://vault.viktorbarzin.me" + +// vaultAddrToSet returns the VAULT_ADDR to export when the caller's environment +// doesn't already set one, else "". homelab vault is invoked by AFK agent +// sessions — frequently non-login shells (tmux panes, agent subprocesses) that +// never sourced /etc/environment — so, like claude-auth-sync, the CLI must NOT +// depend on an ambient VAULT_ADDR; otherwise every `vault` child falls back to +// the 127.0.0.1:8200 default and fails "connection refused" (exit 2). +func vaultAddrToSet(envAddr string) string { + if strings.TrimSpace(envAddr) == "" { + return vaultAddrDefault + } + return "" +} + +// ensureVaultAddr exports the default VAULT_ADDR when none is set, so the vault +// child processes reach the cluster Vault regardless of the caller's shell. An +// explicit VAULT_ADDR (admins, CI) is left untouched. +func ensureVaultAddr() { + if a := vaultAddrToSet(os.Getenv("VAULT_ADDR")); a != "" { + os.Setenv("VAULT_ADDR", a) + } +} + +// fileNonEmpty reports whether path exists and has content. +func fileNonEmpty(path string) bool { + fi, err := os.Stat(path) + return err == nil && fi.Size() > 0 +} + +// ensureVaultToken wires vaultTokenSource to the real environment: when the user +// has no ambient Vault credential, it exports the claude-auth-sync scoped token +// so the `vault` child processes authenticate as workstation-claude-. It +// is idempotent and safe for admins, whose explicit $VAULT_TOKEN / ~/.vault-token +// take precedence and are left untouched. +func ensureVaultToken() { + // Every vault verb funnels through here, so this is the one place that also + // guarantees VAULT_ADDR is set (see vaultAddrToSet for why it can't be + // assumed from the caller's shell). + ensureVaultAddr() + home := os.Getenv("HOME") + scoped, _ := os.ReadFile(scopedTokenPath(home)) + tok, src := vaultTokenSource(os.Getenv("VAULT_TOKEN"), home != "" && fileNonEmpty(home+"/.vault-token"), string(scoped)) + if src == "scoped" { + os.Setenv("VAULT_TOKEN", tok) + } +} + // bwBaseEnv is the minimal non-secret environment bw/node need. We deliberately // do NOT inherit the full parent env (keeps stray secrets out of the child). func bwBaseEnv(appdata string) []string { @@ -157,10 +284,12 @@ func bwSecretEnv(appdata string, c vwCreds, session string) []string { return env } -func bwLoginArgs() []string { return []string{"login", "--apikey"} } -func bwUnlockArgs() []string { return []string{"unlock", "--passwordenv", "BW_PASSWORD", "--raw"} } +func bwLoginArgs() []string { return []string{"login", "--apikey"} } +func bwUnlockArgs() []string { return []string{"unlock", "--passwordenv", "BW_PASSWORD", "--raw"} } func bwGetArgs(field, name string) []string { return []string{"get", field, name} } -func bwStatusArgs() []string { return []string{"status"} } +func bwItemArgs(name string) []string { return []string{"get", "item", name} } +func bwStatusArgs() []string { return []string{"status"} } +func bwSyncArgs() []string { return []string{"sync"} } // bwNeedsLogin parses `bw status` JSON and reports whether a `bw login` is // required. Unparseable/empty output → true (safer to attempt login). @@ -327,13 +456,23 @@ func openSession(run cmdRunner, user, uid string) (session, error) { if err != nil { return session{}, err } - return session{env: bwSecretEnv(appdata, creds, sess)}, nil + sessEnv := bwSecretEnv(appdata, creds, sess) + // Pull the latest server-side state so reads reflect current values. `bw + // unlock` only decrypts the LOCAL cache, so a persisted (already-logged-in) + // session would otherwise serve stale data until the next login. Best-effort: + // a transient sync failure must not break a read — fall back to the cached + // vault and warn (status reports reachability separately). + if _, err := run("bw", bwSyncArgs(), sessEnv); err != nil { + fmt.Fprintln(os.Stderr, "homelab vault: warning: bw sync failed; using cached vault (values may be stale): "+err.Error()) + } + return session{env: sessEnv}, nil } type getOpts struct { name string field string json bool + all bool // dump every field (incl. custom) as normalized JSON } var validGetFields = map[string]bool{"password": true, "username": true, "uri": true, "notes": true, "totp": true} @@ -345,6 +484,8 @@ func parseGetArgs(args []string) (getOpts, error) { switch { case a == "--json": o.json = true + case a == "--all": + o.all = true case a == "--field" && i+1 < len(args): o.field = args[i+1] i++ @@ -355,9 +496,10 @@ func parseGetArgs(args []string) (getOpts, error) { } } if o.name == "" { - return o, fmt.Errorf("usage: homelab vault get [--field password|username|uri|notes|totp] [--json]") + return o, fmt.Errorf("usage: homelab vault get [--field password|username|uri|notes|totp] [--json] [--all]") } - if !validGetFields[o.field] { + // --all dumps the whole item, so --field is irrelevant — skip its allowlist. + if !o.all && !validGetFields[o.field] { return o, fmt.Errorf("invalid --field %q (want password|username|uri|notes|totp)", o.field) } return o, nil @@ -373,6 +515,81 @@ func getValue(run cmdRunner, user, uid string, o getOpts) (string, error) { return bwGet(run, s.env, o.field, o.name) } +// getItem opens a session and returns the whole item as raw `bw get item` JSON. +// Used by `get --all`; normalization is a separate, pure step (normalizeItem). +func getItem(run cmdRunner, user, uid, name string) (string, error) { + s, err := openSession(run, user, uid) + if err != nil { + return "", err + } + return run("bw", bwItemArgs(name), s.env) +} + +// normalizedItem is the browse-all-fields projection of a Vaultwarden item: the +// standard login fields that are present, notes, and a flat map of custom field +// name→value. bw internals (id, object, reprompt, passwordHistory) are dropped, +// and the TOTP *seed* is reduced to a presence flag — the only seed-derived path +// stays the specially-audited `vault code` (see the design §10/§16). +type normalizedItem struct { + Name string `json:"name"` + Username string `json:"username,omitempty"` + Password string `json:"password,omitempty"` + URIs []string `json:"uris,omitempty"` + TOTP bool `json:"totp,omitempty"` // presence only, never the seed + Notes string `json:"notes,omitempty"` + Fields map[string]string `json:"fields,omitempty"` // custom field name→value +} + +// bwFieldLinked is the Bitwarden custom-field type for a "linked" field: it +// references another field and carries a null value, so it is not real data. +const bwFieldLinked = 3 + +// normalizeItem parses a `bw get item` payload into the browse projection. It is +// pure (no I/O), so it is the unit-tested heart of `get --all`. +func normalizeItem(raw string) (normalizedItem, error) { + var it struct { + Name string `json:"name"` + Notes string `json:"notes"` + Login *struct { + Username string `json:"username"` + Password string `json:"password"` + Totp string `json:"totp"` + URIs []struct { + URI string `json:"uri"` + } `json:"uris"` + } `json:"login"` + Fields []struct { + Name string `json:"name"` + Value string `json:"value"` + Type int `json:"type"` + } `json:"fields"` + } + if err := json.Unmarshal([]byte(raw), &it); err != nil { + return normalizedItem{}, fmt.Errorf("parse bw item: %w", err) + } + n := normalizedItem{Name: it.Name, Notes: it.Notes} + if it.Login != nil { + n.Username = it.Login.Username + n.Password = it.Login.Password + n.TOTP = it.Login.Totp != "" + for _, u := range it.Login.URIs { + if u.URI != "" { + n.URIs = append(n.URIs, u.URI) + } + } + } + for _, f := range it.Fields { + if f.Type == bwFieldLinked { + continue // references another field, no value of its own + } + if n.Fields == nil { + n.Fields = map[string]string{} + } + n.Fields[f.Name] = f.Value // duplicate names: last-wins (rare; documented) + } + return n, nil +} + // clipboardDecision picks how to return a secret value. "stdout" prints it (a // pipe/agent — the intended machine path); "clipboard" copies via OSC52; // "refuse" emits nothing sensitive (would otherwise risk dumping the secret's @@ -443,6 +660,7 @@ func runList(run cmdRunner, user, uid, search string) ([]string, error) { func vaultList(args []string) error { hardenProcess() + ensureVaultToken() search := "" for i := 0; i < len(args); i++ { if args[i] == "--search" && i+1 < len(args) { @@ -477,6 +695,7 @@ func vaultSearch(args []string) error { func vaultCode(args []string) error { hardenProcess() + ensureVaultToken() if len(args) == 0 { return fmt.Errorf("usage: homelab vault code ") } @@ -508,7 +727,9 @@ func statusSummary(run cmdRunner, user, uid string) string { if err != nil { return "vault: configured, but unlock/login FAILED (creds stale? run `homelab vault setup`): " + err.Error() } - if _, err := run("bw", []string{"sync"}, s.env); err != nil { + // openSession already did a best-effort sync; status re-runs it explicitly so + // a reachability failure surfaces in this report rather than only on stderr. + if _, err := run("bw", bwSyncArgs(), s.env); err != nil { return "vault: configured + unlocked, but sync/reachability failed: " + err.Error() } return "vault: configured, unlocked, reachable ✓" @@ -516,6 +737,7 @@ func statusSummary(run cmdRunner, user, uid string) string { func vaultStatus(args []string) error { hardenProcess() + ensureVaultToken() uid := vaultCurrentUID() unlock, err := withUserLock(uid) if err != nil { @@ -542,32 +764,61 @@ func vaultLock(args []string) error { return nil // lock/logout best-effort; never error the caller } -// vaultPatchPublicArgs writes the non-secret identifiers via argv. Neither the +// kvWriteVerb selects the KV write semantics. merge=true → `kv patch -method=rw` +// (read-modify-write: needs only read+update, NOT the `patch` capability the +// scoped workstation-claude- policy lacks, and preserves co-located keys +// such as claude-auth-sync's claude_ai_oauth_json). merge=false → `kv put` +// (creates the path on first use, before any sibling keys exist). +func kvWriteVerb(merge bool) []string { + if merge { + return []string{"kv", "patch", "-method=rw"} + } + return []string{"kv", "put"} +} + +// vaultWritePublicArgs writes the non-secret identifiers via argv. Neither the // email nor the API client_id is a usable credential on its own. -func vaultPatchPublicArgs(user, email, clientID string) []string { - return []string{"kv", "patch", vwCredsPath(user), - "vaultwarden_email=" + email, - "vaultwarden_client_id=" + clientID, - } +func vaultWritePublicArgs(merge bool, user, email, clientID string) []string { + return append(kvWriteVerb(merge), vwCredsPath(user), + "vaultwarden_email="+email, + "vaultwarden_client_id="+clientID, + ) } -// vaultPatchSecretArgs writes ONE secret value via the `key=-` stdin form, so -// the value never appears in argv (ps / /proc//cmdline). The value is fed -// on stdin by realRunnerStdin. -func vaultPatchSecretArgs(user, key string) []string { - return []string{"kv", "patch", vwCredsPath(user), key + "=-"} +// vaultWriteSecretArgs writes ONE secret value via the `key=-` stdin form, so the +// value never appears in argv (ps / /proc//cmdline). Fed on stdin by +// realRunnerStdin. +func vaultWriteSecretArgs(merge bool, user, key string) []string { + return append(kvWriteVerb(merge), vwCredsPath(user), key+"=-") } -// writeCreds stores all four fields in the user's Vault path. The two real -// secrets (master password, API client_secret) go via stdin — never argv. -func writeCreds(user string, c vwCreds) error { - if _, err := realRunner("vault", vaultPatchPublicArgs(user, c.Email, c.ClientID), nil); err != nil { +// credsPathExists reports whether the user's KV path already holds data. Used to +// pick create (`kv put`) vs merge (`kv patch -method=rw`) for the first write: +// claude-auth-sync usually creates the path first (Claude OAuth backup), but a +// user could run `homelab vault setup` before that ever happens. +func credsPathExists(run cmdRunner, user string) bool { + _, err := run("vault", []string{"kv", "get", "-format=json", vwCredsPath(user)}, nil) + return err == nil +} + +// cmdRunnerStdin is realRunnerStdin's shape, injected so writeCreds is testable. +type cmdRunnerStdin func(name string, argv, envv []string, stdin string) (string, error) + +// writeCreds stores all four fields in the user's Vault path using only the +// capabilities the scoped policy grants (create/read/update — NOT `patch`). The +// first (public) write creates the path when absent; the two real secrets then +// merge in via read-modify-write so the public keys — and any claude-auth-sync +// keys already present — survive. Secret values travel on stdin, never argv. +func writeCreds(run cmdRunner, runStdin cmdRunnerStdin, user string, c vwCreds) error { + merge := credsPathExists(run, user) + if _, err := run("vault", vaultWritePublicArgs(merge, user, c.Email, c.ClientID), nil); err != nil { return err } - if _, err := realRunnerStdin("vault", vaultPatchSecretArgs(user, "vaultwarden_master_password"), nil, c.MasterPassword); err != nil { + // The path now exists regardless of the branch above → merge the secrets in. + if _, err := runStdin("vault", vaultWriteSecretArgs(true, user, "vaultwarden_master_password"), nil, c.MasterPassword); err != nil { return err } - if _, err := realRunnerStdin("vault", vaultPatchSecretArgs(user, "vaultwarden_client_secret"), nil, c.ClientSecret); err != nil { + if _, err := runStdin("vault", vaultWriteSecretArgs(true, user, "vaultwarden_client_secret"), nil, c.ClientSecret); err != nil { return err } return nil @@ -593,6 +844,7 @@ func promptLine(prompt string) (string, error) { func vaultSetup(args []string) error { hardenProcess() + ensureVaultToken() fmt.Fprintln(os.Stderr, "One-time setup. Stored ONLY in your own Vault path; the admin never sees it.") fmt.Fprintln(os.Stderr, "Get your API key at https://vaultwarden.viktorbarzin.me → Settings → Security → Keys → View API key.") email, err := promptLine("Vaultwarden email: ") @@ -615,7 +867,7 @@ func vaultSetup(args []string) error { return fmt.Errorf("all fields are required") } c := vwCreds{Email: email, MasterPassword: master, ClientID: clientID, ClientSecret: clientSecret} - if err := writeCreds(vaultCurrentUser(), c); err != nil { + if err := writeCreds(realRunner, realRunnerStdin, vaultCurrentUser(), c); err != nil { return fmt.Errorf("writing creds to your Vault path failed (scoped token present?): %w", err) } fmt.Fprintln(os.Stderr, "Stored. Verifying unlock…") @@ -634,6 +886,7 @@ func vaultSetup(args []string) error { func vaultGet(args []string) error { hardenProcess() + ensureVaultToken() o, err := parseGetArgs(args) if err != nil { return err @@ -645,6 +898,9 @@ func vaultGet(args []string) error { } defer unlock() user := vaultCurrentUser() + if o.all { + return getAllFields(user, uid, o.name) + } val, err := getValue(realRunner, user, uid, o) if err != nil { return err @@ -661,3 +917,28 @@ func vaultGet(args []string) error { return nil } +// getAllFields prints every field of one item as normalized JSON. Like +// `get --json`, the payload is all secret values, so it refuses a terminal +// (pipe it). The TOTP seed is never emitted — only a presence flag — so no extra +// TOTP audit is needed; the op-log uses a distinct verb so a bulk dump is +// distinguishable from a single-field get (the item name is still never logged). +func getAllFields(user, uid, name string) error { + if !jsonToStdoutOK(stdoutIsTTY()) { + return fmt.Errorf("refusing to print all fields as JSON to a terminal; pipe it (e.g. | jq)") + } + raw, err := getItem(realRunner, user, uid, name) + if err != nil { + return err + } + item, err := normalizeItem(raw) + if err != nil { + return err + } + out, err := json.Marshal(item) + if err != nil { + return err + } + writeOpLog(opRecord{User: user, Verb: "get-all", PID: os.Getpid(), PPID: os.Getppid(), ParentComm: parentComm(os.Getppid()), ItemName: name}) + fmt.Println(string(out)) + return nil +} diff --git a/cli/cmd_vault_kv.go b/cli/cmd_vault_kv.go new file mode 100644 index 00000000..5f70e6b5 --- /dev/null +++ b/cli/cmd_vault_kv.go @@ -0,0 +1,248 @@ +package main + +import ( + "encoding/json" + "fmt" + "io" + "os" + "strings" +) + +// The `vault kv` verbs talk to HashiCorp Vault / OpenBao — the homelab INFRA +// secrets store (the `secret/…` KV-v2 mount at vault.viktorbarzin.me) — NOT +// Vaultwarden. They are a thin, TTY-aware wrapper over the `vault` CLI that adds +// the same conveniences as the Vaultwarden verbs: a self-defaulted VAULT_ADDR +// (so non-login agent shells work) and clipboard/refuse-on-TTY secret handling. +// +// CREDENTIALS DIFFER FROM THE VAULTWARDEN VERBS. Those use the per-user *scoped* +// token (bound only to secret/workstation/claude-users/). A general kv read +// of e.g. secret/viktor must use the caller's OWN Vault token (the OIDC +// ~/.vault-token or an explicit $VAULT_TOKEN) — the scoped token has `deny` +// everywhere else and would 403. So the kv handlers call ensureVaultAddr() to +// guarantee VAULT_ADDR but deliberately do NOT call ensureVaultToken() (which +// injects the scoped token). Access is then whatever the caller's policy grants. +func vaultKVCommands() []Command { + return []Command{ + {Path: []string{"vault", "kv", "get"}, Tier: TierRead, + Summary: "[hashicorp-vault] read an infra KV secret: vault kv get [--field K]", Run: vaultKVGet}, + {Path: []string{"vault", "kv", "list"}, Tier: TierRead, + Summary: "[hashicorp-vault] list infra KV sub-paths: vault kv list ", Run: vaultKVList}, + {Path: []string{"vault", "kv", "put"}, Tier: TierWrite, + Summary: "[hashicorp-vault] write one KV key (value via stdin): vault kv put ", Run: vaultKVPut}, + {Path: []string{"vault", "kv"}, Tier: TierRead, + Summary: "[hashicorp-vault] infra secrets (run `homelab vault kv` for help)", + Run: func([]string) error { fmt.Print(vaultKVHelp()); return nil }}, + } +} + +func vaultKVHelp() string { + return `homelab vault kv — HashiCorp Vault / OpenBao (homelab INFRA secrets, the secret/… KV store) + + homelab vault kv get [--field K] read a secret + --field K → one value (TTY → clipboard; piped → stdout) + no --field → all fields as JSON (piped only) + homelab vault kv list list sub-paths under (no values) + homelab vault kv put write one key; value read from stdin + (piped, or no-echo prompt); merges — never clobbers siblings + +Uses YOUR Vault token (vault login -method=oidc → ~/.vault-token); access is +whatever your policy grants. This is NOT Vaultwarden — for your personal logins +use 'homelab vault get' (see 'homelab vault'). +` +} + +// --- arg builders (pure; values never travel via argv) -------------------- + +func vaultKVGetFieldArgs(path, field string) []string { + return []string{"kv", "get", "-field=" + field, path} +} +func vaultKVGetJSONArgs(path string) []string { return []string{"kv", "get", "-format=json", path} } +func vaultKVListArgs(path string) []string { return []string{"kv", "list", "-format=json", path} } + +// vaultKVPutArgs builds the write argv. merge=true → `kv patch -method=rw` +// (read-modify-write: merges, needs only read+update — not the `patch` capability +// — and preserves sibling keys); merge=false → `kv put` (creates the path on +// first write). The value is ALWAYS read from stdin via the `=-` form, so it +// never appears in argv (visible via ps / /proc//cmdline to same-UID procs). +func vaultKVPutArgs(merge bool, path, key string) []string { + return append(kvWriteVerb(merge), path, key+"=-") +} + +// --- pure parsers ---------------------------------------------------------- + +// extractKVData returns the inner secret object from a `vault kv get -format=json` +// envelope (`{"data":{"data":{…},"metadata":{…}}}`), dropping the metadata/request +// wrapper so only the secret's own key→value data is emitted. +func extractKVData(jsonOut string) (string, error) { + var env struct { + Data struct { + Data json.RawMessage `json:"data"` + } `json:"data"` + } + if err := json.Unmarshal([]byte(jsonOut), &env); err != nil { + return "", fmt.Errorf("parse vault kv json: %w", err) + } + if len(env.Data.Data) == 0 { + return "", fmt.Errorf("no secret data at that path") + } + return string(env.Data.Data), nil +} + +// parseKVList parses the JSON array `vault kv list -format=json` prints. +func parseKVList(jsonOut string) ([]string, error) { + var keys []string + if err := json.Unmarshal([]byte(jsonOut), &keys); err != nil { + return nil, fmt.Errorf("parse vault kv list json: %w", err) + } + return keys, nil +} + +// --- testable cores (injected cmdRunner) ----------------------------------- + +func kvGetField(run cmdRunner, path, field string) (string, error) { + return run("vault", vaultKVGetFieldArgs(path, field), nil) +} + +func kvGetJSON(run cmdRunner, path string) (string, error) { + out, err := run("vault", vaultKVGetJSONArgs(path), nil) + if err != nil { + return "", err + } + return extractKVData(out) +} + +func kvList(run cmdRunner, path string) ([]string, error) { + out, err := run("vault", vaultKVListArgs(path), nil) + if err != nil { + return nil, err + } + return parseKVList(out) +} + +// kvPathExists reports whether the KV path already holds data, to pick create +// (`kv put`) vs merge (`kv patch -method=rw`) — so a write never clobbers +// sibling keys on an existing path. +func kvPathExists(run cmdRunner, path string) bool { + _, err := run("vault", vaultKVGetJSONArgs(path), nil) + return err == nil +} + +// kvPut writes one key, creating the path when absent and merging when present. +// The value travels on stdin only (never argv). +func kvPut(run cmdRunner, runStdin cmdRunnerStdin, path, key, value string) error { + merge := kvPathExists(run, path) + _, err := runStdin("vault", vaultKVPutArgs(merge, path, key), nil, value) + return err +} + +// --- handlers -------------------------------------------------------------- + +func vaultKVGet(args []string) error { + hardenProcess() + ensureVaultAddr() // own token, NOT the scoped one (see file header) + var path, field string + for i := 0; i < len(args); i++ { + a := args[i] + switch { + case a == "--field" && i+1 < len(args): + field = args[i+1] + i++ + case strings.HasPrefix(a, "--field="): + field = strings.TrimPrefix(a, "--field=") + case !strings.HasPrefix(a, "-") && path == "": + path = a + } + } + if path == "" { + return fmt.Errorf("usage: homelab vault kv get [--field ]") + } + if field != "" { + val, err := kvGetField(realRunner, path, field) + if err != nil { + return err + } + emitSecret(val) // TTY-aware: clipboard on a terminal, stdout when piped + return nil + } + // No --field → the whole secret. All values, so refuse a bare TTY (like + // `vault get --json`): pick a --field for the clipboard path, or pipe it. + if !jsonToStdoutOK(stdoutIsTTY()) { + return fmt.Errorf("refusing to print all KV fields as JSON to a terminal; use --field , or pipe it (e.g. | jq)") + } + out, err := kvGetJSON(realRunner, path) + if err != nil { + return err + } + fmt.Println(out) + return nil +} + +func vaultKVList(args []string) error { + ensureVaultAddr() + var path string + for _, a := range args { + if !strings.HasPrefix(a, "-") { + path = a + break + } + } + if path == "" { + return fmt.Errorf("usage: homelab vault kv list ") + } + keys, err := kvList(realRunner, path) + if err != nil { + return err + } + for _, k := range keys { + fmt.Println(k) + } + return nil +} + +func vaultKVPut(args []string) error { + hardenProcess() + ensureVaultAddr() + var path, key string + for _, a := range args { + if strings.HasPrefix(a, "-") { + continue + } + switch { + case path == "": + path = a + case key == "": + key = a + } + } + if path == "" || key == "" { + return fmt.Errorf("usage: homelab vault kv put (value read from stdin)") + } + value, err := readSecretValue("Value for " + key + ": ") + if err != nil { + return err + } + if value == "" { + return fmt.Errorf("empty value; aborting (nothing written)") + } + if err := kvPut(realRunner, realRunnerStdin, path, key, value); err != nil { + return fmt.Errorf("writing %q to %s failed (does your token have write access? path correct?): %w", key, path, err) + } + fmt.Fprintln(os.Stderr, "wrote "+key+" to "+path) + return nil +} + +// readSecretValue obtains a secret value WITHOUT putting it in argv: piped stdin +// is read verbatim (trailing newline trimmed, internal newlines preserved so +// multi-line values like PEM keys survive); an interactive TTY is prompted +// without echo. +func readSecretValue(prompt string) (string, error) { + fi, err := os.Stdin.Stat() + if err == nil && fi.Mode()&os.ModeCharDevice == 0 { + b, rerr := io.ReadAll(os.Stdin) + if rerr != nil { + return "", rerr + } + return strings.TrimRight(string(b), "\r\n"), nil + } + return promptNoEcho(prompt) +} diff --git a/cli/cmd_vault_test.go b/cli/cmd_vault_test.go index 36aab1f4..fbfd876d 100644 --- a/cli/cmd_vault_test.go +++ b/cli/cmd_vault_test.go @@ -2,6 +2,8 @@ package main import ( "encoding/base64" + "encoding/json" + "errors" "fmt" "os" "reflect" @@ -70,7 +72,7 @@ func (f *fakeRunner) run(name string, argv, envv []string) (string, error) { func TestLoadCredsReadsFourFields(t *testing.T) { f := &fakeRunner{out: map[string]string{ - "vault kv get -field=vaultwarden_email secret/workstation/claude-users/emo": "emo@x.me", + "vault kv get -field=vaultwarden_email secret/workstation/claude-users/emo": "emo@x.me", "vault kv get -field=vaultwarden_master_password secret/workstation/claude-users/emo": "hunter2", "vault kv get -field=vaultwarden_client_id secret/workstation/claude-users/emo": "user.abc", "vault kv get -field=vaultwarden_client_secret secret/workstation/claude-users/emo": "sek", @@ -233,12 +235,181 @@ func TestStatusSummaryUnconfigured(t *testing.T) { } } -func TestVaultPatchPublicArgs(t *testing.T) { - got := vaultPatchPublicArgs("emo", "e@x.me", "user.ci") - want := []string{"kv", "patch", "secret/workstation/claude-users/emo", +func TestEnsureVaultTokenSetsScopedFallback(t *testing.T) { + dir := t.TempDir() + cfg := dir + "/.config/claude-auth-sync" + if err := os.MkdirAll(cfg, 0o700); err != nil { + t.Fatal(err) + } + if err := os.WriteFile(cfg+"/vault-token", []byte("SCOPED-TOK\n"), 0o600); err != nil { + t.Fatal(err) + } + t.Setenv("HOME", dir) + t.Setenv("VAULT_TOKEN", "") // no ambient token + + ensureVaultToken() + if got := os.Getenv("VAULT_TOKEN"); got != "SCOPED-TOK" { + t.Fatalf("VAULT_TOKEN = %q, want scoped fallback to be exported", got) + } +} + +func TestEnsureVaultTokenKeepsExplicitEnv(t *testing.T) { + dir := t.TempDir() + cfg := dir + "/.config/claude-auth-sync" + if err := os.MkdirAll(cfg, 0o700); err != nil { + t.Fatal(err) + } + if err := os.WriteFile(cfg+"/vault-token", []byte("SCOPED-TOK"), 0o600); err != nil { + t.Fatal(err) + } + t.Setenv("HOME", dir) + t.Setenv("VAULT_TOKEN", "ADMIN-TOK") + + ensureVaultToken() + if got := os.Getenv("VAULT_TOKEN"); got != "ADMIN-TOK" { + t.Fatalf("VAULT_TOKEN = %q, must not override an explicit token", got) + } +} + +func TestEnsureVaultTokenPrefersScopedOverFile(t *testing.T) { + // Regression: a power-user's read-only OIDC ~/.vault-token must NOT shadow the + // purpose-built scoped token (emo's setup hit 403 because it did, 2026-06-28). + dir := t.TempDir() + cfg := dir + "/.config/claude-auth-sync" + if err := os.MkdirAll(cfg, 0o700); err != nil { + t.Fatal(err) + } + if err := os.WriteFile(cfg+"/vault-token", []byte("SCOPED-TOK"), 0o600); err != nil { + t.Fatal(err) + } + if err := os.WriteFile(dir+"/.vault-token", []byte("STALE-OIDC-TOK"), 0o600); err != nil { + t.Fatal(err) + } + t.Setenv("HOME", dir) + t.Setenv("VAULT_TOKEN", "") + + ensureVaultToken() + if got := os.Getenv("VAULT_TOKEN"); got != "SCOPED-TOK" { + t.Fatalf("VAULT_TOKEN = %q, want the scoped token to win over a stale ~/.vault-token", got) + } +} + +func TestScopedTokenPath(t *testing.T) { + if got := scopedTokenPath("/home/emo"); got != "/home/emo/.config/claude-auth-sync/vault-token" { + t.Fatalf("scopedTokenPath = %q", got) + } +} + +func TestVaultTokenSource(t *testing.T) { + // Precedence: explicit $VAULT_TOKEN > the claude-auth-sync per-user scoped + // token > a native ~/.vault-token. Scoped beats the file so a power-user's + // read-only OIDC ~/.vault-token can't shadow the scoped token on the user's + // own path (emo, 2026-06-28). + cases := []struct { + name string + env string + haveVaultToken bool + scoped string + wantTok, wantSrc string + }{ + {"explicit env wins", "abc", true, "S", "", "env"}, + {"scoped beats a stale ~/.vault-token", "", true, "S-TOK", "S-TOK", "scoped"}, + {"scoped used when no file", "", false, "S-TOK", "S-TOK", "scoped"}, + {"native ~/.vault-token only when no scoped", "", true, "", "", "file"}, + {"scoped value is trimmed", "", false, " S-TOK\n", "S-TOK", "scoped"}, + {"whitespace-only scoped falls back to file", "", true, " \n", "", "file"}, + {"nothing configured", "", false, "", "", "none"}, + } + for _, c := range cases { + tok, src := vaultTokenSource(c.env, c.haveVaultToken, c.scoped) + if tok != c.wantTok || src != c.wantSrc { + t.Errorf("%s: vaultTokenSource(%q,%v,%q) = (%q,%q), want (%q,%q)", + c.name, c.env, c.haveVaultToken, c.scoped, tok, src, c.wantTok, c.wantSrc) + } + } +} + +func TestVaultAddrToSet(t *testing.T) { + // homelab vault is invoked by AFK agent sessions (non-login shells that + // never sourced /etc/environment), so the CLI must self-default VAULT_ADDR + // rather than rely on the ambient env — else every `vault` child hits the + // 127.0.0.1:8200 default and fails "connection refused" (exit 2). + cases := []struct { + name, env, want string + }{ + {"unset -> default", "", vaultAddrDefault}, + {"whitespace-only -> default", " \n", vaultAddrDefault}, + {"explicit kept (empty = leave alone)", "https://vault.example.com", ""}, + } + for _, c := range cases { + if got := vaultAddrToSet(c.env); got != c.want { + t.Errorf("%s: vaultAddrToSet(%q) = %q, want %q", c.name, c.env, got, c.want) + } + } +} + +func TestEnsureVaultTokenSetsDefaultAddr(t *testing.T) { + dir := t.TempDir() // no scoped token, no ~/.vault-token + t.Setenv("HOME", dir) + t.Setenv("VAULT_TOKEN", "") + t.Setenv("VAULT_ADDR", "") // emo's non-login-shell situation + + ensureVaultToken() + if got := os.Getenv("VAULT_ADDR"); got != vaultAddrDefault { + t.Fatalf("VAULT_ADDR = %q, want default %q to be exported", got, vaultAddrDefault) + } +} + +func TestEnsureVaultTokenKeepsExplicitAddr(t *testing.T) { + dir := t.TempDir() + t.Setenv("HOME", dir) + t.Setenv("VAULT_TOKEN", "") + t.Setenv("VAULT_ADDR", "https://vault.example.com") + + ensureVaultToken() + if got := os.Getenv("VAULT_ADDR"); got != "https://vault.example.com" { + t.Fatalf("VAULT_ADDR = %q, must not override an explicit addr", got) + } +} + +func TestAugmentErrSurfacesStderr(t *testing.T) { + if got := augmentErr(nil, []byte("ignored")); got != nil { + t.Fatalf("augmentErr(nil, …) = %v, want nil", got) + } + base := errors.New("exit status 2") + got := augmentErr(base, []byte(" dial tcp 127.0.0.1:8200: connect: connection refused\n")) + if got == nil || !strings.Contains(got.Error(), "connection refused") || !strings.Contains(got.Error(), "exit status 2") { + t.Fatalf("augmentErr did not surface stderr: %v", got) + } + if !errors.Is(got, base) { + t.Fatal("augmentErr lost the wrapped error (errors.Is failed)") + } + if got := augmentErr(base, []byte(" ")); got != base { + t.Fatalf("augmentErr with blank stderr = %v, want the original error unchanged", got) + } +} + +func TestKvWriteVerb(t *testing.T) { + // merge=true → read-modify-write patch (needs only read+update, NOT the + // `patch` capability the scoped workstation policy lacks). + if got := kvWriteVerb(true); !reflect.DeepEqual(got, []string{"kv", "patch", "-method=rw"}) { + t.Fatalf("kvWriteVerb(true) = %v", got) + } + // merge=false → put (creates the path on first use) + if got := kvWriteVerb(false); !reflect.DeepEqual(got, []string{"kv", "put"}) { + t.Fatalf("kvWriteVerb(false) = %v", got) + } +} + +func TestVaultWritePublicArgs(t *testing.T) { + got := vaultWritePublicArgs(true, "emo", "e@x.me", "user.ci") + want := []string{"kv", "patch", "-method=rw", "secret/workstation/claude-users/emo", "vaultwarden_email=e@x.me", "vaultwarden_client_id=user.ci"} if !reflect.DeepEqual(got, want) { - t.Fatalf("vaultPatchPublicArgs = %v", got) + t.Fatalf("vaultWritePublicArgs(merge) = %v", got) + } + if got := vaultWritePublicArgs(false, "emo", "e@x.me", "user.ci"); got[0] != "kv" || got[1] != "put" { + t.Fatalf("vaultWritePublicArgs(create) must use `kv put`, got %v", got) } for _, a := range got { if strings.Contains(a, "master_password") || strings.Contains(a, "client_secret") { @@ -247,12 +418,12 @@ func TestVaultPatchPublicArgs(t *testing.T) { } } -func TestVaultPatchSecretArgsNoValueInArgv(t *testing.T) { +func TestVaultWriteSecretArgsNoValueInArgv(t *testing.T) { for _, key := range []string{"vaultwarden_master_password", "vaultwarden_client_secret"} { - got := vaultPatchSecretArgs("emo", key) - want := []string{"kv", "patch", "secret/workstation/claude-users/emo", key + "=-"} + got := vaultWriteSecretArgs(true, "emo", key) + want := []string{"kv", "patch", "-method=rw", "secret/workstation/claude-users/emo", key + "=-"} if !reflect.DeepEqual(got, want) { - t.Fatalf("vaultPatchSecretArgs(%q) = %v", key, got) + t.Fatalf("vaultWriteSecretArgs(%q) = %v", key, got) } if got[len(got)-1] != key+"=-" { t.Fatalf("secret value must be read from stdin (`%s=-`), got %v", key, got) @@ -260,6 +431,90 @@ func TestVaultPatchSecretArgsNoValueInArgv(t *testing.T) { } } +// recStdin records a stdin-bearing call for assertions. +type recStdin struct { + argv []string + stdin string +} + +// TestWriteCredsCreatesThenMerges: when the path is ABSENT the first (public) +// write must `kv put` (create), and the two secrets must merge via patch -rw +// with values on stdin only — never the buggy plain `kv patch` (needs `patch`). +func TestWriteCredsCreatesThenMerges(t *testing.T) { + var calls [][]string + var stdinCalls []recStdin + run := func(name string, argv, envv []string) (string, error) { + calls = append(calls, append([]string{name}, argv...)) + if len(argv) >= 2 && argv[0] == "kv" && argv[1] == "get" { + return "", fmt.Errorf("no value found") // path absent + } + return "", nil + } + runStdin := func(name string, argv, envv []string, stdin string) (string, error) { + stdinCalls = append(stdinCalls, recStdin{append([]string{name}, argv...), stdin}) + return "", nil + } + c := vwCreds{Email: "e@x.me", MasterPassword: "PW", ClientID: "user.ci", ClientSecret: "CS"} + if err := writeCreds(run, runStdin, "emo", c); err != nil { + t.Fatalf("writeCreds: %v", err) + } + var sawPut, sawPlainPatch bool + for _, cl := range calls { + j := strings.Join(cl, " ") + if strings.Contains(j, "kv put") { + sawPut = true + } + if strings.Contains(j, "kv patch") && !strings.Contains(j, "-method=rw") { + sawPlainPatch = true + } + } + if !sawPut { + t.Fatalf("path absent → public write must be `kv put`; calls=%v", calls) + } + if sawPlainPatch { + t.Fatalf("must never use plain `kv patch` (needs `patch` capability); calls=%v", calls) + } + if len(stdinCalls) != 2 { + t.Fatalf("want 2 stdin secret writes, got %d", len(stdinCalls)) + } + for _, sc := range stdinCalls { + if !strings.Contains(strings.Join(sc.argv, " "), "kv patch -method=rw") { + t.Errorf("secret write must use patch -method=rw: %v", sc.argv) + } + for _, a := range sc.argv { + if strings.Contains(a, "PW") || strings.Contains(a, "CS") { + t.Errorf("secret leaked into argv: %v", sc.argv) + } + } + } + if stdinCalls[0].stdin != "PW" || stdinCalls[1].stdin != "CS" { + t.Errorf("stdin values wrong: %q,%q", stdinCalls[0].stdin, stdinCalls[1].stdin) + } +} + +// TestWriteCredsMergesWhenPresent: when the path EXISTS, every write must merge +// (patch -rw) — a `kv put` would wipe sibling keys (e.g. claude_ai_oauth_json). +func TestWriteCredsMergesWhenPresent(t *testing.T) { + var calls [][]string + run := func(name string, argv, envv []string) (string, error) { + calls = append(calls, append([]string{name}, argv...)) + return "{}", nil // get succeeds → path exists + } + runStdin := func(name string, argv, envv []string, stdin string) (string, error) { + calls = append(calls, append([]string{name}, argv...)) + return "", nil + } + c := vwCreds{Email: "e@x.me", MasterPassword: "PW", ClientID: "user.ci", ClientSecret: "CS"} + if err := writeCreds(run, runStdin, "emo", c); err != nil { + t.Fatalf("writeCreds: %v", err) + } + for _, cl := range calls { + if strings.Contains(strings.Join(cl, " "), "kv put") { + t.Fatalf("path exists → must NOT `kv put` (wipes siblings): %v", cl) + } + } +} + // TestNoSecretInArgvAcrossFlow is the load-bearing security test: across the // whole get flow (vault reads, bw config/status/login/unlock/get) NO secret // value may appear in any command's argv — secrets travel via env/stdin only. @@ -267,8 +522,8 @@ func TestNoSecretInArgvAcrossFlow(t *testing.T) { uid := fmt.Sprintf("%d", os.Getuid()) f := &fakeRunner{out: map[string]string{ "vault kv get -field=vaultwarden_master_password secret/workstation/claude-users/emo": "SUPERSECRETPW", - "vault kv get -field=vaultwarden_client_id secret/workstation/claude-users/emo": "user.x", - "vault kv get -field=vaultwarden_client_secret secret/workstation/claude-users/emo": "CLIENTSEKRET", + "vault kv get -field=vaultwarden_client_id secret/workstation/claude-users/emo": "user.x", + "vault kv get -field=vaultwarden_client_secret secret/workstation/claude-users/emo": "CLIENTSEKRET", "bw status": `{"status":"locked"}`, "bw unlock": "SESSIONXYZ", "bw get password github": "p@ss", @@ -353,8 +608,8 @@ func TestVaultBareGroupRegistered(t *testing.T) { func TestGetValueFlow(t *testing.T) { f := &fakeRunner{out: map[string]string{ "vault kv get -field=vaultwarden_master_password secret/workstation/claude-users/emo": "pw", - "vault kv get -field=vaultwarden_client_id secret/workstation/claude-users/emo": "user.x", - "vault kv get -field=vaultwarden_client_secret secret/workstation/claude-users/emo": "cs", + "vault kv get -field=vaultwarden_client_id secret/workstation/claude-users/emo": "user.x", + "vault kv get -field=vaultwarden_client_secret secret/workstation/claude-users/emo": "cs", "bw status": `{"status":"locked"}`, "bw unlock": "SESS", "bw get password github": "p@ss", @@ -366,3 +621,437 @@ func TestGetValueFlow(t *testing.T) { t.Fatalf("getValue = %q, %v", val, err) } } + +// --- vault get --all (browse all fields) ---------------------------------- + +func TestParseGetArgsAll(t *testing.T) { + o, err := parseGetArgs([]string{"github", "--all"}) + if err != nil || o.name != "github" || !o.all { + t.Fatalf("parseGetArgs(--all) = %+v err=%v", o, err) + } + // --all must skip --field validation (field is irrelevant for a full dump). + if _, err := parseGetArgs([]string{"github", "--all", "--field", "evil"}); err != nil { + t.Fatalf("--all must ignore an otherwise-invalid --field, got err=%v", err) + } + // A name is still required. + if _, err := parseGetArgs([]string{"--all"}); err == nil { + t.Fatal("get --all with no name must error") + } + // Without --all, the field allowlist still applies. + if _, err := parseGetArgs([]string{"github", "--field", "evil"}); err == nil { + t.Fatal("invalid --field without --all must still error") + } +} + +func TestBwItemArgs(t *testing.T) { + argv := bwItemArgs("github") + if !reflect.DeepEqual(argv, []string{"get", "item", "github"}) { + t.Fatalf("bwItemArgs = %v", argv) + } + for _, a := range argv { + if strings.Contains(a, "SESSION") || a == "--session" { + t.Fatalf("session must travel via env, not argv: %v", argv) + } + } +} + +// a representative `bw get item` payload: login fields, multiple URIs, a TOTP +// seed, notes, custom fields (text/hidden/boolean), plus bw internals that MUST +// be dropped (id/object/reprompt/passwordHistory). +const sampleLoginItemJSON = `{ + "object":"item","id":"abc-123","folderId":null,"type":1,"reprompt":0, + "name":"GitHub","notes":"my notes","favorite":false, + "fields":[ + {"name":"PIN","value":"1234","type":1}, + {"name":"endpoint","value":"https://api.gh","type":0}, + {"name":"enabled","value":"true","type":2} + ], + "login":{ + "username":"octocat","password":"hunter2", + "totp":"otpauth://totp/GitHub:octocat?secret=SEEDSEEDSEED", + "uris":[{"match":null,"uri":"https://github.com"},{"match":null,"uri":"https://gist.github.com"}] + }, + "passwordHistory":[{"password":"OLD-PASSWORD-XYZ"}] +}` + +func TestNormalizeItemLogin(t *testing.T) { + n, err := normalizeItem(sampleLoginItemJSON) + if err != nil { + t.Fatalf("normalizeItem: %v", err) + } + if n.Name != "GitHub" || n.Username != "octocat" || n.Password != "hunter2" || n.Notes != "my notes" { + t.Fatalf("standard fields wrong: %+v", n) + } + if !n.TOTP { + t.Fatal("TOTP presence flag must be true when a seed exists") + } + if !reflect.DeepEqual(n.URIs, []string{"https://github.com", "https://gist.github.com"}) { + t.Fatalf("URIs = %v", n.URIs) + } + want := map[string]string{"PIN": "1234", "endpoint": "https://api.gh", "enabled": "true"} + if !reflect.DeepEqual(n.Fields, want) { + t.Fatalf("custom fields = %v want %v", n.Fields, want) + } +} + +// The load-bearing security test: the raw TOTP seed (more powerful than a +// one-time code) and the password history must NEVER appear in the dump. +func TestNormalizeItemNeverLeaksSeedOrHistory(t *testing.T) { + n, err := normalizeItem(sampleLoginItemJSON) + if err != nil { + t.Fatalf("normalizeItem: %v", err) + } + out, err := json.Marshal(n) + if err != nil { + t.Fatalf("marshal: %v", err) + } + for _, leak := range []string{"SEEDSEEDSEED", "otpauth", "OLD-PASSWORD-XYZ", "passwordHistory", "abc-123"} { + if strings.Contains(string(out), leak) { + t.Fatalf("dump leaked %q: %s", leak, out) + } + } +} + +func TestNormalizeItemNoTOTP(t *testing.T) { + n, err := normalizeItem(`{"name":"X","type":1,"login":{"username":"u","password":"p"}}`) + if err != nil { + t.Fatalf("normalizeItem: %v", err) + } + if n.TOTP { + t.Fatal("TOTP must be false when no seed present") + } + out, _ := json.Marshal(n) + if strings.Contains(string(out), "totp") { + t.Fatalf("no-totp item must omit the totp key entirely: %s", out) + } +} + +func TestNormalizeItemEmptyStandardFieldsOmitted(t *testing.T) { + n, err := normalizeItem(`{"name":"Bare","type":1,"login":{"username":"","password":"","totp":"","uris":[]},"fields":[{"name":"only","value":"x","type":0}]}`) + if err != nil { + t.Fatalf("normalizeItem: %v", err) + } + out, _ := json.Marshal(n) + for _, k := range []string{"username", "password", "uris", "notes", "totp"} { + if strings.Contains(string(out), `"`+k+`"`) { + t.Fatalf("empty standard field %q must be omitted: %s", k, out) + } + } + if !strings.Contains(string(out), `"name":"Bare"`) || !strings.Contains(string(out), `"only":"x"`) { + t.Fatalf("name + custom field must survive: %s", out) + } +} + +func TestNormalizeItemSecureNoteNullLogin(t *testing.T) { + // type 2 (secure note): login is null — must not panic; notes + custom fields survive. + n, err := normalizeItem(`{"name":"SN","type":2,"notes":"secret note","login":null,"fields":[{"name":"k","value":"v","type":1}]}`) + if err != nil { + t.Fatalf("normalizeItem(null login): %v", err) + } + if n.Name != "SN" || n.Notes != "secret note" || n.Fields["k"] != "v" { + t.Fatalf("secure-note normalize wrong: %+v", n) + } + if n.Username != "" || n.Password != "" || n.TOTP { + t.Fatalf("login fields must be empty for a login-less item: %+v", n) + } +} + +func TestNormalizeItemDuplicateCustomNames(t *testing.T) { + // Bitwarden permits duplicate custom-field names; a JSON object can't hold + // dups, so last-wins (documented). + n, err := normalizeItem(`{"name":"D","fields":[{"name":"k","value":"first","type":0},{"name":"k","value":"second","type":0}]}`) + if err != nil { + t.Fatalf("normalizeItem: %v", err) + } + if n.Fields["k"] != "second" { + t.Fatalf("duplicate custom names must be last-wins, got %q", n.Fields["k"]) + } +} + +func TestNormalizeItemLinkedFieldSkipped(t *testing.T) { + // type 3 (linked) fields reference another field and carry a null value — + // they are not real data and must be skipped. + n, err := normalizeItem(`{"name":"L","login":{"username":"u"},"fields":[{"name":"linked","value":null,"type":3},{"name":"real","value":"r","type":0}]}`) + if err != nil { + t.Fatalf("normalizeItem: %v", err) + } + if _, ok := n.Fields["linked"]; ok { + t.Fatalf("linked field must be skipped: %v", n.Fields) + } + if n.Fields["real"] != "r" { + t.Fatalf("real custom field dropped: %v", n.Fields) + } +} + +func TestNormalizeItemMalformed(t *testing.T) { + if _, err := normalizeItem("not json"); err == nil { + t.Fatal("malformed item JSON must error") + } +} + +// getItem opens a session and runs `bw get item `, returning raw JSON. +func TestGetItemFlow(t *testing.T) { + f := &fakeRunner{out: map[string]string{ + "vault kv get -field=vaultwarden_master_password secret/workstation/claude-users/emo": "pw", + "vault kv get -field=vaultwarden_client_id secret/workstation/claude-users/emo": "user.x", + "vault kv get -field=vaultwarden_client_secret secret/workstation/claude-users/emo": "cs", + "bw status": `{"status":"locked"}`, + "bw unlock": "SESS", + "bw get item github": sampleLoginItemJSON, + }} + uid := fmt.Sprintf("%d", os.Getuid()) + raw, err := getItem(f.run, "emo", uid, "github") + if err != nil || !strings.Contains(raw, `"name":"GitHub"`) { + t.Fatalf("getItem = %q, %v", raw, err) + } + // The session key must reach bw via env, never argv. + for _, call := range f.calls { + for _, arg := range call { + if strings.Contains(arg, "SESS") { + t.Errorf("session leaked into argv: %v", call) + } + } + } +} + +func TestVaultHelpMentionsAll(t *testing.T) { + if !strings.Contains(vaultHelp(), "--all") { + t.Error("vault help must document --all") + } +} + +// --- bw sync on read (freshness) ------------------------------------------ + +func TestBwSyncArgs(t *testing.T) { + if got := bwSyncArgs(); !reflect.DeepEqual(got, []string{"sync"}) { + t.Fatalf("bwSyncArgs = %v", got) + } +} + +// Every read opens a session that first `bw sync`s, so reads reflect the latest +// server-side values: `bw unlock` is local-only, so without a sync a persisted +// (already-logged-in) session serves a stale local cache. +func TestOpenSessionSyncsBeforeRead(t *testing.T) { + f := &fakeRunner{out: map[string]string{ + "vault kv get -field=vaultwarden_master_password secret/workstation/claude-users/emo": "pw", + "vault kv get -field=vaultwarden_client_id secret/workstation/claude-users/emo": "user.x", + "vault kv get -field=vaultwarden_client_secret secret/workstation/claude-users/emo": "cs", + "bw status": `{"status":"locked"}`, + "bw unlock": "SESS", + "bw sync": "Syncing complete.", + "bw get password github": "p@ss", + }} + uid := fmt.Sprintf("%d", os.Getuid()) + if _, err := getValue(f.run, "emo", uid, getOpts{name: "github", field: "password"}); err != nil { + t.Fatalf("getValue: %v", err) + } + idx := func(prefix string) int { + for i, c := range f.calls { + if strings.HasPrefix(strings.Join(c, " "), prefix) { + return i + } + } + return -1 + } + syncAt, unlockAt, getAt := idx("bw sync"), idx("bw unlock"), idx("bw get password github") + if syncAt < 0 { + t.Fatal("expected a `bw sync` before the read") + } + if !(unlockAt < syncAt && syncAt < getAt) { + t.Fatalf("order wrong: unlock=%d sync=%d get=%d (want unlock= 2 && argv[0] == "kv" && argv[1] == "get" { + if tc.exists { + return `{"data":{"data":{}}}`, nil + } + return "", fmt.Errorf("No value found at secret/x") + } + return "", nil + } + runStdin := func(name string, argv, envv []string, stdin string) (string, error) { + stdinCalls = append(stdinCalls, recStdin{append([]string{name}, argv...), stdin}) + return "", nil + } + if err := kvPut(run, runStdin, "secret/x", "api_key", "SECRETVALUE"); err != nil { + t.Fatalf("kvPut: %v", err) + } + if len(stdinCalls) != 1 { + t.Fatalf("want exactly 1 stdin write, got %d", len(stdinCalls)) + } + sc := stdinCalls[0] + joined := strings.Join(sc.argv, " ") + if tc.wantCreate && !strings.Contains(joined, "kv put") { + t.Fatalf("absent path must use `kv put`: %v", sc.argv) + } + if !tc.wantCreate && !strings.Contains(joined, "kv patch -method=rw") { + t.Fatalf("present path must merge via `kv patch -method=rw`: %v", sc.argv) + } + if strings.Contains(joined, "kv patch") && !strings.Contains(joined, "-method=rw") { + t.Fatalf("must never use plain `kv patch`: %v", sc.argv) + } + if sc.stdin != "SECRETVALUE" { + t.Fatalf("value must travel via stdin, got %q", sc.stdin) + } + for _, a := range sc.argv { + if strings.Contains(a, "SECRETVALUE") { + t.Fatalf("value leaked into argv: %v", sc.argv) + } + } + }) + } +} + +func TestVaultHelpMentionsBothSystems(t *testing.T) { + h := vaultHelp() + for _, want := range []string{"Vaultwarden", "vault kv"} { + if !strings.Contains(h, want) { + t.Errorf("vault help must mention %q (distinguish the two systems)", want) + } + } + // Must name the infra-secrets system so the distinction is unambiguous. + if !strings.Contains(h, "HashiCorp") && !strings.Contains(h, "OpenBao") { + t.Error("vault help must name HashiCorp Vault / OpenBao (the infra secrets store)") + } +} diff --git a/cli/edges.go b/cli/edges.go new file mode 100644 index 00000000..396cc5b9 --- /dev/null +++ b/cli/edges.go @@ -0,0 +1,164 @@ +package main + +import ( + "fmt" + "regexp" + "strconv" + "strings" +) + +// edgesOpts is the parsed filter set for `homelab edges` (the who-talks-to-whom +// investigation helper over the goldmane_edges trail; see ADR-0014). +type edgesOpts struct { + ns string // edges touching this namespace (either direction) + src string // edges where src_ns = this + dst string // edges where dst_ns = this + peersOf string // distinct peers of this namespace (both directions) + newSince string // first_seen >= duration (24h/7d/30m) or date (YYYY-MM-DD) + denied bool // action = 'deny' only + asJSON bool // wrap result as a JSON array + limit int // row cap (default 200) +} + +// parseEdgesArgs parses the edges flag surface. Unknown flags error out so a +// typo surfaces instead of silently dumping the whole table. +func parseEdgesArgs(args []string) (edgesOpts, error) { + o := edgesOpts{limit: 200} + i := 0 + for i < len(args) { + a := args[i] + key, inline, hasInline := a, "", false + if eq := strings.IndexByte(a, '='); eq >= 0 { + key, inline, hasInline = a[:eq], a[eq+1:], true + } + needVal := func() (string, error) { + if hasInline { + return inline, nil + } + if i+1 < len(args) { + i++ + return args[i], nil + } + return "", fmt.Errorf("flag %s needs a value", key) + } + var err error + switch key { + case "--ns": + o.ns, err = needVal() + case "--src": + o.src, err = needVal() + case "--dst": + o.dst, err = needVal() + case "--peers-of": + o.peersOf, err = needVal() + case "--new-since": + o.newSince, err = needVal() + case "--denied": + o.denied = true + case "--json": + o.asJSON = true + case "--limit": + var v string + if v, err = needVal(); err == nil { + if o.limit, err = strconv.Atoi(v); err != nil { + err = fmt.Errorf("--limit must be an integer: %q", v) + } + } + default: + return o, fmt.Errorf("unknown flag: %s", a) + } + if err != nil { + return o, err + } + i++ + } + return o, nil +} + +// nsRE is the safe namespace-token charset (k8s names + "Global"). Used as the +// injection guard — anything else is rejected rather than quoted-and-hoped. +var nsRE = regexp.MustCompile(`^[A-Za-z0-9][A-Za-z0-9_.-]*$`) + +func validateNS(s string) error { + if s == "" || len(s) > 63 || !nsRE.MatchString(s) { + return fmt.Errorf("invalid namespace name: %q", s) + } + return nil +} + +// sqlStr renders a SQL string literal (belt-and-suspenders on top of validateNS). +func sqlStr(s string) string { return "'" + strings.ReplaceAll(s, "'", "''") + "'" } + +var ( + durRE = regexp.MustCompile(`^(\d+)([smhd])$`) + dateRE = regexp.MustCompile(`^\d{4}-\d{2}-\d{2}([ T]\d{2}:\d{2}(:\d{2})?)?$`) +) + +// newSinceCond turns a duration (24h/7d/30m/90s) or a date (YYYY-MM-DD[ HH:MM]) +// into a first_seen predicate. +func newSinceCond(v string) (string, error) { + if m := durRE.FindStringSubmatch(v); m != nil { + unit := map[string]string{"s": "seconds", "m": "minutes", "h": "hours", "d": "days"}[m[2]] + return fmt.Sprintf("first_seen >= now() - interval '%s %s'", m[1], unit), nil + } + if dateRE.MatchString(v) { + return "first_seen >= " + sqlStr(v), nil + } + return "", fmt.Errorf("--new-since must be a duration (e.g. 24h, 7d, 30m) or a date (YYYY-MM-DD): %q", v) +} + +// buildEdgesQuery renders the SQL for the given filters against the `edge` table. +func buildEdgesQuery(o edgesOpts) (string, error) { + limit := o.limit + if limit <= 0 { + limit = 200 + } + + // peers-of is a distinct-peer summary, a different shape from the row list. + if o.peersOf != "" { + if err := validateNS(o.peersOf); err != nil { + return "", err + } + p := sqlStr(o.peersOf) + return fmt.Sprintf("SELECT DISTINCT peer, action FROM ("+ + "SELECT dst_ns AS peer, action FROM edge WHERE src_ns = %s "+ + "UNION SELECT src_ns AS peer, action FROM edge WHERE dst_ns = %s"+ + ") t ORDER BY peer LIMIT %d", p, p, limit), nil + } + + var conds []string + for _, f := range []struct{ val, tmpl string }{ + {o.ns, "(src_ns = %[1]s OR dst_ns = %[1]s)"}, + {o.src, "src_ns = %s"}, + {o.dst, "dst_ns = %s"}, + } { + if f.val == "" { + continue + } + if err := validateNS(f.val); err != nil { + return "", err + } + conds = append(conds, fmt.Sprintf(f.tmpl, sqlStr(f.val))) + } + if o.denied { + conds = append(conds, "action = 'deny'") + } + if o.newSince != "" { + c, err := newSinceCond(o.newSince) + if err != nil { + return "", err + } + conds = append(conds, c) + } + + q := "SELECT src_ns, dst_ns, action, flow_count, first_seen, last_seen FROM edge" + if len(conds) > 0 { + q += " WHERE " + strings.Join(conds, " AND ") + } + q += fmt.Sprintf(" ORDER BY first_seen DESC LIMIT %d", limit) + + if o.asJSON { + q = "SELECT coalesce(json_agg(row_to_json(t)), '[]') FROM (" + q + ") t" + } + return q, nil +} diff --git a/cli/edges_test.go b/cli/edges_test.go new file mode 100644 index 00000000..c8ead29d --- /dev/null +++ b/cli/edges_test.go @@ -0,0 +1,163 @@ +package main + +import ( + "strings" + "testing" +) + +func TestParseEdgesArgs(t *testing.T) { + cases := []struct { + name string + args []string + want edgesOpts + }{ + {"defaults", nil, edgesOpts{limit: 200}}, + {"ns", []string{"--ns", "immich"}, edgesOpts{ns: "immich", limit: 200}}, + {"ns equals", []string{"--ns=immich"}, edgesOpts{ns: "immich", limit: 200}}, + {"src dst", []string{"--src", "a", "--dst", "b"}, edgesOpts{src: "a", dst: "b", limit: 200}}, + {"peers-of", []string{"--peers-of", "authentik"}, edgesOpts{peersOf: "authentik", limit: 200}}, + {"denied json", []string{"--denied", "--json"}, edgesOpts{denied: true, asJSON: true, limit: 200}}, + {"new-since", []string{"--new-since", "24h"}, edgesOpts{newSince: "24h", limit: 200}}, + {"limit", []string{"--limit", "50"}, edgesOpts{limit: 50}}, + } + for _, c := range cases { + t.Run(c.name, func(t *testing.T) { + got, err := parseEdgesArgs(c.args) + if err != nil { + t.Fatalf("parseEdgesArgs(%v) error: %v", c.args, err) + } + if got != c.want { + t.Fatalf("parseEdgesArgs(%v) = %+v, want %+v", c.args, got, c.want) + } + }) + } +} + +func TestParseEdgesArgsErrors(t *testing.T) { + for _, args := range [][]string{ + {"--limit", "abc"}, + {"--bogus"}, + } { + if _, err := parseEdgesArgs(args); err == nil { + t.Errorf("parseEdgesArgs(%v) expected error, got nil", args) + } + } +} + +func TestBuildEdgesQueryDefaults(t *testing.T) { + q, err := buildEdgesQuery(edgesOpts{limit: 200}) + if err != nil { + t.Fatal(err) + } + for _, want := range []string{"FROM edge", "ORDER BY first_seen DESC", "LIMIT 200"} { + if !strings.Contains(q, want) { + t.Errorf("query %q missing %q", q, want) + } + } + if strings.Contains(q, "WHERE") { + t.Errorf("no-filter query should have no WHERE: %q", q) + } +} + +func TestBuildEdgesQueryFilters(t *testing.T) { + cases := []struct { + name string + o edgesOpts + want string + }{ + {"ns both directions", edgesOpts{ns: "immich", limit: 10}, "(src_ns = 'immich' OR dst_ns = 'immich')"}, + {"src only", edgesOpts{src: "authentik", limit: 10}, "src_ns = 'authentik'"}, + {"dst only", edgesOpts{dst: "dbaas", limit: 10}, "dst_ns = 'dbaas'"}, + {"denied", edgesOpts{denied: true, limit: 10}, "action = 'deny'"}, + } + for _, c := range cases { + t.Run(c.name, func(t *testing.T) { + q, err := buildEdgesQuery(c.o) + if err != nil { + t.Fatal(err) + } + if !strings.Contains(q, "WHERE") || !strings.Contains(q, c.want) { + t.Errorf("query %q missing WHERE/%q", q, c.want) + } + }) + } +} + +func TestBuildEdgesQueryCombinedFiltersAnded(t *testing.T) { + q, err := buildEdgesQuery(edgesOpts{src: "a", denied: true, limit: 5}) + if err != nil { + t.Fatal(err) + } + if !strings.Contains(q, "src_ns = 'a' AND action = 'deny'") { + t.Errorf("combined filters not AND'd: %q", q) + } +} + +func TestBuildEdgesQueryPeersOf(t *testing.T) { + q, err := buildEdgesQuery(edgesOpts{peersOf: "authentik", limit: 100}) + if err != nil { + t.Fatal(err) + } + for _, want := range []string{"DISTINCT", "src_ns = 'authentik'", "dst_ns = 'authentik'", "UNION"} { + if !strings.Contains(q, want) { + t.Errorf("peers-of query %q missing %q", q, want) + } + } +} + +func TestBuildEdgesQueryJSON(t *testing.T) { + q, err := buildEdgesQuery(edgesOpts{asJSON: true, limit: 200}) + if err != nil { + t.Fatal(err) + } + if !strings.Contains(q, "json_agg") || !strings.Contains(q, "row_to_json") { + t.Errorf("json query missing json_agg wrapper: %q", q) + } +} + +func TestBuildEdgesQueryRejectsInjection(t *testing.T) { + for _, bad := range []string{"a'; DROP TABLE edge;--", "a b", "a;b", "a\"b"} { + if _, err := buildEdgesQuery(edgesOpts{ns: bad, limit: 10}); err == nil { + t.Errorf("buildEdgesQuery(ns=%q) expected validation error, got nil", bad) + } + } +} + +func TestNewSinceCond(t *testing.T) { + cases := []struct { + in string + want string + }{ + {"24h", "first_seen >= now() - interval '24 hours'"}, + {"7d", "first_seen >= now() - interval '7 days'"}, + {"30m", "first_seen >= now() - interval '30 minutes'"}, + {"2026-06-28", "first_seen >= '2026-06-28'"}, + } + for _, c := range cases { + got, err := newSinceCond(c.in) + if err != nil { + t.Fatalf("newSinceCond(%q) error: %v", c.in, err) + } + if got != c.want { + t.Errorf("newSinceCond(%q) = %q, want %q", c.in, got, c.want) + } + } + for _, bad := range []string{"yesterday", "1y", "'; DROP", ""} { + if _, err := newSinceCond(bad); err == nil { + t.Errorf("newSinceCond(%q) expected error, got nil", bad) + } + } +} + +func TestValidateNS(t *testing.T) { + for _, ok := range []string{"immich", "calico-system", "kube-system", "Global", "pg-cluster-rw"} { + if err := validateNS(ok); err != nil { + t.Errorf("validateNS(%q) unexpected error: %v", ok, err) + } + } + for _, bad := range []string{"", "a b", "a'b", "a;b", "../x", "a$b"} { + if err := validateNS(bad); err == nil { + t.Errorf("validateNS(%q) expected error, got nil", bad) + } + } +} diff --git a/cli/homelab.go b/cli/homelab.go index 62c0c8aa..14b0afd4 100644 --- a/cli/homelab.go +++ b/cli/homelab.go @@ -20,6 +20,7 @@ func buildRegistry() []Command { reg = append(reg, deployCommands()...) reg = append(reg, netCommands()...) reg = append(reg, obsCommands()...) + reg = append(reg, edgesCommands()...) reg = append(reg, usageCommands()...) reg = append(reg, haCommands()...) reg = append(reg, browserCommands()...) diff --git a/cli/memory_test.go b/cli/memory_test.go index 7b14ef20..1c673c7b 100644 --- a/cli/memory_test.go +++ b/cli/memory_test.go @@ -5,8 +5,31 @@ import ( "os" "strings" "testing" + "unicode/utf8" ) +func TestTruncatePreviewKeepsValidUTF8(t *testing.T) { + // Byte-slicing a long Cyrillic string at 240 splits a 2-byte rune and emits + // invalid UTF-8 — the bug that crashed the recall hook. truncatePreview must + // cut on a rune boundary and always stay valid UTF-8. + long := strings.Repeat("я", 300) // 300 runes / 600 bytes + got := truncatePreview(long, 240) + if !utf8.ValidString(got) { + t.Fatalf("truncatePreview produced invalid UTF-8: %q", got) + } + if r := []rune(got); len(r) != 241 || string(r[:240]) != strings.Repeat("я", 240) || r[240] != '…' { + t.Fatalf("truncatePreview = %d runes, want 240 Cyrillic + ellipsis", len(r)) + } + // Short multibyte strings pass through untouched (no ellipsis). + if got := truncatePreview("кратко", 240); got != "кратко" { + t.Fatalf("short string altered: %q", got) + } + // ASCII boundary still works. + if got := truncatePreview(strings.Repeat("a", 500), 240); got != strings.Repeat("a", 240)+"…" { + t.Fatalf("ascii truncation wrong: %q", got) + } +} + func TestResolveMemoryBase(t *testing.T) { old1, old2 := os.Getenv("CLAUDE_MEMORY_API_URL"), os.Getenv("MEMORY_API_URL") defer func() { os.Setenv("CLAUDE_MEMORY_API_URL", old1); os.Setenv("MEMORY_API_URL", old2) }() diff --git a/docs/adr/0003-keep-forgejo-canonical-complete-mirror.md b/docs/adr/0003-keep-forgejo-canonical-complete-mirror.md index 9e0e2192..67022732 100644 --- a/docs/adr/0003-keep-forgejo-canonical-complete-mirror.md +++ b/docs/adr/0003-keep-forgejo-canonical-complete-mirror.md @@ -13,7 +13,7 @@ The trigger was a proposal to swap Forgejo out for GitHub entirely. The grilling Do **not** swap to GitHub. Reaffirm and *complete* the model already in `CONTEXT.md`: - Every first-party repo has exactly **one** push target — its **Canonical repo** on Forgejo. GitHub is a one-way push-mirror (off-site backup + the source GitHub Actions builds from). **No repo is ever dual-pushed.** -- A small, explicit set of **GitHub-first repos** are the exception (canonical lives on GitHub, outside the mirror policy): third-party clones/forks where GitHub is genuinely upstream (`jsoncrack.com`, `snmp_exporter`, `SparkyFitness`, `agent-rules-books`, `Plotting-Your-Dream-Book`) and the deliberately-public first-party `health`. +- A small, explicit set of **GitHub-first repos** are the exception (canonical lives on GitHub, outside the mirror policy): third-party clones/forks where GitHub is genuinely upstream (`jsoncrack.com`, `snmp_exporter`, `SparkyFitness`, `agent-rules-books`, `Plotting-Your-Dream-Book`) and the deliberately-public first-party `health`. `Plotting-Your-Dream-Book` (owned by Anca, dev in her org) keeps its GHA build in-place and pushes the image to **its own org's ghcr** (`ghcr.io/passionprojectsanca/book-plotter`, private) via the workflow's built-in `GITHUB_TOKEN` — no Forgejo mirror, no `viktorbarzin`-namespace push, no shared PAT in her repo (2026-06-27, migrated off DockerHub). - `infra` is reconciled into the standard model: its GitHub-only `.github/workflows/build-*.yml` are brought onto Forgejo-canonical (inert on Forgejo, active on the mirror), then the mirror is enabled — ending the deliberate divergence while keeping Woodpecker on the Forgejo forge. - Enforcement is **structural**: reconciled clones keep only the Forgejo remote, so there is no GitHub remote to habitually push to; the execution rule is "push to the canonical forge only, never the mirror." diff --git a/docs/adr/0011-homelab-usage-telemetry.md b/docs/adr/0011-homelab-usage-telemetry.md index c383211b..fc0c4e76 100644 --- a/docs/adr/0011-homelab-usage-telemetry.md +++ b/docs/adr/0011-homelab-usage-telemetry.md @@ -5,6 +5,14 @@ exists to answer the question that drove the whole CLI — *which verbs are wort adding next* — with data instead of one maintainer's habits (the earlier mining covered a single user's ~51k commands, so the surface is shaped to that user). +> **Update (2026-06-26) — the cross-user privacy *norm* below is superseded by +> [ADR-0015](0015-os-is-the-authorization-boundary.md).** The prohibition this +> ADR leaned on ("reading another user's `~/.claude` is off-limits even for an +> owner in-session") no longer holds: the managed-settings policy now **defers +> to OS/sudo authorization**. The `usage top` telemetry design itself is +> unchanged and still current — only the "never read homes" framing in the +> third decision below is overtaken. + ## Decisions - **Emit on dispatch, in `dispatch()`.** The longest-prefix match already knows diff --git a/docs/adr/0014-service-identity-and-east-west-observability.md b/docs/adr/0014-service-identity-and-east-west-observability.md index 5eb1c83a..cdccac4f 100644 --- a/docs/adr/0014-service-identity-and-east-west-observability.md +++ b/docs/adr/0014-service-identity-and-east-west-observability.md @@ -27,3 +27,9 @@ As the Service count grows we want an audit-grade record of which Service talks - **Enforcement gains a better data source.** Goldmane's allow/deny + policy-trace flows build the Wave 1 empirical egress allowlist faster than the current iptables-`LOG`→journald→Loki path, and policies select on namespace/label with no SA dependency. - **New ubiquitous language** recorded in `CONTEXT.md`: **Service identity** and **Goldmane / Whisker**. - **Revisit triggers:** adopt dedicated per-Service SAs if identity-aware NetworkPolicy needs a principal finer than namespace/label, or if mTLS is ever required; reconsider Retina if DNS/drop-level flow detail becomes necessary. + +## As-built (2026-06-25) + +Implemented across infra issues #57–#63. **One material deviation from the decision above:** the durable trail is NOT a Goldmane→Loki emitter (no such emitter exists in OSS Calico 3.30) — it is the **`goldmane-edge-aggregator`** service, which streams Goldmane's gRPC `Flows.Stream` API over mTLS and upserts the unique namespace-pair **edge set** (`edge(src_ns,dst_ns,action,first_seen,last_seen,flow_count)`, self-edges + empty-namespace flows dropped) into **CNPG DB `goldmane_edges`**, plus a daily `goldmane-edges-digest` CronJob → `#alerts` (all Slack consolidated to `#alerts`; the `#security` channel was abandoned 2026-06-25 — the shared webhook's Slack app isn't a member of it — see runbook). The mTLS client cert **reuses the operator's Tigera-CA-signed `whisker-backend-key-pair`** rather than copying the CA private key into TF state (Goldmane verifies CA-chain only, not identity) — re-apply `stacks/goldmane-edge-aggregator` if the operator rotates it. `service-identity` labels are live on the multi-Service namespaces (`monitoring`, `dbaas`). Whisker UI is Authentik-gated at `whisker.viktorbarzin.me`. Health: Prometheus alerts `AggregatorDown` + `DigestFailing` and cluster-health check #48. + +Full as-built, query recipes (incl. the Wave-1 egress-allowlist derivation), and troubleshooting: [`docs/runbooks/goldmane-flow-trail.md`](../runbooks/goldmane-flow-trail.md). Stacks: `stacks/calico` (Goldmane/Whisker + Whisker ingress), `stacks/goldmane-edge-aggregator` (the trail). Code: `~/code/goldmane-edge-aggregator`. diff --git a/docs/adr/0015-os-is-the-authorization-boundary.md b/docs/adr/0015-os-is-the-authorization-boundary.md new file mode 100644 index 00000000..8999682b --- /dev/null +++ b/docs/adr/0015-os-is-the-authorization-boundary.md @@ -0,0 +1,57 @@ +# OS is the authorization boundary: agents defer to Unix/sudo, not a stricter in-policy rule + +Supersedes the cross-user privacy *norm* that the devvm managed-settings policy +carried and that ADR-0011 leaned on ("never read another user's home / +`~/.claude`, off-limits even for an owner in-session"). ADR-0011's actual +subject — `usage top` telemetry and its emit design — is unchanged and still +current; only the privacy prohibition it referenced is superseded here. + +## Context + +The devvm managed-settings policy (`/etc/claude-code/managed-settings.json`, +`claudeMd`) carried two rules that were, in practice, *stricter than the OS*: +"you are not the admin, do not escalate privileges" and "never read another +user's home directory, credentials, tokens, or `~/.claude`." The OS told a +different story: `wizard` holds `(ALL) NOPASSWD: ALL` — full passwordless root. +The kernel had already granted total read access; the policy was layering an +artificial refusal on top of an authorization the OS already permits, and the +"not the admin" framing was factually wrong for a NOPASSWD-root user. + +Two honest ways to resolve the inconsistency: tighten sudo to match the policy, +or loosen the policy to match the OS. The owner chose the latter on 2026-06-26, +for analytics/debugging across the shared box. + +## Decision + +- **Authorization follows the OS, not this policy.** Agents may access whatever + their OS user can access — directly or via `sudo` where they hold sudo rights + — and must not impose restrictions stricter than the OS. On this box that + includes other users' home directories and `~/.claude` for users who hold + broad sudo. +- **No separate prompt or carve-out** for OS-authorized access. The Unix + permission model + sudoers is the single source of truth for who may read + what. Other homes are `0750`-owned, so a cross-home read necessarily transits + `sudo` and is therefore captured in the sudo/auth audit log. +- **Cluster/infra RBAC tiering is unchanged.** kubectl / Vault / infra access + stays scoped to each user's RBAC tier; "defer to the OS" is about OS-level + file access, not a licence to exceed cluster RBAC. +- **Scope is symmetric and multi-user.** The rule lives in the *shared* + managed-settings, so every user's agents defer to that user's own sudo grant. + Any user with broad sudo gets the same cross-home read capability over other + users' files. Accepted by the owner with that understanding; emo's and + ancamilea's `~/.claude` is now agent-readable by sudo-holders. +- **Takes effect in a fresh session.** managed-settings loads at session start; + the session that made the change keeps running under the old policy. + +## Consequences + +- The privacy-preserving telemetry rationale in ADR-0011 (`usage top` as the + "cross-user analytics without reading homes" answer) remains useful but is no + longer the *only* sanctioned path; direct reads via `sudo` are now permitted. +- Larger blast radius: if an agent session running as a sudo-holder is + prompt-injected or otherwise compromised, it can now read every user's secrets + with no in-agent friction (sudo here is passwordless). The sudo/auth audit log + is the remaining accountability control. +- Reversible: restore the prior `claudeMd` bullets (backup kept at + `/etc/claude-code/managed-settings.json.bak-2026-06-26`) and start a fresh + session. diff --git a/docs/architecture/authentication.md b/docs/architecture/authentication.md index 9decc8dc..620bcf6b 100644 --- a/docs/architecture/authentication.md +++ b/docs/architecture/authentication.md @@ -86,10 +86,56 @@ Signin latency is dominated by screen count and round trips, not server time use the explicit-consent flow (it re-prompted every 4 weeks per app). - **Live tuning via `server.env`/`worker.env`** (the `authentik.*` Helm values are inert due to `existingSecret`): 3 gunicorn workers, 30m flow-plan cache, - 15m policy cache, 60s persistent DB connections. + 15m policy cache, gunicorn `max_requests=10000`/jitter=1000 (recycle + hardening — decorrelates the 9 workers' recycles from PG blips). **No + `CONN_MAX_AGE`** — persistent Django connections pin a PgBouncer server conn + 1:1 and saturate the session-mode pool (reverted 2026-06-10). - **Static assets cached immutable**: `/static` ingress carve-out adds `Cache-Control: public, max-age=31536000, immutable` (assets are version-fingerprinted; authentik itself sends no max-age). +- **Rate-limit carve-out** (2026-06-28): `/` and `/static` use a dedicated + `authentik-rate-limit` (100/1000) instead of the shared 10/50 default — the + login SPA cold-loads ~70 flow-executor chunks from `/static`; the default + burst 429'd the tail and a failed ES-module import left a blank login screen. +- **Readiness tolerance** (2026-06-28): server `readinessProbe.failureThreshold:8` + (~80s, was the chart-default ~30s). The probe (`/-/health/ready/`) queries the + DB; too-tight tolerance let a sub-60s PG/pgbouncer transient return 503 on all + 3 server pods at once → Traefik had no healthy backend → 502/503/504 (episodic + blank login + 30s hangs). 80s absorbs a full CNPG failover reconnect. Sessions + + cache are PostgreSQL-only since Redis was removed in 2026.2 (no external-cache + option), so request-serving is coupled to PG — this survives a short transient, + not a total CNPG outage. +- **Rolling-update strategy** (2026-06-28): the chart key is `deploymentStrategy` + (the repo's old `strategy:` key was silently inert → live ran the chart-default + 25%/25% and dropped a server pod out of rotation on every roll). Now + `maxSurge:1/maxUnavailable:0` keeps all 3 ready throughout a roll. +- **Old-browser login (SFE)** (2026-06-28): authentik's modern flow SPA is ES2022 + and renders a **blank login** on Safari/WebKit ≤16.3 (every iOS browser shares + the system WebKit, so it's not browser-choice — e.g. iPadOS ≤15). The overlay + image patches `flows/views/interface.py::compat_needs_sfe()` to also serve + authentik's built-in no-JS **Simplified Flow Executor** (SFE, ES5) to old Safari + **and any iOS browser** (Chrome/Firefox on iOS are WebKit skins) on iOS ≤16.3, + so those clients get the *real* authentik login (password + MFA + reputation — + no auth downgrade). The SFE can't render Identification-stage **sources** + (authentik limitation), so the patch also injects static social-login `` + links into `flow-sfe.html` (→ `/source/oauth/login//`, plain redirects) — + required for password-less accounts (e.g. Google-only users). A Traefik + basic-auth fallback was rejected: it would have put a single spoofable-UA + password in front of `vbarzin→wizard` (passwordless root on the devvm). See + `stacks/authentik/patch-compat-sfe.py`. +- **SFE + forced-WebAuthn MFA gotcha** (2026-06-28): the `default-authentication-flow` + MFA stage (`not_configured_action=configure`, `conf_stages=[webauthn]`) force-enrols + a WebAuthn passkey for any **password**-path user with no MFA device — but the SFE + **cannot render WebAuthn** (enrol *or* validate), so that user gets + `unsupported state: ak-stage-authenticator-webauthn`. Two escape hatches, **no MFA + downgrade**: (1) **social login** — sources run `default-source-authentication` + (UserLoginStage only, **no MFA stage**), so the SFE's "Continue with " + button always completes; (2) **enrol TOTP** — the SFE *can* validate TOTP codes, and + ≥1 confirmed device flips the stage from force-enrol to validate. User MFA devices are + runtime data (not Terraform): enrol via `ak shell` + (`TOTPDevice.objects.create(user=…, confirmed=True)`) and store the secret in the + user's own Vaultwarden item. (Done for emo — the Google-only iPadOS-15 case: TOTP in + his `authentik.viktorbarzin.me` Bitwarden item; e2e-verified the BW code is accepted.) - **Outpost**: 2 replicas, `log_level=info` (was 1 replica at `trace`). - **auth-proxy nginx**: upstream `keepalive 32` + HTTP/1.1 — no per-request TCP setup on the forward-auth subrequest path. diff --git a/docs/architecture/chrome-service.md b/docs/architecture/chrome-service.md index 6f9c1ee4..118c0895 100644 --- a/docs/architecture/chrome-service.md +++ b/docs/architecture/chrome-service.md @@ -205,6 +205,43 @@ healthy <0.3s, broken hangs). **Fix: cap `ulimit -n 65536` before x11vnc starts* wrapper in `main.tf` (so it applies deterministically even though the image is `:latest`/`IfNotPresent` and won't re-pull a rebuilt entrypoint). Same bug + fix as the android-emulator stack. + +### noVNC black after a browser-container restart (x11vnc supervision) + +A **distinct** failure from the fd-sweep gotcha above: the noVNC client *connects* +but the view is **black**, and the novnc container logs spew +`connecting to: localhost:5900` → `Failed to connect ... [Errno 111] Connection +refused` (x11vnc is **down**, not slow). Cause: `x11vnc` and `websockify` both run +in the **novnc** container, but x11vnc attaches to the **chrome-service** (browser) +container's Xvfb over `localhost:6099` (shared pod network). When the browser +container restarts — Chrome exits cleanly (exit 0, "Completed") or crashes — its +Xvfb vanishes and x11vnc loses its X connection and exits. + +`entrypoint.sh` **supervises** x11vnc: it launches x11vnc and websockify as +background children and `wait -n`s on them, exiting non-zero if **either** dies, so +the kubelet restarts the novnc container, which re-waits for Xvfb on `:6099` and +relaunches x11vnc — the bridge **self-heals** across browser-container restarts. +(Before 2026-06-27, x11vnc was an unsupervised background child of an `exec`ed +websockify; a dead x11vnc was never relaunched, leaving `:5900` dead — a +`` zombie — and the view black until a manual pod restart. Same +supervision pattern as the android-emulator stack's entrypoint.) + +**Diagnose:** `kubectl exec -c novnc -- ps aux | grep x11vnc` (a ``/Z +entry = the bug); or the RFB-banner probe from a sibling container (`python3 -c +"import socket;s=socket.socket();s.settimeout(2);s.connect(('127.0.0.1',5900));print(s.recv(12))"` +— healthy returns `b'RFB 003.008\n'`, broken = `ConnectionRefused`). **Immediate +recovery** (no image change): restart just the novnc container with `kubectl exec +-n chrome-service deploy/chrome-service -c novnc -- kill 1` — re-runs its entrypoint +and relaunches x11vnc **without** touching the browser session/in-flight CDP jobs. + +> **Deploying a rebuilt novnc entrypoint:** Keel is **off** for this deployment +> (`keel.sh/policy=never`, because the browser container's playwright image is +> version-pinned to f1-stream) and the image is `:latest`/`IfNotPresent`, so a +> rebuilt `:latest` will **not** redeploy on its own. After the +> `build-chrome-service-novnc.yml` GHA build pushes `:latest` + `:`, +> **SHA-pin** the novnc `image` in `main.tf` to the new `:` to force the pull +> and rollout (the novnc image is TF-managed — not in the deployment's +> `lifecycle.ignore_changes`). - **snapshot-server sidecar** (`mcr.microsoft.com/playwright/python:v1.48.0-noble`) serves `GET /api/snapshot` from `/profile/snapshots/storage-state.json`, bearer-gated by `PW_TOKEN`. Service `chrome-snapshot` maps :8088 → :8088 @@ -256,6 +293,42 @@ Key facts: byte-identical copy of `files/stealth.js`, guarded by a drift test — so the CLI's stealth never diverges from the in-cluster callers'. +## Multi-user access (sharing the browser) + +There is ONE chrome-service browser with ONE persistent profile, warmed with +**Viktor's** logged-in sessions. CDP has no per-context auth, so anyone who can +drive the browser — over the noVNC view OR the CDP/`homelab browser` path — can +reach the persistent profile (`browser.contexts[0]`) and therefore Viktor's +sessions. Access is gated accordingly, per user. + +**Decision (2026-06-28):** emo (`emil.barzin` / `emil.barzin@gmail.com`) SHARES +Viktor's browser for form-filling + captcha solving, rather than getting an +isolated instance. The session-exposure trade-off above was explicitly accepted. + +Two independent grants make up "browser access" for a user: + +1. **noVNC (interactive view, `chrome.viktorbarzin.me`)** — gated by the Authentik + `admin-services-restriction` policy: the `CHROME_ALLOWED` set + (`stacks/authentik/admin-services-restriction.tf`) matches the user's Authentik + username OR email. Add the user there. No kubeconfig/RBAC needed. +2. **CLI (`homelab browser`, CDP over port-forward)** — needs `pods/portforward` + in `chrome-service` PLUS a non-interactive credential (a normal devvm user's + kubeconfig is interactive-OIDC-only and can't authenticate a headless agent + session). Provided by a per-user **ServiceAccount** with a long-lived token + (`stacks/chrome-service/rbac.tf`, e.g. `emo-browser`): `pods/portforward` in + this namespace + cluster read-only (`oidc-power-user-readonly`, so it can also + resolve the Service and doesn't regress the user's normal read). The devvm + provisioner (`scripts/t3-provision-users.sh` → `install_browser_kubeconfig`) + reads that token and installs it as the user's DEFAULT kubeconfig context + (`-browser@homelab`), keeping their personal OIDC login as the + `oidc@homelab` named context. The SA's existence is the source of truth for who + gets the CLI — the provisioner no-ops for users without a `-browser` SA. + +**To grant another user:** add them to `CHROME_ALLOWED` (noVNC) and/or add a +`-browser` SA + bindings mirroring `emo-browser` in `rbac.tf` (CLI), then run +the provisioner. To revoke: remove from `CHROME_ALLOWED` and delete the SA (rotate +a token by deleting its `-browser-token` Secret). + ## Limits + risks - **Anti-bot vs stealth arms race** — when an upstream beats us (DRM diff --git a/docs/architecture/ci-cd.md b/docs/architecture/ci-cd.md index 35e041e6..5a9c3722 100644 --- a/docs/architecture/ci-cd.md +++ b/docs/architecture/ci-cd.md @@ -115,9 +115,67 @@ claude-agent-service, claude-memory-mcp, kms-website, Freedify, instagram-poster, payslip-ingest, broker-sync (image name `wealthfolio-sync`), fire-planner, recruiter-responder, x402-gateway — plus **tripit** (the original pilot, 2026-06-09). Earlier public-repo apps already on GHA (Website, -k8s-portal, apple-health-data, audiblez-web, plotting-book, insta2spotify, +k8s-portal, apple-health-data, audiblez-web, insta2spotify, audiobook-search) now also land on ghcr. +**plotting-book** is a special case (a GitHub-first repo owned by Anca, +ADR-0003): the build runs in *her* GitHub repo +(`PassionProjectsAnca/Plotting-Your-Dream-Book`) and pushes to **private +`ghcr.io/passionprojectsanca/book-plotter`** — under her org's ghcr namespace, +not `viktorbarzin`, using the workflow's built-in `GITHUB_TOKEN` (no shared +PAT). The cluster pulls it via the Kyverno-synced `ghcr-credentials` secret (the +`plotting-book` namespace is on the allowlist; the shared `ghcr_pull_token` has +read access). Migrated off public DockerHub (`viktorbarzin/book-plotter`) on +2026-06-27. The Woodpecker deploy hook (repo 43, registered to Anca's repo) is +unchanged. Flow: + +```text + DEVELOP ─────────────────────────────────────────────────────────────────────── + Anca (Codex / t3 web agent) + │ git push → main + ▼ + ┌──────────────────────────────────────────────────────────────┐ + │ GitHub: PassionProjectsAnca/Plotting-Your-Dream-Book (private)│ ← canonical + │ .github/workflows/build-and-deploy.yml on: push → main │ + └───────────────────────────┬──────────────────────────────────┘ + │ GitHub Actions runner (off-infra build · ADR-0002) + ┌────────────────────┴─────────────────────────────────┐ + ▼ ▼ + ┌─────────────────────────────────────────────┐ ╔═══════════════════════════════════════╗ + │ build job │ push ║ GHCR · PRIVATE package ║ + │ • svu next --always → tag vX.Y.Z (→ repo) │═════▶║ ghcr.io/passionprojectsanca/ ║ + │ • buildx linux/amd64, provenance:false │ tags ║ book-plotter :vX.Y.Z :latest ║ + │ • login ghcr (GITHUB_TOKEN, packages:write)│ ╚═══════════════════╤═══════════════════╝ + │ • delete-package-versions (keep newest 10) │ │ + └───────────────────────┬─────────────────────┘ │ pull (private, + ▼ deploy job [gate: repo var DEPLOY_ENABLED ≠ "false"] via secret) + POST ci.viktorbarzin.me/api/repos/43/pipelines {IMAGE_TAG, IMAGE_NAME} │ + ▼ │ + ┌─────────────────────────────────────────────────────────────┐ │ + │ Woodpecker repo 43 · .woodpecker/deploy.yml (event: manual) │ │ + │ kubectl set image deployment/plotting-book = :vX.Y.Z │ │ + │ kubectl rollout status │ │ + └───────────────────────────┬─────────────────────────────────┘ │ + ▼ │ + ═══════════════ Kubernetes · ns: plotting-book ════════════════════════════ │ + ┌─────────────────────────────────────────────────────────────┐ │ + │ Deployment plotting-book (Recreate · image = ignore_changes)│ │ + │ imagePullSecrets: ghcr-credentials ────────pull───────────┼─────────────────┘ + │ Pod → Express :3001 + SQLite on PVC (proxmox-lvm) │ + └─────────────────────────────────────────────────────────────┘ + guards / supporting: + • Kyverno require-trusted-registries [Enforce] → ghcr.io/* ALLOWED (admission) + • Keel policy=patch @1h → watches GHCR via ghcr-credentials (backstop) + • ghcr-credentials ⇐ Kyverno generate-clone ⇐ Vault secret/viktor/ghcr_pull_token + + ═══════════════ Serving path (unchanged) ══════════════════════════════════ + Browser ─▶ plotting-book.viktorbarzin.me (non-proxied DNS → Traefik .203) + ─▶ Authentik forward-auth (gate) ─▶ Service :80 ─▶ Pod :3001 +``` + +Governance: the Deployment + Kyverno allowlist are Terraform (`stacks/plotting-book`, +`stacks/kyverno`); the live image *tag* is CI-owned (`ignore_changes`). + ### Infra-owned images (issues #29 / #30) Images owned by the infra repo build on GHA workflows **in the infra repo's own @@ -163,9 +221,9 @@ Woodpecker is **deploy + cluster-touching steps only**: | Pipeline | File | Purpose | |----------|------|---------| | per-app deploy | `.woodpecker/deploy.yml` (each repo) | `kubectl set image` + Slack notify (event: **manual**) | -| terragrunt apply | `.woodpecker/default.yml` | Changed-stacks apply on push to master (runs in `infra-ci`) | +| terragrunt apply | `.woodpecker/default.yml` | Changed-stacks apply on push to master (runs in `infra-ci`). **Skips Tier-0 `vault`** — it's human-applied via OIDC; the CI `ci` role lacks Vault-admin perms (`sys/mounts`, `sys/policies/acl`) so a CI apply 403s | | certbot | `.woodpecker/renew-tls.yml` | TLS renewal cron | -| drift-detection | `.woodpecker/drift-detection.yml` | Nightly Terraform drift (runs in `infra-ci`) | +| drift-detection | `.woodpecker/drift-detection.yml` | Nightly Terraform drift (runs in `infra-ci`). **Skips Tier-0 `vault`** (its `plan` 403s under the `ci` role and would fail the whole run) | | provision-user | `.woodpecker/provision-user.yml` | Add namespace-owner user from Vault spec | | registry-config-sync | `.woodpecker/registry-config-sync.yml` | SCP `modules/docker-registry/*` → `10.0.20.10` on change | | pve-nfs-exports-sync | `.woodpecker/pve-nfs-exports-sync.yml` | Sync `scripts/pve-nfs-exports` → `/etc/exports` on PVE | @@ -176,6 +234,38 @@ Woodpecker is **deploy + cluster-touching steps only**: **No build/test pipeline exists on any repo.** Do not (re)introduce one. +### `default.yml` apply: dual-registration de-dup + reliability (2026-06-28) + +infra is registered in Woodpecker on **both** the canonical Forgejo repo (id 82) +and the legacy GitHub mirror (id 1), and **both fire `default.yml` on every +push**. Left unguarded, two `terragrunt apply` runs race each other for the +per-stack PG state lock — historically the #1 source of `Error acquiring the +state lock` failures and push-supersede "killed" runs. + +- **Forge guard** (first command in the `apply` step): the push-apply runs **only + on the canonical Forgejo forge**; on the GitHub mirror it logs `[forge-guard]` + and `exit 0`s. Detection: `CI_REPO_URL`/`CI_FORGE_URL` contains `github.com` → + skip. Fail-open (unknown forge still applies). The mirror keeps running the + **crons** (drift-detection, renew-tls, …), which live on repo 1 — only its + duplicate push-apply no-ops. (Crons were NOT moved; deactivating repo 1 would + have killed them.) +- **Lock-skip matches both tiers**: a stack whose apply hits a lock is SKIPPED, + not failed. The grep now matches the Tier-0 Vault message (`is locked by`) **and** + the Tier-1 PG-backend message (`Error acquiring the state lock` / `already + locked`) — the PG case was previously miscounted as a hard failure. +- **Transient retry** (bounded, 3 attempts): only provider-registry download + timeouts (`Failed to install provider` / `Client.Timeout`) and Vault 5xx are + retried. Config errors (missing arg, invalid index) and helm `atomic` timeouts + are NOT retried — they fail fast. + +A pre-apply off-infra validate gate was evaluated and rejected: `terraform +validate` runs without state but catches ~0 of the observed failures (they are +provider-config-from-Vault-data, server-side-apply conflicts, helm installs, and +lock contention — all invisible to static validate), and `plan` cannot run +off-infra (no Vault/PG access). `terragrunt apply` already fails at its plan +phase without mutating on config errors, so a separate in-pipeline plan-gate was +also dropped as redundant. + ### Woodpecker API Uses **numeric repo IDs** (`/api/repos//pipelines`), NOT owner/name paths diff --git a/docs/architecture/monitoring.md b/docs/architecture/monitoring.md index 3c75a345..06ee943f 100644 --- a/docs/architecture/monitoring.md +++ b/docs/architecture/monitoring.md @@ -286,7 +286,7 @@ Uptime Kuma monitors: TCP SMTP (port 25) on `176.12.22.76` (external), IMAP (por #### Security Alerts (Wave 1 — planned, beads `code-8ywc`) -Routed via **Loki ruler → Alertmanager → `#security` Slack receiver**. Same handling path as infra alerts. Single channel with severity labels inside (critical/warning/info), not three separate channels. Detection sources: K8s API audit log (`job=kube-audit`), Vault audit log (`job=vault-audit`), PVE sshd journald (`job=sshd-pve`), Calico flow logs (`job=calico-flow`, W1.6 only). +Routed via **Loki ruler → Alertmanager → the `slack-security` receiver, which posts to `#alerts`** (it keeps its `[SECURITY/]` title styling so security-lane alerts stand out there). Same handling path as infra alerts; severity labels carried in the alert (critical/warning/info). The dedicated `#security` channel was abandoned 2026-06-25 — the shared `alertmanager_slack_api_url` webhook's Slack app isn't a member of it (a `#security` override 404s), so everything consolidated to `#alerts`. Detection sources: K8s API audit log (`job=kube-audit`), Vault audit log (`job=vault-audit`), PVE sshd journald (`job=sshd-pve`), Calico flow logs (`job=calico-flow`, W1.6 only). | # | Source | Event | Severity | |---|---|---|---| @@ -318,9 +318,20 @@ IOPS impact estimated ~1-2 GB/day additional disk writes after custom audit-poli Detects the inverse of the K-series alerts: a service that **must work WITHOUT Authentik SSO** getting accidentally walled off. Services on `ingress_factory auth = "required"` put Authentik forward-auth on `/`, which 302-bounces native-client / public / webhook / WebSocket / SPA-XHR paths. We carve those out with path-scoped `auth = "none"` ingresses; a TF revert, a bad deploy, or `ingress_factory`'s fail-closed `auth` default flipping back to `"required"` can silently clobber a carve-out. - **Mechanism**: `blackbox-exporter` (monitoring ns) probes a representative GET-able URL per carve-out with `no_follow_redirects: true`. The `http_no_authentik_redirect` module FAILS the probe (`fail_if_header_matches` on the `Location` header, regex `authentik\.viktorbarzin\.me|/outpost\.goauthentik\.io|/application/o/authorize`) iff the response redirects to Authentik. `valid_status_codes` enumerates all expected non-Authentik responses **including 301/302** (so a legitimate redirect, e.g. a short-link 302, or a 404 carve-out like meshcentral `/agent.ashx`, stays green). Scrape job: `blackbox-authentik-walloff` (1m). -- **Alert**: `probe_failed_due_to_regex{job="blackbox-authentik-walloff"} == 1` for 10m → `severity=warning`, `lane=security` → **`#security` Slack** (Slack-only, no paging). `probe_failed_due_to_regex` (not bare `probe_success==0`) is the signal: it isolates the Authentik-redirect from unrelated 5xx/DNS/TLS failures already covered by reachability alerts. Inhibited by `TraefikDown` and `AuthentikDown` (symptom, not regression, during those outages). +- **Alert**: `probe_failed_due_to_regex{job="blackbox-authentik-walloff"} == 1` for 10m → `severity=warning`, `lane=security` → posts to **`#alerts`** via the `slack-security` receiver, which keeps its `[SECURITY]` styling (Slack-only, no paging; the dedicated `#security` channel was abandoned 2026-06-25 — the shared webhook's app isn't a member of it). `probe_failed_due_to_regex` (not bare `probe_success==0`) is the signal: it isolates the Authentik-redirect from unrelated 5xx/DNS/TLS failures already covered by reachability alerts. Inhibited by `TraefikDown` and `AuthentikDown` (symptom, not regression, during those outages). - **Target list + how to add one**: `local.authentik_walloff_targets` in `stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf` — a map of `service → URL`. To guard a NEW carve-out, add ONE line. Verify it does NOT already 302 to Authentik first: `curl -s -o /dev/null -w '%{http_code} %{redirect_url}\n' ''`. The map key becomes the `service` label on the metric + alert. (Note: openclaw `task-webhook` is intentionally NOT probed — no public DNS record.) +#### East-west flow observability (Goldmane edge-aggregator) — `AggregatorDown` / `DigestFailing` (ADR-0014) + +Health for the durable "who-talks-to-whom" trail (Calico Goldmane → `goldmane-edge-aggregator` → CNPG `goldmane_edges` → daily `#alerts` digest; full trail in security.md + [runbooks/goldmane-flow-trail.md](../runbooks/goldmane-flow-trail.md)). The aggregator pod exposes **no `/metrics`**, so health is inferred from kube-state-metrics. Alert group `Network Observability (Goldmane)` in `prometheus_chart_values.tpl`; both route the default `slack-warning` receiver → **`#alerts`**. + +| Alert | Expr (abridged) | For | Severity | +|---|---|---|---| +| `AggregatorDown` | `kube_deployment_status_replicas_available{namespace="goldmane-edge-aggregator",deployment="goldmane-edge-aggregator"} < 1` (+ Prometheus-restart guard) | 15m | warning | +| `DigestFailing` | `kube_job_status_failed{namespace="goldmane-edge-aggregator",job_name=~"goldmane-edges-digest.*"} > 0` within 24h | 30m | warning | + +The two layers are **complementary**: `AggregatorDown` ⇒ no new edges land in the DB; `DigestFailing` ⇒ edges still land but nobody is told. (`< 1` requires the metric series to exist — a fully-deleted Deployment is instead caught by cluster-health check #48 below as "deployment missing".) A freshness probe (#61b) was deliberately skipped — `AggregatorDown` is the agreed floor. **Cluster-health check #48** (`check_goldmane_aggregator` in `scripts/cluster_healthcheck.sh`) reads the Deployment's `Available` condition independently (human / `--quiet` / `--json`; JSON key `goldmane_aggregator`). + #### Backup Alerts - **PostgreSQLBackupStale**: >36h since last backup - **MySQLBackupStale**: >36h since last backup diff --git a/docs/architecture/multi-tenancy.md b/docs/architecture/multi-tenancy.md index c64a146c..2cabf9e7 100644 --- a/docs/architecture/multi-tenancy.md +++ b/docs/architecture/multi-tenancy.md @@ -541,7 +541,7 @@ Separate from the in-cluster namespace-owner model above, the **devvm** (`10.0.1 **RBAC tiers:** `admin` (Viktor — cluster-admin, unlocked tree, secrets) · `power-user` (cluster-wide read-only, NO Secrets, via a dedicated `oidc-power-user-readonly` ClusterRole) · `namespace-owner` (admin in own namespace only). Each session acts as the user's **own** OIDC identity (kubelogin), never the admin's. -**Config inheritance (live):** wizard authors the base (his chezmoi-versioned `~/.claude`). Two native layers carry it to every user — the enforced org `claudeMd` in `/etc/claude-code/managed-settings.json` (top precedence, all sessions) and per-user `~/.claude/{skills,rules,…}` **symlinks** to the base (seeded via `/etc/skel`; edits propagate live). Secrets stay per-user at mode 600, never symlinked. **The managed config self-deploys from the repo** (2026-06-10): the hourly reconcile's `sync_managed_config` installs `scripts/workstation/managed-settings.json` to `/etc/claude-code/` whenever the repo copy changes — so editing the claudeMd = edit + commit, no manual install — and `refresh_codex_mirror` regenerates each user's `~/.codex/AGENTS.md` (a static mirror of the claudeMd; only files carrying the mirror header are touched, user-customized ones are left alone). Repo-level guidance (`.claude/CLAUDE.md`, `AGENTS.md`, `CONTEXT.md` in the infra repo) reaches non-admins through their auto-freshened clones — commit + push and every user has it within the hour. +**Config inheritance (live):** wizard authors the base (his chezmoi-versioned `~/.claude`). Two native layers carry it to every user — the enforced org `claudeMd` in `/etc/claude-code/managed-settings.json` (top precedence, all sessions) and per-user `~/.claude/{skills,rules,…}` **symlinks** to the base (seeded via `/etc/skel`; edits propagate live). Secrets stay per-user at mode 600, never symlinked. **(2026-06-26: the managed `claudeMd` now defers OS-level file access to the OS/sudo — a user holding broad `sudo` may read other users' files incl. `~/.claude`; the mode-600 / no-symlink posture is unchanged but is no longer reinforced by an agent "never read other homes" rule. See [ADR-0015](../adr/0015-os-is-the-authorization-boundary.md).)** **The managed config self-deploys from the repo** (2026-06-10): the hourly reconcile's `sync_managed_config` installs `scripts/workstation/managed-settings.json` to `/etc/claude-code/` whenever the repo copy changes — so editing the claudeMd = edit + commit, no manual install — and `refresh_codex_mirror` regenerates each user's `~/.codex/AGENTS.md` (a static mirror of the claudeMd; only files carrying the mirror header are touched, user-customized ones are left alone). Repo-level guidance (`.claude/CLAUDE.md`, `AGENTS.md`, `CONTEXT.md` in the infra repo) reaches non-admins through their auto-freshened clones — commit + push and every user has it within the hour. **Memory — homelab CLI hooks (rolled out 2026-06-21, deploy-fixed 2026-06-22):** the per-user `claude_memory` MCP was retired for the **homelab-memory hooks** — the reconcile's `install_memory` (re)installs four scripts into `~/.claude/hooks/` each run (`homelab-memory-recall.py` UserPromptSubmit recall, `auto-learn.py` Stop-hook extraction, `pre-compact-backup.sh`/`post-compact-recovery.sh`), wires them into `settings.json` if-absent + additive, and removes the old `claude_memory` MCP. **The provisioner binary itself now self-deploys from the repo** (step 0: `bash -n`-gated `install` + re-exec when `scripts/t3-provision-users.sh` differs from `/usr/local/bin/t3-provision-users`, guarded against re-exec loops / DRY_RUN mutation) — added after this very rollout sat committed-but-undeployed for a day (only the manual `setup-devvm.sh` had ever deployed the binary), so the hourly reconcile kept running the pre-memory version and emo/anca silently lost memory (recall + auto-learn never wired). A latent `set -e` abort in `install_memory` (a bare `[[ -d plugin-dir ]] && …` returning non-zero) was also fixed; it had killed the reconcile after the first user the first time it actually ran. The hooks need a `MEMORY_API_KEY` (or `CLAUDE_MEMORY_API_KEY`) in the user's `settings.json` env — the `homelab` CLI defaults the API URL, so **the key is the only hard requirement**; `install_memory` reuses an existing key and only WARNs if absent (it does NOT mint one — that's an admin Vault step, see Remaining). wizard + emo carry a key from their original MCP setup; **ancamilea is keyless → her memory no-ops until a key is minted.** (`auto-learn.py`'s passive store calls the API directly, so it additionally needs `*_API_URL` in env to avoid its local-SQLite fallback; recall + manual `homelab memory store` go through the URL-defaulting CLI and need only the key.) diff --git a/docs/architecture/networking.md b/docs/architecture/networking.md index 4659038a..070cc59e 100644 --- a/docs/architecture/networking.md +++ b/docs/architecture/networking.md @@ -261,7 +261,7 @@ Traefik chain: 1. **Anti-AI bot-block** (`ai-bot-block` ForwardAuth, on by default via `ingress_factory`): blocks/tarpits known AI crawlers. **Fail-open** (currently a no-op `return 200` — poison-fountain scaled to 0; see `docs/architecture/security.md`). 2. **Authentik Forward-Auth** (if `protected = true`): SSO authentication via OIDC. Non-authenticated users are redirected to login. Auth headers are stripped before forwarding to backend. -3. **Rate Limiting**: Per-IP throttling. Returns **429 Too Many Requests** (not 503) when limit exceeded. Default is `rate-limit` (average 10 req/s, burst 50). Services whose clients legitimately burst harder get a dedicated middleware via `skip_default_rate_limit = true` + `extra_middlewares`: Immich (`immich-rate-limit`, 1000/20000, photo uploads) and ActualBudget (`actualbudget-rate-limit`, 50/300 — the Actual web app boots with ~70 parallel asset/migration revalidations; the default burst 429'd the tail and stalled every page load). +3. **Rate Limiting**: Per-IP throttling. Returns **429 Too Many Requests** (not 503) when limit exceeded. Default is `rate-limit` (average 10 req/s, burst 50). Services whose clients legitimately burst harder get a dedicated middleware via `skip_default_rate_limit = true` + `extra_middlewares`: Immich (`immich-rate-limit`, 1000/20000, photo uploads), ActualBudget (`actualbudget-rate-limit`, 50/300 — the Actual web app boots with ~70 parallel asset/migration revalidations; the default burst 429'd the tail and stalled every page load), and authentik (`authentik-rate-limit`, 100/1000, on `/` and `/static` — the login SPA cold-loads ~70 flow-executor JS/CSS chunks from `/static`; the default burst 429'd the tail and a failed ES-module import left a blank login screen for cold/incognito/NAT-shared clients). 4. **Retry**: 2 attempts with 100ms delay on transient failures (5xx errors, connection errors). Additional middleware: @@ -550,7 +550,7 @@ chain — a CrowdSec/LAPI outage cannot cause 503s; it only stops new bans.) Che **Diagnosis**: Check Traefik middleware config for the affected IngressRoute. -**Fix**: Give the service a dedicated higher-limit middleware (don't loosen the shared default): define `-rate-limit` in `stacks/traefik/modules/traefik/middleware.tf`, then set `skip_default_rate_limit = true` + `extra_middlewares = ["traefik--rate-limit@kubernetescrd"]` on its `ingress_factory` call. Shared default is average 10 req/s / burst 50; Immich uses 1000/20000, ActualBudget 50/300. +**Fix**: Give the service a dedicated higher-limit middleware (don't loosen the shared default): define `-rate-limit` in `stacks/traefik/modules/traefik/middleware.tf`, then set `skip_default_rate_limit = true` + `extra_middlewares = ["traefik--rate-limit@kubernetescrd"]` on its `ingress_factory` call. Shared default is average 10 req/s / burst 50; Immich uses 1000/20000, ActualBudget 50/300, authentik 100/1000 (login SPA `/static` chunk burst → blank screen). ### Large Downloads or Uploads Truncate / Fail Partway diff --git a/docs/architecture/security.md b/docs/architecture/security.md index 7d3043ea..1cec0de6 100644 --- a/docs/architecture/security.md +++ b/docs/architecture/security.md @@ -132,6 +132,13 @@ for the supersession history — there is no longer an inline Traefik bouncer.) account hard-limits to **one** list), and CAPI is already covered in-kernel on direct hosts and by Cloudflare's own managed protections on proxied hosts. Registered bouncer key: **`kvsync`**. +- **Rate-limit resilient (2026-06-27):** Cloudflare's Lists-API *write* endpoint + is throttled (~per-60s; `429 retry-after`). The CronJob runs `backoff_limit=0` + (one POST per cycle — the `*/2` schedule IS the retry cadence) and treats a CF + `429` as a soft-skip (exit 0, retry next cycle), the same fail-safe pattern it + uses for LAPI. An earlier `backoff_limit=2` fired 3 rapid POSTs/cycle and + escalated the throttle into a stuck state that left the list empty — a + self-inflicted DoS that this change prevents. - **Block-only**: the single-list limit precludes a separate captcha/managed-challenge list, so both ban and captcha decisions are enforced as a plain block at the edge. @@ -272,7 +279,7 @@ Beads epic: `code-8ywc`. **Status: partially live as of 2026-05-18.** The block below documents the locked design. -Response model: **(I) Slack-only, daily skim.** All security alerts land in a new `#security` Slack channel via Alertmanager. No paging. Mean detection time accepted as ~12-24h; the design weight sits on prevention (Kyverno enforce, NetworkPolicy default-deny egress) rather than runtime detection. +Response model: **(I) Slack-only, daily skim.** All security alerts post to **`#alerts`** via Alertmanager (the `slack-security` receiver keeps its distinct `[SECURITY/]` title styling so security-lane alerts still stand out). The dedicated `#security` channel was abandoned (2026-06-25) — the shared `alertmanager_slack_api_url` incoming webhook's Slack app isn't a member of it, so a channel override there returns HTTP `404 channel_not_found`; everything consolidated to `#alerts`. No paging. Mean detection time accepted as ~12-24h; the design weight sits on prevention (Kyverno enforce, NetworkPolicy default-deny egress) rather than runtime detection. #### Detection sources @@ -285,7 +292,7 @@ Response model: **(I) Slack-only, daily skim.** All security alerts land in a ne #### Alert rules (16 total) -Routed via **Loki ruler → Alertmanager → `#security` Slack receiver**. Same handling path as existing infra alerts — silenceable in Alertmanager UI, history queryable, severity labels (critical/warning/info) inside the single `#security` channel. +Routed via **Loki ruler → Alertmanager → the `slack-security` receiver, which posts to `#alerts`** (it keeps its `[SECURITY/]` title styling so security-lane alerts stand out there; the dedicated `#security` channel was abandoned 2026-06-25 — the shared webhook's Slack app isn't a member of it). Same handling path as existing infra alerts — silenceable in Alertmanager UI, history queryable, severity labels (critical/warning/info) carried in the alert. **K8s API audit (K2-K9, 8 rules — K1 cluster-admin-grant intentionally skipped):** @@ -364,6 +371,69 @@ Beads: `code-8ywc` W1.6 + W1.7. **Status: planned.** - Rare-event misses: a Sunday-only CronJob's egress won't appear in 7 days of flow logs. Mitigation: extend observation to 2 weeks for namespaces with weekly CronJobs. - Mass-rollout cascade: the 26h March 2026 outage (memory id=390) was a mass-change cascade. Mitigation: phased per-namespace with health-check pauses, similar to the 2026-05-17 Keel phased rollout (memory id=1972). +#### Deriving the per-namespace egress allowlist from the edge trail (Wave 1 W1.7) + +The durable **east-west flow trail** (below) is now the preferred data source for +the *internal* (namespace-to-namespace) half of each Wave-1 egress allowlist — +faster and identity-stamped vs the original iptables-`LOG`→journald→Loki path +(ADR-0014: "Enforcement gains a better data source"). The unique observed +namespace pairs live in CNPG DB `goldmane_edges`, table `edge`. To derive the +namespaces a source is observed talking to (the `allow` set that seeds its +NetworkPolicy): + +```sql +SELECT DISTINCT dst_ns FROM edge WHERE src_ns='' AND action='allow' ORDER BY dst_ns; +``` + +The full SQL recipe (whole-cluster matrix, deny sanity-checks, the ≥7-day +observation caveat) is in +[runbooks/goldmane-flow-trail.md → Deriving the Wave-1 egress allowlist](../runbooks/goldmane-flow-trail.md#deriving-the-wave-1-egress-allowlist-from-the-edge-table-infra-62). +**External / public-internet egress is NOT in this table** (empty-namespace flows +are dropped) — for those destinations keep using the Calico flow-log observation +(the W1.6 snapshot, `wave1-egress-observation-2026-05-22.md`). This feeds the +existing observe-then-enforce effort (beads `code-8ywc`); **enforce-flips remain +out of scope** of the trail — it is observe-and-derive only. + +### East-west flow observability (Goldmane / Whisker + edge trail) (ADR-0014) + +The "who-talks-to-whom" data plane that succeeds raw iptables-`LOG` lines (which +carried no identity). **Service identity = the workload's namespace** (primary), +refined by a `service-identity` label in the few multi-Service namespaces +(`monitoring`, `kube-system`, `dbaas`). End-to-end trail, three layers: + +1. **Calico Goldmane + Whisker** (`calico-system`) — Goldmane aggregates + identity-stamped flows (ns/pod/workload/labels + allow-deny + policy-trace) + streamed from Felix over gRPC into a **~60-min in-memory ring buffer** (no + etcd/API writes — the etcd-cost constraint that drove the design). **Whisker** + is its live web UI at `whisker.viktorbarzin.me` (Authentik-gated, + `auth = "required"` — Whisker has no own login; an additive NetworkPolicy ORs + Traefik past the operator's default-deny `whisker` NP). The ring buffer is + **not** a trail (lost on Goldmane restart). Enabled via operator CRs in + `stacks/calico/main.tf`; reversible toggle (Goldmane is OSS tech-preview). +2. **`goldmane-edge-aggregator`** (`stacks/goldmane-edge-aggregator`) — streams + Goldmane's gRPC `Flows.Stream` over **mTLS** and upserts the low-cardinality + namespace-pair edge set (`edge(src_ns,dst_ns,action,first_seen,last_seen, + flow_count)`) into CNPG DB `goldmane_edges`. Self-edges and empty-namespace + (public-internet) flows are dropped — in-cluster relationships only. The mTLS + client cert **reuses the operator's Tigera-CA-signed `whisker-backend-key-pair`** + (Goldmane verifies CA-chain only, not identity) rather than copying the CA + private key into TF state — **re-apply the stack if the operator rotates that + Secret**. +3. **`goldmane-edges-digest`** CronJob — posts first-seen edges daily to + **`#alerts`** (reuses the alert-digest webhook). All Slack now consolidates to + `#alerts`; the `#security` channel was abandoned 2026-06-25 because that + webhook's Slack app isn't a member of it (a `#security` override 404s). See + runbook. + +The trail is **attribution-grade, not cryptographic** (reconstructs events in a +trusted cluster; cannot prove identity against a spoofing pod — accepted trust-model +limit; east-west stays plaintext, no mTLS between app pods). Health is covered by +the **`AggregatorDown`** + **`DigestFailing`** alerts and cluster-health check #48 +(see monitoring.md). Full as-built, query recipes, and troubleshooting: +[runbooks/goldmane-flow-trail.md](../runbooks/goldmane-flow-trail.md). Decision: +[ADR-0014](../adr/0014-service-identity-and-east-west-observability.md); glossary +`CONTEXT.md` → **Service identity**, **Goldmane / Whisker**. + ### TLS & HTTP/3 **Traefik** handles TLS termination: diff --git a/docs/plans/2026-06-28-k8s-upgrade-gate-held-classification.md b/docs/plans/2026-06-28-k8s-upgrade-gate-held-classification.md new file mode 100644 index 00000000..eaa24286 --- /dev/null +++ b/docs/plans/2026-06-28-k8s-upgrade-gate-held-classification.md @@ -0,0 +1,117 @@ +# k8s-upgrade compat-gate: classify "actionable" vs "held" blocks + +**Date:** 2026-06-28 +**Status:** design → implementation +**Stack:** `stacks/k8s-version-upgrade` (+ `stacks/monitoring` alert rules) + +## Problem + +The cluster is on k8s 1.35.6. The nightly `k8s-version-check` chain detects the +next minor (1.36.2), runs the preflight compat-gate, and the gate **refuses** +it — because no released kyverno/ESO supports k8s 1.36 yet, and gpu-operator is +deliberately pinned (its 26.3 bump needs a newer NVIDIA driver image + Ubuntu +release we're not ready for). The result, **every single night**: + +- a **Failed** preflight Job (`block()` exits 1), and +- `k8s_upgrade_blocked=1` → the **K8sUpgradeBlocked** alert. + +But this block is **not actionable** — there's nothing we can upgrade to clear +it; we can only wait for upstream (kyverno/ESO) and, separately, do the +gpu-operator/Ubuntu work. The gate is crying wolf: a "blocked, needs attention" +signal that's indistinguishable from a block we could actually fix. + +## Goal + +Make the gate **classify** each blocker and behave accordingly: + +| Class | Definition | Behaviour | +|-------|-----------|-----------| +| **actionable** | the compat matrix has a newer version of the addon whose `max_k8s >= target`, and the running version is older — upgrading it would clear the block | **alert** (`k8s_upgrade_blocked=1` → K8sUpgradeBlocked), with the specific "upgrade X → Y" remediation in the nightly report | +| **waiting-upstream** | **no** matrix version of the addon supports the target yet (kyverno/ESO for 1.36) | **quiet** (`k8s_upgrade_held=1`, no alert) — nightly report only | +| **pinned** | a supporting version exists but the addon carries `"pinned": true` in the matrix (gpu-operator) | **quiet** (held) | + +Removed-API and containerd blocks are always **actionable**. **Held wins:** if +*any* blocker is waiting-or-pinned, the whole target is **HELD** (quiet) — +acting on the actionable blockers wouldn't unblock it yet. The nightly report +still lists everything so the full eventual scope is visible. + +Also (scope decision: "tidy the block path"): deliberate gate decisions +(actionable-block **and** held) now make the preflight Job **Complete cleanly** +(exit 0) instead of Failing. Chain progression is gated on the verdict, not the +exit code. Real failures (unhealthy nodes, kubeadm errors, crashes) still exit +1 → `K8sUpgradeChainJobFailed`. + +## Design + +### `compat-gate.py` +- New exit codes: `0` safe · `2` actionable-block · `3` gate-error (fail-safe) · **`4` held**. +- Each stdout reason line is tagged `[ACTIONABLE]` / `[WAITING]` / `[PINNED]`. +- `check_addons`: when an addon blocks, decide its class: + - `pinned: true` in its matrix entry → `[PINNED]`. + - else a higher matrix version with `max_k8s >= target` exists → `[ACTIONABLE]` (`upgrade X to >= V`). + - else → `[WAITING]` (`no released X version supports k8s T yet`). + - unreadable image / below-matrix → `[ACTIONABLE]` (fail-safe — a human must look). +- `check_removed_apis`, `check_containerd`: tag `[ACTIONABLE]`. +- `exit_code(reasons)`: `0` if none; `4` if any `held_reason` (WAITING/PINNED); else `2`. + +### `upgrade-step.sh` +- New global `HALT_CHAIN=0`; `spawn_next()` returns early (no next Job) when set. +- Replace `block()` with `record_blocked()` / `record_held()` — push the gauge, + set `HALT_CHAIN=1`, **do not exit**. +- `phase_preflight` gate handling routes on the gate's exit code: + - `0` → push `blocked=0`+`held=0`, proceed. + - `2`/`3` → `record_blocked`, `return 0` (Job Completes, K8sUpgradeBlocked fires). + - `4` → `record_held`, `return 0` (Job Completes, **no alert**). +- Push the gauge **definitively once** per run (remove the pre-reset `blocked=0` + at gate start) so a standing block doesn't flap 1→0→1 and re-notify. +- postflight also clears `held=0` alongside the existing gauge resets. + +### detector (`main.tf`, the `k8s-version-check` CronJob) +- Consequence of the tidy change: refusals now **Complete** instead of Failing, + so the old "re-spawn only a *Failed* preflight" idempotency would skip a + refused-but-Complete preflight until its 7d TTL. Fix: re-spawn nightly when the + preflight is **Complete but no `k8s-upgrade-master-` Job exists** (the + gate refused — chain never advanced) — **silently** (no Slack), so a standing + hold re-evaluates each night without noise. +- The per-night `slack "K8s upgrade available…"` becomes an `echo`; the spawn + Slack fires only for a genuinely new spawn or a Failed-respawn (`ANNOUNCE` + flag), not for silent re-evaluations — killing the last nightly-noise source. + +### `addon-compat.json` +- Add `"pinned": true` + `"pin_reason"` to the gpu-operator entry (its + `26.3 → 1.36` row stays; `pinned` overrides classification to held). Document + the `pinned` flag in `_comment`. Unpinning later = delete two keys. + +### `stacks/monitoring` alert rules (`prometheus_chart_values.tpl`) +- `K8sUpgradeBlocked` (`k8s_upgrade_blocked == 1`): unchanged trigger, now + actionable-only; reword annotation (reasons are in the nightly report, not a + per-run chain Slack). +- `K8sUpgradeChainJobFailed`: **drop** the `unless on() (k8s_upgrade_blocked == 1)` + clause — deliberate blocks no longer create Failed Jobs, so the alert again + means a genuine wedge. +- **No alert** for `k8s_upgrade_held` (intentional — nothing to action; the + nightly report surfaces it). Add a comment recording this. + +### `nightly-report.py` +- Read `k8s_upgrade_held`. New `⏸️ HELD — not yet upgradable` headline. +- Group reasons by tag: *Action needed* / *Waiting on upstream* / *Pinned (held by us)* + (fallback bullets for untagged lines, so older reason strings still render). +- Fetch reasons when avail AND (blocked OR held). + +## Net effect on 1.36 today +**HELD, quiet** — waiting on kyverno + ESO (upstream) + gpu-operator (pinned); +Calico listed as the lone actionable piece. No nightly Failed Job, no alert — +just the nightly report's ⏸️ line. Flips to actionable (→ alert) only once +kyverno/ESO ship support **and** gpu-operator is unpinned. + +## Tests (TDD) +- `compat-gate`: waiting / actionable / pinned-is-held / mixed-held-wins, + removed-API & containerd are actionable, exit_code mapping, + existing + patch/safe cases stay green. +- `nightly-report`: held headline + grouped reasons; existing tests stay green. +- `upgrade-step.sh`: shellcheck; manual review of the HALT_CHAIN + gauge flow + (bash, not unit-tested). + +## Out of scope (separate follow-up) +Auto-refreshing the matrix when upstream ships 1.36 support (a periodic +addon-readiness probe). This change only *consumes* the matrix. diff --git a/docs/post-mortems/2026-05-16-metallb-l2-immutable-pg-vip-flap.md b/docs/post-mortems/2026-05-16-metallb-l2-immutable-pg-vip-flap.md new file mode 100644 index 00000000..daf5006a --- /dev/null +++ b/docs/post-mortems/2026-05-16-metallb-l2-immutable-pg-vip-flap.md @@ -0,0 +1,128 @@ +# Post-Mortem: MetalLB ServiceL2Status Stuck Immutable → PG LB VIP Flap → Woodpecker CI Tier 1 Applies Broken + +| Field | Value | +|-------|-------| +| **Date** | 2026-05-16 (mitigated) / 2026-05-26 (closed) | +| **Duration** | ~5 days of degraded CI (2026-04-21 first observed → 2026-05-16 mitigated). Symptom-only; no human-visible service downtime. | +| **Severity** | SEV3 — Woodpecker CI default.yml apply step failed on Tier 1 (PG-backend) stacks. Drift-detection ran silently broken. Manual `scripts/tg apply` continued to work. No data loss, no app downtime. | +| **Affected Services** | Woodpecker CI pipelines applying any of the 28+ Tier 1 stacks (monitoring, crowdsec, authentik, headscale, etc.). PostgreSQL backend itself was healthy. | +| **Issue** | Beads `code-aoxk` (closed 2026-05-26). | +| **Status** | Closed | + +## Summary + +Woodpecker CI surfaced as `ERROR: Cannot read PG credentials from Vault. Run: vault login -method=oidc` from `scripts/tg` whenever a pipeline tried to apply a Tier 1 stack. The error was misleading on two counts: + +1. **Vault was healthy.** A direct `vault read database/static-creds/pg-terraform-state` from inside a Woodpecker pipeline pod (using K8s SA JWT → `auth/kubernetes/login role=ci`) succeeded every time when run in isolation. +2. **The "Cannot read PG credentials" message in `scripts/tg` was a catch-all** that fired for *any* Terraform/Terragrunt failure during PG state-lock acquire-release, including TCP RSTs against the PG LoadBalancer VIP. + +Actual root cause: the MetalLB `ServiceL2Status` CR for the `postgresql-lb` service (`dbaas` namespace, VIP `10.0.20.200`) had a stuck `status.node` field that the controller treated as immutable. The L2 speaker kept failing to update it with `Invalid value: "k8s-nodeX": Value is immutable`, so the leader-elected announcer flapped between k8s-node3 and k8s-node4 every few seconds. Each flap dropped open TCP connections (RST). Terraform's state-lock acquire → operation → release sequence straddled flaps and failed mid-operation. `scripts/tg` surfaced this as the misleading "Cannot read PG credentials" message. + +Manual `scripts/tg apply` from the DevVM kept working because the developer's session happened to land on whichever node currently held the VIP and complete fast enough to not straddle a flap. CI pipelines, being slower (full stack walk), reliably straddled at least one flap. + +## Impact + +- **CI degradation**: Tier 1 stack changes pushed to master were NOT auto-applied. Required manual `scripts/tg apply` from DevVM after every push touching one of 28+ stacks. +- **Drift-detection broken**: The daily `drift-detection.yml` Woodpecker pipeline silently failed on every Tier 1 stack — meaning unannounced manual changes to those stacks could have persisted undetected for the duration. +- **No user-facing outage**: PG cluster itself, all apps that use PG, and all in-cluster traffic to `10.0.20.200` worked normally. Only the very specific `acquire-state-lock → run operation → release-state-lock` round-trip pattern from CI was unreliable. + +## Timeline (UTC) + +| Time | Event | +|------|-------| +| 2026-04-21 | First broken CI pipelines (#411, #412, #413). Drift-detection failures noticed. `code-aoxk` filed. Initial hypothesis: Vault auth/role mismatch. | +| 2026-04-22 — 2026-05-15 | Multiple investigation attempts. Verified Vault K8s `auth/kubernetes/role/ci` has correct policies (`terraform-state`, `ci`). Verified `database/static-creds/pg-terraform-state` exists, rotates on schedule, credentials valid. Could not reproduce the failure in isolated `vault read` from Woodpecker pods. | +| 2026-05-16 (~12:14 UTC) | `pg-cluster-3` came up (third CNPG replica); endpoint set churn likely triggered MetalLB L2 announcer to attempt to update the existing `ServiceL2Status` CR (was `l2-rgt9d`). Update was rejected as immutable. Speaker kept retrying. VIP flapped. | +| 2026-05-16 | RCA breakthrough: noticed `kubectl logs -n metallb-system -l component=speaker` was full of `Invalid value: "k8s-node…": Value is immutable` on the postgresql-lb ServiceL2Status. Correlated with `kubectl get servicel2status` returning multiple stale entries for the same service. | +| 2026-05-16 | **Mitigation**: `kubectl delete servicel2status.metallb.io l2-rgt9d -n metallb-system`. Speaker recreated the CR cleanly (became `l2-zj9ss`). Flap stopped. PG connections stable. Manual CI re-runs of `monitoring` stack apply succeeded immediately. | +| 2026-05-17 | Audit: acceptance criteria 1 + 2 met implicitly. #3 (post-mortem) remained pending. Beads task reverted from `in_progress` → `open`. | +| 2026-05-25 | Node2 SCSI LUN remap → encrypted PVC emergency_ro → containerd boltdb corruption outage. Unrelated, but pulled Woodpecker server off node2. Subsequent server pod restart on k8s-node4. | +| 2026-05-26 | Verification: from a live Woodpecker pipeline pod (`wp-01kshph6pa0w6ch0zf5x9bfqgr`), `vault write auth/kubernetes/login role=ci jwt=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)` succeeded. `vault read database/static-creds/pg-terraform-state` returned valid creds (`username=terraform_state`, last_vault_rotation 2026-05-21, TTL 58h). Live `default.yml` pipeline confirmed applying Tier 1 stacks: dbaas, authentik, crowdsec, monitoring, nvidia, cloudflared, kyverno, metallb — all `OK`. `postgresql-lb` ServiceL2Status currently single allocation (`l2-sv9vv` on k8s-node3, no flap). Beads task closed. | + +## Root Cause + +`metallb-speaker` reconciler in the deployed MetalLB version treats `ServiceL2Status.status.node` as immutable after first set. When the L2 announcer's leader-election picks a different node to announce a given VIP (which happens on speaker pod restart, node loss, endpoint set churn, or pod-anti-affinity reshuffles), the reconciler fails to patch the existing CR and gets stuck in a retry loop. Without manual deletion, the reconciler will not progress. + +Why it manifested as Vault credential errors: + +1. CI's `scripts/tg` pre-flight runs `vault read database/static-creds/pg-terraform-state` (line 83 in current code) to get PG credentials. That call succeeds. +2. CI then runs `terragrunt apply` against the Tier 1 stack. Terragrunt connects to `10.0.20.200:5432` for state-lock acquire (via `pg_advisory_lock`). The TCP connection lands on whichever node MetalLB last announced the VIP from. +3. Mid-operation, MetalLB tries to re-announce from a different node, sends gratuitous ARPs, and the upstream switch updates its MAC table. Open TCP sessions on the previous announcer's node are immediately RST. +4. Terragrunt's state-lock release (or any subsequent PG operation) fails with broken pipe / connection refused. +5. `scripts/tg` interpreted the wrapper-level failure as "PG creds bad" because that's the most common failure mode it handles. The actual error from terragrunt was buried in `2>/dev/null` suppression (since fixed — see Fix #1 below). + +## Detection + +We did not have any of: +- A direct alert for "MetalLB ServiceL2Status reconciler errors". +- An alert for "PG LB VIP node changed N times in M minutes". +- An end-to-end probe for the CI state-lock pattern (terragrunt against `10.0.20.200`). + +Detection mechanism was a human reading `kubectl logs -n metallb-system` for unrelated reasons. Took 25 days from first observed symptom to RCA. + +## Fixes & Mitigations + +### 1. Surface real error from `scripts/tg` (DONE) + +The original `scripts/tg` swallowed the real `vault read` / terragrunt error behind `2>/dev/null` and printed a static "Cannot read PG credentials from Vault" message. Fixed in the script: + +```sh +# scripts/tg lines 79-89 (current) +if ! command -v vault >/dev/null 2>&1; then + echo "ERROR: vault CLI not found on PATH. Install it or use an image that includes it (ci/Dockerfile)." >&2 + exit 1 +fi +VAULT_OUT=$(vault read -format=json database/static-creds/pg-terraform-state 2>&1) || { + echo "ERROR: Cannot read PG credentials from Vault. Vault output follows:" >&2 + echo "$VAULT_OUT" >&2 + echo "" >&2 + echo "Hint: humans run 'vault login -method=oidc'; CI auths via K8s SA (role=ci)." >&2 + exit 1 +} +``` + +Comment in the code explicitly references this incident. + +### 2. Stuck-CR cleanup procedure (DOCUMENTED) + +Reproduction check for future sessions (also in `code-aoxk` beads notes): + +```sh +kubectl logs -n metallb-system -l component=speaker --tail=200 | grep -iE 'Invalid value.*immutable' +# If matches found → same root cause. Delete the stuck CR: +kubectl get servicel2status -n metallb-system +kubectl delete servicel2status.metallb.io -n metallb-system +``` + +Speaker recreates the CR cleanly within seconds. + +### 3. Long-term MetalLB controller fix (DEFERRED) + +The underlying bug — speaker not recreating the CR when the immutable field needs to change — is upstream MetalLB behaviour. Two paths possible: + +- **Upgrade MetalLB** to a version where this is fixed (needs research — check changelogs). +- **File upstream issue / patch** with reproducer. + +Not done as part of this post-mortem; tracked separately. Risk acceptance: until then, the manual `delete servicel2status` workaround is the playbook, and is fast (<10s). + +### 4. Alerting (DEFERRED) + +Suggested but not implemented: +- Prometheus alert on `metallb_speaker_reconcile_errors_total{kind="ServiceL2Status"}` rate. +- Synthetic probe: a CronJob that does `pg_advisory_lock` + release against the PG VIP every 5min from CI namespace, alert if it ever fails. + +Tracked as future hardening (no beads task yet — only worth filing if recurrence happens). + +## Lessons + +1. **`2>/dev/null` is a time-bomb.** It hid the real error for weeks. Fix #1 already lands the principle; audit other places in `scripts/` for the same anti-pattern next time we touch them. +2. **CRD `status.*` immutability is non-obvious failure mode.** When debugging weird LB / VIP / endpoint behaviour, always grep speaker logs for `immutable`, `cannot update`, and reconciler errors. Add to cluster-health checks. +3. **Misleading wrapper errors cost weeks.** `scripts/tg` claimed "Cannot read PG credentials" — that's what the operator believed. The actual `vault read` step worked. The real failure was three steps later in a completely different subsystem. When a wrapper script makes a definitive claim about which subsystem failed, distrust it; reproduce the subsystem in isolation before chasing the claim. +4. **CNPG primary changes / endpoint churn can trigger L2 announcer flap.** The trigger (within the timeline) was likely the `pg-cluster-3` pod coming up. Worth flagging for any future CNPG topology changes. + +## References + +- Beads: `code-aoxk` — closed 2026-05-26. +- `scripts/tg` lines 65-95 — current pre-flight with explicit error surfacing. +- `kubectl get servicel2status -A` — current state, single allocation per service. +- This file: `infra/docs/post-mortems/2026-05-16-metallb-l2-immutable-pg-vip-flap.md`. diff --git a/docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md b/docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md new file mode 100644 index 00000000..e6b11816 --- /dev/null +++ b/docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md @@ -0,0 +1,97 @@ +# Post-mortem: k8s 1.34→1.35 upgrade stalled — etcd IO starvation (2026-06-24) + +> Filename kept for inbound links. The originally-suspected cause (kubeadm-config +> OIDC drift) turned out **not** to be the crash — see "Correction" below. The OIDC +> drift was a real *separate* latent bug fixed in the same change. + +**Impact:** The autonomous k8s-version-upgrade chain (23:00 UTC nightly) reached +the master control-plane phase for the first time — preflight passed, etcd +snapshot taken, master cordoned + drained, etcd upgraded 3.6.5→3.6.6 — then the +kube-apiserver upgrade to v1.35.6 **crash-looped**. kubeadm waited its 5-minute +static-pod-hash window across all internal retries, then auto-rolled-back to +v1.34.9. The cluster stayed healthy on 1.34.9 (apiserver, all 7 nodes Ready), but +the run left **k8s-master cordoned** and the chain **wedged on `in_flight=1`**. +No data loss; no user-facing outage (the master carries control-plane taints, so +no workloads were displaced). + +**Trigger:** the first *minor* upgrade the chain ever attempted (1.34→1.35) — the +first time kubeadm upgrades etcd (3.6.5→3.6.6) and regenerates the control-plane +static pods, i.e. the first time the upgrade pushes real write-IO at etcd. + +## Root cause — etcd IO starvation on the shared HDD + +The new kube-apiserver could not establish/keep a working connection to etcd +during the upgrade because **etcd was IO-starved**. etcd's surviving container log +from the crash window (`/var/log/pods/.../etcd/0.log`, 23:04–23:20 UTC) shows: + +- **1,180** `apply request took too long` warnings in 16 minutes; +- individual applies of **4.3s / 2.9s / 2.7s / 1.8s** (healthy is <100ms), + clustered at **23:18:51 UTC** — exactly when kubeadm's final attempt was trying + to bring the new apiserver up. + +A reproduced 1.35.6 apiserver with no etcd dies with +`F instance.go:233 Error creating leases: error creating storage factory: context +deadline exceeded` — the same failure mode a multi-second etcd produces. etcd +lives on the contended `sdc` HDD (**beads code-oflt**: "etcd/critical VM disks on +shared sdc HDD — recurring IO-storm root cause"). The upgrade itself piled IO onto +that spindle: + +1. etcd's own upgrade-restart + WAL/db re-read (it restarted ~23:04, re-elected); +2. kubeadm dumping a full **~400MB etcd DB backup** to + `/etc/kubernetes/tmp/kubeadm-backup-etcd-/` (on the same HDD) before the + etcd upgrade — and **145 of these had accumulated to 28GB** (kubeadm never + cleans them up), pushing master root fs to **73%**, above the 70% kubelet + image-GC threshold, so image GC churned during the drain too; +3. master-drain pod evictions. + +### Correction — it was NOT the OIDC flag swap + +`kubeadm upgrade diff v1.35.6` showed the regenerated manifest also swaps +`--authentication-config` (structured multi-issuer OIDC) back to legacy +single-issuer `--oidc-*` flags (kubeadm-config drift, see secondary finding). That +was the *first* hypothesis — but an isolated repro of the 1.35.6 apiserver with +those exact `--oidc-*` flags **and authentik reachable** initialised OIDC cleanly +(`oidc.go:313`, no error) and ran fine until it hit the (deliberately dead) test +etcd. So the auth swap does **not** crash the apiserver; it was a red herring for +the crash. Image pull (all v1.35.6 images pre-pulled), OOM (none), and disk-full +were also ruled out. + +## Secondary finding (real, fixed separately) — kubeadm-config OIDC drift + +apiserver auth is configured in three places that must agree: +(1) `/etc/kubernetes/pki/auth-config.yaml` (structured, two issuers: `kubernetes` ++ `k8s-dashboard`, added 2026-06-19); (2) the live static-pod manifest +(`--authentication-config`); (3) the kubeadm-config `ClusterConfiguration` CM — +which still carried the legacy `--oidc-*` extraArgs. `kubeadm upgrade` regenerates +the manifest from (3), so it would have reverted structured auth → **dashboard + +kubectl SSO break after a successful upgrade** (recoverable: the chain's +post-master `restore.sh` re-adds the flag). This is a real bug, just not the crash. + +## Resolution + +1. **Reclaimed the 28GB kubeadm scratch** on master (`/etc/kubernetes/tmp/kubeadm-backup-*`) — root fs 73% → 23%. +2. **Reconciled kubeadm-config live** (zero cluster impact — CM only read at upgrade time): dropped `--oidc-*`, added `--authentication-config` via `kubeadm init phase upload-config kubeadm`. `kubeadm upgrade diff` then shows only the control-plane image bumps. +3. **Recovered:** uncordoned k8s-master, cleared the stuck `in_flight` gauge + annotation, deleted last night's Complete/Failed `1-35-6` phase jobs (a Complete preflight would otherwise make the detector idempotent-skip the re-run). + +## Prevention (landed in this change) + +| Gap | Fix | +|-----|-----| +| kubeadm leaks ~400MB etcd-DB backups into `/etc/kubernetes/tmp` forever (→ disk fills, image-GC churn, write-IO on etcd's spindle) | **`upgrade-step.sh` preflight now prunes** `/etc/kubernetes/tmp/kubeadm-backup-*` + `kubeadm-upgraded-manifests*` older than 3 days on master, every run. Best-effort, never aborts. | +| kubeadm-config drift would silently break SSO after an upgrade | `apiserver-oidc.tf`'s remote script now **also reconciles kubeadm-config** (`kubeadm init phase upload-config`), delivered via the `apiserver-oidc-restore` ConfigMap the chain re-runs (CI needs no ssh) or a local `-replace` apply. Preflight **alerts** (not blocks — SSO drift is recoverable) if `kubeadm upgrade diff` would still drop `--authentication-config`. | +| etcd on the contended `sdc` HDD starves under upgrade IO | **Durable fix is beads code-oflt** (move etcd/critical VM disks off `sdc`). Not in this change. Mitigations above reduce the upgrade's own IO; reclaimed disk removes the image-GC variable. | + +## Lessons + +- **Capture the failing component's own logs before concluding.** The `kubeadm + upgrade diff` made the OIDC swap look like the cause; only etcd's log (multi-second + applies) + an isolated apiserver repro showed the truth (etcd IO). A clean diff is + "what config changes," not "why it crashed." +- **etcd on shared HDD is the cluster's recurring fragility** (immich IO storm + 2026-05-25, this stall). Upgrades concentrate IO (etcd restart + kubeadm's 400MB + backup copy + drain) onto that spindle. code-oflt is the real fix. +- **Tools that leave per-operation scratch must be reaped.** kubeadm's + `/etc/kubernetes/tmp` etcd backups are throwaway (real backups → NFS) but never + GC'd; 28GB had silently accumulated. +- **Out-of-band control-plane edits must be written back to kubeadm-config** — else + `kubeadm upgrade` silently reverts them (here: SSO; could be admission/audit/API flags). diff --git a/docs/runbooks/claude-auth-renew-workstation.md b/docs/runbooks/claude-auth-renew-workstation.md index f5ce6625..8156530e 100644 --- a/docs/runbooks/claude-auth-renew-workstation.md +++ b/docs/runbooks/claude-auth-renew-workstation.md @@ -11,6 +11,11 @@ inference every six hours and backs up only the `claudeAiOauth` object to: secret/workstation/claude-users/ ``` +The backup **merges** into that path (`vault kv patch -method=rw`, falling back to +`kv put` only when the path does not exist yet), so keys that other tools +co-locate there — notably `homelab vault`'s `vaultwarden_*` credentials — survive. +A blind `kv put` here silently wiped them on every six-hourly run (fixed 2026-06-26). + The user's unrelated `mcpOAuth` credentials never leave their home directory. Each renewal service has a distinct 32-day periodic Vault token, mode `0600`, at `~/.config/claude-auth-sync/vault-token`. Its policy can access only that user's @@ -75,8 +80,64 @@ sudo --preserve-env=VAULT_ADDR,VAULT_TOKEN /usr/local/bin/t3-provision-users ``` Never copy another user's `.credentials.json` or scoped Vault token. Never restore -the old shared `CLAUDE_CODE_OAUTH_TOKEN`; environment credentials outrank per-user -login and would silently collapse all users onto one identity. +a **shared** `CLAUDE_CODE_OAUTH_TOKEN` across users; environment credentials +outrank per-user login and would silently collapse all users onto one identity. +(A **per-user**, non-rotating setup-token tied to the user's OWN Enterprise +identity is a different, sanctioned thing — see "Long-lived per-user token" below.) + +## Long-lived per-user token (heavy concurrent-agent users) + +The six-hourly renewal above assumes Claude owns refresh-token rotation in a +single `~/.claude/.credentials.json`. A user who runs **many concurrent Claude +sessions** (interactive tmux panes + their `t3-serve` instance + always-on +`start-claude.sh` agents) breaks that assumption: when the shared access token +expires, the processes refresh **simultaneously**, the OAuth server rotates the +refresh token, and the losing writer persists an **empty** refresh token — +logging the user out roughly every access-token lifetime (~8h). Re-issuing the +credential does not help; the race recurs. + +The fix is a **per-user, long-lived setup-token** (`sk-ant-oat01-…`, ~1y, +**non-rotating**). With `CLAUDE_CODE_OAUTH_TOKEN` set, Claude uses it directly and +never touches `.credentials.json` — so there is nothing to race on. This is the +user's OWN Enterprise identity (scope `user:inference`; local MCP servers are +client-side and unaffected), stored only in their OWN Vault path — **NOT** the +forbidden shared token, and it never crosses OS users. + +**Enable it (one-time, per user):** + +1. The user mints their own token (interactive Enterprise SSO): + + ```bash + claude setup-token # opens an SSO URL; paste the code back -> prints sk-ant-oat01-… + ``` + +2. An admin stores it in that user's Vault path (MERGE, never `kv put` — siblings + like `claude_ai_oauth_json` / `vaultwarden_*` must survive): + + ```bash + vault kv patch -method=rw secret/workstation/claude-users/ \ + setup_token=sk-ant-oat01-… + ``` + +3. Materialize + activate (or just wait ≤6h for the timer): + + ```bash + systemctl start claude-auth-sync@.service + ``` + + `claude-auth-sync` writes `~/.config/claude-auth-sync/claude-oauth.env` + (`CLAUDE_CODE_OAUTH_TOKEN=…`, mode 0600) and, while a token is present, **skips** + the rotating-credential validate/backup/restore (so no false + `WorkstationClaudeAuthInvalid`). `start-claude.sh` and `t3-serve@.service` load + that env file. **Sessions started before activation keep the old credential + until relaunched** — the user must restart their agents / `t3-serve` to cut over. + +**Disable it:** clear the field (`vault kv patch -method=rw +secret/workstation/claude-users/ setup_token=""`) — the next sync removes +the env file and the user reverts to the per-user SSO credential flow. + +**Rotate before expiry:** setup-tokens expire 1y after mint. Re-mint (step 1) and +re-store (step 2); the env file refreshes on the next sync. ## Verification diff --git a/docs/runbooks/goldmane-flow-trail.md b/docs/runbooks/goldmane-flow-trail.md new file mode 100644 index 00000000..dbf6f6d4 --- /dev/null +++ b/docs/runbooks/goldmane-flow-trail.md @@ -0,0 +1,346 @@ +# Goldmane Flow Trail — east-west "who-talks-to-whom" observability + +> As-built runbook for the Calico Goldmane + Whisker flow plane and the +> `goldmane-edge-aggregator` durable audit trail. Design + rationale: +> [ADR-0014](../adr/0014-service-identity-and-east-west-observability.md). +> Glossary: `CONTEXT.md` → **Service identity**, **Goldmane / Whisker**. +> Implements infra issues #57 (Whisker ingress), #58 (aggregator), #61 +> (monitoring), #62 (egress allowlist queries), #63 (these docs). + +## What the trail is + +Three layers turn raw east-west traffic into a queryable, durable record of +which Service talks to which. **Service identity = the workload's namespace** +(primary), refined by a `service-identity` label in the few multi-Service +namespaces (`monitoring`, `kube-system`, `dbaas`) — see ADR-0014. + +| Layer | Component | Lifetime | Where it lives | +|---|---|---|---| +| **Live map** | Calico **Goldmane** + **Whisker** | ~60-min in-memory ring buffer (lost on Goldmane restart) | `calico-system`; Whisker UI at `whisker.viktorbarzin.me` | +| **Durable trail** | `goldmane-edge-aggregator` (`aggregate` mode) | persistent | CNPG Postgres DB `goldmane_edges`, table `edge` | +| **Notification** | `goldmane-edges-digest` CronJob (`digest` mode) | daily | Slack `#alerts` | + +**Goldmane** aggregates identity-stamped flows (namespace / pod / workload / +labels + allow-deny + policy-trace) streamed from Felix (the existing +`calico-node` DaemonSet) over gRPC into a ~60-minute in-memory ring buffer — +**nothing is written to etcd or the K8s API** (the etcd-cost constraint that +drove the whole design). **Whisker** is its live web UI. Because the ring +buffer is *not* a trail (a Goldmane restart loses the window), the +`goldmane-edge-aggregator` consumes Goldmane's gRPC `Flows.Stream` API over +mTLS and upserts the unique **namespace-pair edge set** into Postgres; a daily +CronJob posts first-seen edges to Slack. + +The edge set is deliberately **low-cardinality** — one row per +`(src_ns, dst_ns, action)`, *not* per-pod or per-port — so the table stays +small no matter how much traffic flows. + +## Where the data lives + +### Whisker UI — live, ~60 min +- `https://whisker.viktorbarzin.me` (Authentik-gated — Whisker ships no own + login; `auth = "required"`). Shows the live flow stream + a service graph for + roughly the last hour. Use it for "what is talking right now"; it is **not** + history. +- In-cluster: `Service goldmane:7443` (gRPC/mTLS), `Service whisker:8081` + (HTTP), both in `calico-system`. +- **DNS fix + self-heal:** whisker's egress to the kube-dns ClusterIP is allowed + by `whisker-allow-dns-clusterip` (`stacks/calico`) — without it the UI goes + empty after any gRPC-stream break (see Troubleshooting → "Whisker UI empty"). + The `whisker-watchdog` CronJob (every 10 min) is a backstop that restarts + whisker if its backend ever wedges for another reason. + +### CNPG `goldmane_edges` — durable +- Postgres DB `goldmane_edges` on the CNPG cluster + (`pg-cluster-rw.dbaas.svc.cluster.local:5432`). One table: + + ``` + edge(src_ns text, dst_ns text, action text, + first_seen timestamptz, last_seen timestamptz, flow_count bigint, + PRIMARY KEY (src_ns, dst_ns, action)) + ``` + + - `action` ∈ `allow` / `deny` / `pass` / `unspecified` (normalised Goldmane + action). + - **Self-edges (`src_ns == dst_ns`) and empty-namespace flows** (host-endpoint + / public-internet) are **dropped** — the trail is about in-cluster service + relationships only. (Egress to the public internet is therefore NOT in this + table; it lives in the Wave-1 Calico flow-log path — see security.md.) + - A **"new edge"** = a row whose `first_seen` falls inside the digest window. + - Role `goldmane_edges` (Vault-rotated, 7-day) owns the DB. The `edge` table + is created idempotently by the aggregator at startup (canonical DDL also in + the repo at `migrations/0001_edge.sql`). + +### Slack `#alerts` — daily digest + +> **Channel note (2026-06-25):** posts to **`#alerts`**. The dedicated `#security` channel was abandoned — the shared `alertmanager_slack_api_url` incoming webhook's Slack app is not a member of it, so a channel override there returns HTTP `404 channel_not_found`. Everything now posts to `#alerts` (this digest plus alertmanager's `slack-security` receiver, which keeps its `[SECURITY]` styling so security-lane alerts still stand out there). + +- CronJob `goldmane-edges-digest` (08:00 Europe/London) posts edges first seen + in the last 24h. Quiet when there are none. Reuses the existing alert-digest + Slack incoming webhook (Vault `secret/viktor` → `alertmanager_slack_api_url`) + — no new webhook was created. + +## How to enable / disable + +### Goldmane + Whisker (the flow plane) +Operator CRs in **`stacks/calico/main.tf`** — NOT the Helm `goldmane`/`whisker` +flags (those stay `false`; the operator's own `installation`/`apiServer` are +operator-managed via the `goldmanes`/`whiskers.operator.tigera.io` CRDs): + +- `kubectl_manifest.goldmane` (kind `Goldmane`) — creating it makes the operator + re-render `calico-node` with the `FELIX_FLOWLOGSGOLDMANESERVER` env (the + operator auto-wires Felix — **do NOT patch FelixConfiguration**), triggering a + supervised `calico-node` DaemonSet roll. Yields `Deployment` + `Service + goldmane:7443`. +- `kubectl_manifest.whisker` (kind `Whisker`, `depends_on` goldmane; + `notifications = Disabled`). Yields `Deployment` + `Service whisker:8081`. + +**To disable:** delete those two CRs and re-apply `stacks/calico`. Reversible +toggle (Goldmane is tech-preview in OSS Calico 3.30 — the main standing risk per +ADR-0014). + +### Whisker public ingress (infra #57) +Also in `stacks/calico/main.tf`: +- `module "ingress_whisker"` (`ingress_factory`, `auth = "required"`, + `dns_type = "proxied"`) → `whisker.viktorbarzin.me`. +- `kubernetes_network_policy_v1.whisker_allow_traefik` — **required alongside the + ingress**: the operator's own `whisker` NetworkPolicy (owned by the Whisker CR) + is `policyTypes: [Ingress]` with no rules = default-deny ingress to the pod. + This additive NP ORs in an allow for `namespaceSelector + kubernetes.io/metadata.name=traefik` on TCP 8081. Without it Traefik 502s. + +### The aggregator + digest (the durable trail) — `stacks/goldmane-edge-aggregator` +A Tier-1 stack (PG state) mirroring the claude-memory pattern. `scripts/tg +apply` from `stacks/goldmane-edge-aggregator/`. It provisions: the namespace, +the mTLS client material, the Postgres DB-init Job, the `DATABASE_URL` +ExternalSecret (Vault static role `pg-goldmane-edges`), the Slack ExternalSecret, +the `aggregate` Deployment, and the `digest` CronJob. **To disable the trail +without touching the flow plane:** scale `deployment/goldmane-edge-aggregator` to +0 (transient) or remove the stack (permanent) — Goldmane/Whisker keep running. + +Image: `ghcr.io/viktorbarzin/goldmane-edge-aggregator` (PRIVATE) — the +`goldmane-edge-aggregator` namespace must be in the `ghcr-credentials` Kyverno +allowlist (`stacks/kyverno/modules/kyverno/ghcr-credentials.tf`, +`local.ghcr_private_namespaces`) or pulls 401. Code repo: +`~/code/goldmane-edge-aggregator` (see its `README.md` + `DEPLOY.md`). + +## mTLS cert — the REUSE decision (cert-reuse gotcha) + +The aggregator dials `goldmane:7443` over **mutual TLS**. Goldmane requires the +client cert to chain to the **Tigera CA**, but it does **NOT authorize by client +identity** — any Tigera-CA-signed cert is accepted. + +Rather than copy the Tigera CA **private key** into Terraform state to mint our +own cert (a needless CA-key exposure; the `hashicorp/tls` provider also clashes +with this repo's global generate-providers/lockfile pattern), the stack +**REUSES the operator-minted, Tigera-CA-signed `whisker-backend-key-pair` +Secret** (`calico-system`), copying its `tls.crt`/`tls.key` into the +`goldmane-client-tls` Secret in the aggregator namespace. The CA *bundle* that +verifies Goldmane's serving cert (`tigera-ca-bundle` ConfigMap, key +`tigera-ca-bundle.crt`) is likewise copied verbatim (a ConfigMap can't be +cross-namespace-mounted). + +> **GOTCHA — if the operator rotates `whisker-backend-key-pair`, re-apply +> `stacks/goldmane-edge-aggregator`** to re-sync the copied cert. Symptom of a +> stale copy: the `aggregate` pod logs TLS handshake / `Flows.Stream` failures +> and no `last_seen` updates land in the `edge` table. Hardening follow-up +> (noted in the stack): mint an own-identity cert in-namespace if Whisker is ever +> removed (which would delete the reused source Secret). + +The Deployment leaves `GOLDMANE_HOST=goldmane.calico-system.svc.cluster.local:7443` +and the default cert/CA paths; the default ServerName (host sans port) is a SAN +on Goldmane's live serving cert, so no `GOLDMANE_SERVER_NAME` / +`GOLDMANE_TLS_INSECURE` override is needed. + +## How to query who-talks-to-whom + +**Quickest — the `homelab edges` CLI** (the investigation helper; read-only +SELECT against the DB via the dbaas primary pod, no creds/SQL to remember): + +``` +homelab edges --ns # edges touching (either direction) +homelab edges --peers-of # 's distinct peer namespaces +homelab edges --src # 's egress peers (--dst for ingress) +homelab edges --new-since 24h # edges first seen in the last day (or a date) +homelab edges --denied # blocked / lateral-movement attempts +homelab edges --json [...] # machine-readable, for agents/pipelines +homelab edges --help # full flag list +``` + +For ad-hoc SQL, `psql` into the DB (creds: Vault static role +`static-creds/pg-goldmane-edges`, or exec a CNPG pod). All queries are against +the single `edge` table. + +```sql +-- Everything talking to a namespace (inbound), most-active first +SELECT src_ns, action, flow_count, first_seen, last_seen +FROM edge WHERE dst_ns = '' ORDER BY flow_count DESC; + +-- Everything a namespace talks TO (outbound) +SELECT dst_ns, action, flow_count, first_seen, last_seen +FROM edge WHERE src_ns = '' ORDER BY last_seen DESC; + +-- New edges in the last 24h (what the digest reports) +SELECT src_ns, dst_ns, action, flow_count, first_seen +FROM edge WHERE first_seen > now() - interval '24 hours' +ORDER BY first_seen DESC; + +-- Any DENIED edges (policy is dropping this pair) +SELECT src_ns, dst_ns, flow_count, last_seen +FROM edge WHERE action = 'deny' ORDER BY last_seen DESC; + +-- Full edge set as a graph adjacency list +SELECT src_ns, dst_ns, action, flow_count FROM edge ORDER BY src_ns, dst_ns; +``` + +For the **live** (sub-hour) view including pod/port detail, use the Whisker UI — +the `edge` table intentionally aggregates that away. + +## Deriving the Wave-1 egress allowlist from the edge table (infra #62) + +The durable edge set is a faster, identity-stamped data source for the existing +**observe-then-enforce** egress effort (beads `code-8ywc`; snapshot +`docs/architecture/wave1-egress-observation-2026-05-22.md`) than the original +iptables-`LOG` → journald → Loki path (ADR-0014 consequence: "Enforcement gains +a better data source"). It replaces the *internal* (namespace-to-namespace) leg +of the allowlist; **external/public-internet egress is NOT in this table** (empty +dst namespace, dropped) — for those destinations keep using the Calico flow-log +path described in security.md. + +**Per-namespace internal egress allowlist** — the set of in-cluster namespaces a +given source is *observed* talking to with `action='allow'`: + +```sql +-- Internal egress allowlist for one namespace (feeds its NetworkPolicy) +SELECT DISTINCT dst_ns +FROM edge +WHERE src_ns = '' AND action = 'allow' +ORDER BY dst_ns; +``` + +```sql +-- Full internal egress matrix for all namespaces at once +SELECT src_ns, array_agg(DISTINCT dst_ns ORDER BY dst_ns) AS allowed_dst_ns +FROM edge +WHERE action = 'allow' +GROUP BY src_ns +ORDER BY src_ns; +``` + +```sql +-- Sanity: namespaces with a DENY edge already (policy is biting; investigate +-- before tightening further) +SELECT DISTINCT src_ns, dst_ns FROM edge WHERE action = 'deny'; +``` + +**How this feeds enforcement (scope):** the derived `dst_ns` set is the +*internal* half of a namespace's egress allowlist — it tells you which +in-cluster namespaces to permit before flipping that namespace to default-deny. +The universal baseline (kube-dns :53, often dbaas :3306/:5432, redis :6379) and +the external destinations still come from the Wave-1 observation snapshot. +**Enforce-flips remain OUT OF SCOPE** here — this is observe-and-derive only; +the phased per-namespace default-deny rollout (starting `recruiter-responder`) +is tracked under `code-8ywc`. Cross-links: +[security.md → NetworkPolicy Default-Deny Egress](../architecture/security.md#networkpolicy-default-deny-egress-wave-1--observe-then-enforce-tier-34), +[wave1-egress-observation-2026-05-22.md](../architecture/wave1-egress-observation-2026-05-22.md), +[ADR-0014](../adr/0014-service-identity-and-east-west-observability.md). + +> **Caveat (same as the Wave-1 snapshot):** an edge only exists if it was +> *observed*. A weekly CronJob or a 7-day Vault rotation may not have fired yet — +> collect ≥7 days of edges before treating a namespace's `allow` set as +> complete. The `first_seen` column tells you how long an edge has been known; +> the digest surfaces brand-new ones daily. + +## Monitoring & health (infra #61) + +The aggregator pod has **no `/metrics` endpoint** — health is inferred from +kube-state-metrics. Three complementary signals (memory ids 6598, 6599; +see also [monitoring.md → Security Alerts](../architecture/monitoring.md#security-alerts-wave-1--planned-beads-code-8ywc)): + +| Signal | What | Where | +|---|---|---| +| **`AggregatorDown`** | `kube_deployment_status_replicas_available{namespace="goldmane-edge-aggregator",deployment="goldmane-edge-aggregator"} < 1` for 15m → warning | Prometheus alert group `Network Observability (Goldmane)` in `stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl`; routes `slack-warning` → `#alerts` | +| **`DigestFailing`** | `kube_job_status_failed{...job_name=~"goldmane-edges-digest.*"} > 0` within 24h, for 30m → warning | same alert group → `#alerts` | +| **cluster-health #48** | `check_goldmane_aggregator` reads the Deployment's `Available` condition (missing or not-Available → FAIL) | `scripts/cluster_healthcheck.sh` (human / `--quiet` / `--json` modes; emits `goldmane_aggregator`) | + +The two alert layers are deliberately complementary: `AggregatorDown` → +**no new edges land** in the DB; `DigestFailing` → **edges still land but nobody +is told**. A freshness probe (#61b) was intentionally skipped — `AggregatorDown` +is the agreed floor. + +## Troubleshooting + +**Whisker UI 502 / unreachable.** The additive +`kubernetes_network_policy_v1.whisker_allow_traefik` is missing or the +operator's default-deny `whisker` NP regenerated — re-apply `stacks/calico`. A +brand-new ingress host is also invisible to LAN split-horizon until the hourly +`technitium-ingress-dns-sync` runs (memory #5349); test meanwhile with +`curl -sSI --resolve whisker.viktorbarzin.me:443:10.0.20.203 https://whisker.viktorbarzin.me` +(expect a 302 to Authentik — the gate working). + +**Whisker UI empty (but reachable — 302s to Authentik fine).** ROOT CAUSE (the +2026-06-28 incident): the operator's own `whisker` NetworkPolicy is +policyTypes:[Ingress,**Egress**], and its egress allows DNS only to the kube-dns +*pods* (podSelector `k8s-app=kube-dns`). But whisker-backend resolves +`goldmane.calico-system.svc` via the kube-dns **ClusterIP** (10.96.0.10), and +**Calico drops UDP DNS to a ClusterIP under a podSelector-only egress rule**. +Verified: from the whisker pod's netns, ClusterIP DNS = 100% timeout while direct +kube-dns *pod-IP* DNS = OK, and a pod with no egress policy resolves fine. +whisker-backend resolves goldmane ONCE in the brief startup window before the +policy programs, holds its long-lived gRPC stream, and only re-resolves when that +stream breaks (e.g. a node-reboot blip) — at which point the blocked ClusterIP +DNS wedges its Go resolver (`failed to stream flows` / `code = Unavailable: dns +... i/o timeout` forever) and the UI goes blank. The durable **aggregator is a +SEPARATE pod in its own (unrestricted) namespace** and is unaffected. + +FIX (applied 2026-06-28): `kubernetes_network_policy_v1.whisker_allow_dns_clusterip` +(`stacks/calico`) — an additive egress NP allowing whisker → the kube-dns +ClusterIP (`10.96.0.10/32`) on 53/UDP+TCP; k8s egress policies are additive so +the operator NP is untouched. Backstop: the `whisker-watchdog` CronJob restarts +the pod if it ever wedges for another reason. Immediate manual heal: +`kubectl -n calico-system delete pod -l k8s-app=whisker`. Diagnose by comparing, +from the whisker pod's netns, `nslookup goldmane.calico-system.svc.cluster.local +10.96.0.10` (the ClusterIP — times out if the NP fix is missing) against the same +query aimed at a kube-dns *pod IP* (always works). + +**No new `last_seen` updates / `AggregatorDown` firing.** Check the `aggregate` +pod logs (`kubectl logs -n goldmane-edge-aggregator deploy/goldmane-edge-aggregator`). +Common causes, in order: +1. **Stale mTLS cert** — the operator rotated `whisker-backend-key-pair`; re-apply + `stacks/goldmane-edge-aggregator` (see cert-reuse gotcha above). Symptom: TLS + handshake / `Flows.Stream` errors. +2. **Stale DB password** — the 7-day Vault rotation bounced the credential but + the pod kept the old one. The Deployment carries + `secret.reloader.stakater.com/reload: goldmane-edges-db-creds`; if it's not + restarting on rotation, verify the Reloader annotation and the ExternalSecret. +3. **Goldmane restarted** — the in-memory window was lost (expected); the stream + reconnects automatically and resumes upserting. No data loss in the DB + (only the sub-hour live window in Whisker is gone). + +**Digest never posts / `DigestFailing` firing.** Inspect the most recent +`goldmane-edges-digest-*` Job (`kubectl get jobs -n goldmane-edge-aggregator`; +`kubectl logs job/`). The CronJob's `ttl_seconds_after_finished=86400` GCs +pods after a day, so check soon after a failed run. With `SLACK_WEBHOOK_URL` +empty the binary forces a dry-run (no post) — verify the `goldmane-edges-slack` +ExternalSecret resolved. A dry run / smoke test: run the image with `args: +["digest"]` + `DRY_RUN=1` to print the message instead of POSTing. +> Resolved (2026-06-28): the digest posts cleanly to `#alerts` +> (`lastSuccessfulTime` current, `DigestFailing` clear; e.g. the 2026-06-28 08:00 +> London run reported "8 new edges in last 24h"). The 2026-06-25 failures were +> the `#security` channel override returning HTTP 404 — the shared +> `alertmanager_slack_api_url` webhook's Slack app isn't a member of `#security`; +> consolidating all Slack output to `#alerts` fixed it. + +**No edges at all in the table.** Confirm Goldmane is enabled +(`kubectl get goldmane,whisker -A`) and `calico-node` rolled with the +`FELIX_FLOWLOGSGOLDMANESERVER` env; confirm the `goldmane-edges-db-init` Job +completed; confirm the aggregator pod is `Running` and not `ImagePullBackOff` +(ghcr allowlist). + +## Related +- [ADR-0014 — Service identity & east-west observability](../adr/0014-service-identity-and-east-west-observability.md) +- [security.md — NetworkPolicy Default-Deny Egress + east-west flow observability](../architecture/security.md) +- [monitoring.md — east-west flow observability + alerts](../architecture/monitoring.md) +- [wave1-egress-observation-2026-05-22.md](../architecture/wave1-egress-observation-2026-05-22.md) +- `CONTEXT.md` glossary — **Service identity**, **Goldmane / Whisker** +- Code: `~/code/goldmane-edge-aggregator` (`README.md`, `DEPLOY.md`); stacks + `stacks/goldmane-edge-aggregator`, `stacks/calico` diff --git a/docs/runbooks/homelab-vault-onboarding.md b/docs/runbooks/homelab-vault-onboarding.md new file mode 100644 index 00000000..b4bacced --- /dev/null +++ b/docs/runbooks/homelab-vault-onboarding.md @@ -0,0 +1,164 @@ +# `homelab vault` onboarding (Vaultwarden access + `vault kv` infra secrets) + +## Scope + +`homelab vault` fronts **two unrelated secret stores** — the name collides, so +the command keeps them clearly separated: + +- **Vaultwarden** — your personal *password manager* (logins/passwords/TOTP). + The verbs below give each devvm roster user no-HITL access to **their own** + Vaultwarden vault (and any Organization Collection shared with their account). + It shells out to the official `bw` CLI; the user's Vaultwarden credentials live + only in their isolated Vault path `secret/workstation/claude-users/` + and are decrypted as that OS user — the admin never sees them. +- **HashiCorp Vault / OpenBao** — the homelab *infra* secrets store (the + `secret/…` KV mount at `vault.viktorbarzin.me`), under `homelab vault kv`. + These use the caller's **own** Vault token (`vault login -method=oidc` → + `~/.vault-token`), **not** the scoped Vaultwarden token (which only reads the + `claude-users/` path); access is whatever your Vault policy grants. + +```text +# Vaultwarden (password manager) +homelab vault setup one-time: store VW email + master password + API key +homelab vault status configured / unlocked / reachable (no secrets) +homelab vault list [--search Q] item names (no secrets) +homelab vault get [--field password|username|uri|notes|totp] [--json] +homelab vault get --all all fields (incl. custom) as JSON; pipe it (| jq) +homelab vault code current TOTP code +homelab vault lock lock / log out the local bw session + +# HashiCorp Vault / OpenBao (infra secrets; uses your own OIDC token) +homelab vault kv get [--field K] read an infra KV secret +homelab vault kv list list sub-paths +homelab vault kv put write one key (value via stdin; merges) +``` + +## How auth works (why a non-admin can use it) + +`homelab vault` runs `vault` as the calling user. It resolves a Vault token in +this order (`ensureVaultToken`, `cli/cmd_vault.go`): + +1. an explicit `$VAULT_TOKEN` (a deliberate override), then +2. the per-user **scoped token** that `claude-auth-sync` maintains at + `~/.config/claude-auth-sync/vault-token` (policy `workstation-claude-`), then +3. a native `~/.vault-token` (admins who carry one; non-admins usually don't). + +**The scoped token deliberately beats `~/.vault-token`.** This tool only touches +your own `secret/workstation/claude-users/` path, and a power-user who ran +`vault login -method=oidc` carries a read-only `~/.vault-token` (capability +`deny` on that path); letting it win would shadow the scoped token and fail every +op with `403 permission denied` (this is exactly what bit emo, 2026-06-28). The +CLI also **self-defaults `VAULT_ADDR`** to `https://vault.viktorbarzin.me` when +unset, so it works from non-login shells (tmux panes, AFK agent subprocesses) +that never sourced `/etc/environment` — otherwise every `vault` child hits the +`127.0.0.1:8200` default and fails `connection refused` (exit 2). + +That scoped policy grants exactly `create`/`read`/`update` on the user's own +`secret/workstation/claude-users/` path — no `patch` capability — so the +tool writes with `vault kv patch -method=rw` (read-modify-write), falling back to +`kv put` only when the path does not exist yet. This preserves the +`claude_ai_oauth_json` key that [claude-auth-sync](claude-auth-renew-workstation.md) +co-locates there. (The admin-only bugs were fixed 2026-06-27; the +`VAULT_ADDR`/token-precedence bugs above were fixed 2026-06-28.) + +## Prerequisites (per user) + +- The user is in `scripts/workstation/roster.yaml` and the **vault** stack has + been applied → their `workstation-claude-` policy exists. +- The user's workstation was provisioned (`setup-devvm.sh`) → their scoped Vault + token exists at `~/.config/claude-auth-sync/vault-token`. +- `bw` is installed **system-wide** at `/usr/bin/bw` (see below). +- The user has a Vaultwarden account at `https://vaultwarden.viktorbarzin.me` + (self-service signup is open; admin panel is disabled). + +## One-time admin steps (devvm) + +`bw` must be system-wide so every user resolves it (it is a Node script, and +`node` is already system-wide at `/usr/bin/node`). `setup-devvm.sh` installs it +to the npm `/usr` prefix; the guard checks the **system** path, not +`command -v bw` (an admin's own `~/.local/bin/bw` used to mask the system +install, leaving non-admins with no backend). To install on a running box: + +```bash +sudo npm install -g --prefix /usr "@bitwarden/cli@^2024" +bw --version # confirm /usr/bin/bw resolves +``` + +After landing a `cli/` change, rebuild the binary so users pick it up: + +```bash +# version is stamped from cli/VERSION, exactly as setup-devvm.sh does it +sudo bash -c 'cd /home/wizard/code/infra/cli && \ + go build -ldflags "-X main.version=$(cat VERSION 2>/dev/null || echo dev)" \ + -o /usr/local/bin/homelab .' +``` + +(or just re-run `scripts/workstation/setup-devvm.sh` as root, which rebuilds it.) + +## User onboarding + +The user runs these as themselves. The master password / API key are entered +interactively (never on the command line) and stored only in the user's Vault +path. + +1. In the Vaultwarden web vault → **Settings → Security → Keys → View API key**, + copy the `client_id` (`user.xxxx`) and `client_secret`. +2. Configure: + + ```bash + homelab vault setup # prompts: VW email, API client_id/secret, master password + homelab vault status # → "vault: configured, unlocked, reachable ✓" + homelab vault list # item names (own vault + any shared Collections) + ``` + +## Shared-Collection access (sharing passwords with a user) + +`homelab vault` surfaces Organization Collection items automatically once the +user's Vaultwarden account is a confirmed member. These steps are done by the +vault owner in the **Vaultwarden web UI** (they need the owner's master +password — not an infra/Terraform operation): + +1. Create or reuse an **Organization** and a **Collection** of shared logins. +2. **Invite** the user's Vaultwarden account to the Organization, granting + **"Can view"** on that Collection (least privilege). +3. The user accepts the email invite and confirms membership. +4. The user runs `homelab vault list` — the shared items now appear alongside + their own (a `homelab vault status` sync picks them up). + +## Security model (the no-HITL trade) + +Identity is the kernel UID. Anything running as the user can decrypt the user's +vault — this is the accepted trade for no-human-in-the-loop fetches. Secrets +never appear in `argv` (passed via env or stdin), core dumps are disabled, TOTP +fetches are logged to syslog/Loki, and on a TTY values go to the clipboard +(auto-clearing) rather than scrollback. The admin's Vault token is never used by +a non-admin: each user authenticates with their own scoped token. + +## Verification + +```bash +# the scoped token carries the right policy +VAULT_TOKEN="$(sudo cat /home//.config/claude-auth-sync/vault-token)" \ + vault token lookup -format=json | jq '.data.display_name, .data.policies' +# → "token-devvm-claude-auth-", [..., "workstation-claude-"] + +sudo -u -i bw --version # /usr/bin/bw resolves for the user +sudo -u -i homelab vault status +``` + +## Troubleshooting + +**`homelab vault setup` (or any verb) fails with `exit status 2`** — older +binaries swallowed the underlying `vault` error; the message now includes it. +Two historical causes (both fixed in-CLI 2026-06-28, kept here for diagnosis): + +- `... connection refused` to `127.0.0.1:8200` → `VAULT_ADDR` wasn't set in the + caller's shell. The CLI now self-defaults it, but if you see this on an old + binary: `export VAULT_ADDR=https://vault.viktorbarzin.me`. +- `403 permission denied` on `PUT .../secret/data/workstation/claude-users/` + → a stale read-only `~/.vault-token` (e.g. from `vault login -method=oidc`, + policy `default`, capability `deny` on that path) was shadowing the scoped + token. The CLI now prefers the scoped token; on an old binary, `rm + ~/.vault-token` (or `unset VAULT_TOKEN`) and retry. Confirm with + `VAULT_TOKEN="$(sudo cat /home//.config/claude-auth-sync/vault-token)" vault token capabilities secret/data/workstation/claude-users/` + → must be `create, read, update`. diff --git a/docs/runbooks/k8s-version-upgrade.md b/docs/runbooks/k8s-version-upgrade.md index 08d43926..4b4b42b0 100644 --- a/docs/runbooks/k8s-version-upgrade.md +++ b/docs/runbooks/k8s-version-upgrade.md @@ -36,11 +36,13 @@ envsubst on /template/job-template.yaml | kubectl apply -f - ▼ Job 0 — preflight (pinned: k8s-node1) - ├── compat-gate: addon/API/containerd support for target (else BLOCK+alert) + ├── compat-gate: addon/API/containerd support for target (else BLOCK-actionable+alert / HOLD-quiet) ├── All nodes Ready + no Mem/Disk pressure ├── halt-on-alert (kured-style ignore-list) ├── 24h-quiet baseline (no Ready transitions <24h ago) ├── kubeadm upgrade plan matches target (skipped when master already at target — partial-resume) + ├── apiserver-OIDC drift check: kubeadm upgrade diff drops --authentication-config? → Slack WARN (recoverable; not a block) + ├── reclaim kubeadm scratch: prune /etc/kubernetes/tmp/kubeadm-backup-* >3d on master (kubeadm leaks ~400MB etcd-db backups) ├── Push k8s_upgrade_in_flight=1, k8s_upgrade_started_timestamp=$(date +%s) ├── Trigger backup-etcd Job, wait, verify snapshot byte count ├── SSH master: containerd skew fix (if master < workers) @@ -112,18 +114,36 @@ inert for a patch (no API removal or containerd floor occurs inside a minor). This is the **"auto-upgrade when we can, halt + alert when we can't"** contract. -**On a block**, the gate: -- pushes `k8s_upgrade_blocked=1` to Pushgateway (→ the `K8sUpgradeBlocked` - Prometheus alert), -- Slacks the **specific reasons** (which addon/API/node, current vs required), and -- **halts the chain** — it exits **non-fatal** (the upgrade simply isn't safe yet, - this is not a failure). Because the block happens **before any mutation, no - rollback is involved**; nothing was changed. +**The gate classifies each refusal** (2026-06-28) so it only cries wolf when +there's something to do — `compat-gate.py` exit code + a `[TAG]` on every reason: -**To clear a block**: upgrade the named addon (or migrate the API caller off the -deprecated group/version, or bump containerd on the named node) so the offending -condition no longer holds. The **next nightly run then proceeds automatically** — -no manual chain restart needed. +- **`[ACTIONABLE]`** (exit 2) — a newer version of the lagging addon **exists in + the compat matrix** and upgrading it would clear the block (or an in-use + deprecated API must be migrated / a node's containerd bumped). +- **`[WAITING]`** (exit 4 = held) — **no released addon version supports the + target yet** (e.g. kyverno/ESO behind a brand-new k8s minor). Only an upstream + release can clear it. +- **`[PINNED]`** (exit 4 = held) — a supporting version exists but the addon is + **deliberately pinned** in the matrix (`"pinned": true`, e.g. gpu-operator, + whose bump is coupled to a newer NVIDIA driver image + Ubuntu/kernel). +- **Held wins on a mix**: if any blocker is waiting/pinned the whole target is + held — acting on the actionable ones wouldn't unblock it yet. + +**On any refusal** the preflight pushes the verdict gauge (`k8s_upgrade_blocked=1` +for actionable, `k8s_upgrade_held=1` for held), sets `HALT_CHAIN` so the chain +doesn't advance, and **exits 0 — the Job Completes cleanly** (a refusal is a +decision, not a failure: no Failed Job, no `K8sUpgradeChainJobFailed`). It's +before any mutation, so no rollback. Reasons (grouped by class) appear in the +**morning nightly report**, not a per-run Slack. + +- **Actionable** → `K8sUpgradeBlocked` fires (once, via alert-on-change). Clear + it by doing the named upgrade/migration; the next nightly run proceeds. +- **Held** → **deliberately NO alert** — only the nightly report's `⏸️ HELD` + line, because it can't be actioned now (a nightly alert would cry wolf). It + clears itself once upstream ships support (refresh `addon-compat.json`) or the + pin is lifted (delete `pinned`+`pin_reason`). The detector re-evaluates every + night, silently re-spawning the refused-but-Complete preflight (so a cleared + block is picked up next run, not after the 7d Job TTL). The **compat matrix** lives in `stacks/k8s-version-upgrade/scripts/addon-compat.json` — a map of `addon → highest @@ -163,6 +183,8 @@ Pushed by upgrade-step.sh during phase execution; observed by the | `k8s_upgrade_in_flight` (1/0) | preflight Job (set to 1) | postflight Job (set to 0) | | `k8s_upgrade_started_timestamp` (epoch s) | preflight Job | postflight Job (set to 0) | | `k8s_upgrade_snapshot_taken` (1/0) | preflight Job (set to 1 after Job=`pre-upgrade-etcd-*` completes with `Backup done:` log of ≥1 KiB) | postflight Job (0) | +| `k8s_upgrade_blocked` (1/0) | preflight Job — set 1 on an **actionable** compat refusal (→ `K8sUpgradeBlocked`) | preflight (definitive each run; 0 when safe) / postflight (0) | +| `k8s_upgrade_held` (1/0) | preflight Job — set 1 on a **held** (waiting-upstream/pinned) refusal; **no alert** | preflight (definitive each run; 0 when safe) / postflight (0) | | `k8s_upgrade_available{kind,running,target}` | detection CronJob | next detection run (overwrite) | | `k8s_version_check_last_run_timestamp` | detection CronJob | (cumulative) | @@ -171,8 +193,8 @@ Pushed by upgrade-step.sh during phase execution; observed by the - **`K8sVersionSkew`** — distinct kubelet/apiserver `gitVersion` count > 1 for 30m. Catches a half-done rollout. - **`EtcdPreUpgradeSnapshotMissing`** — `k8s_upgrade_in_flight==1 && k8s_upgrade_snapshot_taken==0` for 10m. Catches preflight Stage 2 failing silently. - **`K8sUpgradeStalled`** — `k8s_upgrade_in_flight==1 && time()-k8s_upgrade_started_timestamp > 5400` for 5m. Catches a Job in the chain dying without spawning its successor. -- **`K8sUpgradeChainJobFailed`** — `(kube_job_status_failed{namespace="k8s-upgrade",job_name=~"k8s-upgrade-(preflight|master|worker|postflight)-.*",reason=~"BackoffLimitExceeded|DeadlineExceeded"} > 0) unless on() (k8s_upgrade_blocked == 1)` for 15m (warning). Catches a phase Job that **terminally failed before `k8s_upgrade_in_flight` was set** — the preflight gates exit pre-metric, so the two `in_flight`-based alerts above are blind to a failed preflight (this is what hid the 5-day 1.34.9 wedge on 2026-06-12). Reason-scoped to terminal job conditions so a retry-success doesn't false-positive (a bare failed-pod-count would otherwise also block kured for the Job's 7d TTL). The `unless k8s_upgrade_blocked == 1` clause (added 2026-06-21) excludes a preflight that failed because the **compat gate deliberately refused** the target — that's owned by `K8sUpgradeBlocked` and was double-firing here; a genuine wedge exits without setting the blocked gauge, so it still fires. -- **`K8sUpgradeBlocked`** — `k8s_upgrade_blocked == 1` (warning). A k8s **auto-upgrade was refused** by the compat gate because a critical addon, an in-use deprecated API, or a node's containerd is too old for the detected target. The **specific reasons are in Slack**; clear it by upgrading the named addon / migrating the API caller / bumping containerd, after which the next nightly run proceeds (see "Auto-upgrade compat gate"). No upgrade was attempted, so this is not a half-done-rollout alert. +- **`K8sUpgradeChainJobFailed`** — `kube_job_status_failed{namespace="k8s-upgrade",job_name=~"k8s-upgrade-(preflight|master|worker|postflight)-.*",reason=~"BackoffLimitExceeded|DeadlineExceeded"} > 0` for 15m (warning). Catches a phase Job that **terminally failed before `k8s_upgrade_in_flight` was set** — the preflight gates exit pre-metric, so the two `in_flight`-based alerts above are blind to a failed preflight (this is what hid the 5-day 1.34.9 wedge on 2026-06-12). Reason-scoped to terminal job conditions so a retry-success doesn't false-positive (a bare failed-pod-count would otherwise also block kured for the Job's 7d TTL). The old `unless on() (k8s_upgrade_blocked == 1)` clause was **dropped 2026-06-28**: compat-gate refusals now Complete cleanly (exit 0) instead of Failing, so a terminally-Failed chain Job again means a genuine wedge with nothing to exclude. +- **`K8sUpgradeBlocked`** — `k8s_upgrade_blocked == 1` (warning). An **ACTIONABLE** compat-gate refusal — a newer version of the lagging addon exists and upgrading it would clear the block (or an in-use deprecated API must be migrated / a node's containerd bumped). Reasons (grouped by class) are in the **morning nightly report**; clear it by doing the named upgrade/migration, after which the next nightly run proceeds (see "Auto-upgrade compat gate"). No upgrade was attempted, so this is not a half-done-rollout alert. **There is deliberately NO companion alert for the held verdict** (`k8s_upgrade_held=1` — waiting-on-upstream / pinned): nothing can be actioned now, so it is surfaced only by the nightly report's `⏸️ HELD` line. - The first four alerts ALSO block kured (same `--prometheus-url` halt-on-alert mechanism) so the OS-reboot pipeline can't run on top of a half-done version upgrade. ### Nightly upgrade report (Slack) @@ -181,8 +203,8 @@ CronJob `k8s-upgrade-nightly-report` (k8s-upgrade ns, `var.report_schedule`, default `7 6 * * *` = 06:07 UTC — after the 23:00 chain, before the 08:00 London alert-digest) posts ONE Slack summary each morning of the previous night's run: running version, detector freshness, detected target + kind, the outcome -(⚪ no upgrade needed / 🔴 blocked + live blocker reasons / 🟢 upgraded / -🟡 in progress / ⚠️ detector stale), and recent chain jobs. Read-only — it reads +(⚪ no upgrade needed / 🔴 blocked-actionable + reasons / ⏸️ held = waiting-upstream/pinned / +🟢 upgraded / 🟡 in progress / ⚠️ detector stale), and recent chain jobs. Read-only — it reads the Pushgateway gauges + live nodes/jobs and re-runs `compat-gate.py` for fresh blocker reasons; reuses the chain's SA + `slack_webhook` + scripts ConfigMap. Logic + unit tests: `scripts/nightly-report.py`, `scripts/test_nightly_report.py`. @@ -222,22 +244,34 @@ Exposed in K8s via ExternalSecret `k8s-upgrade-creds` in the `k8s-upgrade` names ## Common Operations -### Post-upgrade: apiserver OIDC restore (AUTOMATED by the chain since 2026-06-19) +### apiserver OIDC + kubeadm upgrades (kubeadm-config reconciliation since 2026-06-24) `kubeadm upgrade apply` **regenerates `/etc/kubernetes/manifests/kube-apiserver.yaml` -and drops the `--authentication-config` flag**, silently disabling apiserver -OIDC (kubectl/kubelogin CLI **and** the web dashboard SSO break — tokens get -401). This used to require a manual re-apply after **every** control-plane bump. +from kubeadm-config**. apiserver auth uses a structured multi-issuer +`--authentication-config` (kubectl + dashboard SSO), but kubeadm-config used to +still carry the legacy single-issuer `--oidc-*` extraArgs — so every upgrade +reverted the flag, **silently breaking SSO after the upgrade** (the apiserver does +NOT crash on this — verified by isolated repro; it's recoverable via the restore +script below). NB: the **1.34→1.35 stall on 2026-06-24 was a *separate* issue — +etcd IO starvation**, not this drift; post-mortem: +`docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md`. -**Now automated:** the `rbac` stack publishes its OIDC restore script to the -`kube-system/apiserver-oidc-restore` ConfigMap, and the version-upgrade chain's -`phase_master` re-runs it on master immediately after `kubeadm upgrade apply` -(while tigera-operator is still quiesced, so the flag-add apiserver restart can't -crashloop the operator). It's idempotent, health-gates `/livez` with -auto-rollback, and is **non-fatal** — a failure only lags SSO until the next rbac -apply (the version upgrade itself already succeeded). So a chain-driven -control-plane bump no longer breaks SSO. The master phase self-skips when master -is already at target, so this only runs when master was actually upgraded. +**Primary fix (2026-06-24):** `stacks/rbac/modules/rbac/apiserver-oidc.tf` now +**reconciles kubeadm-config** (`kubeadm init phase upload-config kubeadm`, rewriting +`apiServer.extraArgs`: drop `--oidc-*`, add `--authentication-config`) as part of +its remote script. So kubeadm regenerates a **correct** manifest and the apiserver +upgrades with a pure image bump — `kubeadm upgrade diff ` shows only the +image change. Zero live impact (the CM is read only during an upgrade). + +**Backstops:** +- **Preflight check 4b** runs `kubeadm upgrade diff` and **alerts** (Slack WARN, does + NOT block — the drift only breaks SSO, which is recoverable) if + `--authentication-config` would still be dropped. +- The `rbac` stack still publishes its restore script to the + `kube-system/apiserver-oidc-restore` ConfigMap, and `phase_master` re-runs it on + master right after `kubeadm upgrade apply` (idempotent, `/livez`-gated with + auto-rollback, non-fatal) — now redundant belt-and-suspenders that *also* + re-reconciles kubeadm-config. Self-skips when master is already at target. **Manual fallback** — only for an out-of-band/manual `kubeadm` upgrade, or if the chain logged `WARN: --authentication-config absent after re-apply`: diff --git a/docs/runbooks/pfsense-egress.md b/docs/runbooks/pfsense-egress.md new file mode 100644 index 00000000..39bca116 --- /dev/null +++ b/docs/runbooks/pfsense-egress.md @@ -0,0 +1,72 @@ +# Runbook: pfSense WAN / egress outage + +**Scope:** the cluster (and home) loses **internet egress** while pfSense is +otherwise alive — internal VLAN routing and DNS keep working. This is the +**2026-06-27 incident class**: pfSense (Proxmox **VMID 101**) stopped passing +IPv4 egress for ~20 min (00:02→00:23 UTC) while LAN/OPT1 routing + Unbound +stayed up; recovery required a manual reboot, and **nothing alerted** (no egress +probe existed; the cloudflared replica metric stayed green). The alerts + +probes below close that gap. Incident detail: memory ids #6715–#6723. + +pfSense is a **single point of failure** (no HA): it is the k8s default gateway +(`10.0.20.1`), Kea DHCP, Unbound DNS, NAT, and the WireGuard hub. WAN is +**static** `192.168.1.2/24`, upstream gateway `WANGW = 192.168.1.1` (the TP-Link +Archer AX6000). The sole IPv4 default gateway, no gateway-group/failover. + +## Alerts (all in `stacks/monitoring/modules/monitoring/`) + +| Alert | Signal | Means | +|-------|--------|-------| +| `WANGatewayUnreachable` (critical) | in-cluster ICMP to `192.168.1.1` fails >3m | pfSense's upstream gateway is unreachable from the cluster | +| `InternetEgressDown` (critical) | in-cluster ICMP to **both** `9.9.9.9` and `1.1.1.1` fails >2m | internet egress through pfSense NAT is black-holed | +| `ExternalDNSResolutionDown` (warning) | UDP/53 to both public resolvers fails >3m | egress or external-DNS path broken | +| `EgressOnlyDivergence` (critical) | t3-probe `cloudflare` leg down **while** `internal` leg up >3m | egress-specific failure, internal healthy (the exact 2026-06-27 signature) | +| `PfSenseVMDown` (critical) | `pve_up{id="qemu/101"}==0` while host up >2m | the pfSense VM stopped/crashed (host fine) | +| `CloudflaredTunnelConnLoss` (warning, Loki) | >20 cloudflared edge-conn failures/5m | tunnel/egress trouble (canary that fires first; replica metric is blind) | + +Probes run **from inside the cluster** (blackbox-exporter, pod → node → pfSense +NAT), so they exercise the exact egress path that fails. `WANGatewayUnreachable` +/ `InternetEgressDown` **inhibit** the downstream egress symptoms so one root +alert pages, not a storm. + +`PfSenseVMDown` **does not** catch a *guest-internal* reboot — `pve_up` tracks +the qemu process, which survives an in-guest reboot (this is why 2026-06-27 was +metric-invisible). `CloudflaredTunnelConnLoss` + the probe alerts cover that case. + +## Diagnose (read-only first) + +1. **Confirm scope** — is it egress-only or total? + - `kubectl -n monitoring` Grafana → `probe_success{job=~"wan-gateway-icmp|internet-egress-icmp"}` and `t3probe_connected` by `leg`. + - Internal still up? `pve_up{id="qemu/101"}` should be `1`; internal k8s DNS (`10.0.20.1`) still resolving = pfSense alive, egress-only. +2. **Capture pfSense on-box logs BEFORE rebooting** (they persist on disk — no RAM-disk — and are the only source that proves the mechanism; they are NOT shipped to Loki): + ``` + ssh -i ~/.ssh/id_ed25519 admin@10.0.20.1 # devvm wizard key (id #6784) + clog /var/log/gateways.log | grep -iE 'WANGW|down|up|delay|loss' # dpinger gateway alarms + clog /var/log/routing.log | grep -iE 'default|route' # default-route add/delete + clog /var/log/system.log | tail -200 + netstat -rn | head # is the default route present? + ls -la /var/crash/ # panic/textdump? + ``` + (If SSH is rejected post-reboot, the reboot regenerated `authorized_keys` from + config.xml — re-add the key via console or WebGUI; see id #6718.) +3. **Upstream check** — is the TP-Link / ISP up? It held the same public IP with + clean DHCP renewals through the 2026-06-27 event, so a *sustained* upstream + fault is unlikely; a reboot fixing it points at **pfSense-side state**. + +## Recover + +- **Fast path (known fix):** reboot pfSense — re-adds the default route, re-arms + dpinger, flushes pf state. **Capture the logs above FIRST** (a reboot wipes + the volatile evidence needed to find the real mechanism). +- Targeted (if logs show a dpinger gateway-down): System → Routing → Gateways → + WANGW; check the monitor IP + dpinger state; re-enable the gateway / let it + re-eval. Confirm `netstat -rn` shows the default route restored. + +## Prevent / harden (deferred, needs a live-pfSense change) + +Not done in this monitoring change — tracked for a follow-up with hands-on +pfSense access: point dpinger's monitor at the local gateway (`192.168.1.1`) +instead of an external IP + widen thresholds; disable `gw_down_kill_states` for +the single WAN; add a failover gateway group; a 60s auto-recovery watchdog; +ship pfSense system/gateway/routing syslog to the cluster so these logs become +centrally queryable. diff --git a/scripts/cluster_healthcheck.sh b/scripts/cluster_healthcheck.sh index 51a13b5d..a5088137 100755 --- a/scripts/cluster_healthcheck.sh +++ b/scripts/cluster_healthcheck.sh @@ -27,7 +27,7 @@ KUBECONFIG_PATH="${KUBECONFIG:-${HOME}/.kube/config}" [[ -f "$KUBECONFIG_PATH" ]] || KUBECONFIG_PATH="$(pwd)/config" KUBECTL="" JSON_RESULTS=() -TOTAL_CHECKS=47 +TOTAL_CHECKS=48 # Parallel execution settings. Each check function is self-contained — it # only reads cluster state and mutates the in-memory counters / JSON_RESULTS @@ -3156,6 +3156,44 @@ PYEOF esac } +# --- 48. Goldmane edge-aggregator availability --- +# +# The goldmane-edge-aggregator Deployment (ADR-0014 / infra #58) streams Calico +# Goldmane flows into the goldmane_edges CNPG DB — the durable who-talks-to-whom +# trail. The pod has NO /metrics endpoint, so its liveness can't be scraped; +# this check reads the Deployment's Available condition directly so the trail +# silently dying surfaces in the health board (mirrors the AggregatorDown +# Prometheus alert). Missing Deployment / not-Available -> FAIL. +check_goldmane_aggregator() { + section 48 "Goldmane Edge-Aggregator" + local ns="goldmane-edge-aggregator" dep="goldmane-edge-aggregator" + local avail desired ready + + # One get; absent Deployment is a hard fail (the trail isn't deployed). + if ! $KUBECTL get deploy "$dep" -n "$ns" >/dev/null 2>&1; then + [[ "$QUIET" == true ]] && section_always 48 "Goldmane Edge-Aggregator" + fail "Deployment $ns/$dep not found — who-talks-to-whom edge trail is not running" + json_add "goldmane_aggregator" "FAIL" "deployment missing" + return 0 + fi + + avail=$($KUBECTL get deploy "$dep" -n "$ns" \ + -o jsonpath='{.status.conditions[?(@.type=="Available")].status}' 2>/dev/null) + ready=$($KUBECTL get deploy "$dep" -n "$ns" -o jsonpath='{.status.readyReplicas}' 2>/dev/null) + desired=$($KUBECTL get deploy "$dep" -n "$ns" -o jsonpath='{.spec.replicas}' 2>/dev/null) + ready=${ready:-0} + desired=${desired:-0} + + if [[ "$avail" == "True" ]]; then + pass "Edge-aggregator Available ($ready/$desired ready)" + json_add "goldmane_aggregator" "PASS" "${ready}/${desired} ready" + else + [[ "$QUIET" == true ]] && section_always 48 "Goldmane Edge-Aggregator" + fail "Edge-aggregator NOT Available ($ready/$desired ready) — edge trail has stopped recording" + json_add "goldmane_aggregator" "FAIL" "${ready}/${desired} ready; Available=${avail:-unknown}" + fi +} + # --- Summary --- print_summary() { if [[ "$JSON" == true ]]; then @@ -3224,7 +3262,7 @@ main() { check_monitoring_prom_am check_monitoring_vault check_monitoring_css check_external_replicas check_external_divergence check_pve_thermals check_pve_load check_external_traefik_5xx check_ha_status_dashboard - check_immich_search check_csi_ghost_drift + check_immich_search check_csi_ghost_drift check_goldmane_aggregator ) # Auto-fix mutates cluster state inside individual checks — keep that diff --git a/scripts/t3-provision-users.sh b/scripts/t3-provision-users.sh index 9cbc6c1e..1714596a 100644 --- a/scripts/t3-provision-users.sh +++ b/scripts/t3-provision-users.sh @@ -240,6 +240,79 @@ EOF log "wrote OIDC kubeconfig -> $user:~/.kube/config" } +# Hands-off chrome-service browser credential. For a user who has a +# `-browser` ServiceAccount in the chrome-service namespace (created in +# stacks/chrome-service/rbac.tf), install a DUAL-CONTEXT kubeconfig whose DEFAULT +# context authenticates with that SA's long-lived token — so `homelab browser` +# (which shells out to `kubectl port-forward -n chrome-service`) works +# non-interactively, even from a headless agent session (the user's interactive +# OIDC login can't authenticate a headless kubectl). The user's personal OIDC +# identity is retained as the `oidc@homelab` named context +# (`kubectl --context oidc@homelab`). TF (the SA's existence) is the source of +# truth for WHO gets this — there is no roster flag. Idempotent (cmp-guarded; SA +# tokens are stable) + best-effort (cluster/secret unreachable -> WARN, never aborts). +install_browser_kubeconfig() { + local user="$1" home kc sa secret token server ca tmp + home="$(getent passwd "$user" | cut -d: -f6)" + [[ -z "$home" ]] && return 0 + sa="${user}-browser" + secret="${sa}-token" + [[ -r "$ADMIN_KUBECONFIG" ]] || return 0 + # Gate: only users with a chrome-service browser SA (TF-driven). Best-effort read. + KUBECONFIG="$ADMIN_KUBECONFIG" kubectl --request-timeout=10s -n chrome-service get serviceaccount "$sa" >/dev/null 2>&1 || return 0 + token="$(KUBECONFIG="$ADMIN_KUBECONFIG" kubectl --request-timeout=10s -n chrome-service get secret "$secret" -o jsonpath='{.data.token}' 2>/dev/null | base64 -d 2>/dev/null || true)" + [[ -n "$token" ]] || { log "WARN: browser SA token not ready for $user (secret chrome-service/$secret) — skipped"; return 0; } + server="$(KUBECONFIG="$ADMIN_KUBECONFIG" kubectl config view --raw --minify -o jsonpath='{.clusters[0].cluster.server}')" + ca="$(KUBECONFIG="$ADMIN_KUBECONFIG" kubectl config view --raw --minify -o jsonpath='{.clusters[0].cluster.certificate-authority-data}')" + [[ -n "$server" && -n "$ca" ]] || { log "WARN: could not read cluster server/CA -> skip browser kubeconfig for $user"; return 0; } + kc="$home/.kube/config" + tmp="$(mktemp)" + cat > "$tmp" </dev/null; then rm -f "$tmp"; return 0; fi # already current -> no churn + if [[ "$DRY_RUN" == 1 ]]; then echo "[dry-run] dual-context (SA default + OIDC) browser kubeconfig -> $user:$kc"; rm -f "$tmp"; return 0; fi + install -d -o "$user" -g "$user" -m 0700 "$home/.kube" + install -o "$user" -g "$user" -m 0600 "$tmp" "$kc" || { log "WARN: failed to write browser kubeconfig for $user"; rm -f "$tmp"; return 0; } + rm -f "$tmp" + log "wrote dual-context browser kubeconfig (SA default + OIDC) -> $user:~/.kube/config" + return 0 +} + # Idempotently set KEY=VALUE in a t3-serve env file, PRESERVING other lines — so writing # T3_PORT never clobbers an injected CLAUDE_CODE_OAUTH_TOKEN, and vice-versa. Mode 0600. env_set() { @@ -594,6 +667,7 @@ while IFS=$'\t' read -r os_user tier shell groups_csv code_layout repos_csv; do refresh_user_clone "$os_user" code fi install_user_kubeconfig "$os_user" + install_browser_kubeconfig "$os_user" # hands-off chrome-service CLI cred (no-op unless the user has a browser SA) deploy_user_launcher "$os_user" # keep ~/start-claude.sh current (skel only seeds new accounts) fi refresh_codex_mirror "$os_user" # all tiers — mirror of the managed claudeMd diff --git a/scripts/t3-serve@.service b/scripts/t3-serve@.service index 4109b36b..7f3d765d 100644 --- a/scripts/t3-serve@.service +++ b/scripts/t3-serve@.service @@ -11,6 +11,12 @@ Environment=HOME=/home/%i Environment=PATH=/usr/local/bin:/usr/bin:/bin:/home/%i/.local/bin Environment=NODE_ENV=production EnvironmentFile=/etc/t3-serve/%i.env +# Optional per-user long-lived CLAUDE_CODE_OAUTH_TOKEN, materialized by +# claude-auth-sync from the user's own Vault path. Non-rotating, so t3's +# concurrent agent sessions can't race on OAuth refresh-token rotation and wipe +# the shared ~/.claude/.credentials.json. Leading '-' = optional (absent for +# users on the normal per-user Enterprise-SSO credential flow). +EnvironmentFile=-/home/%i/.config/claude-auth-sync/claude-oauth.env WorkingDirectory=/home/%i ExecStart=/usr/bin/t3 serve --host 0.0.0.0 --port ${T3_PORT} --base-dir /home/%i/.t3 Restart=on-failure diff --git a/scripts/test-claude-auth-sync.sh b/scripts/test-claude-auth-sync.sh index 10f07746..62c54e8b 100755 --- a/scripts/test-claude-auth-sync.sh +++ b/scripts/test-claude-auth-sync.sh @@ -28,5 +28,61 @@ ok "accept own scoped Vault token" cas_vault_identity_ok token-devvm-claude-auth no "reject another user's token" cas_vault_identity_ok token-devvm-claude-auth-anca default,workstation-claude-anca no "reject wrong policy" cas_vault_identity_ok token-devvm-claude-auth-emo default,workstation-claude-anca +# --- Regression: cas_backup must MERGE into the shared Vault path, preserving +# sibling keys that other tools co-locate there (e.g. `homelab vault`'s +# vaultwarden_* creds) — NOT overwrite the whole KV document. A blind `kv put` +# wiped them every 6h (claude-auth-sync clobber, 2026-06-26). +fakebin="$tmp/bin"; mkdir -p "$fakebin" +store="$tmp/vault-store.json" +cat > "$fakebin/vault" <<'FAKE' +#!/usr/bin/env bash +# Minimal KV-v2 fake backed by $VAULT_FAKE_STORE (a flat JSON object). +[[ "$1" == kv ]] || { echo '{}'; exit 0; } # token lookup etc. -> ignore +op="$2"; shift 2 +store="$VAULT_FAKE_STORE" +case "$op" in + get) + for a in "$@"; do [[ "$a" == -field=* ]] && field="${a#-field=}"; done + if [[ "$*" == *-format=json* ]]; then + [[ -f "$store" ]] || { echo "No value found"; exit 2; } + jq -n --argjson d "$(cat "$store")" '{data:{data:$d}}'; exit 0 + fi + [[ -f "$store" ]] || exit 2 # bare get == existence check + if [[ -n "${field:-}" ]]; then + v="$(jq -r --arg k "$field" '.[$k] // empty' "$store")"; [[ -n "$v" ]] || exit 1 + printf '%s' "$v"; exit 0 + fi + exit 0 ;; + put) echo '{}' > "$store" ;; # full replace + patch) [[ -f "$store" ]] || { echo "No value found"; exit 2; } ;; # merge (rw) + *) exit 1 ;; +esac +for a in "$@"; do + case "$a" in + -*|secret/*) continue ;; # flags + the path arg + *=*) k="${a%%=*}"; v="${a#*=}" + t="$(mktemp)"; jq --arg k "$k" --arg v "$v" '.[$k]=$v' "$store" > "$t" && mv "$t" "$store" ;; + esac +done +exit 0 +FAKE +chmod +x "$fakebin/vault" + +CAS_VAULT_PATH="secret/workstation/claude-users/test" +CAS_CREDENTIALS="$tmp/credentials.json" +CAS_STATE_DIR="$tmp/state" +_oldpath="$PATH"; PATH="$fakebin:$PATH"; export VAULT_FAKE_STORE="$store" + +printf '{"vaultwarden_master_password":"keep-me"}\n' > "$store" # pretend `homelab vault setup` ran +ok "backup succeeds (existing doc)" cas_backup +eq "merge preserves sibling key" keep-me "$(jq -r '.vaultwarden_master_password' "$store")" +eq "merge writes claude oauth" access "$(jq -r '.claude_ai_oauth_json|fromjson|.accessToken' "$store")" + +rm -f "$store" # fresh user: no doc yet +ok "backup succeeds (creates doc)" cas_backup +eq "create writes claude oauth" access "$(jq -r '.claude_ai_oauth_json|fromjson|.accessToken' "$store")" + +PATH="$_oldpath"; unset VAULT_FAKE_STORE + printf '\n%d passed, %d failed\n' "$pass" "$fail" (( fail == 0 )) diff --git a/scripts/workstation/claude-auth-sync.sh b/scripts/workstation/claude-auth-sync.sh index dc3d780d..b9676df9 100755 --- a/scripts/workstation/claude-auth-sync.sh +++ b/scripts/workstation/claude-auth-sync.sh @@ -13,6 +13,10 @@ CAS_VAULT_TOKEN_FILE="${CLAUDE_AUTH_VAULT_TOKEN_FILE:-$CAS_CONFIG_DIR/vault-toke CAS_VAULT_PATH="${CLAUDE_AUTH_VAULT_PATH:-secret/workstation/claude-users/$CAS_USER}" CAS_STATE_DIR="${CLAUDE_AUTH_STATE_DIR:-$CAS_HOME/.local/state/claude-auth-sync}" CAS_LOG="$CAS_STATE_DIR/sync.log" +# Where a long-lived per-user setup-token is materialized as an env file +# (KEY=VALUE) for start-claude.sh + t3-serve@.service to load. Lives under the +# already-ReadWritePaths config dir so the sandboxed service may write it. +CAS_TOKEN_ENV_FILE="${CLAUDE_AUTH_TOKEN_ENV_FILE:-$CAS_CONFIG_DIR/claude-oauth.env}" cas_log() { mkdir -p "$CAS_STATE_DIR" @@ -82,7 +86,17 @@ cas_backup() { return 1 } expires="$(jq -r '.expiresAt' <<<"$oauth")" - vault kv put "$CAS_VAULT_PATH" \ + # MERGE into the shared path so sibling keys other tools co-locate there + # (e.g. `homelab vault`'s vaultwarden_* creds) survive. `kv patch -method=rw` + # is read+update (needs no `patch` capability) but requires the secret to + # already exist, so create it with `kv put` on the very first backup only. + local -a write_cmd + if vault kv get "$CAS_VAULT_PATH" >/dev/null 2>&1; then + write_cmd=(vault kv patch -method=rw "$CAS_VAULT_PATH") + else + write_cmd=(vault kv put "$CAS_VAULT_PATH") + fi + "${write_cmd[@]}" \ claude_ai_oauth_json="$oauth" \ credential_expires_at_ms="$expires" \ backed_up_at="$(date -Is)" >/dev/null || { @@ -123,6 +137,41 @@ cas_restore() { cas_log "RECOVERED restored Claude OAuth state from Vault" } +# A user-scoped, long-lived setup-token (`sk-ant-oat01-…`, ~1y, NON-rotating) may +# be stored in this user's OWN Vault path (field `setup_token`). When present it +# is the authoritative credential: it bypasses the shared +# ~/.claude/.credentials.json OAuth refresh-token rotation entirely — the fix for +# users running many concurrent Claude sessions (interactive + t3-serve + always-on +# agents) that otherwise race on refresh and wipe each other's refresh token. +# We materialize it to a user-owned env file that start-claude.sh and +# t3-serve@.service load as CLAUDE_CODE_OAUTH_TOKEN. This is the user's OWN +# Enterprise identity, NOT the forbidden legacy SHARED token — it never crosses +# OS users. Returns 0 when a token is active, so the caller skips the +# rotating-credential validate/backup/restore (probing the now-vestigial +# credential would otherwise emit false WorkstationClaudeAuthInvalid alerts). +cas_sync_setup_token() { + local token desired tmp + token="$(vault kv get -field=setup_token "$CAS_VAULT_PATH" 2>/dev/null)" || token="" + if [[ "$token" != sk-ant-oat01-* ]]; then + if [[ -e "$CAS_TOKEN_ENV_FILE" ]]; then + rm -f "$CAS_TOKEN_ENV_FILE" + cas_log "removed stale CLAUDE_CODE_OAUTH_TOKEN env (no setup-token in Vault)" + fi + return 1 + fi + desired="CLAUDE_CODE_OAUTH_TOKEN=$token" + if [[ -r "$CAS_TOKEN_ENV_FILE" && "$(<"$CAS_TOKEN_ENV_FILE")" == "$desired" ]]; then + cas_log "OK long-lived setup-token active (CLAUDE_CODE_OAUTH_TOKEN current); credential checks skipped" + return 0 + fi + tmp="$(mktemp "${CAS_TOKEN_ENV_FILE}.XXXXXX")" || { cas_log "FAIL could not stage token env file"; return 1; } + printf '%s\n' "$desired" > "$tmp" + chmod 0600 "$tmp" + mv "$tmp" "$CAS_TOKEN_ENV_FILE" + cas_log "OK long-lived setup-token active; CLAUDE_CODE_OAUTH_TOKEN materialized; credential checks skipped" + return 0 +} + cas_main() { umask 077 for bin in jq vault claude timeout flock; do @@ -133,6 +182,11 @@ cas_main() { flock -n 9 || { cas_log "SKIP another sync is already running"; return 0; } cas_prepare_vault || return 1 + # A long-lived per-user setup-token, if provisioned, is authoritative and + # non-rotating — materialize it and skip the rotating-credential dance. + if cas_sync_setup_token; then + return 0 + fi if cas_live_auth_ok; then cas_backup return diff --git a/scripts/workstation/claude-hooks/homelab-memory-recall.py b/scripts/workstation/claude-hooks/homelab-memory-recall.py index 7315f116..c9e1d1c3 100755 --- a/scripts/workstation/claude-hooks/homelab-memory-recall.py +++ b/scripts/workstation/claude-hooks/homelab-memory-recall.py @@ -45,9 +45,15 @@ def main() -> None: try: res = subprocess.run( [homelab, "memory", "recall", prompt, "--limit", "5"], - capture_output=True, text=True, timeout=4, env=os.environ, + capture_output=True, text=True, errors="replace", timeout=4, + env=os.environ, ) - except (subprocess.TimeoutExpired, OSError): + except Exception: + # Best-effort: ANY failure — timeout, OSError, or a UnicodeDecodeError on + # truncated multibyte (Cyrillic) output — must silently skip recall this + # turn, exactly like the MCP being unavailable. errors="replace" above + # also keeps a mid-rune-truncated payload from raising here at all. Never + # let this hook surface a "UserPromptSubmit hook error". return out = (res.stdout or "").strip() diff --git a/scripts/workstation/claude-skills/README.md b/scripts/workstation/claude-skills/README.md index 816cbcb7..1fa06d94 100644 --- a/scripts/workstation/claude-skills/README.md +++ b/scripts/workstation/claude-skills/README.md @@ -19,13 +19,29 @@ unpinned-CLI dependencies out of the hourly **root** reconcile. - `mattpocock/skills` (https://github.com/mattpocock/skills) — all except `find-skills` - `vercel-labs/skills` (https://github.com/vercel-labs/skills) — `find-skills` +- **homelab-local, emo-PERSONALIZED** — `cluster-health` here is an + **emo-specific variant**, not a copy of the canonical skill. It started as a + copy of this repo's `.claude/skills/cluster-health/` but was rewritten on + 2026-06-26 to focus on ha-sofia + emo's Sofia devices (emo is the only entry + in `SKILL_USERS`, a read-only power-user). The canonical admin skill + (`.claude/skills/cluster-health/`) is the full 47-check version and is left + untouched. **Do NOT `cp -a` the canonical copy over this one** — that would + clobber the personalization. Maintain the two independently. ## Refreshing -Re-snapshot from a current install and commit the diff: +Re-snapshot the upstream skills from a current install and commit the diff: ```sh cp -a ~/.agents/skills/. scripts/workstation/claude-skills/ ``` -Snapshot taken 2026-06-23. +`cluster-health` is hand-maintained (emo variant) — it is **not** covered by the +`cp -a` above and must **not** be overwritten from `.claude/skills/`. Edit it in +place here when emo's needs change, then refresh his live copy (the provisioner's +`install_skills()` is if-absent, so it won't update an existing `~/.agents/skills` +copy — `cp` the new `SKILL.md` to `/home/emo/.agents/skills/cluster-health/` and +`chown emo:emo`, or remove emo's copy and re-run the reconcile). + +Snapshot taken 2026-06-23 (upstream); `cluster-health` vendored 2026-06-26, +personalized for emo 2026-06-26. diff --git a/scripts/workstation/claude-skills/cluster-health/SKILL.md b/scripts/workstation/claude-skills/cluster-health/SKILL.md new file mode 100644 index 00000000..20d13211 --- /dev/null +++ b/scripts/workstation/claude-skills/cluster-health/SKILL.md @@ -0,0 +1,146 @@ +--- +name: cluster-health +description: | + Personalized for emo. Check whether the homelab Kubernetes cluster is + affecting ha-sofia or the Sofia smart-home devices it runs (Tuya devices, + the MPPT ATS, lights, climate, security, irrigation). Use when: + (1) "is ha-sofia ok", "are my devices / the ATS / the lights down", + (2) "is the cluster affecting Sofia / my devices", + (3) "check the cluster", "cluster health", "is everything running", + (4) a device on the Барзини → Статус dashboard looks offline. + Runs the cluster-wide healthcheck read-only and triages it by what + ha-sofia actually depends on; the rest of the cluster is the admin's area. +author: Claude Code +version: 3.0.0-emo +date: 2026-06-26 +--- + +# Cluster Health — personalized for emo (ha-sofia focus) + +## What you actually care about + +You care about **ha-sofia** and the **Sofia smart-home devices** it runs — +the Tuya devices, the **MPPT ATS**, and the lights / climate / security / +irrigation on your **Барзини → Статус** dashboard. The wider Kubernetes +cluster matters to you **only when it's breaking something ha-sofia or your +devices depend on.** Anything else is the admin's (wizard's) area — note it in +one line and move on; don't chase it. + +You have **read-only** cluster access. You can SEE everything but change +nothing — so when something on your chain is broken, the job is to confirm it +and hand it off, not to repair it. + +## How ha-sofia depends on the cluster + +ha-sofia itself runs at the house (HAOS at https://ha-sofia.viktorbarzin.me) — +**not** in the cluster. The cluster reaches it through exactly two things: + +1. **tuya-bridge** (namespace `tuya-bridge`) — the REST API ha-sofia calls for + every Tuya device **and the MPPT ATS**. If it's unhealthy, your Tuya devices + + ATS stop responding. **This is the #1 thing to check.** +2. **The path that carries ha-sofia ⇄ tuya-bridge and keeps ha-sofia + reachable**: cloudflared (tunnel) → Traefik (LB) → the ingress + TLS cert + for `tuya-bridge.viktorbarzin.me` and `ha-sofia.viktorbarzin.me`, plus + Technitium DNS. If any of these break, ha-sofia can't reach tuya-bridge and + you can't reach ha-sofia remotely. + +Everything else in the cluster is unrelated to you unless it's hosting one of +those pods. + +## Step 1 — run the healthcheck (read-only, with your HA token) + +Your account can't read Vault, so load your own ha-sofia token first (it was +minted for you and lives at `~/.config/cluster-health/haos_token`). Then run +the script from YOUR clone, read-only: + +```bash +cd /home/emo/code +export HOME_ASSISTANT_SOFIA_TOKEN="$(cat ~/.config/cluster-health/haos_token)" +bash scripts/cluster_healthcheck.sh --no-fix --quiet +# machine-readable instead: +# bash scripts/cluster_healthcheck.sh --no-fix --quiet --json | tee /tmp/cluster-health.json +``` + +- **Never pass `--fix`** — it deletes pods (a write); you're read-only and it + will fail. +- Exit codes: `0` healthy, `1` warnings, `2` failures. + +With the token exported, the **ha-sofia checks run for you**: +26 Entity Availability · 27 Integration Health · 28 Automation Status · +29 System Resources · **45 Status Dashboard** — your Барзини → Статус view, +classifying every device tile as OK / ⚠️ / Offline across Сигурност, Мрежа & +IT, Енергия, Климат, Уреди, Мултимедия, Осветление, Поливна. Check 30 also +covers the **tuya** exporter. + +## Step 2 — triage the output by relevance to YOU + +Read the PASS/WARN/FAIL summary, then split the WARN/FAIL items in two: + +- **On your chain → this is what matters.** Anything touching: `tuya-bridge`, + `cloudflared`, `traefik`, DNS (check 21), the TLS cert / ingress for your two + hosts (checks 12, 22, 31, 32), or a **node** hosting those pods — plus all the + **ha-sofia** checks (26–29, 45) and the **tuya** exporter (30). +- **Not on your chain → one line, then drop it.** Summarise as "N unrelated + cluster issues (admin's area)" and don't investigate. + +## Step 3 — read-only checks for your chain + +All of these work with your read-only access: + +```bash +# tuya-bridge — your devices + the ATS +kubectl get pods -n tuya-bridge +kubectl rollout status deploy/tuya-bridge -n tuya-bridge +kubectl logs -n tuya-bridge deploy/tuya-bridge --tail=50 + +# the reachability path ha-sofia uses +kubectl get pods -n cloudflared +kubectl get pods -n traefik +kubectl get ingress -A | grep -Ei 'tuya-bridge|ha-sofia' + +# whole external path in one shot (DNS + tunnel + Traefik + cert): +curl -sI --max-time 10 https://tuya-bridge.viktorbarzin.me | head -1 +# reachable -> HTTP/2 200 / 401 / 403 (any HTTP response = path is up) +# broken -> curl: timeout / could not resolve host +``` + +The fastest **device-level** signal is your own dashboard: open +**https://ha-sofia.viktorbarzin.me → Барзини → Статус**. If devices show +Offline / Разкачен / ⚠️ **but tuya-bridge is healthy**, the problem is at the +house (device power / Wi-Fi / the Sofia TP-Link network) — **not** the cluster. + +## Step 4 — if something on your chain is broken + +You can't fix the cluster (read-only), so **capture + hand off**: + +```bash +kubectl describe pod -n tuya-bridge +kubectl logs -n tuya-bridge --previous --tail=200 +``` + +Then file it for the admin with the **`/file-issue`** skill — e.g. *"ha-sofia +Tuya devices + ATS unresponsive; tuya-bridge pod CrashLooping"* with the output +above. cloudflared / Traefik / DNS outages are cluster-wide — the admin's +alerting is already firing, but file it so it's tracked from your side too. + +## What will skip for you (expected — not failures) + +A few checks need access your account doesn't have. They warn/skip — that's +normal, and **none of them are on your ha-sofia chain**: + +- **Uptime Kuma (14)** — needs an admin password from Vault. +- **PVE host checks** — 36 (LVM snapshots), 43 (host thermals), 44 (host load), + and the Proxmox CSI ghost-disk check — all need root SSH to the Proxmox host. +- **`--fix`** — pod deletion (a write); not available to you. + +(The ha-sofia checks are **not** in this list — your token makes them work.) + +## Your ha-sofia token + +- Stored at `~/.config/cluster-health/haos_token` (yours, mode 600). +- It's a **dedicated** long-lived token, named `emo-cluster-health` under + ha-sofia → your profile → **Long-Lived Access Tokens**. Revoking it there + affects only you. +- It currently carries admin-level HA scope (Home Assistant only lets a token + be minted for the account that created it, and it was minted via the admin + account). If it ever stops working, tell wizard and a fresh one can be minted. diff --git a/scripts/workstation/managed-settings.json b/scripts/workstation/managed-settings.json index de214a1b..6e8a13a5 100644 --- a/scripts/workstation/managed-settings.json +++ b/scripts/workstation/managed-settings.json @@ -1,4 +1,4 @@ { - "claudeMd": "# Viktor Barzin homelab — shared multi-user Claude Code Workstation (devvm)\n\nYou are running as a specific OS user on a SHARED devvm Workstation, not as the admin. These org-wide rules apply to EVERY user and sit at the top of settings precedence (they cannot be overridden by a user's own config):\n\n- Respect your permission tier. kubectl, Vault, and infra access are scoped to your RBAC tier (admin / power-user / namespace-owner). Do not attempt to escalate privileges or reach another user's resources.\n- Secrets are per-user. Never read another user's home directory, credentials, tokens, or ~/.claude secrets. Your own secrets live in your home at mode 600.\n- Infrastructure changes go through Terraform/Terragrunt — never direct kubectl apply/edit/patch. Committed stack changes are auto-applied by CI on push to master; verify the live result with your read-only kubectl.\n- The AGENT does ALL git mechanics silently — the user may not know git, so never ask them to commit, push, pull, or open anything, and never surface git jargon. Lifecycle (worktrees, landing, cleanup): ~/.claude/rules/execution.md. Org red-lines on top:\n - THE COMMIT MESSAGE IS THE AUDIT TRAIL — subject says WHAT changed; body says WHY in plain words (paraphrase the user's actual request).\n - Never use [ci skip] as a non-admin (it hides the change from the audit feed).\n - Push rejected by branch protection (user not whitelisted) → fall back to a / branch + PR via the Forgejo API (token = password field in ~/.git-credentials).\n - Keep every clone on a clean master when done; tell the user in plain words what happened.\n - Full recipe: AGENTS.md → \"Non-admin workstation users\" in your infra clone.\n- Follow the engineering rules in ~/.claude/rules/ (execution, planning) and every CLAUDE.md in the repo tree.\n- Code lives under ~/code in one of two per-user layouts: either ~/code IS the git-crypt-LOCKED infra clone (single layout), or ~/code is a workspace directory of per-project clones — the locked infra clone at ~/code/infra plus other project repos alongside it. [ -d ~/code/.git ] means single. In locked infra clones secret files read as ciphertext — that is expected, not an error.\n", + "claudeMd": "# Viktor Barzin homelab — shared multi-user Claude Code Workstation (devvm)\n\nYou are running as a specific OS user on a SHARED devvm Workstation, not as the admin. These org-wide rules apply to EVERY user and sit at the top of settings precedence (they cannot be overridden by a user's own config):\n\n- Respect your permission tier for cluster/infra. kubectl, Vault, and infra access are scoped to your RBAC tier (admin / power-user / namespace-owner) — don't try to exceed it. At the OS level, authorization follows the OS: where your user holds `sudo`, using it is permitted and audited.\n- OS access follows OS permissions, not a separate rule here. You may read whatever your OS user can read — directly or via `sudo` where you hold it — including other users' home directories, credentials, tokens, and `~/.claude`. Do not impose restrictions stricter than the OS; no extra prompt is required for OS-authorized access. Protect your own secrets at mode 600.\n- Infrastructure changes go through Terraform/Terragrunt — never direct kubectl apply/edit/patch. Committed stack changes are auto-applied by CI on push to master; verify the live result with your read-only kubectl.\n- The AGENT does ALL git mechanics silently — the user may not know git, so never ask them to commit, push, pull, or open anything, and never surface git jargon. Lifecycle (worktrees, landing, cleanup): ~/.claude/rules/execution.md. Org red-lines on top:\n - THE COMMIT MESSAGE IS THE AUDIT TRAIL — subject says WHAT changed; body says WHY in plain words (paraphrase the user's actual request).\n - Never use [ci skip] as a non-admin (it hides the change from the audit feed).\n - Push rejected by branch protection (user not whitelisted) → fall back to a / branch + PR via the Forgejo API (token = password field in ~/.git-credentials).\n - Keep every clone on a clean master when done; tell the user in plain words what happened.\n - Full recipe: AGENTS.md → \"Non-admin workstation users\" in your infra clone.\n- Follow the engineering rules in ~/.claude/rules/ (execution, planning) and every CLAUDE.md in the repo tree.\n- Code lives under ~/code in one of two per-user layouts: either ~/code IS the git-crypt-LOCKED infra clone (single layout), or ~/code is a workspace directory of per-project clones — the locked infra clone at ~/code/infra plus other project repos alongside it. [ -d ~/code/.git ] means single. In locked infra clones secret files read as ciphertext — that is expected, not an error.\n", "model": "claude-opus-4-8" } diff --git a/scripts/workstation/setup-devvm.sh b/scripts/workstation/setup-devvm.sh index 2969b803..02bd9257 100755 --- a/scripts/workstation/setup-devvm.sh +++ b/scripts/workstation/setup-devvm.sh @@ -72,11 +72,14 @@ if [[ -n "$want_t3" && "$(t3 --version 2>/dev/null | awk '{print $NF}' | sed 's/ fi # 2c) Bitwarden CLI — backs `homelab vault` (per-user no-HITL Vaultwarden access). -# npm-global so every user's PATH resolves it. Pinned major; best-effort (a -# failure only disables `homelab vault`, nothing else on the box). -if ! command -v bw >/dev/null; then - log "npm: installing @bitwarden/cli (homelab vault backend)" - npm install -g "@bitwarden/cli@^2024" >/dev/null 2>&1 || log "WARN: @bitwarden/cli install failed; homelab vault unavailable" +# Install SYSTEM-WIDE (npm prefix /usr → /usr/bin/bw) so EVERY user's PATH +# resolves it. The guard tests the SYSTEM path, NOT `command -v bw`: the +# latter is satisfied by an admin's own ~/.local/bin/bw and would skip the +# system install, leaving non-admins (emo, anca, …) with no backend. Pinned +# major; best-effort (a failure only disables `homelab vault`). +if [ ! -x /usr/bin/bw ] && [ ! -x /usr/local/bin/bw ]; then + log "npm: installing @bitwarden/cli system-wide (homelab vault backend)" + npm install -g --prefix /usr "@bitwarden/cli@^2024" >/dev/null 2>&1 || log "WARN: @bitwarden/cli install failed; homelab vault unavailable" fi # 3) kubelogin (kubectl oidc-login) system-wide — NOT the apt 'kubelogin' (= Azure tool). diff --git a/scripts/workstation/skel/start-claude.sh b/scripts/workstation/skel/start-claude.sh index b3e25744..45ed9c4a 100755 --- a/scripts/workstation/skel/start-claude.sh +++ b/scripts/workstation/skel/start-claude.sh @@ -93,6 +93,15 @@ ensure_onboarding() { } ensure_onboarding +# Load a per-user long-lived CLAUDE_CODE_OAUTH_TOKEN if claude-auth-sync has +# materialized one from this user's own Vault path. A non-rotating setup-token +# sidesteps the shared ~/.claude/.credentials.json OAuth refresh-token race that +# logs out users running many concurrent agents (interactive + t3 + always-on). +# Absent file -> no-op (normal per-user Enterprise-SSO flow). The user's OWN +# token; never shared between OS users. +_oauth_env="$HOME/.config/claude-auth-sync/claude-oauth.env" +if [ -r "$_oauth_env" ]; then set -a; . "$_oauth_env"; set +a; fi + # Deliberately not `exec` so we can branch on the exit code: clean quit ends the # pane (ttyd closes the terminal); a crash drops to a shell so the tmux session # isn't destroyed-and-recreated in a ttyd auto-reconnect loop. diff --git a/stacks/actualbudget/main.tf b/stacks/actualbudget/main.tf index 33012033..13da68a8 100644 --- a/stacks/actualbudget/main.tf +++ b/stacks/actualbudget/main.tf @@ -5,6 +5,9 @@ variable "tls_secret_name" { variable "nfs_server" { type = string } resource "kubernetes_manifest" "external_secret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/affine/main.tf b/stacks/affine/main.tf index bc63381c..10a94ad7 100644 --- a/stacks/affine/main.tf +++ b/stacks/affine/main.tf @@ -5,6 +5,9 @@ variable "tls_secret_name" { variable "nfs_server" { type = string } resource "kubernetes_manifest" "external_secret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" @@ -42,6 +45,9 @@ data "kubernetes_secret" "eso_secrets" { # DB credentials from Vault database engine (rotated automatically) # Provides DATABASE_URL that auto-updates when password rotates resource "kubernetes_manifest" "db_external_secret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/authentik/Dockerfile b/stacks/authentik/Dockerfile new file mode 100644 index 00000000..e60c5319 --- /dev/null +++ b/stacks/authentik/Dockerfile @@ -0,0 +1,46 @@ +# SLOW-1a overlay over the official authentik server image. +# +# The login flow's identification stage renders each enabled source's UI login +# button. Upstream authentik/stages/identification/stage.py does: +# current_stage.sources.filter(enabled=True).order_by("name").select_subclasses() +# The bare no-arg select_subclasses() (django-model-utils InheritanceManager) +# LEFT-JOINs EVERY Source subtype table; on the cold-login hot path that is ~1.5s +# (verified live on 2026.2.4: 1527ms vs 14ms). Passing only the subtypes that +# actually render a UI login button — every concrete Source type that overrides +# ui_login_button: oauth/saml/plex/telegram/kerberos, NOT the sync-only ldap/scim — +# is ~100x faster and BYTE-IDENTICAL output (verified: concrete types + rendered +# buttons match). django-model-utils accepts the lowercase subclass *accessor +# names* as strings, so no new import is needed (no circular-import risk) — the +# patch is a single, reviewable line edit. +# +# RE-VERIFY ON EVERY AUTHENTIK BUMP: bump the FROM tag below AND the image tag in +# modules/authentik/values.yaml together. The grep guards fail the build LOUDLY if +# the upstream target line moved. If a future authentik version adds a NEW +# login-capable source type, add its lowercase accessor to the list below. +# Upstream: the bare select_subclasses() is still present in main (no fix/PR as of +# 2026-06-28) — drop this overlay once upstream narrows the query. +FROM ghcr.io/goauthentik/server:2026.2.4 + +USER root +RUN set -eux; \ + F=/authentik/stages/identification/stage.py; \ + grep -q 'order_by("name").select_subclasses()' "$F"; \ + sed -i 's/order_by("name")\.select_subclasses()/order_by("name").select_subclasses("oauthsource", "samlsource", "plexsource", "telegramsource", "kerberossource")/' "$F"; \ + grep -q 'select_subclasses("oauthsource", "samlsource", "plexsource", "telegramsource", "kerberossource")' "$F"; \ + PY="$(command -v python || command -v python3)"; "$PY" -c "import ast,sys; ast.parse(open('$F').read())"; \ + rm -f /authentik/stages/identification/__pycache__/stage.*.pyc + +# PATCH #2 — old-browser BLANK LOGIN. authentik's modern flow SPA is ES2022 and +# hard-fails (blank login) on Safari<=16.3 (e.g. iPadOS<=16.3). authentik already +# ships a no-JS Simplified Flow Executor (SFE, ES5) but only serves it to +# IE/old-Edge/PKeyAuth. patch-compat-sfe.py (a) extends compat_needs_sfe() to +# serve the SFE to old Safari AND any iOS browser (Chrome/CriOS, Firefox/FxiOS — +# all share the system WebKit) on iOS<=16.3, and (b) injects static social-login +# links into the SFE shell (the SFE can't render Identification-stage sources; +# needed for password-less Google-only accounts). Clients get the REAL authentik +# login (password + MFA + reputation, NO auth downgrade) instead of a blank page. +# The script is guarded (asserts both upstream anchors + ast-parses) so the build +# fails loudly if upstream moves — re-verify on every authentik bump. +COPY patch-compat-sfe.py /tmp/patch-compat-sfe.py +RUN python3 /tmp/patch-compat-sfe.py && rm -f /tmp/patch-compat-sfe.py +USER authentik diff --git a/stacks/authentik/admin-services-restriction.tf b/stacks/authentik/admin-services-restriction.tf index 806dd417..293c78b5 100644 --- a/stacks/authentik/admin-services-restriction.tf +++ b/stacks/authentik/admin-services-restriction.tf @@ -49,14 +49,15 @@ resource "authentik_policy_expression" "admin_services_restriction" { host = request.context.get("host", "") - # chrome-service noVNC (chrome.viktorbarzin.me) exposes Viktor's LIVE - # logged-in browser sessions, so lock it to Viktor's own accounts ONLY. - # "Home Server Admins" is NOT sufficient — emo (emil.barzin@gmail.com) is a - # member. akadmin kept as break-glass. The homelab-browser CDP path is - # already RBAC-gated (emo = oidc-power-user-readonly, no pods/portforward), - # so this closes the only remaining, human, noVNC path. Match username OR - # email so neither attribute alone can lock Viktor out. - CHROME_ALLOWED = {"akadmin", "akadmin@viktorbarzin.me", "vbarzin@gmail.com"} + # chrome-service noVNC (chrome.viktorbarzin.me) exposes LIVE logged-in browser + # sessions from the SHARED persistent profile. Originally Viktor-only. + # 2026-06-28 (Viktor's explicit decision): emo SHARES Viktor's browser, so emo + # (emil.barzin / emil.barzin@gmail.com) is allowed in for noVNC form-filling + + # captcha solving. Trade-off accepted: emo can therefore reach Viktor's warmed + # sessions (the CLI half is the emo-browser ServiceAccount in + # stacks/chrome-service/rbac.tf). akadmin kept as break-glass. Match username OR + # email so neither attribute alone can lock anyone out. + CHROME_ALLOWED = {"akadmin", "akadmin@viktorbarzin.me", "vbarzin@gmail.com", "emil.barzin", "emil.barzin@gmail.com"} if host == "chrome.viktorbarzin.me": return request.user.username in CHROME_ALLOWED or request.user.email in CHROME_ALLOWED diff --git a/stacks/authentik/email-secret.tf b/stacks/authentik/email-secret.tf index b3a7f201..87be65d4 100644 --- a/stacks/authentik/email-secret.tf +++ b/stacks/authentik/email-secret.tf @@ -6,6 +6,9 @@ # are non-secret and live in values.yaml. The reloader annotation rolls the # authentik pods if the password ever changes. resource "kubernetes_manifest" "authentik_email_secret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/authentik/modules/authentik/main.tf b/stacks/authentik/modules/authentik/main.tf index 3ae6d7c6..5c688452 100644 --- a/stacks/authentik/modules/authentik/main.tf +++ b/stacks/authentik/modules/authentik/main.tf @@ -29,7 +29,12 @@ resource "kubernetes_namespace" "authentik" { labels = { tier = var.tier "resource-governance/custom-quota" = "true" - "keel.sh/enrolled" = "true" + # Keel intentionally NOT enrolled: server+worker run our custom overlay image + # (ghcr.io/viktorbarzin/authentik-server — see values.yaml global.image + + # stacks/authentik/Dockerfile). The tag is pinned explicitly and bumped + # manually (rebuild the overlay FROM the new authentik version + repoint), so + # a Keel auto-bump would only risk re-introducing the upstream tag / the + # 2026-06-10 downgrade-boot-storm class. Re-enroll only if the overlay is dropped. } } lifecycle { @@ -82,6 +87,11 @@ module "ingress" { service_name = "goauthentik-server" tls_secret_name = var.tls_secret_name anti_ai_scraping = false + # Swap the shared 10/50 default limiter for a dedicated 100/1000 carve-out: + # the login SPA + flow-executor API burst on a cold load otherwise 429s into + # a blank screen (see traefik middleware "authentik-rate-limit"). + skip_default_rate_limit = true + extra_middlewares = ["traefik-authentik-rate-limit@kubernetescrd"] extra_annotations = { "gethomepage.dev/enabled" = "true" "gethomepage.dev/name" = "Authentik" @@ -140,14 +150,21 @@ module "ingress-static" { # Same-host path carve-out of the public authentik UI ingress above, only # adding the cache-headers middleware for the static asset prefix. # auth = "none": versioned static assets of the (already public) Authentik login UI. - auth = "none" - namespace = kubernetes_namespace.authentik.metadata[0].name - name = "authentik-static" - host = "authentik" - service_name = "goauthentik-server" - ingress_path = ["/static"] - tls_secret_name = var.tls_secret_name - anti_ai_scraping = false - homepage_enabled = false - extra_middlewares = ["authentik-static-cache-headers@kubernetescrd"] + auth = "none" + namespace = kubernetes_namespace.authentik.metadata[0].name + name = "authentik-static" + host = "authentik" + service_name = "goauthentik-server" + ingress_path = ["/static"] + tls_secret_name = var.tls_secret_name + anti_ai_scraping = false + homepage_enabled = false + # /static serves ALL the SPA JS/CSS chunks; the default 10/50 limiter 429s the + # cold-load fan-out → blank screen. Dedicated 100/1000 carve-out (note the two + # namespaces: cache-headers is in ns authentik, rate-limit is in ns traefik). + skip_default_rate_limit = true + extra_middlewares = [ + "authentik-static-cache-headers@kubernetescrd", + "traefik-authentik-rate-limit@kubernetescrd", + ] } diff --git a/stacks/authentik/modules/authentik/values.yaml b/stacks/authentik/modules/authentik/values.yaml index bfe755cd..f4b1b3f2 100644 --- a/stacks/authentik/modules/authentik/values.yaml +++ b/stacks/authentik/modules/authentik/values.yaml @@ -39,6 +39,16 @@ server: value: "3" - name: AUTHENTIK_WEB__THREADS value: "4" + # Gunicorn worker recycle hardening (defaults max_requests=1000/jitter=50). + # A worker recycle that coincides with a transient PG/pgbouncer blip stalls + # in-flight requests (sessions+cache are on PostgreSQL since Redis was removed + # in 2026.2), and with 9 workers recycling on a tight 50-jitter window the + # recycles cluster — feeding the episodic all-pods-NotReady 502/504 cascade. + # 10x rarer recycles + 20x wider jitter (1000) decorrelate them from DB blips. + - name: AUTHENTIK_WEB__MAX_REQUESTS + value: "10000" + - name: AUTHENTIK_WEB__MAX_REQUESTS_JITTER + value: "1000" # Cache flow plans for 30m and policy evaluations for 15m (defaults 300s). # Authentik 2026.2 stores cache in Postgres, so a TTL hit is still a # SELECT — but a single indexed lookup beats re-planning the flow @@ -87,11 +97,28 @@ server: livenessProbe: failureThreshold: 6 timeoutSeconds: 5 - strategy: + # Readiness widened from the chart default (3x10s/3s ~= 30s) to ~80s. The + # readiness probe (/-/health/ready/) queries the DB, so a sub-~60s PG/pgbouncer + # transient otherwise returns 503 and drops ALL 3 server pods from the Service + # at once -> Traefik has no healthy backend -> 502/504 (the episodic blank + # screen + 30s hang). 80s absorbs a full CNPG failover reconnect; liveness + # still reaps a truly hung pod. Partial override — the chart deep-merges the + # httpGet path /-/health/ready/ (same as the livenessProbe override above). + readinessProbe: + failureThreshold: 8 + periodSeconds: 10 + timeoutSeconds: 5 + # RollingUpdate strategy. The chart key is `deploymentStrategy`, NOT `strategy` + # (authentik.server reads .Values.server.deploymentStrategy) — the old + # `strategy:` key was silently ignored, so live ran the chart default 25%/25% + # and every rolling event dropped a server pod out of rotation, amplifying the + # NotReady cascade. maxSurge:1 + maxUnavailable:0 keeps all 3 ready throughout + # a roll (PDB minAvailable:2 + ResourceQuota headroom allow the transient pod). + deploymentStrategy: type: RollingUpdate rollingUpdate: - maxSurge: 0 - maxUnavailable: 1 + maxSurge: 1 + maxUnavailable: 0 resources: requests: cpu: 100m @@ -118,15 +145,23 @@ server: global: addPrometheusAnnotations: true image: - # Pin to the Keel-managed live tag. Keel (diun-annotated, keel.sh/enrolled - # namespace) bumps the IMAGE between chart releases, while helm defaults - # the tag to the chart appVersion — so any helm upgrade silently - # DOWNGRADES the running pods to the chart pin (2026-06-10: a values-only - # apply rolled live 2026.2.4 back to 2026.2.2 against a 2026.2.4-migrated - # DB → boot storm, see docs/post-mortems/2026-06-10-authentik-downgrade- - # boot-storm.md). Keep this tag in sync with what Keel has deployed when - # touching this chart; clear it only when bumping the chart version itself. - tag: "2026.2.4" + # CUSTOM OVERLAY: two thin patches over the official authentik server image + # (see stacks/authentik/Dockerfile): (1) SLOW-1a — narrows the login-flow + # select_subclasses() query, ~1.4s -> ~14ms; (2) serve authentik's no-JS SFE + # login to old Safari/WebKit AND any iOS browser (Chrome/Firefox = WebKit) on + # iOS<=16.3 so old devices (e.g. iPadOS<=15) get a working login instead of a + # blank page, and injects social-login links into the SFE (it can't render + # sources; needed for password-less Google-only accounts). Built by + # .github/workflows/build-authentik.yml to ghcr.io/viktorbarzin/authentik-server + # (public package, anonymous pull — no imagePullSecret needed, like the + # upstream goauthentik image). Keel is NO LONGER enrolled for this namespace + # (see main.tf) so it can't bump/downgrade the tag; helm also defaults the tag + # to the chart appVersion (2026.2.2) — so BOTH repository AND tag are pinned + # explicitly here to prevent the 2026-06-10 downgrade-boot-storm class. + # UPGRADE = bump the Dockerfile FROM tag + this tag together (e.g. -> + # 2026.3.0-patch1), let GHA rebuild, then apply. + repository: ghcr.io/viktorbarzin/authentik-server + tag: "2026.2.4-patch3" worker: # 2 replicas: workers handle background tasks (LDAP sync, email, @@ -166,7 +201,10 @@ worker: secretKeyRef: name: authentik-email key: AUTHENTIK_EMAIL__PASSWORD - strategy: + # Chart key is `deploymentStrategy`, not `strategy` (see server above). Workers + # serve no user traffic, so maxSurge:0/maxUnavailable:1 is fine — this is just + # the dead-key cleanup so the declared intent actually takes effect. + deploymentStrategy: type: RollingUpdate rollingUpdate: maxSurge: 0 diff --git a/stacks/authentik/patch-compat-sfe.py b/stacks/authentik/patch-compat-sfe.py new file mode 100644 index 00000000..014603b7 --- /dev/null +++ b/stacks/authentik/patch-compat-sfe.py @@ -0,0 +1,96 @@ +#!/usr/bin/env python3 +"""Overlay patch — make authentik usable on OLD browsers (no modern-JS SPA). + +authentik's modern flow SPA is ES2022 (static{} init blocks) that hard-fail on +Safari/WebKit <= 16.3 (e.g. iPadOS <= 16.3) and render a COMPLETELY BLANK login. +authentik ships a no-JS Simplified Flow Executor (SFE, ES5) but only serves it to +IE / old-Edge / PKeyAuth, and the SFE itself canNOT render Identification-stage +sources (social-login buttons) — authentik docs list "Sources" as unsupported. + +This patch does TWO things, both guarded (assert the upstream anchor + verify the +result) so the image build fails LOUDLY if upstream moves. RE-VERIFY on every +authentik upgrade. + + 1. flows/views/interface.py::compat_needs_sfe() -> also return True for old + Safari/WebKit: (a) Safari/Mobile Safari Version <= 16.3 (covers desktop-mode + iPadOS which reports as Mac Safari), and (b) ANY iOS browser (Chrome/CriOS, + Firefox/FxiOS, Edge — all share the system WebKit) on iOS <= 16.3. So old + iPads get the SFE on EVERY browser, not just Safari. + + 2. flows/templates/if/flow-sfe.html -> inject static social-login links + (plain redirects to /source/oauth/login//, work on ANY browser) so SFE + users (who otherwise see only username/password) can use social login — + required for accounts with no password (e.g. Google-only users like emo). +""" +import ast +import glob +import os + +# --- Patch 1: compat_needs_sfe() UA gate ------------------------------------- +INTERFACE = "/authentik/flows/views/interface.py" +ANCHOR = ( + ' if "PKeyAuth" in ua["string"]:\n' + " return True\n" + " return False" +) +REPLACEMENT = ( + ' if "PKeyAuth" in ua["string"]:\n' + " return True\n" + " # OVERLAY: old WebKit can't parse the modern ES2022 flow SPA (blank\n" + " # login) -> serve the SFE (real authentik login). (a) desktop-mode\n" + " # Safari/iPadOS reports as Mac Safari with Version<=16.3:\n" + ' if ua["user_agent"]["family"] in ("Safari", "Mobile Safari"):\n' + " try:\n" + ' _maj = int(ua["user_agent"]["major"] or 0)\n' + ' _min = int(ua["user_agent"]["minor"] or 0)\n' + " except (TypeError, ValueError):\n" + " _maj = _min = 0\n" + " if _maj and (_maj < 16 or (_maj == 16 and _min <= 3)):\n" + " return True\n" + " # (b) ANY iOS browser (Chrome/CriOS, Firefox/FxiOS, Edge) shares the\n" + " # system WebKit, so iOS<=16.3 fails regardless of the browser family:\n" + ' if ua["os"]["family"] == "iOS":\n' + " try:\n" + ' _omaj = int(ua["os"]["major"] or 0)\n' + ' _omin = int(ua["os"]["minor"] or 0)\n' + " except (TypeError, ValueError):\n" + " _omaj = _omin = 0\n" + " if _omaj and (_omaj < 16 or (_omaj == 16 and _omin <= 3)):\n" + " return True\n" + " return False" +) +src = open(INTERFACE).read() +assert "def compat_needs_sfe" in src, "compat_needs_sfe() not found — upstream changed" +assert src.count(ANCHOR) == 1, f"anchor not found exactly once in {INTERFACE}" +src = src.replace(ANCHOR, REPLACEMENT) +open(INTERFACE, "w").write(src) +ast.parse(src) +assert 'ua["os"]["family"] == "iOS"' in open(INTERFACE).read() +for pyc in glob.glob("/authentik/flows/views/__pycache__/interface.*.pyc"): + os.remove(pyc) + +# --- Patch 2: social-login links on the SFE shell ---------------------------- +SFE_HTML = "/authentik/flows/templates/if/flow-sfe.html" +HTML_ANCHOR = ( + " \n" + " {% trans 'Powered by authentik' %}" +) +HTML_REPLACEMENT = ( + " \n" + " \n" + ' \n" + " {% trans 'Powered by authentik' %}" +) +html = open(SFE_HTML).read() +assert html.count(HTML_ANCHOR) == 1, f"SFE html anchor not found exactly once in {SFE_HTML}" +html = html.replace(HTML_ANCHOR, HTML_REPLACEMENT) +open(SFE_HTML, "w").write(html) +assert "Continue with Google" in open(SFE_HTML).read() + +print("patch-compat-sfe: SFE for old Safari + all iOS<=16.3; social-login links added to SFE") diff --git a/stacks/beads-server/main.tf b/stacks/beads-server/main.tf index 5b71373e..eebed876 100644 --- a/stacks/beads-server/main.tf +++ b/stacks/beads-server/main.tf @@ -601,6 +601,9 @@ resource "kubernetes_config_map" "beadboard_config" { # Pulls the claude-agent-service bearer token from Vault so BeadBoard can # dispatch agent jobs via the in-cluster HTTP API. resource "kubernetes_manifest" "beadboard_agent_service_secret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/broker-sync/main.tf b/stacks/broker-sync/main.tf index 2de168a1..76d822d8 100644 --- a/stacks/broker-sync/main.tf +++ b/stacks/broker-sync/main.tf @@ -28,6 +28,9 @@ resource "kubernetes_namespace" "broker_sync" { # trading212_api_keys — JSON array of {account_id, account_type, api_key, name, currency} # imap_host, imap_user, imap_password, imap_directory — for InvestEngine + Schwab email ingest resource "kubernetes_manifest" "external_secret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/calico/main.tf b/stacks/calico/main.tf index 39550024..956534fb 100644 --- a/stacks/calico/main.tf +++ b/stacks/calico/main.tf @@ -22,7 +22,7 @@ resource "kubernetes_namespace" "calico_system" { name = "calico-system" labels = { name = "calico-system" -# calico-system namespace is managed by tigera-operator — auto-update is + # calico-system namespace is managed by tigera-operator — auto-update is # incompatible (operator reverts DaemonSet image from its Installation CR). # "keel.sh/enrolled" = "true" } @@ -161,8 +161,8 @@ resource "helm_release" "tigera_operator" { # render before their crds/ (which helm skips on upgrade) -> "ensure CRDs # are installed first". We instead enable them via the operator CRs applied # directly below (kubectl_manifest) now that the CRDs exist — see ADR-0014. - goldmane = { enabled = false } - whisker = { enabled = false } + goldmane = { enabled = false } + whisker = { enabled = false } # 512Mi (was 256Mi): the operator idles at ~38Mi but its STARTUP spike # (re-listing resources to build informer caches) exceeded 256Mi and # OOM-crashlooped on 2026-06-23 the first time the pod restarted (a latent @@ -212,3 +212,229 @@ resource "kubectl_manifest" "whisker" { spec = { notifications = "Disabled" } }) } + +# --------------------------------------------------------------------------- +# Gated public ingress for the Whisker UI (infra #57 / ADR-0014). +# +# whisker.viktorbarzin.me -> whisker:8081, Authentik-gated (auth="required": +# Whisker ships NO own login — it's an admin observability UI, so Authentik +# forward-auth is the only gate between strangers and the flow view). The +# operator replicated `tls-secret` into calico-system already. +# +# TWO coupled pieces are required because the operator's own `whisker` +# NetworkPolicy (owned by the Whisker CR above) sets policyTypes:[Ingress] +# with NO ingress rules => default-deny on ingress to the whisker pod. The +# additive NP below ORs in a Traefik allow (k8s NetworkPolicies are additive +# across policies selecting the same pod), so we never edit the operator NP. +module "ingress_whisker" { + source = "../../modules/kubernetes/ingress_factory" + dns_type = "proxied" + namespace = "calico-system" + name = "whisker" + service_name = "whisker" + port = 8081 + auth = "required" + tls_secret_name = "tls-secret" + extra_annotations = { + "gethomepage.dev/enabled" = "true" + "gethomepage.dev/name" = "Whisker" + "gethomepage.dev/description" = "Calico flow observability (who-talks-to-whom)" + "gethomepage.dev/icon" = "calico.png" + "gethomepage.dev/group" = "Infrastructure" + } +} + +# Additive NetworkPolicy: permit Traefik -> whisker:8081. ORs with the +# operator's default-deny `whisker` NP (selecting the same pod) so Traefik +# can reach the UI without touching the operator-owned policy. +resource "kubernetes_network_policy_v1" "whisker_allow_traefik" { + metadata { + name = "whisker-allow-traefik" + namespace = "calico-system" + } + spec { + pod_selector { + match_labels = { + "app.kubernetes.io/name" = "whisker" + } + } + policy_types = ["Ingress"] + ingress { + from { + namespace_selector { + match_labels = { + "kubernetes.io/metadata.name" = "traefik" + } + } + } + ports { + port = "8081" + protocol = "TCP" + } + } + } +} + +# Additive egress NetworkPolicy: permit whisker -> the kube-dns ClusterIP for DNS. +# +# ROOT CAUSE of the 2026-06-28 "Whisker UI empty" incident: the operator's own +# `whisker` NetworkPolicy is policyTypes:[Ingress,Egress] and its egress allows +# DNS only to the kube-dns *pods* (podSelector k8s-app=kube-dns). But +# whisker-backend resolves `goldmane...svc` via the kube-dns *ClusterIP* +# (10.96.0.10), and Calico drops UDP DNS to a ClusterIP under a podSelector-only +# egress rule (verified: from whisker's netns, ClusterIP DNS = 100% timeout +# while direct kube-dns pod-IP DNS = OK; a pod with no egress policy resolves +# fine). whisker-backend resolves once in the brief startup window before the +# policy programs, establishes its long-lived gRPC stream, and only re-resolves +# when that stream breaks — at which point the blocked ClusterIP DNS wedges its +# Go resolver and the UI goes empty (the durable aggregator, in its own +# unrestricted namespace, is unaffected). k8s egress policies are additive, so +# this ORs in an allow for the ClusterIP; the operator NP is left untouched. +# (Empirically: adding this ipBlock rule flips ClusterIP DNS from 100% fail to +# 100% ok.) See docs/runbooks/goldmane-flow-trail.md. +resource "kubernetes_network_policy_v1" "whisker_allow_dns_clusterip" { + metadata { + name = "whisker-allow-dns-clusterip" + namespace = "calico-system" + } + spec { + pod_selector { + match_labels = { + "app.kubernetes.io/name" = "whisker" + } + } + policy_types = ["Egress"] + egress { + # 10.96.0.10 is the kube-dns ClusterIP (cluster invariant — service CIDR + # 10.96.0.0/12, DNS always .10; the same IP CoreDNS/Technitium configs pin). + to { + ip_block { + cidr = "10.96.0.10/32" + } + } + ports { + port = "53" + protocol = "UDP" + } + ports { + port = "53" + protocol = "TCP" + } + } + } +} + +# --------------------------------------------------------------------------- +# Whisker self-heal watchdog (ADR-0014; added 2026-06-28 after a live incident). +# +# BACKSTOP. The REAL fix is kubernetes_network_policy_v1.whisker_allow_dns_clusterip +# above (it unblocks the root-cause ClusterIP DNS). This watchdog stays as +# defense-in-depth: whisker-backend has NO operator liveness probe, so if its +# long-lived goldmane gRPC stream ever wedges for any OTHER reason (the Go +# resolver spams `failed to stream flows` / `code = Unavailable` and never +# reconnects -> empty UI, while the durable aggregator in its own namespace is +# unaffected), nothing else would restart it. Whisker is operator-managed +# (Whisker CR) so we can't inject a probe; this is the supported-pattern +# alternative. With the DNS fix in place it should rarely, if ever, fire. +# +# It restarts the pod ONLY when the wedged signature is present AND Goldmane is +# Ready (so a real Goldmane outage doesn't cause restart-thrash). A fresh pod +# reconnects cleanly. See docs/runbooks/goldmane-flow-trail.md. +resource "kubernetes_service_account" "whisker_watchdog" { + metadata { + name = "whisker-watchdog" + namespace = kubernetes_namespace.calico_system.metadata[0].name + } +} + +# Namespaced Role (least privilege — only calico-system): read pod logs to +# detect the wedge, delete the whisker pod to heal it. +resource "kubernetes_role" "whisker_watchdog" { + metadata { + name = "whisker-watchdog" + namespace = kubernetes_namespace.calico_system.metadata[0].name + } + rule { + api_groups = [""] + resources = ["pods"] + verbs = ["get", "list", "delete"] + } + rule { + api_groups = [""] + resources = ["pods/log"] + verbs = ["get"] + } +} + +resource "kubernetes_role_binding" "whisker_watchdog" { + metadata { + name = "whisker-watchdog" + namespace = kubernetes_namespace.calico_system.metadata[0].name + } + role_ref { + api_group = "rbac.authorization.k8s.io" + kind = "Role" + name = kubernetes_role.whisker_watchdog.metadata[0].name + } + subject { + kind = "ServiceAccount" + name = kubernetes_service_account.whisker_watchdog.metadata[0].name + namespace = kubernetes_namespace.calico_system.metadata[0].name + } +} + +resource "kubernetes_cron_job_v1" "whisker_watchdog" { + metadata { + name = "whisker-watchdog" + namespace = kubernetes_namespace.calico_system.metadata[0].name + } + spec { + schedule = "*/10 * * * *" + successful_jobs_history_limit = 1 + failed_jobs_history_limit = 1 + concurrency_policy = "Forbid" + job_template { + metadata { + name = "whisker-watchdog" + } + spec { + template { + metadata { + name = "whisker-watchdog" + } + spec { + service_account_name = kubernetes_service_account.whisker_watchdog.metadata[0].name + container { + name = "watchdog" + image = "bitnami/kubectl:latest" + command = ["/bin/sh", "-c", <<-EOT + set -eu + NS=calico-system + # Don't thrash if Goldmane itself is down — that's not a whisker bug. + if ! kubectl -n "$NS" get pod -l k8s-app=goldmane \ + -o jsonpath='{.items[*].status.conditions[?(@.type=="Ready")].status}' 2>/dev/null | grep -q True; then + echo "goldmane not Ready — skipping (not a whisker problem)"; exit 0 + fi + ERRS=$(kubectl -n "$NS" logs -l k8s-app=whisker -c whisker-backend --since=11m --tail=500 2>/dev/null \ + | grep -cE 'failed to stream flows|failed to list filter hints|code = Unavailable|i/o timeout' || true) + ERRS=$${ERRS:-0} + if [ "$ERRS" -ge 10 ]; then + echo "whisker-backend WEDGED: $ERRS goldmane-connection errors in 11m — restarting whisker pod" + kubectl -n "$NS" delete pod -l k8s-app=whisker --ignore-not-found + else + echo "whisker-backend healthy: $ERRS goldmane-connection errors in 11m" + fi + EOT + ] + } + restart_policy = "Never" + } + } + } + } + } + lifecycle { + # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2 + ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config] + } +} diff --git a/stacks/changedetection/main.tf b/stacks/changedetection/main.tf index ee203e7b..319ebcf1 100644 --- a/stacks/changedetection/main.tf +++ b/stacks/changedetection/main.tf @@ -19,6 +19,9 @@ resource "kubernetes_namespace" "changedetection" { } resource "kubernetes_manifest" "external_secret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/chrome-service/files/novnc/entrypoint.sh b/stacks/chrome-service/files/novnc/entrypoint.sh index fae5c641..aeff9408 100644 --- a/stacks/chrome-service/files/novnc/entrypoint.sh +++ b/stacks/chrome-service/files/novnc/entrypoint.sh @@ -19,14 +19,14 @@ for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15; do sleep 2 done -# websockify runs as PID 1; x11vnc is a child so its logs land on container stdout -# `-noshm` skips MIT-SHM probes that fail across container boundaries (each -# container has its own /dev/shm); `-noxdamage` skips XDAMAGE which Xvfb -# doesn't expose; `-quiet` keeps the polling chatter out of pod logs. +# Both x11vnc and websockify run as supervised children of this entrypoint (PID +# 1) so their logs land on container stdout and the `wait -n` at the end can catch +# either one dying. `-noshm` skips MIT-SHM probes that fail across container +# boundaries (each container has its own /dev/shm); `-noxdamage` skips XDAMAGE +# which Xvfb doesn't expose; `-quiet` keeps the polling chatter out of pod logs. echo "starting x11vnc -> :5900" x11vnc -display localhost:99 -nopw -listen 0.0.0.0 -rfbport 5900 \ -forever -shared -noshm -noxdamage -quiet 2>&1 & -X11VNC_PID=$! for i in 1 2 3 4 5 6 7 8 9 10; do if echo > /dev/tcp/127.0.0.1/5900 2>/dev/null; then @@ -43,4 +43,18 @@ if ! echo > /dev/tcp/127.0.0.1/5900 2>/dev/null; then fi echo "starting websockify -> :6080" -exec websockify --web=/usr/share/novnc 6080 localhost:5900 +# Run websockify in the background (it was `exec`ed before) so BOTH it and x11vnc +# are supervised. x11vnc attaches to the chrome-service container's Xvfb over +# localhost:6099 (shared pod network); when that container restarts, x11vnc loses +# its X connection and exits. Previously websockify was PID 1 and x11vnc was an +# unsupervised child, so a dead x11vnc was never relaunched: :5900 stayed dead and +# the noVNC view went black until a manual pod restart. Now if EITHER process +# exits, `wait -n` returns and we exit non-zero so the kubelet restarts this +# container, which re-waits for Xvfb and relaunches x11vnc — the bridge self-heals +# across browser-container restarts. (Same supervision pattern as the +# android-emulator stack's entrypoint.) +websockify --web=/usr/share/novnc 6080 localhost:5900 & + +wait -n || true +echo "novnc: a supervised process (x11vnc or websockify) exited; exiting so the kubelet restarts this container." >&2 +exit 1 diff --git a/stacks/chrome-service/main.tf b/stacks/chrome-service/main.tf index 2f679c00..82e8fe45 100644 --- a/stacks/chrome-service/main.tf +++ b/stacks/chrome-service/main.tf @@ -41,6 +41,9 @@ resource "kubernetes_namespace" "chrome_service" { # --- Secrets (single-key extract: api_bearer_token) --- resource "kubernetes_manifest" "external_secret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" @@ -330,15 +333,23 @@ resource "kubernetes_deployment" "chrome_service" { container { name = "novnc" # Phase 3 cutover 2026-05-07 — Forgejo registry consolidation. - image = "ghcr.io/viktorbarzin/chrome-service-novnc:latest" + # SHA-pinned (not :latest): Keel is OFF for this deployment + # (keel.sh/policy=never, below) and :latest/IfNotPresent won't re-pull a + # rebuilt image, so a new noVNC entrypoint only deploys when this digest + # is bumped here. Bump after build-chrome-service-novnc.yml pushes a new + # SHA tag — then WAIT for that apply pipeline to finish before pushing + # anything else: Woodpecker cancel-previous SIGKILLs an in-flight apply + # mid-run (memory id=1957), which is exactly how the 2026-06-27 apply got + # killed. 2026-06-27: bumped to land the x11vnc-supervision self-heal fix + # (noVNC went black after a browser-container restart; see + # docs/architecture/chrome-service.md "x11vnc supervision"). + image = "ghcr.io/viktorbarzin/chrome-service-novnc:19d0f0933a8ec75be6cfa077db88e0f8c3760f40" image_pull_policy = "IfNotPresent" # Cap RLIMIT_NOFILE before the entrypoint runs. Containerd grants pods # nofile=2^31; x11vnc sweeps the whole fd table on each client connect, # so every VNC connection hangs on "Connecting" until it times out - # (fd-sweep bug, same as android-emulator). entrypoint.sh now also sets - # this, but the image is :latest/IfNotPresent so a rebuilt entrypoint - # isn't guaranteed to be pulled — this wrapper applies the cap - # deterministically on every rollout off the cached image. + # (fd-sweep bug, same as android-emulator). entrypoint.sh also sets this; + # the wrapper keeps the cap deterministic even off a cached image. command = ["bash", "-c", "ulimit -n 65536; exec /entrypoint.sh"] port { name = "http" @@ -348,9 +359,13 @@ resource "kubernetes_deployment" "chrome_service" { # x11vnc connects to the chrome-service container's Xvfb over # localhost TCP (shared pod network). Same uid 1000 as chrome # container so we can read MIT-MAGIC-COOKIE if Xvfb adds one. + # 256Mi (was 96Mi): the 96Mi cap OOMKilled (exit 137) the sidecar under + # ACTIVE VNC use — x11vnc + websockify framebuffer/encode buffers spike + # well past idle (~37Mi) when a client streams the 1280x720 screen, so the + # noVNC view froze/hung on connect. Bumped 2026-06-28. resources { - requests = { cpu = "10m", memory = "32Mi" } - limits = { memory = "96Mi" } + requests = { cpu = "10m", memory = "64Mi" } + limits = { memory = "256Mi" } } } diff --git a/stacks/chrome-service/rbac.tf b/stacks/chrome-service/rbac.tf new file mode 100644 index 00000000..f0043f1a --- /dev/null +++ b/stacks/chrome-service/rbac.tf @@ -0,0 +1,95 @@ +# emo's hands-off "homelab browser" credential + chrome-service port-forward RBAC. +# +# Access decision (2026-06-28, Viktor's explicit call): emo SHARES Viktor's single +# chrome-service browser rather than getting an isolated instance. The noVNC half of +# that grant is the Authentik allowlist in +# stacks/authentik/admin-services-restriction.tf (CHROME_ALLOWED); THIS file is the +# CLI half — it lets emo's `homelab browser` reach the headed Chrome over CDP. +# +# `homelab browser` shells out to `kubectl port-forward -n chrome-service svc/chrome-service` +# (cli/browser.go). emo's normal kubeconfig is interactive-OIDC-only (kubelogin) and +# can't authenticate a headless agent session, and his power-user tier has no +# pods/portforward. So we mint a dedicated ServiceAccount with a long-lived token +# (the dashboard-sa.tf pattern) that the devvm provisioner installs as emo's DEFAULT +# kubeconfig context (scripts/t3-provision-users.sh install_browser_kubeconfig); his +# personal OIDC login stays available as the `oidc@homelab` named context. +# +# TRADE-OFF (accepted): CDP access == full control of the shared browser, including +# the persistent profile (browser.contexts[0]) where Viktor's warmed logins live. +# CDP has no per-context auth, so this SA can reach Viktor's sessions. That is inherent +# to sharing one browser (the isolated per-user instance was declined). +# See docs/architecture/chrome-service.md "Multi-user access". + +resource "kubernetes_service_account" "emo_browser" { + metadata { + name = "emo-browser" + namespace = kubernetes_namespace.chrome_service.metadata[0].name + } +} + +# Long-lived (non-expiring) token for the SA — the devvm provisioner reads this and +# writes it into emo's kubeconfig. Same pattern as stacks/rbac/.../dashboard-sa.tf. +resource "kubernetes_secret" "emo_browser_token" { + metadata { + name = "emo-browser-token" + namespace = kubernetes_namespace.chrome_service.metadata[0].name + annotations = { + "kubernetes.io/service-account.name" = kubernetes_service_account.emo_browser.metadata[0].name + } + } + type = "kubernetes.io/service-account-token" + wait_for_service_account_token = true +} + +# The ONLY verb emo's SA lacks for `kubectl port-forward svc/chrome-service`: the +# port-forward subresource. (get/list of pods + services + endpoints comes from the +# cluster-read binding below.) Namespace-scoped to chrome-service. +resource "kubernetes_role" "browser_portforward" { + metadata { + name = "chrome-service-portforward" + namespace = kubernetes_namespace.chrome_service.metadata[0].name + } + rule { + api_groups = [""] + resources = ["pods/portforward"] + verbs = ["create"] + } +} + +resource "kubernetes_role_binding" "emo_browser_portforward" { + metadata { + name = "emo-browser-portforward" + namespace = kubernetes_namespace.chrome_service.metadata[0].name + } + role_ref { + api_group = "rbac.authorization.k8s.io" + kind = "Role" + name = kubernetes_role.browser_portforward.metadata[0].name + } + subject { + kind = "ServiceAccount" + name = kubernetes_service_account.emo_browser.metadata[0].name + namespace = kubernetes_namespace.chrome_service.metadata[0].name + } +} + +# Cluster-wide read-only (NO secrets), mirroring emo's power-user OIDC access, bound +# to the SA. Needed because the SA becomes emo's DEFAULT kubectl context, so without +# this his everyday `kubectl get ...` would regress — AND port-forward itself needs +# get/list on services + pods + endpoints (all covered by oidc-power-user-readonly). +# That ClusterRole is defined in stacks/rbac (modules/rbac/main.tf); referenced by name. +resource "kubernetes_cluster_role_binding" "emo_browser_readonly" { + metadata { + name = "emo-browser-readonly" + } + role_ref { + api_group = "rbac.authorization.k8s.io" + kind = "ClusterRole" + name = "oidc-power-user-readonly" + } + subject { + kind = "ServiceAccount" + name = kubernetes_service_account.emo_browser.metadata[0].name + namespace = kubernetes_namespace.chrome_service.metadata[0].name + } +} diff --git a/stacks/ci-pipeline-health/main.tf b/stacks/ci-pipeline-health/main.tf index 17378f84..44aacbec 100644 --- a/stacks/ci-pipeline-health/main.tf +++ b/stacks/ci-pipeline-health/main.tf @@ -49,6 +49,9 @@ resource "kubernetes_namespace" "ci_pipeline_health" { # billing on PRIVATE mirrors, which a future scoped read:packages rotation of # the alias could not do. Blast radius = this single-CronJob namespace. resource "kubernetes_manifest" "external_secret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/claude-agent-service/main.tf b/stacks/claude-agent-service/main.tf index 9f8b6478..a039f699 100644 --- a/stacks/claude-agent-service/main.tf +++ b/stacks/claude-agent-service/main.tf @@ -38,6 +38,9 @@ resource "kubernetes_namespace" "claude_agent" { # --- Secrets --- resource "kubernetes_manifest" "external_secret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/claude-breakglass/main.tf b/stacks/claude-breakglass/main.tf index 6b996b9e..ca700945 100644 --- a/stacks/claude-breakglass/main.tf +++ b/stacks/claude-breakglass/main.tf @@ -57,6 +57,9 @@ resource "kubernetes_service_account" "breakglass" { # DENIED this path (see stacks/vault/main.tf) so the shared, prompt-injectable # pod can never read it. resource "kubernetes_manifest" "external_secret_ssh" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" @@ -82,6 +85,9 @@ resource "kubernetes_manifest" "external_secret_ssh" { # Env secrets: the Anthropic OAuth token (shared with claude-agent-service — # same account) and the app bearer token (in-cluster/CLI fallback caller auth). resource "kubernetes_manifest" "external_secret_env" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/claude-memory/main.tf b/stacks/claude-memory/main.tf index 18c21fe5..fad08b42 100644 --- a/stacks/claude-memory/main.tf +++ b/stacks/claude-memory/main.tf @@ -29,6 +29,9 @@ resource "kubernetes_namespace" "claude-memory" { } resource "kubernetes_manifest" "external_secret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" @@ -57,6 +60,9 @@ resource "kubernetes_manifest" "external_secret" { # DB credentials from Vault database engine (rotated every 24h) resource "kubernetes_manifest" "db_external_secret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/coturn/main.tf b/stacks/coturn/main.tf index caeb9a66..9ab23e5d 100644 --- a/stacks/coturn/main.tf +++ b/stacks/coturn/main.tf @@ -5,6 +5,9 @@ variable "tls_secret_name" { variable "public_ip" { type = string } resource "kubernetes_manifest" "external_secret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/dawarich/main.tf b/stacks/dawarich/main.tf index 2432e9c3..3eeb1540 100644 --- a/stacks/dawarich/main.tf +++ b/stacks/dawarich/main.tf @@ -23,6 +23,9 @@ resource "kubernetes_namespace" "dawarich" { } resource "kubernetes_manifest" "external_secret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/dbaas/modules/dbaas/main.tf b/stacks/dbaas/modules/dbaas/main.tf index 479263ed..d940f642 100644 --- a/stacks/dbaas/modules/dbaas/main.tf +++ b/stacks/dbaas/modules/dbaas/main.tf @@ -745,7 +745,10 @@ resource "kubernetes_deployment" "phpmyadmin" { labels = { "app" = "phpmyadmin" tier = var.tier - + # ADR-0014 service identity: dbaas is a multi-Service namespace, so the + # namespace alone can't attribute Goldmane flows. Value = the fronting + # Service name (kubernetes_service.phpmyadmin is named "pma"). + "service-identity" = "pma" } annotations = { "reloader.stakater.com/search" = "true" @@ -762,6 +765,10 @@ resource "kubernetes_deployment" "phpmyadmin" { metadata { labels = { "app" = "phpmyadmin" + # ADR-0014: Goldmane/Felix stamps POD labels onto flows, so the + # disambiguating identity must live on the pod template (not just + # the Deployment metadata above). Not in selector → no replace. + "service-identity" = "pma" } } spec { @@ -812,8 +819,19 @@ resource "kubernetes_deployment" "phpmyadmin" { } } lifecycle { - # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2 - ignore_changes = [spec[0].template[0].spec[0].dns_config] + ignore_changes = [ + spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2 + # This Deployment is Keel-enrolled (keel.sh/policy=patch). Ignore the + # attributes Keel/Kyverno mutate at runtime so `terragrunt apply` (incl. + # the daily drift plan) doesn't fight them or revert the live image — + # canonical KEEL/KYVERNO lifecycle guard, matches linkwarden/chrome-service. + metadata[0].annotations["keel.sh/policy"], + metadata[0].annotations["keel.sh/trigger"], + metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2 + metadata[0].annotations["keel.sh/match-tag"], + spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE — Keel manages tag updates + spec[0].template[0].metadata[0].annotations["keel.sh/update-time"], # KEEL_LIFECYCLE_V1 + ] } } @@ -1499,6 +1517,10 @@ resource "kubernetes_deployment" "pgadmin" { } labels = { tier = var.tier + # ADR-0014 service identity: dbaas is a multi-Service namespace, so the + # namespace alone can't attribute Goldmane flows. Value = the fronting + # Service name (kubernetes_service.pgadmin is named "pgadmin"). + "service-identity" = "pgadmin" } } spec { @@ -1514,6 +1536,10 @@ resource "kubernetes_deployment" "pgadmin" { metadata { labels = { app = "pgadmin" + # ADR-0014: Goldmane/Felix stamps POD labels onto flows, so the + # disambiguating identity must live on the pod template (not just + # the Deployment metadata above). Not in selector → no replace. + "service-identity" = "pgadmin" } } spec { @@ -1568,8 +1594,20 @@ resource "kubernetes_deployment" "pgadmin" { } } lifecycle { - # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2 - ignore_changes = [spec[0].template[0].spec[0].dns_config] + ignore_changes = [ + spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2 + # This Deployment is Keel-enrolled (keel.sh/policy=patch) and Keel has + # bumped the live image (dpage/pgadmin4:9.16). Ignore the Keel/Kyverno + # runtime-mutated attributes so `terragrunt apply` (incl. the daily drift + # plan) doesn't revert the image to bare `dpage/pgadmin4` or strip Keel's + # annotations — canonical guard, matches linkwarden/chrome-service. + metadata[0].annotations["keel.sh/policy"], + metadata[0].annotations["keel.sh/trigger"], + metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2 + metadata[0].annotations["keel.sh/match-tag"], + spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE — Keel manages tag updates + spec[0].template[0].metadata[0].annotations["keel.sh/update-time"], # KEEL_LIFECYCLE_V1 + ] } } resource "kubernetes_service" "pgadmin" { diff --git a/stacks/diun/main.tf b/stacks/diun/main.tf index 9933f064..81294806 100644 --- a/stacks/diun/main.tf +++ b/stacks/diun/main.tf @@ -20,6 +20,9 @@ resource "kubernetes_namespace" "diun" { } resource "kubernetes_manifest" "external_secret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/ebooks/main.tf b/stacks/ebooks/main.tf index a5754590..0813b45a 100644 --- a/stacks/ebooks/main.tf +++ b/stacks/ebooks/main.tf @@ -20,6 +20,9 @@ resource "kubernetes_namespace" "ebooks" { # ExternalSecrets for all three sources resource "kubernetes_manifest" "calibre_external_secret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" @@ -47,6 +50,9 @@ resource "kubernetes_manifest" "calibre_external_secret" { } resource "kubernetes_manifest" "audiobookshelf_external_secret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" @@ -74,6 +80,9 @@ resource "kubernetes_manifest" "audiobookshelf_external_secret" { } resource "kubernetes_manifest" "servarr_external_secret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/f1-stream/main.tf b/stacks/f1-stream/main.tf index a62ad01a..bcd66c7f 100644 --- a/stacks/f1-stream/main.tf +++ b/stacks/f1-stream/main.tf @@ -33,6 +33,9 @@ resource "kubernetes_namespace" "f1-stream" { } resource "kubernetes_manifest" "external_secret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" @@ -62,6 +65,9 @@ resource "kubernetes_manifest" "external_secret" { # Pull the chrome-service bearer token into this namespace as a separate # Secret so the verifier can reach the in-cluster Playwright pool. resource "kubernetes_manifest" "chrome_service_client_secret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/fire-planner/main.tf b/stacks/fire-planner/main.tf index 21503a37..be478699 100644 --- a/stacks/fire-planner/main.tf +++ b/stacks/fire-planner/main.tf @@ -53,6 +53,9 @@ resource "kubernetes_namespace" "fire_planner" { # Seed before applying: # secret/fire-planner -> property `recompute_bearer_token` resource "kubernetes_manifest" "external_secret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" @@ -115,6 +118,9 @@ resource "kubernetes_manifest" "external_secret" { # Template builds the asyncpg DSN consumed by the FastAPI app + CronJob # as DB_CONNECTION_STRING. resource "kubernetes_manifest" "db_external_secret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" @@ -159,6 +165,9 @@ resource "kubernetes_manifest" "db_external_secret" { # pg-sync sidecar populates `daily_account_valuation` etc. hourly; the # fire-planner ingest reads those tables via this role. resource "kubernetes_manifest" "wealthfolio_sync_db_external_secret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" @@ -450,6 +459,90 @@ resource "kubernetes_cron_job_v1" "fire_planner_recompute" { ] } +# Monthly FIRE-countdown target solve on the 2nd at 10:00 UTC (an hour after +# recompute-all, so account_snapshot is fresh). Binary-searches each Case's FIRE +# number per country at the 99% Guyton-Klinger bar and upserts fire_target, which +# the wealth Grafana dashboard's "FIRE Countdown" section reads. +resource "kubernetes_cron_job_v1" "fire_planner_fire_targets" { + metadata { + name = "fire-planner-fire-targets" + namespace = kubernetes_namespace.fire_planner.metadata[0].name + } + spec { + schedule = "0 10 2 * *" + concurrency_policy = "Forbid" + successful_jobs_history_limit = 3 + failed_jobs_history_limit = 5 + starting_deadline_seconds = 600 + + job_template { + metadata { + labels = local.labels + } + spec { + backoff_limit = 1 + ttl_seconds_after_finished = 86400 + # The full country sweep is CPU-bound (binary search × ~22 cities × + # 3 cases). Give it room rather than letting it run forever. + active_deadline_seconds = 3600 + template { + metadata { + labels = local.labels + } + spec { + restart_policy = "OnFailure" + image_pull_secrets { + name = "registry-credentials" + } + image_pull_secrets { + name = "ghcr-credentials" + } + container { + name = "fire-targets" + image = local.image + # --horizon 72: Viktor retires ~age 28 and plans to live to 100, so + # the portfolio must last 72 years (was the 60y default ≈ to age 88). + command = ["python", "-m", "fire_planner", "recompute-fire-targets", + "--countries", "all", "--horizon", "72"] + + env_from { + secret_ref { + name = "fire-planner-secrets" + } + } + env_from { + secret_ref { + name = "fire-planner-db-creds" + } + } + + resources { + requests = { + cpu = "500m" + memory = "1Gi" + } + limits = { + memory = "2Gi" + } + } + } + } + } + } + } + } + + lifecycle { + # KYVERNO_LIFECYCLE_V1 + ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config] + } + + depends_on = [ + kubernetes_manifest.external_secret, + kubernetes_manifest.db_external_secret, + ] +} + # Weekly refresh of the COL cache: walks col_snapshot for rows # expiring within 7 days, re-scrapes Numbeo + Expatistan, upserts. With # the user-chosen 1-year TTL, a healthy cache has 0 stale rows on most @@ -569,16 +662,53 @@ module "ingress_api" { auth = "none" } -# Plan-time read of the ESO-created K8s Secret for Grafana datasource -# password. First-apply gotcha: must -# `terragrunt apply -target=kubernetes_manifest.db_external_secret` so -# the Secret exists before this data source plans. -data "kubernetes_secret" "fire_planner_db_creds" { - metadata { - name = "fire-planner-db-creds" - namespace = kubernetes_namespace.fire_planner.metadata[0].name +# ExternalSecret in the monitoring namespace mirroring the rotating +# fire_planner DB password. Grafana mounts this via envFromSecrets in +# monitoring/grafana_chart_values.yaml; the datasource ConfigMap below +# references it as $__env{FIRE_PLANNER_PG_PASSWORD}. Reloader restarts +# Grafana whenever ESO updates this secret (on the 7d static-role +# rotation), so the provisioned datasource never goes stale — replaces +# the old plan-time `data.kubernetes_secret` bake that broke weekly. +# Mirrors the wealth-pg / payslips-pg pattern. +resource "kubernetes_manifest" "grafana_fire_planner_pg_creds" { + field_manager { + force_conflicts = true + } + manifest = { + apiVersion = "external-secrets.io/v1" + kind = "ExternalSecret" + metadata = { + name = "grafana-fire-planner-pg-creds" + namespace = "monitoring" + } + spec = { + refreshInterval = "15m" + secretStoreRef = { + name = "vault-database" + kind = "ClusterSecretStore" + } + target = { + name = "grafana-fire-planner-pg-creds" + template = { + metadata = { + annotations = { + "reloader.stakater.com/match" = "true" + } + } + data = { + FIRE_PLANNER_PG_PASSWORD = "{{ .password }}" + } + } + } + data = [{ + secretKey = "password" + remoteRef = { + key = "static-creds/pg-fire-planner" + property = "password" + } + }] + } } - depends_on = [kubernetes_manifest.db_external_secret] } # Grafana datasource for fire_planner PostgreSQL DB. @@ -615,12 +745,15 @@ resource "kubernetes_config_map" "grafana_fire_planner_datasource" { timescaledb = false } secureJsonData = { - password = data.kubernetes_secret.fire_planner_db_creds.data["DB_PASSWORD"] + # Live env from grafana-fire-planner-pg-creds (above), injected into + # Grafana via envFromSecrets; reloader refreshes it on rotation. + password = "$__env{FIRE_PLANNER_PG_PASSWORD}" } editable = true }] }) } + depends_on = [kubernetes_manifest.grafana_fire_planner_pg_creds] } # CI retrigger 2026-05-16T13:42:57+00:00 — bulk enrollment apply (pipeline #689 killed) @@ -661,6 +794,9 @@ variable "run_examples_bulk_ingest" { # Reddit OAuth creds pulled from Vault secret/viktor. resource "kubernetes_manifest" "external_secret_examples_reddit" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" @@ -701,6 +837,9 @@ resource "kubernetes_manifest" "external_secret_examples_reddit" { # claude-agent-service bearer pulled separately so its rotation cadence # is decoupled from the Reddit creds. resource "kubernetes_manifest" "external_secret_examples_claude" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/forgejo/email-secret.tf b/stacks/forgejo/email-secret.tf index 034d45f2..d0e44c1c 100644 --- a/stacks/forgejo/email-secret.tf +++ b/stacks/forgejo/email-secret.tf @@ -6,6 +6,9 @@ # (stacks/authentik/email-secret.tf) — one credential, one rotation point. The # reloader annotation rolls the Forgejo pod if the password is ever rotated. resource "kubernetes_manifest" "forgejo_email_secret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/freedify/main.tf b/stacks/freedify/main.tf index 3e2cf8b4..2f017003 100644 --- a/stacks/freedify/main.tf +++ b/stacks/freedify/main.tf @@ -3,6 +3,9 @@ variable "tls_secret_name" { sensitive = true } resource "kubernetes_manifest" "external_secret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/freshrss/main.tf b/stacks/freshrss/main.tf index 31c5d20e..61e2122e 100644 --- a/stacks/freshrss/main.tf +++ b/stacks/freshrss/main.tf @@ -18,6 +18,9 @@ resource "kubernetes_namespace" "immich" { } resource "kubernetes_manifest" "external_secret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/goldmane-edge-aggregator/main.tf b/stacks/goldmane-edge-aggregator/main.tf new file mode 100644 index 00000000..1c6fa58a --- /dev/null +++ b/stacks/goldmane-edge-aggregator/main.tf @@ -0,0 +1,499 @@ +# ============================================================================= +# goldmane-edge-aggregator — durable who-talks-to-whom audit trail (ADR-0014 / #58) +# ============================================================================= +# A small Go service that streams Calico Goldmane's gRPC Flows API (mTLS) and +# upserts the unique service-to-service edge set into Postgres, plus a daily +# Slack digest CronJob of first-seen edges. Code lives in the standalone +# `goldmane-edge-aggregator` repo; the authoritative deploy spec is its +# DEPLOY.md. This stack is the infra side of that spec. +# +# Goldmane runs as `Service goldmane:7443` (gRPC/mTLS) in calico-system, enabled +# via the operator CR in stacks/calico/main.tf. The durable Loki path is NOT +# the operator CRs — this service IS the durable trail. +# +# Structure mirrors stacks/claude-memory (the canonical Tier-1 pattern): a +# per-service namespace, a CNPG Postgres DB + role + Vault 7-day rotation + +# ExternalSecret -> DATABASE_URL, the Reloader annotation, and the +# Terragrunt-generated backend.tf/providers.tf/tiers.tf layout. The novel bit is +# minting an mTLS client cert from the Tigera CA (hashicorp/tls; see versions.tf). +# +# IMAGE: ghcr.io/viktorbarzin/goldmane-edge-aggregator is PRIVATE. Onboarding +# MUST add the "goldmane-edge-aggregator" namespace to the ghcr-credentials +# Kyverno allowlist (stacks/kyverno/modules/kyverno/ghcr-credentials.tf, +# local.ghcr_private_namespaces) so the Kyverno-synced `ghcr-credentials` secret +# is cloned into this namespace — otherwise the pulls 401. The imagePullSecrets +# reference below assumes that entry exists. +# ============================================================================= + +variable "postgresql_host" { type = string } + +# Plan-time root creds for the idempotent DB-init Job (mirrors claude-memory). +data "vault_kv_secret_v2" "secrets" { + mount = "secret" + name = "goldmane-edge-aggregator" +} + +# ----------------------------------------------------------------------------- +# 1. Namespace +# ----------------------------------------------------------------------------- +resource "kubernetes_namespace" "goldmane_edge_aggregator" { + metadata { + name = "goldmane-edge-aggregator" + labels = { + name = "goldmane-edge-aggregator" + # Tier 4-aux: a small off-path consumer service, like claude-memory. + tier = local.tiers.aux + "keel.sh/enrolled" = "true" + } + } + lifecycle { + # KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace + ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]] + } +} + +# ----------------------------------------------------------------------------- +# 2. Goldmane mTLS client certificate (minted from the Tigera CA) +# ----------------------------------------------------------------------------- +# The aggregator dials goldmane:7443 over mutual TLS. We mint a client cert +# signed by the Tigera CA (the same CA that issues Goldmane's serving cert), so +# Goldmane requires mutual TLS on :7443 and verifies the client cert chains to +# the Tigera CA — it does NOT authorize by client identity, so ANY Tigera-CA- +# signed cert is accepted. Rather than copy the Tigera CA PRIVATE KEY into TF +# state to mint our own (a needless CA-key exposure; the hashicorp/tls provider +# is also incompatible with this repo's global generate-providers/lockfile +# pattern), we REUSE the operator-minted, Tigera-CA-signed client cert +# `whisker-backend-key-pair` (calico-system). We never touch the CA key. +# Trade-off: if the operator rotates that cert, re-apply to re-sync (hardening +# follow-up: mint an own-identity cert in-namespace if Whisker is ever removed). +data "kubernetes_secret" "whisker_backend" { + metadata { + name = "whisker-backend-key-pair" + namespace = "calico-system" + } +} + +# The CA bundle that verifies Goldmane's serving cert. It lives ONLY in +# calico-system (verified: ConfigMap `tigera-ca-bundle`, 2 keys present — +# `ca-bundle.crt` AND `tigera-ca-bundle.crt`, both the trusted bundle). We read +# it and recreate it as a ConfigMap in this namespace so the pod can mount it +# (a ConfigMap cannot be cross-namespace-mounted). +data "kubernetes_config_map" "tigera_ca_bundle" { + metadata { + name = "tigera-ca-bundle" + namespace = "calico-system" + } +} + +resource "kubernetes_config_map" "tigera_ca_bundle" { + metadata { + name = "tigera-ca-bundle" + namespace = kubernetes_namespace.goldmane_edge_aggregator.metadata[0].name + } + # Copy the upstream bundle verbatim. We mount the `tigera-ca-bundle.crt` key + # at /etc/tigera-ca/tigera-ca-bundle.crt so the service's default + # CA_CERT_PATH (/etc/tigera-ca/tigera-ca-bundle.crt) resolves with no override. + data = data.kubernetes_config_map.tigera_ca_bundle.data +} + +# Client cert + key for mTLS to goldmane:7443, mounted at TLS_CERT_PATH / +# TLS_KEY_PATH defaults (/etc/goldmane-client-tls/tls.crt and .../tls.key). +# Sourced verbatim from the operator's whisker-backend client key-pair (read +# above) — already Tigera-CA-signed, which is all Goldmane verifies. No CA key +# is touched and no cross-namespace CA RBAC is needed. +resource "kubernetes_secret" "goldmane_client_tls" { + metadata { + name = "goldmane-client-tls" + namespace = kubernetes_namespace.goldmane_edge_aggregator.metadata[0].name + } + type = "Opaque" + data = { + "tls.crt" = data.kubernetes_secret.whisker_backend.data["tls.crt"] + "tls.key" = data.kubernetes_secret.whisker_backend.data["tls.key"] + } +} + +# ----------------------------------------------------------------------------- +# 3. Postgres: DB + role `goldmane_edges`, Vault 7-day rotation, DATABASE_URL +# ----------------------------------------------------------------------------- +# Idempotent create of the role + DB using the CNPG root creds from Vault +# (dbaas_root_password), exactly mirroring claude-memory's db_init Job. The +# service creates the `edge` table itself at startup (migrations/0001_edge.sql), +# so no migration Job is needed. +resource "kubernetes_job" "db_init" { + metadata { + name = "goldmane-edges-db-init" + namespace = kubernetes_namespace.goldmane_edge_aggregator.metadata[0].name + } + spec { + template { + metadata {} + spec { + container { + name = "db-init" + image = "postgres:16-alpine" + command = [ + "sh", "-c", + <<-EOT + set -e + # -d postgres: psql defaults the database name to the username; + # the root user has no root-named database, so be explicit. + PGPASSWORD='${data.vault_kv_secret_v2.secrets.data["dbaas_root_password"]}' psql -h ${var.postgresql_host} -U root -d postgres -tc "SELECT 1 FROM pg_roles WHERE rolname='goldmane_edges'" | grep -q 1 || \ + PGPASSWORD='${data.vault_kv_secret_v2.secrets.data["dbaas_root_password"]}' psql -h ${var.postgresql_host} -U root -d postgres -c "CREATE ROLE goldmane_edges WITH LOGIN PASSWORD '${data.vault_kv_secret_v2.secrets.data["db_password"]}'" + PGPASSWORD='${data.vault_kv_secret_v2.secrets.data["dbaas_root_password"]}' psql -h ${var.postgresql_host} -U root -d postgres -tc "SELECT 1 FROM pg_database WHERE datname='goldmane_edges'" | grep -q 1 || \ + PGPASSWORD='${data.vault_kv_secret_v2.secrets.data["dbaas_root_password"]}' psql -h ${var.postgresql_host} -U root -d postgres -c "CREATE DATABASE goldmane_edges OWNER goldmane_edges" + PGPASSWORD='${data.vault_kv_secret_v2.secrets.data["dbaas_root_password"]}' psql -h ${var.postgresql_host} -U root -d postgres -c "GRANT ALL PRIVILEGES ON DATABASE goldmane_edges TO goldmane_edges" + echo "Database init complete" + EOT + ] + } + restart_policy = "Never" + } + } + backoff_limit = 3 + } + wait_for_completion = true + timeouts { + create = "2m" + } + lifecycle { + # KYVERNO_LIFECYCLE_V1: Kyverno injects dns_config (ndots=2); ignore it so + # this idempotent Job isn't replaced (Jobs are immutable) on every apply. + ignore_changes = [spec[0].template[0].spec[0].dns_config] + } +} + +# ExternalSecret projecting the Vault-rotated (7-day) credential into a K8s +# Secret as DATABASE_URL. The Vault DB static role `pg-goldmane-edges` and its +# place in the CNPG connection allowlist are added in stacks/vault/main.tf +# (see this stack's terragrunt.hcl note). remoteRef key: static-creds/pg-goldmane-edges. +resource "kubernetes_manifest" "db_external_secret" { + field_manager { + force_conflicts = true + } + manifest = { + apiVersion = "external-secrets.io/v1" + kind = "ExternalSecret" + metadata = { + name = "goldmane-edges-db-creds" + namespace = kubernetes_namespace.goldmane_edge_aggregator.metadata[0].name + } + spec = { + refreshInterval = "15m" + secretStoreRef = { + name = "vault-database" + kind = "ClusterSecretStore" + } + target = { + name = "goldmane-edges-db-creds" + template = { + data = { + DATABASE_URL = "postgresql://goldmane_edges:{{ .password }}@${var.postgresql_host}:5432/goldmane_edges" + } + } + } + data = [{ + secretKey = "password" + remoteRef = { + key = "static-creds/pg-goldmane-edges" + property = "password" + } + }] + } + } + depends_on = [kubernetes_namespace.goldmane_edge_aggregator] +} + +# ----------------------------------------------------------------------------- +# 4. Slack webhook (reuse the alert-digest incoming webhook) +# ----------------------------------------------------------------------------- +# The monitoring alert-digest CronJob posts with the Slack incoming webhook at +# Vault secret/monitoring -> key `alertmanager_slack_api_url` +# (stacks/monitoring/modules/monitoring/alert_digest.tf). Project that same URL +# into this namespace as SLACK_WEBHOOK_URL via an ExternalSecret (no new +# webhook). The digest CronJob defaults to #security. +resource "kubernetes_manifest" "slack_external_secret" { + field_manager { + force_conflicts = true + } + manifest = { + apiVersion = "external-secrets.io/v1" + kind = "ExternalSecret" + metadata = { + name = "goldmane-edges-slack" + namespace = kubernetes_namespace.goldmane_edge_aggregator.metadata[0].name + } + spec = { + refreshInterval = "1h" + secretStoreRef = { + name = "vault-kv" + kind = "ClusterSecretStore" + } + target = { + name = "goldmane-edges-slack" + } + data = [{ + secretKey = "SLACK_WEBHOOK_URL" + remoteRef = { + key = "viktor" + property = "alertmanager_slack_api_url" + } + }] + } + } + depends_on = [kubernetes_namespace.goldmane_edge_aggregator] +} + +# ----------------------------------------------------------------------------- +# 5. aggregate — Deployment (long-running gRPC stream -> Postgres upserts) +# ----------------------------------------------------------------------------- +resource "kubernetes_deployment" "aggregate" { + depends_on = [ + kubernetes_job.db_init, + kubernetes_manifest.db_external_secret, + ] + metadata { + name = "goldmane-edge-aggregator" + namespace = kubernetes_namespace.goldmane_edge_aggregator.metadata[0].name + labels = { + app = "goldmane-edge-aggregator" + tier = local.tiers.aux + } + annotations = { + # Credential is env-injected and read only at startup; the 7-day rotation + # must bounce the pod or it keeps the stale password and silently fails + # DB auth (infra CLAUDE.md Reloader rule). + "secret.reloader.stakater.com/reload" = "goldmane-edges-db-creds" + } + } + spec { + # 1 replica: the edge set is a global upsert keyed on (src_ns, dst_ns, + # action); a second replica only doubles writes for no benefit (Goldmane + # streams per-flow). Stateless (no PVC) so RollingUpdate is fine. + replicas = 1 + selector { + match_labels = { + app = "goldmane-edge-aggregator" + } + } + template { + metadata { + labels = { + app = "goldmane-edge-aggregator" + } + } + spec { + # PRIVATE ghcr image — cloned into this namespace by the Kyverno + # sync-ghcr-credentials allowlist policy (add this ns to that list). + image_pull_secrets { + name = "ghcr-credentials" + } + container { + name = "aggregate" + # CI (GHA -> ghcr) overwrites this to : via `kubectl set image`; + # the image tag is in ignore_changes below so the SHA sticks across + # `terragrunt apply` (fleet image-pin convention). Placeholder :latest + # until the deploy pipeline runs. + image = "ghcr.io/viktorbarzin/goldmane-edge-aggregator:latest" + args = ["aggregate"] + + # Goldmane mTLS. GOLDMANE_HOST default host sans port => + # ServerName "goldmane.calico-system.svc.cluster.local", which is a SAN + # on the live Goldmane serving cert (verified 2026-06-24: + # DNS:goldmane{,.calico-system{,.svc{,.cluster.local}}}). So no + # GOLDMANE_SERVER_NAME override and no GOLDMANE_TLS_INSECURE needed. + env { + name = "GOLDMANE_HOST" + value = "goldmane.calico-system.svc.cluster.local:7443" + } + # TLS_CERT_PATH / TLS_KEY_PATH / CA_CERT_PATH are left at their image + # defaults (/etc/goldmane-client-tls/tls.{crt,key} and + # /etc/tigera-ca/tigera-ca-bundle.crt) — the mounts below match them. + + env { + name = "DATABASE_URL" + value_from { + secret_key_ref { + name = "goldmane-edges-db-creds" + key = "DATABASE_URL" + } + } + } + + volume_mount { + name = "goldmane-client-tls" + mount_path = "/etc/goldmane-client-tls" + read_only = true + } + volume_mount { + name = "tigera-ca" + mount_path = "/etc/tigera-ca" + read_only = true + } + + resources { + # Idles low: a single gRPC stream + periodic upserts. requests=limits + # per the repo memory rule; no CPU limit (CFS throttling). Right-size + # later with krr. + requests = { + cpu = "10m" + memory = "64Mi" + } + limits = { + memory = "64Mi" + } + } + } + + volume { + name = "goldmane-client-tls" + secret { + secret_name = kubernetes_secret.goldmane_client_tls.metadata[0].name + } + } + volume { + name = "tigera-ca" + config_map { + name = kubernetes_config_map.tigera_ca_bundle.metadata[0].name + } + } + } + } + } + lifecycle { + ignore_changes = [ + # CI pipeline owns the image tag (kubectl set image from GHA/Woodpecker). + spec[0].template[0].spec[0].container[0].image, + spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2 + metadata[0].annotations["keel.sh/policy"], + metadata[0].annotations["keel.sh/trigger"], + metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2 + metadata[0].annotations["keel.sh/match-tag"], + metadata[0].annotations["kubernetes.io/change-cause"], + metadata[0].annotations["deployment.kubernetes.io/revision"], + spec[0].template[0].metadata[0].annotations["keel.sh/update-time"], # KEEL_LIFECYCLE_V1 + ] + } +} + +# ----------------------------------------------------------------------------- +# 6. digest — daily CronJob (first-seen edges -> Slack) +# ----------------------------------------------------------------------------- +resource "kubernetes_cron_job_v1" "digest" { + depends_on = [ + kubernetes_job.db_init, + kubernetes_manifest.db_external_secret, + kubernetes_manifest.slack_external_secret, + ] + metadata { + name = "goldmane-edges-digest" + namespace = kubernetes_namespace.goldmane_edge_aggregator.metadata[0].name + labels = { + app = "goldmane-edge-aggregator" + tier = local.tiers.aux + } + } + spec { + # Daily 08:00 Europe/London — aligns with the alert-digest cadence. + schedule = "0 8 * * *" + timezone = "Europe/London" + concurrency_policy = "Forbid" + successful_jobs_history_limit = 3 + failed_jobs_history_limit = 3 + starting_deadline_seconds = 600 + + job_template { + metadata { + labels = { + app = "goldmane-edge-aggregator" + } + annotations = { + # 7-day DB rotation: bounce the Job pod's stale env (Reloader rule). + "secret.reloader.stakater.com/reload" = "goldmane-edges-db-creds" + } + } + spec { + backoff_limit = 2 + active_deadline_seconds = 300 + ttl_seconds_after_finished = 86400 + + template { + metadata { + labels = { + app = "goldmane-edge-aggregator" + } + } + spec { + restart_policy = "OnFailure" + image_pull_secrets { + name = "ghcr-credentials" + } + container { + name = "digest" + # CronJobs track :latest + imagePullPolicy: Always (fleet + # convention) so the daily run picks up the current image. + image = "ghcr.io/viktorbarzin/goldmane-edge-aggregator:latest" + image_pull_policy = "Always" + args = ["digest"] + + env { + name = "DATABASE_URL" + value_from { + secret_key_ref { + name = "goldmane-edges-db-creds" + key = "DATABASE_URL" + } + } + } + env { + name = "SLACK_WEBHOOK_URL" + value_from { + secret_key_ref { + name = "goldmane-edges-slack" + key = "SLACK_WEBHOOK_URL" + } + } + } + env { + name = "SLACK_CHANNEL" + # Posts to #alerts. The dedicated #security channel was abandoned + # 2026-06-25 — the shared alertmanager_slack_api_url webhook's + # Slack app isn't a member of it (channel override 404s), so all + # Slack (incl. alertmanager's security-lane alerts) consolidated + # to #alerts. See docs/runbooks/goldmane-flow-trail.md. + value = "#alerts" + } + + resources { + requests = { + cpu = "10m" + memory = "64Mi" + } + limits = { + memory = "64Mi" + } + } + } + } + } + } + } + } + lifecycle { + # KYVERNO_LIFECYCLE_V1 (CronJob path): Kyverno mutates dns_config with ndots=2. + ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config] + } +} + +# ----------------------------------------------------------------------------- +# 7. Egress (default-deny consideration) +# ----------------------------------------------------------------------------- +# Goldmane's own NetworkPolicy already allows INGRESS on 7443 from anywhere, so +# nothing is needed on the Goldmane side. No egress policy is declared here: +# this namespace is default-allow egress today. IF/WHEN it is brought under the +# wave-1 default-deny egress enforcement (per-namespace allowlists), add +# (Global)NetworkPolicy egress rules permitting: +# - goldmane.calico-system.svc.cluster.local:7443 (the flow stream) +# - pg-cluster-rw.dbaas.svc.cluster.local:5432 (Postgres) +# - hooks.slack.com:443 (digest -> Slack, internet) +# - kube-dns / CoreDNS :53 (DNS, every namespace) diff --git a/stacks/goldmane-edge-aggregator/terragrunt.hcl b/stacks/goldmane-edge-aggregator/terragrunt.hcl new file mode 100644 index 00000000..a1889d9e --- /dev/null +++ b/stacks/goldmane-edge-aggregator/terragrunt.hcl @@ -0,0 +1,24 @@ +include "root" { + path = find_in_parent_folders() +} + +# Tier-1 stack (PG state backend). The root terragrunt.hcl generates backend.tf +# (pg backend, schema_name = "goldmane-edge-aggregator"), providers.tf, +# cloudflare_provider.tf and tiers.tf automatically — do NOT hand-write those. +# This stack adds the hashicorp/tls provider via a local versions.tf (merged +# into the generated required_providers). + +dependency "platform" { + config_path = "../platform" + skip_outputs = true +} + +dependency "vault" { + config_path = "../vault" + skip_outputs = true +} + +# The Vault DB static role pg-goldmane-edges (7-day rotation) and the CNPG +# connection allowlist entry live in the vault stack (stacks/vault/main.tf). +# The vault dependency above orders this stack after it so the ExternalSecret +# can materialize the rotated credential on first apply. diff --git a/stacks/grampsweb/main.tf b/stacks/grampsweb/main.tf index 2d434ec7..139c6595 100644 --- a/stacks/grampsweb/main.tf +++ b/stacks/grampsweb/main.tf @@ -5,6 +5,9 @@ variable "tls_secret_name" { variable "nfs_server" { type = string } resource "kubernetes_manifest" "external_secret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/hackmd/main.tf b/stacks/hackmd/main.tf index bbe6db40..2e065c99 100644 --- a/stacks/hackmd/main.tf +++ b/stacks/hackmd/main.tf @@ -208,6 +208,9 @@ module "ingress" { } resource "kubernetes_manifest" "external_secret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/health/main.tf b/stacks/health/main.tf index 36fd17d6..7baf5f9c 100644 --- a/stacks/health/main.tf +++ b/stacks/health/main.tf @@ -250,6 +250,9 @@ module "ingress_test" { } resource "kubernetes_manifest" "external_secret_db" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" @@ -284,6 +287,9 @@ resource "kubernetes_manifest" "external_secret_db" { } resource "kubernetes_manifest" "external_secret_kv" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/hermes-agent/main.tf b/stacks/hermes-agent/main.tf index 1293d7a5..fff8578b 100644 --- a/stacks/hermes-agent/main.tf +++ b/stacks/hermes-agent/main.tf @@ -37,6 +37,9 @@ module "tls_secret" { # --- Secrets (ESO from Vault) --- resource "kubernetes_manifest" "external_secret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/immich/frame-emo.tf b/stacks/immich/frame-emo.tf new file mode 100644 index 00000000..577d84af --- /dev/null +++ b/stacks/immich/frame-emo.tf @@ -0,0 +1,155 @@ +# Immich photo-frame for Emo (emil.barzin@gmail.com) — a second instance cloned +# from the London frame in frame.tf, scoped to Emo's Immich account + Sofia +# weather. Served at highlights-immich-emo.viktorbarzin.me and shown on Emo's +# Portal Mini (Sofia) via the portal-immich-frame app. +# API key: Vault secret/immich -> frame_api_key_emo (minted on Emo's account). + +resource "kubernetes_config_map" "frame_config_emo" { + metadata { + name = "config-emo" + namespace = "immich" + + labels = { + app = "frame-config-emo" + } + annotations = { + "reloader.stakater.com/match" = "true" + } + } + + data = { + "Settings.yml" = <<-EOF + General: + Layout: single + Interval: 45 + ImageZoom: true + ShowAlbumName: false + ShowProgressBar: false + ClockFormat: "HH:mm" + PhotoDateFormat: "dd/MM/yyyy" + WeatherApiKey: ${data.vault_kv_secret_v2.secrets.data["frame_weather_api_key"]} + UnitSystem: metric + WeatherLatLong: "42.6977,23.3219" + Language: en + Accounts: + - ImmichServerUrl: http://immich.viktorbarzin.me + ApiKey: ${data.vault_kv_secret_v2.secrets.data["frame_api_key_emo"]} + ImagesFromDays: 730 + EOF + } +} + + +resource "kubernetes_deployment" "immich-frame-emo" { + metadata { + name = "immich-frame-emo" + namespace = "immich" + annotations = { + "reloader.stakater.com/search" = "true" + } + labels = { + tier = local.tiers.gpu + } + } + + spec { + replicas = 1 + selector { + match_labels = { + app = "immich-frame-emo" + } + } + strategy { + type = "RollingUpdate" + } + template { + metadata { + labels = { + app = "immich-frame-emo" + } + annotations = { + "dependency.kyverno.io/wait-for" = "immich-server.immich:2283" + } + } + spec { + container { + image = "ghcr.io/immichframe/immichframe:v1.0.32.0" + name = "immich-frame-emo" + resources { + requests = { + cpu = "10m" + memory = "64Mi" + } + limits = { + memory = "128Mi" + } + } + port { + container_port = 8080 + protocol = "TCP" + name = "http" + } + volume_mount { + name = "config" + mount_path = "/app/Config" + read_only = true + } + } + volume { + name = "config" + config_map { + name = "config-emo" + } + } + } + } + } + lifecycle { + ignore_changes = [ + spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1 + metadata[0].annotations["keel.sh/policy"], + metadata[0].annotations["keel.sh/trigger"], + metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2 + metadata[0].annotations["keel.sh/match-tag"], + metadata[0].annotations["kubernetes.io/change-cause"], + metadata[0].annotations["deployment.kubernetes.io/revision"], + spec[0].template[0].metadata[0].annotations["keel.sh/update-time"], # KEEL_LIFECYCLE_V1 + spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE + ] + } +} + + +resource "kubernetes_service" "immich-frame-emo" { + metadata { + name = "immich-frame-emo" + namespace = "immich" + labels = { + "app" = "immich-frame-emo" + } + } + + spec { + selector = { + app = "immich-frame-emo" + } + port { + port = 80 + target_port = 8080 + } + } +} + +module "ingress_emo" { + source = "../../modules/kubernetes/ingress_factory" + # Photo-frame kiosk display on Emo's Portal — headless browser pulling images + # via an Immich API key (no user login). Forward-auth would 302 the device to + # Authentik with no way to complete login. + # auth = "none": photo-frame kiosk; headless browser with API key; no user login. + auth = "none" + dns_type = "proxied" + namespace = "immich" + name = "highlights-immich-emo" + tls_secret_name = var.tls_secret_name + service_name = "immich-frame-emo" +} diff --git a/stacks/immich/main.tf b/stacks/immich/main.tf index 3009be5e..809d6a2e 100644 --- a/stacks/immich/main.tf +++ b/stacks/immich/main.tf @@ -162,6 +162,9 @@ resource "kubernetes_resource_quota" "immich" { } resource "kubernetes_manifest" "external_secret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/insta2spotify/main.tf b/stacks/insta2spotify/main.tf index 9770afd3..5e1cc4ef 100644 --- a/stacks/insta2spotify/main.tf +++ b/stacks/insta2spotify/main.tf @@ -20,6 +20,9 @@ resource "kubernetes_namespace" "insta2spotify" { } resource "kubernetes_manifest" "external_secret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/instagram-poster/modules/instagram-poster/main.tf b/stacks/instagram-poster/modules/instagram-poster/main.tf index 65714739..7dc3f846 100644 --- a/stacks/instagram-poster/modules/instagram-poster/main.tf +++ b/stacks/instagram-poster/modules/instagram-poster/main.tf @@ -35,6 +35,14 @@ resource "kubernetes_namespace" "instagram_poster" { # - immich_tag_instagram (optional — auto-resolved if missing) # - immich_tag_posted (optional — auto-resolved if missing) resource "kubernetes_manifest" "external_secret" { + # The external-secrets controller takes server-side-apply ownership of + # .spec.refreshInterval, so a plain TF apply conflicts. force_conflicts lets + # TF win (values match, so it's stable) — same pattern as grafana/woodpecker/ + # traefik/k8s-version-upgrade. Surfaced 2026-06-24 by the first IG apply since + # the ESO v1 migration (the scale-to-0 push). + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" @@ -139,6 +147,11 @@ resource "kubernetes_manifest" "external_secret" { # ESO refreshes the K8s Secret every 15m. `reloader.stakater.com/match` # bounces the pod when the password changes. resource "kubernetes_manifest" "benchmark_db_external_secret" { + # See external_secret above — ESO owns .spec.refreshInterval; force_conflicts + # lets the TF apply win instead of erroring on the field-manager conflict. + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" @@ -227,7 +240,11 @@ resource "kubernetes_deployment" "instagram_poster" { } spec { - replicas = 1 + # Scaled to 0 (2026-06-24): Instagram Graph integration is unused and its + # ExternalSecret is dead (missing ig_graph_long_lived_token / + # ig_business_account_id in Vault secret/instagram-poster). Set back to 1 + # after minting a Meta long-lived token and populating those keys. + replicas = 0 # RWO PVC — cannot rolling-update. strategy { type = "Recreate" diff --git a/stacks/job-hunter/main.tf b/stacks/job-hunter/main.tf index a008e83c..94927bf6 100644 --- a/stacks/job-hunter/main.tf +++ b/stacks/job-hunter/main.tf @@ -41,6 +41,9 @@ resource "kubernetes_namespace" "job_hunter" { # digest_to_address — where the weekly digest goes # digest_from_address — From: header for the digest resource "kubernetes_manifest" "external_secret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" @@ -105,6 +108,9 @@ resource "kubernetes_manifest" "external_secret" { # DB credentials from Vault database engine (7-day rotation). # Template builds the asyncpg DSN consumed by the FastAPI app as DB_CONNECTION_STRING. resource "kubernetes_manifest" "db_external_secret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" @@ -325,6 +331,9 @@ resource "kubernetes_service" "job_hunter" { # references it as $__env{JOB_HUNTER_PG_PASSWORD}. Reloader restarts # Grafana whenever ESO updates this secret (every 7d on rotation). resource "kubernetes_manifest" "grafana_job_hunter_db_external_secret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/k8s-dashboard/oauth2_proxy.tf b/stacks/k8s-dashboard/oauth2_proxy.tf index 5ed73793..032d5057 100644 --- a/stacks/k8s-dashboard/oauth2_proxy.tf +++ b/stacks/k8s-dashboard/oauth2_proxy.tf @@ -5,6 +5,9 @@ # ----------------------------------------------------------------------------- resource "kubernetes_manifest" "oauth2_proxy_externalsecret" { + field_manager { + force_conflicts = true + } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/k8s-portal/modules/k8s-portal/files/src/routes/+page.svelte b/stacks/k8s-portal/modules/k8s-portal/files/src/routes/+page.svelte index 2d13fa39..7b617fd0 100644 --- a/stacks/k8s-portal/modules/k8s-portal/files/src/routes/+page.svelte +++ b/stacks/k8s-portal/modules/k8s-portal/files/src/routes/+page.svelte @@ -5,9 +5,11 @@

Kubernetes Access Portal

-
- VPN Required — The cluster is on a private network. You need Headscale VPN access before kubectl will work. - See the Getting Started guide for VPN setup instructions. +
+ Fastest way in: open the web terminal or the + dashboard and sign in — no install, no VPN needed. Prefer your + own machine? The local-setup guide covers VPN + kubectl, and the + Getting Started page compares all three access paths.
@@ -26,6 +28,7 @@

Assigned namespaces: {data.namespaces.join(', ')}

Quick Commands

+

Run these as-is in the web terminal — it's already signed in as you.

 # Check your pods
 kubectl get pods -n {data.namespaces[0]}
@@ -47,16 +50,23 @@ vault write kubernetes/creds/{data.namespaces[0]}-deployer \
 
 	

Get Started

+

No setup — start now

+
    +
  1. Open the web terminal — a ready shell with kubectl, Vault and your repos already set up
  2. +
  3. Open the dashboard — point-and-click view of your workloads
  4. +
+

On your own machine

    {#if data.role === 'namespace-owner'} -
  1. Complete the namespace-owner onboarding guide
  2. +
  3. Follow the namespace-owner setup (VPN, kubectl, Vault, encrypted state)
  4. {:else} -
  5. Complete the onboarding guide (VPN, kubectl, git)
  6. +
  7. Follow the local setup (VPN, kubectl, git)
  8. {/if}
  9. Install kubectl and kubelogin
  10. Download your kubeconfig
  11. Run kubectl get namespaces to verify access
+

Compare all three access paths →

@@ -91,12 +101,12 @@ vault write kubernetes/creds/{data.namespaces[0]}-deployer \ border-radius: 6px; margin: 1rem 0; } - .callout.warning { - background: #fff3cd; - border-left: 4px solid #ffc107; + .callout.info { + background: #e8f4fd; + border-left: 4px solid #2196f3; } .callout a { - color: #856404; + color: #0d47a1; font-weight: 600; } diff --git a/stacks/k8s-portal/modules/k8s-portal/files/src/routes/onboarding/+page.svelte b/stacks/k8s-portal/modules/k8s-portal/files/src/routes/onboarding/+page.svelte index d6ec35b9..6b2d73dd 100644 --- a/stacks/k8s-portal/modules/k8s-portal/files/src/routes/onboarding/+page.svelte +++ b/stacks/k8s-portal/modules/k8s-portal/files/src/routes/onboarding/+page.svelte @@ -5,87 +5,175 @@

Getting Started

-

Welcome! Follow these steps to get access to the home Kubernetes cluster.

- - +

+ Welcome! There are three ways to reach the home Kubernetes cluster. Pick the one that fits — + the first two need zero setup and open right in your browser. +

-

Step 0 — Join the VPN

-

The cluster is on a private network (10.0.20.0/24). You need VPN access first.

+

Three ways in

+ + + + + + + + + + + + + + + + + + + +
PathBest forSetup
A — Web terminalJust want to start working nowNone — opens in your browser
B — Web dashboardClick around, watch your app, read logsNone — opens in your browser
C — Your own machinekubectl / Terraform locally, full controlVPN + one-line installer
+
+ Not sure? Start with the web terminal (Path A). + Everything is already installed and your repos are already cloned — you can run your first + kubectl command within a minute, from any device. +
+
+ +
+

Path A — Web terminal Recommended No setup

+

+ A full terminal that runs in your browser — nothing to install, works from any device + (even a tablet). It drops you into your own account on the shared workstation, with every + tool already set up. +

    -
  1. Install Tailscale for your OS
  2. -
  3. Run this in your terminal: -
    tailscale login --login-server https://headscale.viktorbarzin.me
    +
  4. Open t3.viktorbarzin.me
  5. +
  6. Sign in with your Authentik account (the same SSO login as this portal)
  7. +
  8. You land in a ready-to-use shell. Try it: +
    kubectl get pods -n YOUR_NAMESPACE
  9. -
  10. A browser window will open with a registration URL
  11. -
  12. Send that URL to Viktor via email (vbarzin@gmail.com) or Slack
  13. -
  14. Wait for approval (usually within a few hours)
  15. -
  16. Once approved, test:
    ping 10.0.20.100
+
+ Already done for you on the workstation: +
    +
  • kubectl + your kubeconfig, scoped to your namespaces (no login dance)
  • +
  • vault, terragrunt, terraform, sops, kubeseal
  • +
  • Your repos cloned under ~/code — the infra repo plus your own project repos
  • +
  • Claude Code, ready to pair with you on changes
  • +
+
+
+ No access yet? The workstation is provisioned per person. If + t3.viktorbarzin.me says you're not authorized, ask Viktor to add you + (vbarzin@gmail.com or Slack). +
-
-

Step 1 — Log in to the portal

-

Visit k8s-portal.viktorbarzin.me and sign in with your Authentik account.

-

If you don't have an account yet, ask Viktor to create one.

+
+

Path B — Web dashboard No setup

+

+ A point-and-click view of the cluster — browse your pods, read logs, restart a deployment, + check events. Nothing to install. +

+
    +
  1. Open k8s.viktorbarzin.me
  2. +
  3. Sign in with your Authentik account
  4. +
  5. + You're dropped straight into the Kubernetes Dashboard, already authenticated as you — + no token to paste. The portal injects your personal access token for you. +
  6. +
+
+ Scoped to your namespace(s): you can see and manage your own workloads, but not other + tenants'. This path uses a per-user token that does not depend on CLI login, so it + keeps working even if kubectl OIDC login is having a bad day — making it the + reliable fallback for Path C. +
-
-

Step 2 — Set up kubectl

-

Run one of these commands in your terminal to install everything automatically:

-

macOS

-

Requires Homebrew. Install it first if you don't have it.

-
bash <(curl -fsSL https://k8s-portal.viktorbarzin.me/setup/script?os=mac)
-

Linux

-
bash <(curl -fsSL https://k8s-portal.viktorbarzin.me/setup/script?os=linux)
-

Windows

-

Use WSL2 and follow the Linux instructions.

-
+
+

Path C — From your own machine

+

+ For running kubectl, vault and Terraform locally. This is the most + powerful path and the one to use for infrastructure changes — it just needs a bit more setup + because the cluster API lives on a private network. +

+ + +

+ {#if showNamespaceOwner} + Namespace owner — you'll also set up Vault and encrypted Terraform state so you can deploy + your own app stacks. + {:else} + General user — VPN, kubectl and git access. (Managing your own app stack? Switch to the + Namespace Owner tab above.) + {/if} +

- {#if showNamespaceOwner}
-

Step 3 — Log into Vault

-

Vault manages your secrets and issues dynamic Kubernetes credentials.

-
vault login -method=oidc
-

This opens your browser for Authentik SSO. After login, your token is saved to ~/.vault-token.

+

Step 1 — Join the VPN

+

The cluster API is on a private network (10.0.20.0/24), so you need VPN access first.

+
    +
  1. Install Tailscale for your OS
  2. +
  3. Run this in your terminal: +
    tailscale login --login-server https://headscale.viktorbarzin.me
    +
  4. +
  5. A browser window opens with a registration URL
  6. +
  7. Send that URL to Viktor via email (vbarzin@gmail.com) or Slack
  8. +
  9. Wait for approval (usually within a few hours)
  10. +
  11. Once approved, test:
    ping 10.0.20.100
  12. +
-

Step 4 — Verify kubectl access

-

Run this command. It will open your browser for OIDC login the first time:

-
kubectl get pods -n YOUR_NAMESPACE
-

You should see an empty list (no resources) or your running pods.

+

Step 2 — Install the tools

+

Run one of these to install everything automatically (kubectl, kubelogin, vault, terragrunt, terraform, kubeseal) and write your kubeconfig to ~/.kube/config-home:

+

macOS

+

Requires Homebrew. Install it first if you don't have it.

+
bash <(curl -fsSL https://k8s-portal.viktorbarzin.me/setup/script?os=mac)
+

Linux

+
bash <(curl -fsSL https://k8s-portal.viktorbarzin.me/setup/script?os=linux)
+

Windows

+

Use WSL2 and follow the Linux instructions.

-

Step 5 — Clone the infra repo

-
git clone https://github.com/ViktorBarzin/infra.git
+			

Step 3 — Verify access

+

Run this. The first time, it opens your browser for SSO login:

+
kubectl get {showNamespaceOwner ? 'pods -n YOUR_NAMESPACE' : 'namespaces'}
+

You should see your resources (or an empty list if you haven't deployed anything yet).

+
+ Browser login loops, or kubectl says "Unauthorized"? Command-line SSO + (OIDC) can occasionally be unavailable. When that happens, use the + web dashboard (Path B) or the + web terminal (Path A) — both authenticate a different way and + keep working — and let Viktor know. +
+

Connection error instead? Make sure the VPN is up: tailscale status.

+
+ + {#if showNamespaceOwner} +
+

Step 4 — Log into Vault

+

Vault manages your secrets and issues dynamic Kubernetes credentials.

+
vault login -method=oidc
+

This opens your browser for Authentik SSO. After login, your token is saved to ~/.vault-token.

+
+ +
+

Step 5 — Clone the infra repo

+
git clone https://github.com/ViktorBarzin/infra.git
 cd infra
-

This is where all the infrastructure configuration lives. Terraform state is committed as encrypted files.

-
+

This is where all the infrastructure configuration lives. Terraform state is committed as encrypted files.

+
-
-

Step 6 — Install tools

-

You need sops and terragrunt to work with infrastructure state:

-

macOS

-
brew install sops terragrunt
-

Linux

-
# sops
-curl -LO https://github.com/getsops/sops/releases/latest/download/sops-v3.9.4.linux.amd64
-sudo mv sops-*.linux.amd64 /usr/local/bin/sops && sudo chmod +x /usr/local/bin/sops
-
-# terragrunt
-curl -LO https://github.com/gruntwork-io/terragrunt/releases/latest/download/terragrunt_linux_amd64
-sudo mv terragrunt_linux_amd64 /usr/local/bin/terragrunt && sudo chmod +x /usr/local/bin/terragrunt
-
- -
-

Step 7 — Decrypt your state

-

Terraform state is encrypted with SOPS. Your Vault login gives you access to only your stacks.

-
# Make sure you're logged into Vault
+			
+

Step 6 — Decrypt your state

+

Terraform state is encrypted with SOPS. Your Vault login gives you access to only your stacks.

+
# Make sure you're logged into Vault
 vault login -method=oidc
 
 # Decrypt your stack's state
@@ -95,160 +183,157 @@ scripts/state-sync decrypt YOUR_NAMESPACE
 cd stacks/YOUR_NAMESPACE
 ../../scripts/tg plan
-
-

How state encryption works

-
-
-
vault login -method=oidc
-
-
Authentik SSO
-
-
~/.vault-token
-
-
-
-
scripts/tg plan
-
-
state-sync decrypt
-
-
Vault Transit
sops-state-YOUR_NS
-
-
-
-
terragrunt plan/apply
-
-
state-sync encrypt
-
-
git commit + push
+
+

How state encryption works

+
+
+
vault login -method=oidc
+
+
Authentik SSO
+
+
~/.vault-token
+
+
+
+
scripts/tg plan
+
+
state-sync decrypt
+
+
Vault Transit
sops-state-YOUR_NS
+
+
+
+
terragrunt plan/apply
+
+
state-sync encrypt
+
+
git commit + push
+
-
-
- Access control: You can only decrypt state for your own namespaces. - Each namespace has its own Vault Transit encryption key. Your Vault policy - (sops-user-YOUR_USERNAME) only grants access to your keys. -
-
+
+ Access control: You can only decrypt state for your own namespaces. + Each namespace has its own Vault Transit encryption key. Your Vault policy + (sops-user-YOUR_USERNAME) only grants access to your keys. +
+
-
-

Step 8 — Create your first app stack

-
    -
  1. Copy the template:
    cp -r stacks/_template stacks/myapp
    +			
    +

    Step 7 — Create your first app stack

    +
      +
    1. Copy the template:
      cp -r stacks/_template stacks/myapp
       mv stacks/myapp/main.tf.example stacks/myapp/main.tf
    2. -
    3. Edit stacks/myapp/main.tf — replace all <placeholders>
    4. -
    5. Store secrets in Vault: -
      vault kv put secret/YOUR_USERNAME/myapp DB_PASSWORD=secret123
      -
    6. -
    7. Apply your stack: -
      cd stacks/myapp && ../../scripts/tg apply
      -
    8. -
    9. Commit encrypted state: -
      cd ../..
      +					
    10. Edit stacks/myapp/main.tf — replace all <placeholders>
    11. +
    12. Store secrets in Vault: +
      vault kv put secret/YOUR_USERNAME/myapp DB_PASSWORD=secret123
      +
    13. +
    14. Apply your stack: +
      cd stacks/myapp && ../../scripts/tg apply
      +
    15. +
    16. Commit encrypted state: +
      cd ../..
       git add stacks/myapp/ state/stacks/myapp/terraform.tfstate.enc
       git commit -m "add myapp stack"
       git push
      -
    17. -
    -
    +
  2. +
+
-
-

Architecture Overview

-

Here's how your changes flow through the system:

+
+

Architecture Overview

+

Here's how your changes flow through the system:

-
-

Apply workflow

-
-
-
Your Machine
-
git pull
-
-
scripts/tg plan
-
auto-decrypt
-
scripts/tg apply
-
auto-encrypt
-
git push
-
-
-
Vault
-
OIDC auth
Authentik SSO
-
-
Transit decrypt
sops-state-*
-
-
Transit encrypt
per-stack key
-
-
-
Cluster
-
K8s API
-
-
Your namespace
pods, services
-
-
Traefik ingress
*.viktorbarzin.me
+
+

Apply workflow

+
+
+
Your Machine
+
git pull
+
+
scripts/tg plan
+
auto-decrypt
+
scripts/tg apply
+
auto-encrypt
+
git push
+
+
+
Vault
+
OIDC auth
Authentik SSO
+
+
Transit decrypt
sops-state-*
+
+
Transit encrypt
per-stack key
+
+
+
Cluster
+
K8s API
+
+
Your namespace
pods, services
+
+
Traefik ingress
*.viktorbarzin.me
+
-
-
-

Security model

- - - - - - - - - -
LayerWhatHow
AuthenticationWho are you?Authentik SSO (OIDC) → Vault token
AuthorizationWhat can you access?Vault policy (sops-user-*) scoped to your namespaces
Encryption at restState in gitSOPS + Vault Transit (per-stack key)
Encryption fallbackBootstrap / DRage keys (admin only)
NetworkCluster accessHeadscale VPN (private 10.0.20.0/24)
-
-
- {:else} -
-

Step 3 — Verify access

-

Run this command. It will open your browser for login the first time:

-
kubectl get namespaces
-

You should see output like:

-
NAME              STATUS   AGE
-default           Active   200d
-kube-system       Active   200d
-monitoring        Active   200d
-...
-

If you get a connection error, make sure your VPN is connected (tailscale status).

-
- -
-

Step 4 — Clone the repo

-
git clone https://github.com/ViktorBarzin/infra.git
+				
+

Security model

+ + + + + + + + + +
LayerWhatHow
AuthenticationWho are you?Authentik SSO (OIDC) → Vault token
AuthorizationWhat can you access?Vault policy (sops-user-*) scoped to your namespaces
Encryption at restState in gitSOPS + Vault Transit (per-stack key)
Encryption fallbackBootstrap / DRage keys (admin only)
NetworkCluster accessHeadscale VPN (private 10.0.20.0/24)
+
+
+ {:else} +
+

Step 4 — Clone the repo

+
git clone https://github.com/ViktorBarzin/infra.git
 cd infra
-

This is where all the infrastructure configuration lives.

-
+

This is where all the infrastructure configuration lives.

+
-
-

Step 5 — Your first change

-
    -
  1. Create a branch:
    git checkout -b my-first-change
  2. -
  3. Edit a service file (e.g., change an image tag in stacks/echo/main.tf)
  4. -
  5. Commit and push:
    git add . && git commit -m "my first change" && git push -u origin my-first-change
  6. -
  7. Open a Pull Request on GitHub
  8. -
  9. Viktor reviews and merges
  10. -
  11. Woodpecker CI automatically applies the change to the cluster
  12. -
  13. Slack notification confirms it worked
  14. -
-
- {/if} +
+

Step 5 — Your first change

+
    +
  1. Create a branch:
    git checkout -b my-first-change
  2. +
  3. Edit a service file (e.g., change an image tag in stacks/echo/main.tf)
  4. +
  5. Commit and push:
    git add . && git commit -m "my first change" && git push -u origin my-first-change
  6. +
  7. Open a Pull Request on GitHub
  8. +
  9. Viktor reviews and merges
  10. +
  11. Woodpecker CI automatically applies the change to the cluster
  12. +
  13. Slack notification confirms it worked
  14. +
+
+ {/if} +