diff --git a/.claude/CLAUDE.md b/.claude/CLAUDE.md index 496f30d7..9c873a07 100755 --- a/.claude/CLAUDE.md +++ b/.claude/CLAUDE.md @@ -16,7 +16,6 @@ **ALL infrastructure changes MUST go through Terraform/Terragrunt.** Never use `kubectl apply/edit/patch/set`, `helm install/upgrade`, or any manual cluster mutation as the final state. - **No exceptions for "quick fixes"** — even one-line changes must be in `.tf` files and applied via `scripts/tg apply` -- **Apply locally OR let CI do it — but ALWAYS commit.** You don't have to wait for CI: with apply access you MAY run the apply yourself (`scripts/tg apply ` / `homelab tf apply `), but **from the main checkout, never a worktree** (git-crypt'd `*.tfvars` come through as ciphertext under the worktree filter-bypass, so a worktree apply reads garbage). **Every applied change MUST be committed and pushed to `master` the same session** — the repo is the source of truth, so applied-but-uncommitted HCL is drift that the next CI apply / daily drift-detection will try to revert. Order either way: apply locally then commit + push (CI's changed-stack apply then no-ops), or commit + push and let CI apply. Never apply an uncommitted edit; never leave a committed change unapplied. - **kubectl is for read-only operations and temporary debugging only** (get, describe, logs, exec, port-forward) - **If a resource isn't in Terraform yet**, evaluate whether it can be added before making manual changes. If manual change is unavoidable (e.g., emergency), document it immediately and create the Terraform resource in the same session - **kubectl scale/patch during migrations is acceptable** as a transient step, but the final state must be in Terraform and applied via `scripts/tg apply` @@ -204,7 +203,7 @@ the workflow's built-in `GITHUB_TOKEN` (`packages: write`). - **PDBs**: minAvailable=2 on Traefik and Authentik. - **Fallback proxies**: basicAuth when Authentik is down, fail-open when poison-fountain is down. - **CrowdSec enforcement is out-of-band** (no Traefik plugin/middleware — the dead Yaegi `crowdsec-bouncer-traefik-plugin` was removed on Traefik 3.7.5): banned IPs are dropped **in-kernel via nftables** by the `cs-firewall-bouncer` DaemonSet on **direct** hosts (drops in BOTH the `input` and `forward` hooks — Traefik is ETP=Local so client traffic is DNAT'd to the pod via `forward`; pulls ALL decisions incl. the ~31k CAPI blocklist), and **blocked at the Cloudflare edge** for **proxied** hosts (one `crowdsec_ban` Rules List + a zone WAF block rule, fed by the `crowdsec-cf-sync` CronJob in `rybbit` ns every 2 min — excludes CAPI). Zero per-request latency; **fails open** (LAPI down → no new bans, existing drops persist, legit traffic never blocked). Whitelist covers RFC1918 + tailnet + internal CIDRs. Full as-built: `docs/architecture/security.md`. -- **Rate limiting**: Return 429 (not 503). Per-service tuning via dedicated middleware + `skip_default_rate_limit` (default 10/s burst 50): Immich 1000/20000, ActualBudget 50/300 (app boot = ~70 parallel revalidations), authentik 100/1000 on `/`+`/static` (login SPA cold-loads ~70 flow chunks from `/static`; default burst 429'd them → blank login screen). +- **Rate limiting**: Return 429 (not 503). Per-service tuning via dedicated middleware + `skip_default_rate_limit` (default 10/s burst 50): Immich 1000/20000, ActualBudget 50/300 (app boot = ~70 parallel revalidations). - **Retry middleware**: 2 attempts, 100ms — in default ingress chain. - **Entrypoint transport timeouts** (`websecure` `respondingTimeouts`): `writeTimeout=0` (unlimited download duration), `readTimeout=3600s` (uploads ≤1h), `idleTimeout=600s`. These are **HARD total-duration caps**, not nginx-style per-read idle timeouts — a finite `writeTimeout` truncates *any* large download at that wall-clock mark (a prior `writeTimeout=60s` silently cut Immich videos at 60s). **Do NOT re-tighten `writeTimeout`**; keep `readTimeout` finite (slow-loris backstop) but ≥ longest expected upload. Full rationale: `docs/architecture/networking.md` → "Entrypoint Transport Timeouts". - **HTTP/3 (QUIC)**: Enabled on Traefik. Works for **direct (non-proxied) apps** via the dedicated LB IP below (ETP=Local). Proxied apps get QUIC at the Cloudflare edge. @@ -219,7 +218,7 @@ the workflow's built-in `GITHUB_TOKEN` (`packages: write`). | Immich | ML on SSD (CUDA), disable ModSecurity (breaks streaming), frequent upgrades. **`immich-machine-learning` MUST run with `MACHINE_LEARNING_MODEL_TTL > 0`** (set to `600` in `stacks/immich/main.tf`, env on the `immich-machine-learning` deployment). At `0`, no model ever unloads and onnxruntime's CUDA arena (OCR's dynamic input shapes inflate it to ~10 GB) is held forever on the **time-sliced T4 it shares with llama-swap/frigate/immich-server** — which has no VRAM isolation, so immich-ml starved llama-swap (qwen3-8b) and silently broke recruiter-responder triage for ~5 h on 2026-06-02 (post-mortem `docs/post-mortems/2026-06-02-immich-ml-ttl-gpu-oom-recruiter.md`). TTL>0 lets idle models (OCR, face — AND CLIP) free VRAM. The TTL is a single GLOBAL knob (no per-model pin), so CLIP would also unload after 600s idle; the `clip-keepalive` CronJob (`*/5 * * * *`, same stack) pings the CLIP textual encoder so smart-search stays warm without pinning the ad-hoc models. **Smart search has a SECOND warmth layer in Postgres** (don't conflate it with the ML model): the ~665MB vchord `clip_index` must stay resident in PG `shared_buffers`, else an ANN probe that lands on an evicted list pays a ~1.8s cold storage read vs ~4ms warm. The `postStart` hook prewarms it ONCE at pod start and `pg_prewarm.autoprewarm` only re-warms at *startup*, so the index decays out of cache over days under job buffer-pressure (observed ~33% resident after 9d uptime → slow context search, easily misattributed to the ML model). The `clip-index-prewarm` CronJob (`*/5`, same stack) re-runs `pg_prewarm('clip_index')` to pin it hot; `immich-search-probe` (`*/5`) measures live latency + residency → Pushgateway gauges (`immich_smart_search_db_seconds`, `immich_clip_index_cached_pct`) → alerts `ImmichSmartSearchSlow`/`ImmichClipIndexColdCache`/`ImmichSearchProbeStale` + cluster-health check #46 (`check_immich_search`). immich PG role is a superuser so the CronJobs can run `pg_prewarm`/`pg_buffercache`. **Video transcoding is GPU-accelerated**: `immich-server` is pinned to GPU node1 (nodeSelector `nvidia.com/gpu.present` + NoSchedule toleration + `gpu-workload` priority) with a time-sliced `nvidia.com/gpu=1` slice — the stock immich-server image's ffmpeg already ships h264/hevc_nvenc + NVDEC. Activated via `ffmpeg.accel=nvenc` + `accelDecode=true` in the **DB** system-config (`system_metadata` table, key `system-config`, JSONB — NOT Terraform; app config is DB-managed here like oauth/smtp). Direct DB edits need a pod **recreate** to reload (config is cached at boot; only API-driven changes broadcast a reload). **Streaming bitrate is capped** to keep 4K playback smooth on the contended HDD and over remote uplinks: `ffmpeg.maxBitrate=20000k` + `preset=medium` + `transcode=bitrate` (set 2026-06-01 — was uncapped `maxBitrate=0` + `ultrafast` + `targetResolution=original`, which produced 77–264 Mbps 4K transcodes that stuttered for every client, local and remote, since even a single stream needs ~10–13.5 MB/s off the shared `sdc` spindle). 4K resolution is preserved (`targetResolution=original`); originals are NEVER modified — only the `encoded-video/` streaming copy. To re-apply transcode settings to EXISTING videos (config changes only affect new/missing ones): delete the offenders' `asset_file` rows `WHERE type='encoded_video'` (derived/regenerable — never touches originals) then run videoConversion `force=false` (admin Jobs API → "Missing"); it regenerates them to the deterministic `.mp4` path at concurrency 1 (gentle on sdc). See `docs/runbooks/immich-transcode-bitrate.md`. If Immich is ever reinstalled fresh (not restored), re-set these keys (accel, accelDecode, **maxBitrate=20000k, preset=medium, transcode=bitrate**). Thumbnails/previews live on SSD NFS (sdb) — do NOT move to block storage (HDD sdc = slower + the contended IO domain). **Background-job concurrency is capped to protect sdc** (DB-managed system-config, `system_metadata` key `system-config`, JSONB `job.*.concurrency`; re-set on fresh install): `thumbnailGeneration=2`, `metadataExtraction=2`, `library=2` — these jobs read ORIGINALS off the HDD library. Left uncapped (were 8/4/4) a library-wide job (e.g. Duplicate Detection on 2026-06-01) fans the ML/thumbnail backfill out into a read storm that saturates sdc and starves etcd → apiserver down. `sidecar`/`smartSearch`/`faceDetection` stay at Immich defaults (small `.xmp` / SSD previews). Apply via Job Settings UI or the `system-config` API; **direct DB edits need an `immich-server` pod recreate to reload** (config cached at boot). See `docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md`. | | CrowdSec | Pin version, disable Metabase when not needed (CPU hog), LAPI scaled to 3, **DB on PostgreSQL** (migrated from MySQL), flush config: max_items=10000/max_age=7d/agents_autodelete=30d, DECISION_DURATION=168h in blocklist CronJob. **Enforcement is out-of-band, NOT a Traefik plugin** (the Yaegi `crowdsec-bouncer-traefik-plugin` was dead on Traefik 3.7.5 and removed): `cs-firewall-bouncer` DaemonSet drops in-kernel via nftables on direct hosts (bouncer key `firewall`, v0.0.34 binary fetched at runtime, hostNetwork+NET_ADMIN, `stacks/crowdsec/modules/crowdsec/firewall_bouncer.tf`); `crowdsec-cf-sync` CronJob blocks at the CF edge for proxied hosts (bouncer key `kvsync`, `stacks/rybbit/crowdsec_edge.tf`). Both fail open. See `docs/architecture/security.md` | | Frigate | GPU stall detection in liveness probe (inference speed check), high CPU | -| Authentik | 3 server replicas + 2-replica embedded outpost (PG-backed sessions), PgBouncer in front of PostgreSQL, strip auth headers before forwarding. **`authentik.*` Helm values are INERT** (existingSecret skips chart env rendering) — tune via `server.env`/`worker.env` in `modules/authentik/values.yaml`. Single-screen login (password embedded in identification stage); all first-party OIDC apps use implicit consent (2026-06-10). `/static` ingress carve-out serves assets with immutable Cache-Control; `/`+`/static` use a dedicated `authentik-rate-limit` (100/1000) so the cold-load chunk burst isn't 429'd into a blank screen. **Reliability (2026-06-28): the chart key is `deploymentStrategy`, NOT `strategy`** — the old `strategy:` key was inert, so live ran the chart default 25%/25% and dropped a server pod out of rotation on every roll; now `maxSurge:1/maxUnavailable:0`. Readiness `failureThreshold:8` (~80s, was 30s): the DB-coupled `/-/health/ready/` returns 503 on a PG/pgbouncer blip, and with too-tight tolerance all 3 server pods left the Service at once → Traefik 502/504 (the episodic blank-screen + 30s-hang). gunicorn `max_requests=10000`/jitter=1000 decorrelates worker recycles from DB blips. Redis is GONE since 2026.2 (sessions+cache+channels on PostgreSQL, no external-cache option) — a short PG transient is now survived, but a TOTAL CNPG outage still takes authentik down. **Custom overlay image (2026-06-28):** server+worker run `ghcr.io/viktorbarzin/authentik-server:2026.2.4-patch3` (built by `.github/workflows/build-authentik.yml` from `stacks/authentik/Dockerfile` + `patch-compat-sfe.py`) with TWO guarded patches: **#1 SLOW-1a** — narrows the identification-stage `select_subclasses()` query (~1.4s→~14ms; bare upstream call LEFT-JOINs every source subtype); **#2 old-browser blank login** — `patch-compat-sfe.py` (a) extends `compat_needs_sfe()` to serve authentik's built-in no-JS **SFE** login to old Safari/WebKit AND **any iOS browser** (Chrome/CriOS, Firefox/FxiOS — all share the system WebKit) on iOS≤16.3, and (b) **injects static social-login `` links into the SFE shell** (`flow-sfe.html`) since the SFE can't render Identification-stage sources — required for password-less accounts (e.g. emo = Google-only). The modern flow SPA is ES2022 (needs Safari 16.4+) and renders BLANK on older WebKit; every iOS browser shares that WebKit, so it's not browser-choice (emo's iPadOS-15.8 iPad hit this). SFE = the *real* authentik login (password + MFA + reputation, no auth downgrade) — chosen over a Traefik basic-auth fallback which would have put a spoofable-UA single password in front of `vbarzin→wizard` passwordless-root. Social link = plain redirect to `/source/oauth/login//` (works on any browser); slugs (google/github/facebook) are static — re-verify on source changes. **Keel un-enrolled** for the ns → image pinned in `global.image` (repo+tag), **upgraded manually**: bump the Dockerfile `FROM` + the values tag (+ re-verify both patches) together, GHA rebuilds, then apply. | +| Authentik | 3 server replicas + 2-replica embedded outpost (PG-backed sessions), PgBouncer in front of PostgreSQL, strip auth headers before forwarding. **`authentik.*` Helm values are INERT** (existingSecret skips chart env rendering) — tune via `server.env`/`worker.env` in `modules/authentik/values.yaml`. Single-screen login (password embedded in identification stage); all first-party OIDC apps use implicit consent (2026-06-10). `/static` ingress carve-out serves assets with immutable Cache-Control. | | Kyverno | failurePolicy=Ignore to prevent blocking cluster, pin chart version | | MySQL Standalone | Raw `kubernetes_stateful_set_v1` pinned to `mysql:8.4.8` exactly (migrated from InnoDB Cluster 2026-04-16; **pinned to 8.4.8 on 2026-05-18** after Keel-driven `mysql:8.4` → 8.4.9 bump stalled the DD upgrade and required a full PVC-wipe + dump-restore — see `docs/runbooks/restore-mysql.md` and beads code-eme8/code-k40p). `skip-log-bin`, `innodb_flush_log_at_trx_commit=2`, `innodb_doublewrite=ON`. ConfigMap `mysql-standalone-cnf`. PVC `data-mysql-standalone-0` (5Gi initial → 30Gi via autoresizer, `proxmox-lvm-encrypted`). Service `mysql.dbaas` unchanged. Anti-affinity excludes k8s-node1. Bitnami charts deprecated (Broadcom Aug 2025) — use official images. | | phpIPAM | IPAM — no active scanning. `pfsense-import` CronJob (hourly) pulls Kea leases + ARP via SSH. `dns-sync` CronJob (15min) bidirectional sync with Technitium. Kea DDNS on pfSense handles all 3 subnets. API app `claude` (ssl_token). | @@ -232,10 +231,9 @@ the workflow's built-in `GITHUB_TOKEN` (`packages: write`). - Alertmanager is now scraped (`extraScrapeConfigs` job `alertmanager`) → `alertmanager_notifications_total`/`_alerts`/`_notifications_failed_total` available; it had no `prometheus.io/scrape` annotation so notification volume was previously unmeasurable. - Every new service gets Prometheus scrape config + Uptime Kuma monitor. External monitors auto-created for Cloudflare-proxied services by `external-monitor-sync` CronJob (10min, uptime-kuma ns). Mechanism: `ingress_factory` auto-adds `uptime.viktorbarzin.me/external-monitor=true` whenever `dns_type != "none"` (see `modules/kubernetes/ingress_factory/main.tf`) — no manual action needed on new services. The `cloudflare_proxied_names` list in `config.tfvars` is a legacy fallback for the 17 hostnames not yet migrated to `ingress_factory` `dns_type`; don't check that list when debugging "is this monitored?" questions. - **External monitoring**: `[External] ` monitors in Uptime Kuma test full external path (DNS → Cloudflare → Tunnel → Traefik). Divergence metric `external_internal_divergence_count` → alert `ExternalAccessDivergence` (15min). Config: `stacks/uptime-kuma/`, targets from `cloudflare_proxied_names` in `config.tfvars` (17 remaining centrally-managed hostnames; most DNS records now auto-created by `ingress_factory` `dns_type` param). -- Key alerts: OOMKill, pod replica mismatch, 4xx/5xx error rates, UPS battery, CPU temp, SSD writes, NFS responsiveness, ClusterMemoryRequestsHigh (>85%), ContainerNearOOM (>85% limit), PodUnschedulable, ExternalAccessDivergence, ImmichSmartSearchSlow (context-search latency / clip_index cache eviction), AuthentikRootRouter5xxHigh (all-3-server-pods-NotReady cascade → 502/503/504 on the authentik `/` router). **The Traefik scrape keeps `traefik_router_requests_total`** (per-router `code` label) — the drop-regex in the `traefik` scrape job drops only the high-cardinality `*_duration_seconds_bucket` histogram, NOT the request counter, so per-router 429/5xx is queryable + alertable. +- Key alerts: OOMKill, pod replica mismatch, 4xx/5xx error rates, UPS battery, CPU temp, SSD writes, NFS responsiveness, ClusterMemoryRequestsHigh (>85%), ContainerNearOOM (>85% limit), PodUnschedulable, ExternalAccessDivergence, ImmichSmartSearchSlow (context-search latency / clip_index cache eviction). - **E2E email monitoring**: CronJob `email-roundtrip-monitor` (every 20 min) sends test email via Brevo HTTP API to `smoke-test@viktorbarzin.me` (catch-all → `spam@`), verifies IMAP delivery, deletes test email, pushes metrics to Pushgateway + Uptime Kuma. Alerts: `EmailRoundtripFailing` (60m), `EmailRoundtripStale` (60m), `EmailRoundtripNeverRun` (60m). Outbound relay: Brevo EU (`smtp-relay.brevo.com:587`, 300/day free — migrated from Mailgun). Inbound external traffic enters via pfSense HAProxy on `10.0.20.1:{25,465,587,993}`, which forwards to k8s `mailserver-proxy` NodePort (30125-30128) with `send-proxy-v2`. Mailserver pod runs alt PROXY-speaking listeners (2525/4465/5587/10993) alongside stock PROXY-free ones (25/465/587/993) for intra-cluster clients. Real client IPs recovered from PROXY v2 header despite kube-proxy SNAT (replaces pre-2026-04-19 MetalLB `10.0.20.202` ETP:Local scheme; see bd code-yiu + `docs/runbooks/mailserver-pfsense-haproxy.md`). Vault: `brevo_api_key` in `secret/viktor` (probe + relay). -- **Authentik walling-off guard**: `blackbox-exporter` (monitoring ns, `stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf`) probes each must-stay-public `auth = "none"` carve-out URL with `no_follow_redirects` and FAILS (`fail_if_header_matches` on `Location`) iff it 302s to Authentik. Catches a carve-out regressing (TF revert / deploy / `ingress_factory` `auth` default flipping back to `"required"`). Scrape job `blackbox-authentik-walloff` (1m) → alert `AuthentikWallingOffPublicPath` (`probe_failed_due_to_regex == 1`, for 10m, `lane=security` → posts to `#alerts` via the `slack-security` receiver, which keeps its `[SECURITY]` styling; the dedicated `#security` channel was abandoned 2026-06-25 — the shared webhook's app isn't a member of it). **To guard a new carve-out: add one line to `local.authentik_walloff_targets`** (a `service → URL` map; `valid_status_codes` includes 301/302 so legit redirects/404s stay green — only the Authentik `Location` fails the probe). `curl -sI ''` must NOT show a Location to `authentik.viktorbarzin.me` before adding. -- **pfSense egress / WAN monitoring** (added 2026-06-28 after the 2026-06-27 egress-only incident — pfSense VMID 101 stopped passing internet egress for ~20 min while internal routing + Unbound stayed up, and NOTHING alerted: no egress probe existed and the cloudflared replica metric stayed green): `blackbox-exporter` gained `icmp_egress` + `dns_external` modules (+ `NET_RAW` on the pod) in `authentik_walloff_probe.tf`. Three in-cluster probe jobs (`wan-gateway-icmp` → 192.168.1.1, `internet-egress-icmp` → 9.9.9.9/1.1.1.1, `internet-egress-dns` → cloudflare.com via both) traverse the pod→node→pfSense-NAT path that fails. Alerts (group `Egress / pfSense` in `alerting_rules.yml`): `WANGatewayUnreachable`, `InternetEgressDown` (`max()==0` = both providers dead, not a single-provider blip), `ExternalDNSResolutionDown`, `EgressOnlyDivergence` (t3-probe `cloudflare` leg down WHILE `internal` leg up — the incident signature, reuses the existing t3-probe), `PfSenseVMDown` (`pve_up{id="qemu/101"}==0` while host up — does NOT catch a guest-internal reboot, `pve_up` tracks the qemu process). Plus Loki ruler `CloudflaredTunnelConnLoss` (>20 edge-conn failures/5m; calibrated live: steady-state ~2/6h vs 37-85/5m in-incident; the cloudflared replica metric is blind to tunnel-connection loss). `WANGatewayUnreachable`/`InternetEgressDown` **inhibit** the downstream egress symptoms (ExternalDNSResolutionDown/EgressOnlyDivergence/CloudflaredTunnelConnLoss/Email*/ExternalAccessDivergence). Runbook: `docs/runbooks/pfsense-egress.md`. **Deferred (needs a live-pfSense change, not in this monitoring-only change):** point dpinger's monitor at the local gateway + widen thresholds, disable `gw_down_kill_states`, add a failover gateway group + auto-recovery watchdog, and ship pfSense system/gateway/routing syslog to Loki (today only filterlog → CrowdSec; those logs are NOT centrally queryable — id #6717). No Uptime-Kuma egress monitor was added (the `external-monitor-sync` is purpose-built for `*.viktorbarzin.me` Cloudflare-path discovery; the blackbox probes cover egress directly). +- **Authentik walling-off guard**: `blackbox-exporter` (monitoring ns, `stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf`) probes each must-stay-public `auth = "none"` carve-out URL with `no_follow_redirects` and FAILS (`fail_if_header_matches` on `Location`) iff it 302s to Authentik. Catches a carve-out regressing (TF revert / deploy / `ingress_factory` `auth` default flipping back to `"required"`). Scrape job `blackbox-authentik-walloff` (1m) → alert `AuthentikWallingOffPublicPath` (`probe_failed_due_to_regex == 1`, for 10m, `lane=security` → `#security` Slack). **To guard a new carve-out: add one line to `local.authentik_walloff_targets`** (a `service → URL` map; `valid_status_codes` includes 301/302 so legit redirects/404s stay green — only the Authentik `Location` fails the probe). `curl -sI ''` must NOT show a Location to `authentik.viktorbarzin.me` before adding. ## Security Posture (Wave 1 — locked 2026-05-18) @@ -243,10 +241,9 @@ Plan in `docs/architecture/security.md` + response playbook in `docs/runbooks/se - **Identity allowlist for security rules**: ONLY `me@viktorbarzin.me`. NOT `viktor@viktorbarzin.me`, NOT `emo@viktorbarzin.me` (those don't exist). emo's identity scheme is unknown — ask before assuming. - **Source-IP allowlist (K2, K9, V7, S1)**: `10.0.20.0/22`, `192.168.1.0/24` (Proxmox + Sofia LAN), K8s pod CIDR, K8s service CIDR, Headscale tailnet. **Policy: no public-IP access** — Vault, kube-apiserver, PVE sshd must transit LAN or Headscale. **One documented exception (2026-06-11): break-glass SSH** — PVE sshd on a WAN-exposed `:52222`, key-only, dedicated break-glass key only (`Match LocalPort`), rate-limited + fail2ban; intentionally cluster-independent so it survives an outage. As-built `docs/runbooks/breakglass-ssh.md`. (Replaced the 2026-05-30 port-knock design — circular Vault dep caused a lockout.) -- **Response model**: (I) Slack-only daily skim. All security alerts via Loki ruler → Alertmanager → the `slack-security` receiver, which posts to `#alerts` (it keeps its `[SECURITY/]` title styling so security-lane alerts stand out). Severity labels carried in the alert (critical/warning/info). No paging. The dedicated `#security` channel was abandoned 2026-06-25 — the shared `alertmanager_slack_api_url` webhook's Slack app isn't a member of it (a `#security` override 404s), so everything consolidated to `#alerts`. +- **Response model**: (I) Slack-only daily skim. All security alerts via Loki ruler → Alertmanager → `#security` Slack receiver. Single channel with severity labels inside (critical/warning/info). No paging. - **Kyverno policies (wave 1)**: `deny-privileged-containers`, `deny-host-namespaces`, `restrict-sys-admin`, `require-trusted-registries` flip Audit→Enforce with the 31-namespace exclude list (memory id=1970). `failurePolicy: Ignore` preserved. Cosign `verify-images` deferred. -- **NetworkPolicy default-deny egress (wave 1)**: observe-then-enforce (γ approach) — Calico flow logs cluster-wide + GlobalNetworkPolicy log-only on tier 3+4, build empirical allowlist after 1 week, phased per-namespace enforce starting `recruiter-responder`. Tier 0/1/2 deferred. **The internal (ns-to-ns) half of each allowlist now derives faster from the east-west flow trail** (below): `SELECT DISTINCT dst_ns FROM edge WHERE src_ns='' AND action='allow'`. External egress is NOT in that table (empty-ns flows dropped) — those still come from the Calico flow-log W1.6 snapshot. Enforce-flips remain out of scope of the trail (observe-and-derive only; beads `code-8ywc`). -- **East-west flow trail (who-talks-to-whom, ADR-0014)**: Calico **Goldmane** (`goldmane.calico-system:7443`, gRPC/mTLS, ~60-min in-memory ring buffer — no etcd writes) + **Whisker** live UI (`whisker.viktorbarzin.me`, Authentik-gated) → **`goldmane-edge-aggregator`** streams Goldmane's `Flows.Stream` over mTLS and upserts the namespace-pair **edge set** (`edge(src_ns,dst_ns,action,first_seen,last_seen,flow_count)`, self-edges + public-internet flows dropped) into **CNPG DB `goldmane_edges`** → daily **`goldmane-edges-digest`** CronJob posts first-seen edges to `#alerts` (consolidated to `#alerts`; the `#security` channel was abandoned 2026-06-25 — the shared webhook's Slack app isn't a member of it, so a `#security` override 404s; see runbook). **CERT-REUSE GOTCHA**: the aggregator's mTLS client cert reuses the operator's Tigera-CA-signed `whisker-backend-key-pair` Secret (Goldmane verifies CA-chain only) — **re-apply `stacks/goldmane-edge-aggregator` if the operator rotates it** (symptom: no `last_seen` updates, `AggregatorDown`). Service identity = namespace, + `service-identity` label only in `monitoring`/`kube-system`/`dbaas`. Health: `AggregatorDown` + `DigestFailing` alerts + cluster-health #48. **WHISKER-WEDGE GOTCHA** (2026-06-28): the operator's `whisker` NetworkPolicy allows DNS egress only to kube-dns *pods*, but whisker-backend resolves goldmane via the kube-dns *ClusterIP* — Calico drops UDP DNS to a ClusterIP under a podSelector-only egress rule, so when whisker-backend's gRPC stream breaks and it re-resolves, it wedges and the UI goes **empty** (the aggregator, a separate pod, is unaffected). FIX = additive egress NP `whisker-allow-dns-clusterip` (`stacks/calico`, allows whisker→10.96.0.10/32:53); the `whisker-watchdog` CronJob is a backstop. Manual heal `kubectl -n calico-system delete pod -l k8s-app=whisker`. Runbook: `docs/runbooks/goldmane-flow-trail.md`. (Goldmane is OSS tech-preview — reversible operator-CR toggle in `stacks/calico/main.tf`.) +- **NetworkPolicy default-deny egress (wave 1)**: observe-then-enforce (γ approach) — Calico flow logs cluster-wide + GlobalNetworkPolicy log-only on tier 3+4, build empirical allowlist after 1 week, phased per-namespace enforce starting `recruiter-responder`. Tier 0/1/2 deferred. - **What's NOT in scope**: canary tokens (rejected — self-trigger risk with Viktor's normal `vault kv list secret/viktor` and `kubectl get secret -A` workflows), Falco/Tetragon (too noisy for Slack-only daily check), Cloudflare/GitHub audit polling (deferred to wave 2). ## Storage & Backup Architecture diff --git a/.claude/reference/service-catalog.md b/.claude/reference/service-catalog.md index ca1ee262..cd7b5274 100644 --- a/.claude/reference/service-catalog.md +++ b/.claude/reference/service-catalog.md @@ -13,8 +13,6 @@ | authentik | Identity provider (SSO) | authentik | | cloudflared | Cloudflare tunnel | cloudflared | | authelia | Auth middleware (may be merged into ebooks or removed) | platform | -| goldmane | Calico 3.30 OSS flow aggregator (`goldmane.calico-system.svc:7443`, gRPC/mTLS). Stamps identity (ns/pod/workload/labels + allow-deny) on every flow from Felix into a ~60-min in-memory ring buffer — no etcd/API writes. East-west "who-talks-to-whom" source (ADR-0014). Enabled via operator CR (`kubectl_manifest.goldmane`). | calico | -| whisker | Calico 3.30 OSS live flow-observability UI (`whisker.calico-system.svc:8081`) at `whisker.viktorbarzin.me` (Authentik-gated, `auth=required` — no own login; additive NP ORs Traefik past the operator default-deny). ~60-min live view of Goldmane flows, NOT history. Enabled via operator CR (`kubectl_manifest.whisker`). | calico | | monitoring | Prometheus/Grafana/Loki stack | monitoring | ## Storage & Security (Tier: cluster) @@ -39,7 +37,6 @@ ## Active Use | Service | Description | Stack | |---------|-------------|-------| -| goldmane-edge-aggregator | Durable who-talks-to-whom audit trail (ADR-0014 / #58). Go service: `aggregate` Deployment streams Goldmane's gRPC `Flows.Stream` (mTLS) and upserts the low-cardinality namespace-pair edge set (`edge(src_ns,dst_ns,action,first_seen,last_seen,flow_count)`) into CNPG DB `goldmane_edges`; `goldmane-edges-digest` CronJob posts first-seen edges daily to `#alerts` (the `#security` channel was abandoned 2026-06-25 — shared webhook's app isn't a member of it). mTLS client cert REUSES the operator's `whisker-backend-key-pair` (re-apply if rotated). Tier-4-aux. Image `ghcr.io/viktorbarzin/goldmane-edge-aggregator` (private). Runbook: [goldmane-flow-trail.md](../../docs/runbooks/goldmane-flow-trail.md). | goldmane-edge-aggregator | | mailserver | Email (docker-mailserver) | mailserver | | shadowsocks | Proxy | shadowsocks | | webhook_handler | Webhook processing | webhook_handler | @@ -164,4 +161,3 @@ procedures) are documented in `infra/docs/runbooks/`: | pfSense + Unbound DNS | [pfsense-unbound.md](../../docs/runbooks/pfsense-unbound.md) | | Mailserver PROXY-protocol / HAProxy | [mailserver-pfsense-haproxy.md](../../docs/runbooks/mailserver-pfsense-haproxy.md) | | Technitium apply flow | [technitium-apply.md](../../docs/runbooks/technitium-apply.md) | -| Goldmane flow trail (east-west who-talks-to-whom) | [goldmane-flow-trail.md](../../docs/runbooks/goldmane-flow-trail.md) | diff --git a/.claude/skills/home-assistant/SKILL.md b/.claude/skills/home-assistant/SKILL.md index ab07a27f..61aaa6af 100644 --- a/.claude/skills/home-assistant/SKILL.md +++ b/.claude/skills/home-assistant/SKILL.md @@ -11,8 +11,8 @@ description: | There are TWO Home Assistant deployments: ha-london (default) and ha-sofia. Always use Home Assistant for smart home control. author: Claude Code -version: 2.1.0 -date: 2026-06-24 +version: 2.0.0 +date: 2026-02-07 --- # Home Assistant Control @@ -395,27 +395,14 @@ Advanced SSH, File Editor, Studio Code Server, InfluxDB, Mosquitto, Node-RED, Fr ## ha-london Knowledge Map ### Overview -- **HA Version**: 2026.5.2 on **Home Assistant OS** (HAOS — managed appliance, NOT a `docker run` container). Latest is 2026.6.4 (update available, deliberately not applied). +- **HA Version**: 2025.9.1 (Docker container on Raspberry Pi) - **Location**: London, UK -- **Platform**: Raspberry Pi 4, HA OS -- **Access from the Sofia devvm**: london is **remote** — `homelab ha ssh --instance london` generally WON'T connect (ADR-0012). Drive it via the API: `homelab ha token --instance london` + `https://ha-london.viktorbarzin.me/api/...`, and the WebSocket API `wss://ha-london.viktorbarzin.me/api/websocket` for dashboards / config-entries / HACS installs. -- **SSH (only from the London LAN)**: `ssh hassio@192.168.8.103` (requires `sudo` for file access) -- **Config path**: `/config/` +- **Platform**: Raspberry Pi 4, HA OS (not Docker standalone) +- **SSH**: `ssh hassio@192.168.8.103` (requires `sudo` for file access) +- **Config path**: `/config/` (requires `sudo` for file access) - **3 tracked people**: Viktor Barzin, Anca Milea, Gheorghe Milea - **Zone**: London (home) -### Dashboards (redesigned 2026-06-24) -**Glossary** (HA terms — keep distinct): -- **Dashboard** = a sidebar entry (Overview, Air Quality, Map). Sidebar *order* is a per-USER frontend preference, not in any dashboard config. -- **View** = a tab inside a dashboard. View order is global (stored in the dashboard config). -- **Card** = a widget inside a view. - -- **Overview** (`lovelace`, the default): responsive **sections** views, styled with Mushroom + mini-graph-card. - - **Home** tab: *Who's home* · *Comfort & Air* (CO₂/temp/humidity/PM2.5/VOC chips + CO₂ and temp/humidity trend graphs + link to Air Quality) · *Cowboy* (battery/range/last-ride) · *Energy* (5 Kasa plugs + power trend) · *Quick actions* (Netflix/Stremio/Night). - - **More** tab: *Network* (GL-MT6000 router) · *System* (HA version/update, last backup, RPi power) · *Phones*. -- **Air Quality** (`air-quality`): deep-dive (views: Home, Detailed). (`detialed`→`detailed` path typo fixed 2026-06-24.) -- Built via the WS `lovelace/config/save` API (london is remote — no SSH path). - ### Key Systems #### 1. Smart Plugs (TP-Link Kasa) — Energy Monitoring @@ -437,15 +424,10 @@ Named plugs with power/energy tracking: - PM1.0/2.5/4.0/10 particulate sensors - VOC, NOx, ammonia, CO, ethanol, hydrogen, methane, NO2 gas sensors -#### 3. Cowboy E-Bike (`elsbrock/cowboy-ha`) -Bike named **"Classic Performance"** → entities are `sensor.classic_performance_*` (26 total). The old `sensor.bike_*` names are GONE (they were the dead `jdejaegh` integration). -- `sensor.classic_performance_remaining_battery`: Battery % (was `sensor.bike_state_of_charge`) -- `sensor.classic_performance_remaining_range`: Range km -- `sensor.classic_performance_mileage`: Total km (was `sensor.bike_total_distance`) -- `sensor.classic_performance_saved_co2`: Lifetime CO2 saved (was `sensor.bike_total_co2_saved`) -- Plus `_distance_today`, `_last_trip_*`, `_battery_health`, `device_tracker.classic_performance`, etc. -- **GOTCHA**: live battery/range/mileage read `unknown` while the bike is parked/asleep — Cowboy only reports live SoC when awake (ridden/charging); trip-history + `distance_today` stay live regardless. -- Auth: account **email+password** (no AWS Cognito — that was the dead `jdejaegh`/`cowboybike` lineage). Setup via UI config flow / REST `config_entries/flow`. Creds in Vaultwarden item **"cowboy bike"** (`homelab vault get "cowboy bike"`). +#### 3. Cowboy E-Bike +- `sensor.bike_state_of_charge`: Battery % +- `sensor.bike_total_distance`: Total km +- `sensor.bike_total_co2_saved`: CO2 saved (grams) #### 4. Uptime Monitoring (UptimeRobot) - `sensor.blog`: blog uptime @@ -464,17 +446,12 @@ Bike named **"Classic Performance"** → entities are `sensor.classic_performanc - Scripts: `script.start_netflix`, `script.start_stremio` - Scene: `scene.night` (turns off Livia + Michelle plugs) -### Custom Components (HACS integrations) -- **cowboy** (`elsbrock/cowboy-ha` v1.2.0): Cowboy e-bike — revived 2026-06-24. The old `jdejaegh/home-assistant-cowboy` repo is **dead (404)**; don't chase it. -- **hildebrandglow_dcc**: UK smart meter DCC energy — **DISABLED by user** (config entry `disabled_by: user`), not broken. - -### HACS frontend cards (plugins) -- **Mushroom** (`piitaya/lovelace-mushroom`), **mini-graph-card** (`kalkih/mini-graph-card`), **plotly-graph-card** (`dbuezas/lovelace-plotly-graph-card`) — used by the redesigned Overview. Install over WS `hacs/repository/download`; resources auto-register in storage mode. +### Custom Components +- **cowboy**: Cowboy e-bike integration (HACS) +- **hildebrandglow_dcc**: UK smart meter DCC energy data (HACS) ### Integrations -ESPHome, TP-Link Kasa, Tapo, UptimeRobot, **Cowboy** (elsbrock), Oral-B BLE, Ookla Speedtest (exposes only an `update` entity, no live speed sensors), HACS, OpenRouter (free LLMs), Piper (TTS), Whisper (STT), Android TV/ADB. -- **Disabled by user (NOT broken)**: `met` + `metoffice` (weather — so `weather.*` entities are ABSENT), `roomba` (Rumi vacuum), `hildebrandglow_dcc` (energy). -- **Failing**: `tplink` **Tapo P100** projector plug — `setup_retry`, 403 KLAP handshake from 192.168.8.108 (plug off / firmware). Left as-is. +ESPHome, TP-Link Kasa, Tapo, UptimeRobot, Cowboy, Hildebrand Glow DCC, Oral-B BLE, Ookla Speedtest, HACS, OpenRouter (multiple free LLMs), Piper (local TTS), Whisper (local STT), Android TV/ADB ### AI / Voice Assistants - 5 free LLM conversation agents: Google Gemma 3 27B, Meta Llama 3.2 3B, Mistral Devstral 2, OpenAI GPT-OSS-20B, Z.AI GLM 4.5 Air @@ -489,8 +466,15 @@ ESPHome, TP-Link Kasa, Tapo, UptimeRobot, **Cowboy** (elsbrock), Oral-B BLE, Ook - Anca arrival/departure notifications - Night scene: turns off Livia + Michelle -### Platform (HAOS — ignore any legacy `docker run` snippet) -ha-london runs **Home Assistant OS** (managed appliance), NOT a hand-run Docker container. There is no `docker run homeassistant/home-assistant` to manage. Install HACS components over the WebSocket API (`hacs/repository/download` with the repo's HACS id), then restart via `POST /api/services/homeassistant/restart` — a HAOS restart drops automations for ~1–2 min and resets `sensor.uptime` (use that as the "back up" marker). +### Docker Setup +```bash +docker run -d --name homeassistant --privileged \ + -e TZ=Europe/London \ + -v /home/pi/docker/homeAssistant:/config \ + -v /run/dbus:/run/dbus:ro \ + --network=host --restart=unless-stopped \ + homeassistant/home-assistant:2025.9 +``` ### SSH Access ```bash diff --git a/.github/workflows/build-authentik.yml b/.github/workflows/build-authentik.yml deleted file mode 100644 index bb43502f..00000000 --- a/.github/workflows/build-authentik.yml +++ /dev/null @@ -1,39 +0,0 @@ -name: Build Custom Authentik Image - -# ADR-0002: infra-owned image built off-infra on GHA → ghcr. -# Thin SLOW-1a overlay over the official authentik server (narrows the login -# identification stage's select_subclasses() to the login-capable source subtypes; -# see stacks/authentik/Dockerfile). Rebuild only when the Dockerfile changes — on -# every authentik bump, edit the FROM tag + the patchN suffix here + the image tag -# in modules/authentik/values.yaml together. -on: - push: - branches: [master] - paths: - - 'stacks/authentik/Dockerfile' - workflow_dispatch: {} - -permissions: - contents: read - packages: write - -jobs: - build: - runs-on: ubuntu-latest - steps: - - uses: actions/checkout@v4 - - uses: docker/setup-buildx-action@v3 - - uses: docker/login-action@v3 - with: - registry: ghcr.io - username: ${{ github.actor }} - password: ${{ secrets.GITHUB_TOKEN }} - - uses: docker/build-push-action@v6 - with: - context: stacks/authentik - platforms: linux/amd64 - provenance: false - push: true - tags: | - ghcr.io/viktorbarzin/authentik-server:2026.2.4-patch3 - ghcr.io/viktorbarzin/authentik-server:latest diff --git a/.woodpecker/default.yml b/.woodpecker/default.yml index d46f5ae1..ef94ccee 100644 --- a/.woodpecker/default.yml +++ b/.woodpecker/default.yml @@ -65,21 +65,6 @@ steps: # don't need explicit token propagation. VAULT_ADDR: http://vault-active.vault.svc.cluster.local:8200 commands: - # ── Forge guard: apply ONLY on the canonical Forgejo forge ── - # infra is registered in Woodpecker on BOTH the Forgejo canonical repo and - # the legacy GitHub mirror, and BOTH fire this push pipeline. Without this - # guard both run `terragrunt apply` on every push and race each other for - # the per-stack PG state lock — the dominant cause of the "Error acquiring - # the state lock" failures + push-supersede "killed" runs. The GitHub-mirror - # registration keeps running the CRONS (drift-detection, renew-tls, …) — only - # its duplicate push-apply no-ops here. Fail-open: an unknown forge (neither - # env var set) still applies, preserving prior behaviour. - - | - if echo "${CI_REPO_URL:-}${CI_FORGE_URL:-}" | grep -qi 'github\.com'; then - echo "[forge-guard] GitHub-mirror push — apply runs only on the Forgejo canonical repo (avoids double-apply + state-lock races). Skipping." - exit 0 - fi - # ── Skip CI commits ── - | if echo "$CI_COMMIT_MESSAGE" | grep -q '\[CI SKIP\]\|\[ci skip\]'; then @@ -228,40 +213,23 @@ steps: if [ -s .platform_apply ]; then echo "=== Applying platform stacks (serial, locked) ===" while read -r stack; do - # Tier-0 `vault` is human-applied via OIDC; the CI `ci` Vault role - # lacks Vault-admin perms (sys/mounts + sys/policies/acl), so a CI - # apply always 403s and fails the pipeline. Kept in PLATFORM_STACKS - # (so the app-stack detector still excludes it) but skipped here. - # (2026-06-27 — see docs/architecture/ci-cd.md) - if [ "$stack" = "vault" ]; then echo "[vault] SKIPPED (Tier-0, human-applied via OIDC)"; continue; fi echo "[$stack] Starting apply..." - ATTEMPT=0 - while :; do - ATTEMPT=$((ATTEMPT + 1)) - set +e - OUTPUT=$(cd "stacks/$stack" && ../../scripts/tg apply --non-interactive 2>&1) - EXIT=$? - set -e - if [ $EXIT -eq 0 ]; then - echo "$OUTPUT" | tail -3; echo "[$stack] OK"; break + set +e + OUTPUT=$(cd "stacks/$stack" && ../../scripts/tg apply --non-interactive 2>&1) + EXIT=$? + set -e + if [ $EXIT -ne 0 ]; then + if echo "$OUTPUT" | grep -q "is locked by"; then + echo "[$stack] SKIPPED (locked by another session)" + else + echo "$OUTPUT" | tail -50 + echo "[$stack] FAILED (exit $EXIT)" + FAILED_PLATFORM_STACKS="$FAILED_PLATFORM_STACKS $stack" fi - # Lock contention → SKIP, not fail. Match BOTH the Tier-0 Vault lock - # ("is locked by", from scripts/tg) AND the Tier-1 PG-backend lock - # ("Error acquiring the state lock" / "already locked"). The PG case - # was previously counted as a failure — the #1 source of false reds. - if echo "$OUTPUT" | grep -qE 'is locked by|Error acquiring the state lock|already locked'; then - echo "[$stack] SKIPPED (locked by another session/run)"; break - fi - # Transient: provider-registry download timeout / Vault 5xx → bounded - # retry. Deliberately NOT helm atomic-timeouts or config errors - # (missing arg, invalid index) — those must fail fast, retry can't fix - # them and can worsen a stuck helm release. - if [ $ATTEMPT -lt 3 ] && echo "$OUTPUT" | grep -qE 'Failed to install provider|Client\.Timeout exceeded while awaiting headers|error reading from Vault.*Code: 5[0-9][0-9]'; then - echo "[$stack] transient error (attempt $ATTEMPT/3) — retrying in 15s..."; sleep 15; continue - fi - echo "$OUTPUT" | tail -50; echo "[$stack] FAILED (exit $EXIT)" - FAILED_PLATFORM_STACKS="$FAILED_PLATFORM_STACKS $stack"; break - done + else + echo "$OUTPUT" | tail -3 + echo "[$stack] OK" + fi done < .platform_apply fi # Deferred until after app stacks so both lists get a chance to run. @@ -274,27 +242,22 @@ steps: echo "=== Applying app stacks (serial, locked) ===" while read -r stack; do echo "[$stack] Starting apply..." - ATTEMPT=0 - while :; do - ATTEMPT=$((ATTEMPT + 1)) - set +e - OUTPUT=$(cd "stacks/$stack" && ../../scripts/tg apply --non-interactive 2>&1) - EXIT=$? - set -e - if [ $EXIT -eq 0 ]; then - echo "$OUTPUT" | tail -3; echo "[$stack] OK"; break + set +e + OUTPUT=$(cd "stacks/$stack" && ../../scripts/tg apply --non-interactive 2>&1) + EXIT=$? + set -e + if [ $EXIT -ne 0 ]; then + if echo "$OUTPUT" | grep -q "is locked by"; then + echo "[$stack] SKIPPED (locked by another session)" + else + echo "$OUTPUT" | tail -50 + echo "[$stack] FAILED (exit $EXIT)" + FAILED_APP_STACKS="$FAILED_APP_STACKS $stack" fi - # Lock contention → SKIP, not fail (Tier-0 Vault + Tier-1 PG; see platform loop). - if echo "$OUTPUT" | grep -qE 'is locked by|Error acquiring the state lock|already locked'; then - echo "[$stack] SKIPPED (locked by another session/run)"; break - fi - # Transient provider-download / Vault 5xx → bounded retry (see platform loop). - if [ $ATTEMPT -lt 3 ] && echo "$OUTPUT" | grep -qE 'Failed to install provider|Client\.Timeout exceeded while awaiting headers|error reading from Vault.*Code: 5[0-9][0-9]'; then - echo "[$stack] transient error (attempt $ATTEMPT/3) — retrying in 15s..."; sleep 15; continue - fi - echo "$OUTPUT" | tail -50; echo "[$stack] FAILED (exit $EXIT)" - FAILED_APP_STACKS="$FAILED_APP_STACKS $stack"; break - done + else + echo "$OUTPUT" | tail -3 + echo "[$stack] OK" + fi done < .app_apply fi # Fail the step loudly so the pipeline `default` workflow state diff --git a/.woodpecker/drift-detection.yml b/.woodpecker/drift-detection.yml index b2a552f4..b2e303ff 100644 --- a/.woodpecker/drift-detection.yml +++ b/.woodpecker/drift-detection.yml @@ -85,13 +85,6 @@ steps: stack=$(basename "$stack_dir") [ -f "$stack_dir/terragrunt.hcl" ] || continue - # Tier-0 `vault` is human-applied via OIDC; the CI `ci` Vault role lacks - # Vault-admin perms (sys/mounts + sys/policies/acl), so `terragrunt plan` - # on it ERRORs (detailed-exitcode 1) and fails the whole nightly drift - # run. Skip it — drift on Tier-0 vault is caught at human apply time. - # (2026-06-27) - [ "$stack" = "vault" ] && continue - echo -n "[$stack] planning... " OUTPUT=$(cd "$stack_dir" && terragrunt plan -detailed-exitcode -input=false 2>&1) EXIT=$? diff --git a/AGENTS.md b/AGENTS.md index 4e3ea2de..7fbc838d 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -273,11 +273,8 @@ To land a finished change from such a clone: Slack audit feed; a no-op CI apply on a docs-only commit is harmless. 4. Leave the clone on clean `master` so auto-refresh keeps working. 5. Tell the user in plain language what happened. Stack changes are - auto-applied by CI on push — or, with apply access, applied locally yourself - (`scripts/tg apply`, from the main checkout, not a worktree); either path is - fine, but the change must always be committed here, never applied - uncommitted. Verify the live result with the user's read-only kubectl before - saying "it's live". + auto-applied by CI — verify the live result with the user's read-only + kubectl before saying "it's live". If a push to `master` is rejected by branch protection (user not on the whitelist — e.g. new users before Viktor grants it), fall back to a diff --git a/CONTEXT.md b/CONTEXT.md index 548fa40d..2b9bb8b3 100644 --- a/CONTEXT.md +++ b/CONTEXT.md @@ -125,7 +125,7 @@ How a **Service** is named in flow/audit data — its **namespace** is the prima _Avoid_: equating "service identity" with a workload's **ServiceAccount** (that's the deferred enforcement principal, not the attribution key) or with cryptographic/SPIFFE identity; "Service" here is the domain **Service**, not the K8s `Service` object. **Goldmane / Whisker**: -Calico 3.30's OSS flow-observability pair — **Goldmane** aggregates identity-stamped flows (namespace/pod/workload/labels + allow-deny + policy trace) streamed from Felix over gRPC into an in-memory ~60-min ring buffer (no etcd/API writes); **Whisker** is its live web UI. The east-west "who-talks-to-whom" data plane, succeeding raw iptables-`LOG`→journald lines (which carry no identity). The in-memory buffer alone is not an audit trail — durable history is the **`goldmane-edge-aggregator`** (the implemented trail; ADR-0014 originally framed this as a Loki emitter), which streams Goldmane's gRPC `Flows.Stream` over mTLS and upserts the namespace-pair **edge set** into CNPG DB `goldmane_edges` + a daily `#alerts` digest (the `#security` channel was abandoned 2026-06-25). As-built: `docs/runbooks/goldmane-flow-trail.md`. +Calico 3.30's OSS flow-observability pair — **Goldmane** aggregates identity-stamped flows (namespace/pod/workload/labels + allow-deny + policy trace) streamed from Felix over gRPC into an in-memory ~60-min ring buffer (no etcd/API writes); **Whisker** is its live web UI. The east-west "who-talks-to-whom" data plane, succeeding raw iptables-`LOG`→journald lines (which carry no identity). Durable history requires emitting Goldmane flows to **Loki**; the in-memory buffer alone is not an audit trail. _Avoid_: assuming Goldmane persists (it's a ring buffer — lost on restart); expecting a ServiceAccount field in its schema (it carries labels, not SA); confusing it with Cilium **Hubble** (needs the Cilium datapath, unusable on Calico) or **Kiali** (needs an Istio mesh). ### Storage diff --git a/cli/README.md b/cli/README.md index fa9ff3ec..186c1ee5 100644 --- a/cli/README.md +++ b/cli/README.md @@ -202,69 +202,6 @@ runs on the devvm, `setInputFiles` streams local files to the remote browser ove CDP — no `chmod`/staging-dir workaround. See `docs/architecture/chrome-service.md` and `docs/adr/0013`. -### v0.9 verbs — edges (east-west "who-talks-to-whom" trail) - -Read-only investigation helper over the `goldmane_edges` CNPG trail (ADR-0014): -filters render to a single safe `SELECT` (namespace values validated to the k8s -name charset) run via the dbaas primary pod — the same exec path as `k8s db`. - -| Command | Tier | What it does | -| --- | --- | --- | -| `edges --ns ` | read | edges touching `` (either direction) | -| `edges --src ` / `--dst ` | read | directional: ``'s egress / ingress peers | -| `edges --peers-of ` | read | distinct peer namespaces of `` (both directions) | -| `edges --new-since <24h\|7d\|YYYY-MM-DD>` | read | edges first seen since a duration or date | -| `edges --denied` | read | only `action='deny'` edges (blocked / lateral-movement) | -| `edges --json` / `--limit N` | read | JSON array output / row cap (default 200) | - -### v0.10 — `vault get --all` (browse every field) - -`vault get --all` returns the **whole item** as a normalized JSON object, -so an agent can discover and read fields the single-field `--field` allowlist -can't reach — notably arbitrary **custom fields**. - -| Command | Tier | What it does | -| --- | --- | --- | -| `vault get --all` | read | all fields as JSON: `{name, username?, password?, uris?, totp?, notes?, fields?}` | - -Shape notes: present standard fields only (empty ones omitted); `fields` is a -custom `name→value` map (duplicate names → last-wins; `linked` fields skipped). -The TOTP **seed is never emitted** — `totp` is a presence flag (`true`), so the -only seed-derived path stays the specially-audited `vault code`. Like -`get --json`, the dump is all secret values, so it **refuses a terminal** — pipe -it (`homelab vault get --all | jq`). - -### v0.10.1 — reads `bw sync` first (always fresh) - -Every vault read (`get`, `get --all`, `list`, `code`, `status`) now runs `bw -sync` when opening its session, so it reflects the latest server-side values. -`bw unlock` only decrypts the *local* cache, so without this a persisted -(already-logged-in) session served stale data — a password changed in the web -vault wouldn't show up until the next login. The sync is **best-effort**: a -transient failure warns on stderr and falls back to the cached vault rather than -failing the read. - -### v0.11 — `vault kv` (HashiCorp Vault / OpenBao infra secrets) - -`homelab vault` now fronts **two unrelated stores**, made explicit in the bare -`homelab vault` help and via `[vaultwarden]` / `[hashicorp-vault]` summary tags: - -- **Vaultwarden** — your personal password manager (`vault get/list/code/…`, unchanged). -- **HashiCorp Vault / OpenBao** — homelab infra secrets, the `secret/…` KV store, under `vault kv`. - -| Command | Tier | What it does | -| --- | --- | --- | -| `vault kv get [--field K]` | read | read a secret: `--field K` → one value (TTY-aware clipboard/stdout); no field → all fields as JSON (refuses a bare TTY) | -| `vault kv list ` | read | list sub-paths under `` (no values) | -| `vault kv put ` | write | write one key; **value via stdin** (piped or no-echo prompt, never argv); creates the path or **merges** (never clobbers siblings) | - -**Different credentials:** the Vaultwarden verbs use the per-user *scoped* token -(bound to `claude-users/`); `vault kv` uses your **own** Vault token -(`vault login -method=oidc` → `~/.vault-token`, or `$VAULT_TOKEN`) — the kv -handlers set `VAULT_ADDR` but never inject the scoped token (which would 403 off -its own path). Access is whatever your policy grants. Writes are merge-only; -`put` (replace) / `delete` are out of scope — use the raw `vault` CLI. - ## Build / install Built from source to `/usr/local/bin/homelab` during devvm provisioning diff --git a/cli/VERSION b/cli/VERSION index fd2726c9..85f7059b 100644 --- a/cli/VERSION +++ b/cli/VERSION @@ -1 +1 @@ -v0.11.0 +v0.8.1 diff --git a/cli/cmd_edges.go b/cli/cmd_edges.go deleted file mode 100644 index 7ee528fd..00000000 --- a/cli/cmd_edges.go +++ /dev/null @@ -1,69 +0,0 @@ -package main - -import "fmt" - -func edgesCommands() []Command { - return []Command{ - {Path: []string{"edges"}, Tier: TierRead, - Summary: "who-talks-to-whom trail: edges [--ns|--src|--dst|--peers-of N] [--new-since 24h] [--denied] [--json] [--limit N]", - Run: edgesRun}, - } -} - -// edgesRun renders the filter flags to SQL and runs it read-only against the -// goldmane_edges CNPG DB via the dbaas primary pod (same exec path as `k8s db`). -func edgesRun(args []string) error { - for _, a := range args { - if a == "-h" || a == "--help" { - fmt.Print(edgesUsage()) - return nil - } - } - o, err := parseEdgesArgs(args) - if err != nil { - return fmt.Errorf("%w\n\n%s", err, edgesUsage()) - } - sql, err := buildEdgesQuery(o) - if err != nil { - return err - } - // pg-cluster-rw is a Service (not exec-able); resolve the primary POD. - pod, err := kubectlCapture("dbaas", "get", "pod", "-l", "cnpg.io/instanceRole=primary", - "-o", "jsonpath={.items[0].metadata.name}") - if err != nil || pod == "" { - return fmt.Errorf("could not resolve CNPG primary pod in dbaas: %v", err) - } - exec := []string{"exec", pod, "-c", "postgres", "--", "psql", "-U", "postgres", "-d", "goldmane_edges"} - if o.asJSON { - exec = append(exec, "-tAc", sql) // raw tuple → the JSON array - } else { - exec = append(exec, "-P", "pager=off", "-c", sql) // aligned table for humans - } - return kubectlStream("dbaas", exec...) -} - -func edgesUsage() string { - return `homelab edges — query the who-talks-to-whom trail (goldmane_edges, ADR-0014) - -Usage: homelab edges [filters] - -Filters (AND-combined; namespace values are validated to the k8s name charset): - --ns NAME edges touching NAME (either direction) - --src NAME edges where source namespace = NAME - --dst NAME edges where destination namespace = NAME - --peers-of NAME distinct peer namespaces of NAME (both directions) - --new-since SPEC first seen since SPEC: a duration (24h, 7d, 30m, 90s) or a date (YYYY-MM-DD) - --denied only denied (action='deny') edges — blocked / lateral-movement attempts - --json output a JSON array (for agents/pipelines) - --limit N cap rows (default 200) - -Examples: - homelab edges --ns immich # everything immich talks to / is talked to by - homelab edges --peers-of authentik # authentik's peer namespaces - homelab edges --src recruiter-responder # that namespace's egress peers - homelab edges --new-since 24h # edges first seen in the last day - homelab edges --denied --json # blocked flows, machine-readable - -Read-only SELECT against CNPG DB goldmane_edges via the dbaas primary pod. -` -} diff --git a/cli/cmd_memory.go b/cli/cmd_memory.go index 7ae11ea0..94f3a482 100644 --- a/cli/cmd_memory.go +++ b/cli/cmd_memory.go @@ -54,7 +54,10 @@ func printMemories(raw []byte, jsonOut bool) error { return nil } for _, m := range r.Memories { - c := truncatePreview(strings.ReplaceAll(m.Content, "\n", " "), 240) + c := strings.ReplaceAll(m.Content, "\n", " ") + if len(c) > 240 { + c = c[:240] + "…" + } fmt.Printf("#%d [%s] (%.2f) %s\n", m.ID, m.Category, m.Importance, c) if m.Tags != "" { fmt.Printf(" tags: %s\n", m.Tags) @@ -63,21 +66,6 @@ func printMemories(raw []byte, jsonOut bool) error { return nil } -// truncatePreview shortens s to at most maxRunes RUNES, appending "…" when it -// trims. Counting runes (not bytes) is load-bearing: a byte slice like s[:240] -// can cut through the middle of a multibyte UTF-8 character (e.g. 2-byte -// Cyrillic), leaving a dangling lead byte = invalid UTF-8. That crashed strict -// decoders downstream — notably the homelab-memory-recall.py UserPromptSubmit -// hook (subprocess text=True), which surfaced as a recurring "UserPromptSubmit -// hook error" for Cyrillic-language users. -func truncatePreview(s string, maxRunes int) string { - r := []rune(s) - if len(r) <= maxRunes { - return s - } - return string(r[:maxRunes]) + "…" -} - func memoryRecall(args []string) error { req := memRecallReq{} jsonOut := false diff --git a/cli/cmd_vault.go b/cli/cmd_vault.go index 1a28ff14..bf270886 100644 --- a/cli/cmd_vault.go +++ b/cli/cmd_vault.go @@ -4,7 +4,6 @@ import ( "bufio" "encoding/base64" "encoding/json" - "errors" "fmt" "os" "os/exec" @@ -16,60 +15,43 @@ import ( // Identity is the kernel UID; per-user creds live in that user's isolated Vault // path (secret/workstation/claude-users/) read via their scoped token, and // decryption is done by the official `bw` CLI. See -// docs/runbooks/homelab-vault-onboarding.md. +// docs/superpowers/specs/2026-06-24-homelab-vault-design.md. func vaultCommands() []Command { - cmds := []Command{ - // Vaultwarden — your personal password manager (logins/passwords/TOTP). + return []Command{ {Path: []string{"vault", "setup"}, Tier: TierWrite, - Summary: "[vaultwarden] one-time: store your master password + API key in your Vault path", Run: vaultSetup}, + Summary: "one-time: store your Vaultwarden master password + API key in your Vault path", Run: vaultSetup}, {Path: []string{"vault", "status"}, Tier: TierRead, - Summary: "[vaultwarden] show whether your vault is configured/reachable (no secrets)", Run: vaultStatus}, + Summary: "show whether your vault is configured/reachable (no secrets)", Run: vaultStatus}, {Path: []string{"vault", "list"}, Tier: TierRead, - Summary: "[vaultwarden] list your item names: vault list [--search Q]", Run: vaultList}, + Summary: "list your item names: vault list [--search Q]", Run: vaultList}, {Path: []string{"vault", "get"}, Tier: TierRead, - Summary: "[vaultwarden] fetch one login: vault get [--field password|username|uri|notes|totp] [--json] [--all]", Run: vaultGet}, + Summary: "fetch one item: vault get [--field password|username|uri|notes|totp] [--json]", Run: vaultGet}, {Path: []string{"vault", "search"}, Tier: TierRead, - Summary: "[vaultwarden] search your item names: vault search ", Run: vaultSearch}, + Summary: "search your item names: vault search ", Run: vaultSearch}, {Path: []string{"vault", "code"}, Tier: TierRead, - Summary: "[vaultwarden] current TOTP code for an item: vault code ", Run: vaultCode}, + Summary: "current TOTP code for an item: vault code ", Run: vaultCode}, {Path: []string{"vault", "lock"}, Tier: TierWrite, - Summary: "[vaultwarden] lock/log out the local bw session", Run: vaultLock}, + Summary: "lock/log out the local bw session", Run: vaultLock}, {Path: []string{"vault"}, Tier: TierRead, - Summary: "two stores: Vaultwarden (logins) + HashiCorp Vault/OpenBao kv (infra secrets) — run `homelab vault` for help", + Summary: "Vaultwarden access for your own vault (run `homelab vault` for help)", Run: func([]string) error { fmt.Print(vaultHelp()); return nil }}, } - // HashiCorp Vault / OpenBao — homelab INFRA secrets (the secret/… KV store). - return append(cmds, vaultKVCommands()...) } -// vaultHelp is shown for bare `homelab vault`. It LEADS with the distinction -// between the two unrelated "vaults" this command fronts, because the name -// collides: Vaultwarden (a password manager) vs HashiCorp Vault / OpenBao (the -// infra secrets store). +// vaultHelp is shown for bare `homelab vault`. func vaultHelp() string { - return `homelab vault — two different secret stores under one command: + return `homelab vault — read YOUR OWN Vaultwarden logins (no-HITL after one-time setup) - • Vaultwarden your personal PASSWORD MANAGER (logins / passwords / TOTP) - • HashiCorp Vault / OpenBao homelab INFRA secrets (the secret/… KV store) → 'vault kv …' - -── Vaultwarden (reads YOUR OWN vault; no-HITL after one-time setup) ── homelab vault setup one-time: store your master password + API key in your Vault path homelab vault status configured / unlocked / reachable (no secrets) homelab vault list [--search Q] list your item names (no secrets) homelab vault get [--field password|username|uri|notes|totp] [--json] TTY → clipboard (auto-clears); piped → stdout - homelab vault get --all all fields (incl. custom) as JSON; piped only. - TOTP shown as presence flag — use 'vault code' for a code. homelab vault code current TOTP code homelab vault lock lock / log out the local bw session -── HashiCorp Vault / OpenBao (infra secrets; uses your own OIDC vault token) ── - homelab vault kv get [--field K] read an infra KV secret - homelab vault kv list list sub-paths - homelab vault kv put write one key (value via stdin) - -Vaultwarden creds live only in your own Vault path; the admin never sees them. -Security model: docs/runbooks/homelab-vault-onboarding.md +Creds live only in your own Vault path; the admin never sees them. Identity is +your unix UID. Security model: docs/superpowers/specs/2026-06-24-homelab-vault-design.md (note: anything running as your user can decrypt your vault — the accepted no-HITL trade). ` } @@ -97,33 +79,7 @@ func realRunner(name string, argv, envv []string) (string, error) { out, err := cmd.Output() // Trim only the trailing newline the tool appends — NOT all whitespace, so a // fetched secret with significant leading/trailing spaces is preserved. - return strings.TrimRight(string(out), "\r\n"), augmentErr(err, exitStderr(err)) -} - -// exitStderr returns the stderr captured by cmd.Output() on a failed exec (it -// stows it on *exec.ExitError), or nil. The tools we shell out to (vault, bw) -// write the actionable message there — "connection refused", "permission -// denied" — which the caller would otherwise never see behind a bare -// "exit status N". -func exitStderr(err error) []byte { - var ee *exec.ExitError - if errors.As(err, &ee) { - return ee.Stderr - } - return nil -} - -// augmentErr appends captured stderr to an error so failures are diagnosable -// (not just "exit status 2"). Returns nil when err is nil, and err unchanged -// when there's no stderr; preserves the wrapped error for errors.Is/As. -func augmentErr(err error, stderr []byte) error { - if err == nil { - return nil - } - if s := strings.TrimSpace(string(stderr)); s != "" { - return fmt.Errorf("%w: %s", err, s) - } - return err + return strings.TrimRight(string(out), "\r\n"), err } // realRunnerStdin runs a command feeding `stdin` to it, for secret values that @@ -136,7 +92,7 @@ func realRunnerStdin(name string, argv, envv []string, stdin string) (string, er } cmd.Stdin = strings.NewReader(stdin) out, err := cmd.Output() - return strings.TrimRight(string(out), "\r\n"), augmentErr(err, exitStderr(err)) + return strings.TrimRight(string(out), "\r\n"), err } func vwCredsPath(user string) string { return vwUserPathPrefix + user } @@ -172,89 +128,6 @@ func loadCreds(run cmdRunner, user string) (vwCreds, error) { var vaultCurrentUser = func() string { return os.Getenv("USER") } var vaultCurrentUID = func() string { return fmt.Sprintf("%d", os.Getuid()) } -// scopedTokenPath is where claude-auth-sync keeps the user's scoped Vault token. -// MUST match CAS_VAULT_TOKEN_FILE in scripts/workstation/claude-auth-sync.sh. -func scopedTokenPath(home string) string { - return home + "/.config/claude-auth-sync/vault-token" -} - -// vaultTokenSource decides which Vault token the `vault` child processes should -// use. Precedence: an explicit $VAULT_TOKEN (deliberate override), then the -// per-user scoped token claude-auth-sync maintains at scopedTokenPath(HOME) -// (policy workstation-claude-, which grants exactly the create/read/update -// this tool needs on the user's own path), then a native ~/.vault-token. -// -// The scoped token MUST beat ~/.vault-token: this tool only ever touches the -// caller's own secret/workstation/claude-users/ path, and a power-user who -// ran `vault login -method=oidc` carries a read-only ~/.vault-token whose -// capability on that path is `deny` — letting it win shadows the scoped token -// and every op fails 403/deny (emo, 2026-06-28). ~/.vault-token is only the -// right credential when there is no scoped token (admins). Returns the token to -// export — "" when the vault CLI should read the ambient/native credential — -// plus a source tag for tests/logging. -func vaultTokenSource(envToken string, haveVaultTokenFile bool, scopedToken string) (token, source string) { - switch { - case envToken != "": - return "", "env" - case strings.TrimSpace(scopedToken) != "": - return strings.TrimSpace(scopedToken), "scoped" - case haveVaultTokenFile: - return "", "file" - default: - return "", "none" - } -} - -// vaultAddrDefault is the cluster Vault the workstation talks to. The bw server -// is likewise hardcoded (openSession), so a sane default here is consistent. -const vaultAddrDefault = "https://vault.viktorbarzin.me" - -// vaultAddrToSet returns the VAULT_ADDR to export when the caller's environment -// doesn't already set one, else "". homelab vault is invoked by AFK agent -// sessions — frequently non-login shells (tmux panes, agent subprocesses) that -// never sourced /etc/environment — so, like claude-auth-sync, the CLI must NOT -// depend on an ambient VAULT_ADDR; otherwise every `vault` child falls back to -// the 127.0.0.1:8200 default and fails "connection refused" (exit 2). -func vaultAddrToSet(envAddr string) string { - if strings.TrimSpace(envAddr) == "" { - return vaultAddrDefault - } - return "" -} - -// ensureVaultAddr exports the default VAULT_ADDR when none is set, so the vault -// child processes reach the cluster Vault regardless of the caller's shell. An -// explicit VAULT_ADDR (admins, CI) is left untouched. -func ensureVaultAddr() { - if a := vaultAddrToSet(os.Getenv("VAULT_ADDR")); a != "" { - os.Setenv("VAULT_ADDR", a) - } -} - -// fileNonEmpty reports whether path exists and has content. -func fileNonEmpty(path string) bool { - fi, err := os.Stat(path) - return err == nil && fi.Size() > 0 -} - -// ensureVaultToken wires vaultTokenSource to the real environment: when the user -// has no ambient Vault credential, it exports the claude-auth-sync scoped token -// so the `vault` child processes authenticate as workstation-claude-. It -// is idempotent and safe for admins, whose explicit $VAULT_TOKEN / ~/.vault-token -// take precedence and are left untouched. -func ensureVaultToken() { - // Every vault verb funnels through here, so this is the one place that also - // guarantees VAULT_ADDR is set (see vaultAddrToSet for why it can't be - // assumed from the caller's shell). - ensureVaultAddr() - home := os.Getenv("HOME") - scoped, _ := os.ReadFile(scopedTokenPath(home)) - tok, src := vaultTokenSource(os.Getenv("VAULT_TOKEN"), home != "" && fileNonEmpty(home+"/.vault-token"), string(scoped)) - if src == "scoped" { - os.Setenv("VAULT_TOKEN", tok) - } -} - // bwBaseEnv is the minimal non-secret environment bw/node need. We deliberately // do NOT inherit the full parent env (keeps stray secrets out of the child). func bwBaseEnv(appdata string) []string { @@ -284,12 +157,10 @@ func bwSecretEnv(appdata string, c vwCreds, session string) []string { return env } -func bwLoginArgs() []string { return []string{"login", "--apikey"} } -func bwUnlockArgs() []string { return []string{"unlock", "--passwordenv", "BW_PASSWORD", "--raw"} } +func bwLoginArgs() []string { return []string{"login", "--apikey"} } +func bwUnlockArgs() []string { return []string{"unlock", "--passwordenv", "BW_PASSWORD", "--raw"} } func bwGetArgs(field, name string) []string { return []string{"get", field, name} } -func bwItemArgs(name string) []string { return []string{"get", "item", name} } -func bwStatusArgs() []string { return []string{"status"} } -func bwSyncArgs() []string { return []string{"sync"} } +func bwStatusArgs() []string { return []string{"status"} } // bwNeedsLogin parses `bw status` JSON and reports whether a `bw login` is // required. Unparseable/empty output → true (safer to attempt login). @@ -456,23 +327,13 @@ func openSession(run cmdRunner, user, uid string) (session, error) { if err != nil { return session{}, err } - sessEnv := bwSecretEnv(appdata, creds, sess) - // Pull the latest server-side state so reads reflect current values. `bw - // unlock` only decrypts the LOCAL cache, so a persisted (already-logged-in) - // session would otherwise serve stale data until the next login. Best-effort: - // a transient sync failure must not break a read — fall back to the cached - // vault and warn (status reports reachability separately). - if _, err := run("bw", bwSyncArgs(), sessEnv); err != nil { - fmt.Fprintln(os.Stderr, "homelab vault: warning: bw sync failed; using cached vault (values may be stale): "+err.Error()) - } - return session{env: sessEnv}, nil + return session{env: bwSecretEnv(appdata, creds, sess)}, nil } type getOpts struct { name string field string json bool - all bool // dump every field (incl. custom) as normalized JSON } var validGetFields = map[string]bool{"password": true, "username": true, "uri": true, "notes": true, "totp": true} @@ -484,8 +345,6 @@ func parseGetArgs(args []string) (getOpts, error) { switch { case a == "--json": o.json = true - case a == "--all": - o.all = true case a == "--field" && i+1 < len(args): o.field = args[i+1] i++ @@ -496,10 +355,9 @@ func parseGetArgs(args []string) (getOpts, error) { } } if o.name == "" { - return o, fmt.Errorf("usage: homelab vault get [--field password|username|uri|notes|totp] [--json] [--all]") + return o, fmt.Errorf("usage: homelab vault get [--field password|username|uri|notes|totp] [--json]") } - // --all dumps the whole item, so --field is irrelevant — skip its allowlist. - if !o.all && !validGetFields[o.field] { + if !validGetFields[o.field] { return o, fmt.Errorf("invalid --field %q (want password|username|uri|notes|totp)", o.field) } return o, nil @@ -515,81 +373,6 @@ func getValue(run cmdRunner, user, uid string, o getOpts) (string, error) { return bwGet(run, s.env, o.field, o.name) } -// getItem opens a session and returns the whole item as raw `bw get item` JSON. -// Used by `get --all`; normalization is a separate, pure step (normalizeItem). -func getItem(run cmdRunner, user, uid, name string) (string, error) { - s, err := openSession(run, user, uid) - if err != nil { - return "", err - } - return run("bw", bwItemArgs(name), s.env) -} - -// normalizedItem is the browse-all-fields projection of a Vaultwarden item: the -// standard login fields that are present, notes, and a flat map of custom field -// name→value. bw internals (id, object, reprompt, passwordHistory) are dropped, -// and the TOTP *seed* is reduced to a presence flag — the only seed-derived path -// stays the specially-audited `vault code` (see the design §10/§16). -type normalizedItem struct { - Name string `json:"name"` - Username string `json:"username,omitempty"` - Password string `json:"password,omitempty"` - URIs []string `json:"uris,omitempty"` - TOTP bool `json:"totp,omitempty"` // presence only, never the seed - Notes string `json:"notes,omitempty"` - Fields map[string]string `json:"fields,omitempty"` // custom field name→value -} - -// bwFieldLinked is the Bitwarden custom-field type for a "linked" field: it -// references another field and carries a null value, so it is not real data. -const bwFieldLinked = 3 - -// normalizeItem parses a `bw get item` payload into the browse projection. It is -// pure (no I/O), so it is the unit-tested heart of `get --all`. -func normalizeItem(raw string) (normalizedItem, error) { - var it struct { - Name string `json:"name"` - Notes string `json:"notes"` - Login *struct { - Username string `json:"username"` - Password string `json:"password"` - Totp string `json:"totp"` - URIs []struct { - URI string `json:"uri"` - } `json:"uris"` - } `json:"login"` - Fields []struct { - Name string `json:"name"` - Value string `json:"value"` - Type int `json:"type"` - } `json:"fields"` - } - if err := json.Unmarshal([]byte(raw), &it); err != nil { - return normalizedItem{}, fmt.Errorf("parse bw item: %w", err) - } - n := normalizedItem{Name: it.Name, Notes: it.Notes} - if it.Login != nil { - n.Username = it.Login.Username - n.Password = it.Login.Password - n.TOTP = it.Login.Totp != "" - for _, u := range it.Login.URIs { - if u.URI != "" { - n.URIs = append(n.URIs, u.URI) - } - } - } - for _, f := range it.Fields { - if f.Type == bwFieldLinked { - continue // references another field, no value of its own - } - if n.Fields == nil { - n.Fields = map[string]string{} - } - n.Fields[f.Name] = f.Value // duplicate names: last-wins (rare; documented) - } - return n, nil -} - // clipboardDecision picks how to return a secret value. "stdout" prints it (a // pipe/agent — the intended machine path); "clipboard" copies via OSC52; // "refuse" emits nothing sensitive (would otherwise risk dumping the secret's @@ -660,7 +443,6 @@ func runList(run cmdRunner, user, uid, search string) ([]string, error) { func vaultList(args []string) error { hardenProcess() - ensureVaultToken() search := "" for i := 0; i < len(args); i++ { if args[i] == "--search" && i+1 < len(args) { @@ -695,7 +477,6 @@ func vaultSearch(args []string) error { func vaultCode(args []string) error { hardenProcess() - ensureVaultToken() if len(args) == 0 { return fmt.Errorf("usage: homelab vault code ") } @@ -727,9 +508,7 @@ func statusSummary(run cmdRunner, user, uid string) string { if err != nil { return "vault: configured, but unlock/login FAILED (creds stale? run `homelab vault setup`): " + err.Error() } - // openSession already did a best-effort sync; status re-runs it explicitly so - // a reachability failure surfaces in this report rather than only on stderr. - if _, err := run("bw", bwSyncArgs(), s.env); err != nil { + if _, err := run("bw", []string{"sync"}, s.env); err != nil { return "vault: configured + unlocked, but sync/reachability failed: " + err.Error() } return "vault: configured, unlocked, reachable ✓" @@ -737,7 +516,6 @@ func statusSummary(run cmdRunner, user, uid string) string { func vaultStatus(args []string) error { hardenProcess() - ensureVaultToken() uid := vaultCurrentUID() unlock, err := withUserLock(uid) if err != nil { @@ -764,61 +542,32 @@ func vaultLock(args []string) error { return nil // lock/logout best-effort; never error the caller } -// kvWriteVerb selects the KV write semantics. merge=true → `kv patch -method=rw` -// (read-modify-write: needs only read+update, NOT the `patch` capability the -// scoped workstation-claude- policy lacks, and preserves co-located keys -// such as claude-auth-sync's claude_ai_oauth_json). merge=false → `kv put` -// (creates the path on first use, before any sibling keys exist). -func kvWriteVerb(merge bool) []string { - if merge { - return []string{"kv", "patch", "-method=rw"} - } - return []string{"kv", "put"} -} - -// vaultWritePublicArgs writes the non-secret identifiers via argv. Neither the +// vaultPatchPublicArgs writes the non-secret identifiers via argv. Neither the // email nor the API client_id is a usable credential on its own. -func vaultWritePublicArgs(merge bool, user, email, clientID string) []string { - return append(kvWriteVerb(merge), vwCredsPath(user), - "vaultwarden_email="+email, - "vaultwarden_client_id="+clientID, - ) +func vaultPatchPublicArgs(user, email, clientID string) []string { + return []string{"kv", "patch", vwCredsPath(user), + "vaultwarden_email=" + email, + "vaultwarden_client_id=" + clientID, + } } -// vaultWriteSecretArgs writes ONE secret value via the `key=-` stdin form, so the -// value never appears in argv (ps / /proc//cmdline). Fed on stdin by -// realRunnerStdin. -func vaultWriteSecretArgs(merge bool, user, key string) []string { - return append(kvWriteVerb(merge), vwCredsPath(user), key+"=-") +// vaultPatchSecretArgs writes ONE secret value via the `key=-` stdin form, so +// the value never appears in argv (ps / /proc//cmdline). The value is fed +// on stdin by realRunnerStdin. +func vaultPatchSecretArgs(user, key string) []string { + return []string{"kv", "patch", vwCredsPath(user), key + "=-"} } -// credsPathExists reports whether the user's KV path already holds data. Used to -// pick create (`kv put`) vs merge (`kv patch -method=rw`) for the first write: -// claude-auth-sync usually creates the path first (Claude OAuth backup), but a -// user could run `homelab vault setup` before that ever happens. -func credsPathExists(run cmdRunner, user string) bool { - _, err := run("vault", []string{"kv", "get", "-format=json", vwCredsPath(user)}, nil) - return err == nil -} - -// cmdRunnerStdin is realRunnerStdin's shape, injected so writeCreds is testable. -type cmdRunnerStdin func(name string, argv, envv []string, stdin string) (string, error) - -// writeCreds stores all four fields in the user's Vault path using only the -// capabilities the scoped policy grants (create/read/update — NOT `patch`). The -// first (public) write creates the path when absent; the two real secrets then -// merge in via read-modify-write so the public keys — and any claude-auth-sync -// keys already present — survive. Secret values travel on stdin, never argv. -func writeCreds(run cmdRunner, runStdin cmdRunnerStdin, user string, c vwCreds) error { - merge := credsPathExists(run, user) - if _, err := run("vault", vaultWritePublicArgs(merge, user, c.Email, c.ClientID), nil); err != nil { +// writeCreds stores all four fields in the user's Vault path. The two real +// secrets (master password, API client_secret) go via stdin — never argv. +func writeCreds(user string, c vwCreds) error { + if _, err := realRunner("vault", vaultPatchPublicArgs(user, c.Email, c.ClientID), nil); err != nil { return err } - // The path now exists regardless of the branch above → merge the secrets in. - if _, err := runStdin("vault", vaultWriteSecretArgs(true, user, "vaultwarden_master_password"), nil, c.MasterPassword); err != nil { + if _, err := realRunnerStdin("vault", vaultPatchSecretArgs(user, "vaultwarden_master_password"), nil, c.MasterPassword); err != nil { return err } - if _, err := runStdin("vault", vaultWriteSecretArgs(true, user, "vaultwarden_client_secret"), nil, c.ClientSecret); err != nil { + if _, err := realRunnerStdin("vault", vaultPatchSecretArgs(user, "vaultwarden_client_secret"), nil, c.ClientSecret); err != nil { return err } return nil @@ -844,7 +593,6 @@ func promptLine(prompt string) (string, error) { func vaultSetup(args []string) error { hardenProcess() - ensureVaultToken() fmt.Fprintln(os.Stderr, "One-time setup. Stored ONLY in your own Vault path; the admin never sees it.") fmt.Fprintln(os.Stderr, "Get your API key at https://vaultwarden.viktorbarzin.me → Settings → Security → Keys → View API key.") email, err := promptLine("Vaultwarden email: ") @@ -867,7 +615,7 @@ func vaultSetup(args []string) error { return fmt.Errorf("all fields are required") } c := vwCreds{Email: email, MasterPassword: master, ClientID: clientID, ClientSecret: clientSecret} - if err := writeCreds(realRunner, realRunnerStdin, vaultCurrentUser(), c); err != nil { + if err := writeCreds(vaultCurrentUser(), c); err != nil { return fmt.Errorf("writing creds to your Vault path failed (scoped token present?): %w", err) } fmt.Fprintln(os.Stderr, "Stored. Verifying unlock…") @@ -886,7 +634,6 @@ func vaultSetup(args []string) error { func vaultGet(args []string) error { hardenProcess() - ensureVaultToken() o, err := parseGetArgs(args) if err != nil { return err @@ -898,9 +645,6 @@ func vaultGet(args []string) error { } defer unlock() user := vaultCurrentUser() - if o.all { - return getAllFields(user, uid, o.name) - } val, err := getValue(realRunner, user, uid, o) if err != nil { return err @@ -917,28 +661,3 @@ func vaultGet(args []string) error { return nil } -// getAllFields prints every field of one item as normalized JSON. Like -// `get --json`, the payload is all secret values, so it refuses a terminal -// (pipe it). The TOTP seed is never emitted — only a presence flag — so no extra -// TOTP audit is needed; the op-log uses a distinct verb so a bulk dump is -// distinguishable from a single-field get (the item name is still never logged). -func getAllFields(user, uid, name string) error { - if !jsonToStdoutOK(stdoutIsTTY()) { - return fmt.Errorf("refusing to print all fields as JSON to a terminal; pipe it (e.g. | jq)") - } - raw, err := getItem(realRunner, user, uid, name) - if err != nil { - return err - } - item, err := normalizeItem(raw) - if err != nil { - return err - } - out, err := json.Marshal(item) - if err != nil { - return err - } - writeOpLog(opRecord{User: user, Verb: "get-all", PID: os.Getpid(), PPID: os.Getppid(), ParentComm: parentComm(os.Getppid()), ItemName: name}) - fmt.Println(string(out)) - return nil -} diff --git a/cli/cmd_vault_kv.go b/cli/cmd_vault_kv.go deleted file mode 100644 index 5f70e6b5..00000000 --- a/cli/cmd_vault_kv.go +++ /dev/null @@ -1,248 +0,0 @@ -package main - -import ( - "encoding/json" - "fmt" - "io" - "os" - "strings" -) - -// The `vault kv` verbs talk to HashiCorp Vault / OpenBao — the homelab INFRA -// secrets store (the `secret/…` KV-v2 mount at vault.viktorbarzin.me) — NOT -// Vaultwarden. They are a thin, TTY-aware wrapper over the `vault` CLI that adds -// the same conveniences as the Vaultwarden verbs: a self-defaulted VAULT_ADDR -// (so non-login agent shells work) and clipboard/refuse-on-TTY secret handling. -// -// CREDENTIALS DIFFER FROM THE VAULTWARDEN VERBS. Those use the per-user *scoped* -// token (bound only to secret/workstation/claude-users/). A general kv read -// of e.g. secret/viktor must use the caller's OWN Vault token (the OIDC -// ~/.vault-token or an explicit $VAULT_TOKEN) — the scoped token has `deny` -// everywhere else and would 403. So the kv handlers call ensureVaultAddr() to -// guarantee VAULT_ADDR but deliberately do NOT call ensureVaultToken() (which -// injects the scoped token). Access is then whatever the caller's policy grants. -func vaultKVCommands() []Command { - return []Command{ - {Path: []string{"vault", "kv", "get"}, Tier: TierRead, - Summary: "[hashicorp-vault] read an infra KV secret: vault kv get [--field K]", Run: vaultKVGet}, - {Path: []string{"vault", "kv", "list"}, Tier: TierRead, - Summary: "[hashicorp-vault] list infra KV sub-paths: vault kv list ", Run: vaultKVList}, - {Path: []string{"vault", "kv", "put"}, Tier: TierWrite, - Summary: "[hashicorp-vault] write one KV key (value via stdin): vault kv put ", Run: vaultKVPut}, - {Path: []string{"vault", "kv"}, Tier: TierRead, - Summary: "[hashicorp-vault] infra secrets (run `homelab vault kv` for help)", - Run: func([]string) error { fmt.Print(vaultKVHelp()); return nil }}, - } -} - -func vaultKVHelp() string { - return `homelab vault kv — HashiCorp Vault / OpenBao (homelab INFRA secrets, the secret/… KV store) - - homelab vault kv get [--field K] read a secret - --field K → one value (TTY → clipboard; piped → stdout) - no --field → all fields as JSON (piped only) - homelab vault kv list list sub-paths under (no values) - homelab vault kv put write one key; value read from stdin - (piped, or no-echo prompt); merges — never clobbers siblings - -Uses YOUR Vault token (vault login -method=oidc → ~/.vault-token); access is -whatever your policy grants. This is NOT Vaultwarden — for your personal logins -use 'homelab vault get' (see 'homelab vault'). -` -} - -// --- arg builders (pure; values never travel via argv) -------------------- - -func vaultKVGetFieldArgs(path, field string) []string { - return []string{"kv", "get", "-field=" + field, path} -} -func vaultKVGetJSONArgs(path string) []string { return []string{"kv", "get", "-format=json", path} } -func vaultKVListArgs(path string) []string { return []string{"kv", "list", "-format=json", path} } - -// vaultKVPutArgs builds the write argv. merge=true → `kv patch -method=rw` -// (read-modify-write: merges, needs only read+update — not the `patch` capability -// — and preserves sibling keys); merge=false → `kv put` (creates the path on -// first write). The value is ALWAYS read from stdin via the `=-` form, so it -// never appears in argv (visible via ps / /proc//cmdline to same-UID procs). -func vaultKVPutArgs(merge bool, path, key string) []string { - return append(kvWriteVerb(merge), path, key+"=-") -} - -// --- pure parsers ---------------------------------------------------------- - -// extractKVData returns the inner secret object from a `vault kv get -format=json` -// envelope (`{"data":{"data":{…},"metadata":{…}}}`), dropping the metadata/request -// wrapper so only the secret's own key→value data is emitted. -func extractKVData(jsonOut string) (string, error) { - var env struct { - Data struct { - Data json.RawMessage `json:"data"` - } `json:"data"` - } - if err := json.Unmarshal([]byte(jsonOut), &env); err != nil { - return "", fmt.Errorf("parse vault kv json: %w", err) - } - if len(env.Data.Data) == 0 { - return "", fmt.Errorf("no secret data at that path") - } - return string(env.Data.Data), nil -} - -// parseKVList parses the JSON array `vault kv list -format=json` prints. -func parseKVList(jsonOut string) ([]string, error) { - var keys []string - if err := json.Unmarshal([]byte(jsonOut), &keys); err != nil { - return nil, fmt.Errorf("parse vault kv list json: %w", err) - } - return keys, nil -} - -// --- testable cores (injected cmdRunner) ----------------------------------- - -func kvGetField(run cmdRunner, path, field string) (string, error) { - return run("vault", vaultKVGetFieldArgs(path, field), nil) -} - -func kvGetJSON(run cmdRunner, path string) (string, error) { - out, err := run("vault", vaultKVGetJSONArgs(path), nil) - if err != nil { - return "", err - } - return extractKVData(out) -} - -func kvList(run cmdRunner, path string) ([]string, error) { - out, err := run("vault", vaultKVListArgs(path), nil) - if err != nil { - return nil, err - } - return parseKVList(out) -} - -// kvPathExists reports whether the KV path already holds data, to pick create -// (`kv put`) vs merge (`kv patch -method=rw`) — so a write never clobbers -// sibling keys on an existing path. -func kvPathExists(run cmdRunner, path string) bool { - _, err := run("vault", vaultKVGetJSONArgs(path), nil) - return err == nil -} - -// kvPut writes one key, creating the path when absent and merging when present. -// The value travels on stdin only (never argv). -func kvPut(run cmdRunner, runStdin cmdRunnerStdin, path, key, value string) error { - merge := kvPathExists(run, path) - _, err := runStdin("vault", vaultKVPutArgs(merge, path, key), nil, value) - return err -} - -// --- handlers -------------------------------------------------------------- - -func vaultKVGet(args []string) error { - hardenProcess() - ensureVaultAddr() // own token, NOT the scoped one (see file header) - var path, field string - for i := 0; i < len(args); i++ { - a := args[i] - switch { - case a == "--field" && i+1 < len(args): - field = args[i+1] - i++ - case strings.HasPrefix(a, "--field="): - field = strings.TrimPrefix(a, "--field=") - case !strings.HasPrefix(a, "-") && path == "": - path = a - } - } - if path == "" { - return fmt.Errorf("usage: homelab vault kv get [--field ]") - } - if field != "" { - val, err := kvGetField(realRunner, path, field) - if err != nil { - return err - } - emitSecret(val) // TTY-aware: clipboard on a terminal, stdout when piped - return nil - } - // No --field → the whole secret. All values, so refuse a bare TTY (like - // `vault get --json`): pick a --field for the clipboard path, or pipe it. - if !jsonToStdoutOK(stdoutIsTTY()) { - return fmt.Errorf("refusing to print all KV fields as JSON to a terminal; use --field , or pipe it (e.g. | jq)") - } - out, err := kvGetJSON(realRunner, path) - if err != nil { - return err - } - fmt.Println(out) - return nil -} - -func vaultKVList(args []string) error { - ensureVaultAddr() - var path string - for _, a := range args { - if !strings.HasPrefix(a, "-") { - path = a - break - } - } - if path == "" { - return fmt.Errorf("usage: homelab vault kv list ") - } - keys, err := kvList(realRunner, path) - if err != nil { - return err - } - for _, k := range keys { - fmt.Println(k) - } - return nil -} - -func vaultKVPut(args []string) error { - hardenProcess() - ensureVaultAddr() - var path, key string - for _, a := range args { - if strings.HasPrefix(a, "-") { - continue - } - switch { - case path == "": - path = a - case key == "": - key = a - } - } - if path == "" || key == "" { - return fmt.Errorf("usage: homelab vault kv put (value read from stdin)") - } - value, err := readSecretValue("Value for " + key + ": ") - if err != nil { - return err - } - if value == "" { - return fmt.Errorf("empty value; aborting (nothing written)") - } - if err := kvPut(realRunner, realRunnerStdin, path, key, value); err != nil { - return fmt.Errorf("writing %q to %s failed (does your token have write access? path correct?): %w", key, path, err) - } - fmt.Fprintln(os.Stderr, "wrote "+key+" to "+path) - return nil -} - -// readSecretValue obtains a secret value WITHOUT putting it in argv: piped stdin -// is read verbatim (trailing newline trimmed, internal newlines preserved so -// multi-line values like PEM keys survive); an interactive TTY is prompted -// without echo. -func readSecretValue(prompt string) (string, error) { - fi, err := os.Stdin.Stat() - if err == nil && fi.Mode()&os.ModeCharDevice == 0 { - b, rerr := io.ReadAll(os.Stdin) - if rerr != nil { - return "", rerr - } - return strings.TrimRight(string(b), "\r\n"), nil - } - return promptNoEcho(prompt) -} diff --git a/cli/cmd_vault_test.go b/cli/cmd_vault_test.go index fbfd876d..36aab1f4 100644 --- a/cli/cmd_vault_test.go +++ b/cli/cmd_vault_test.go @@ -2,8 +2,6 @@ package main import ( "encoding/base64" - "encoding/json" - "errors" "fmt" "os" "reflect" @@ -72,7 +70,7 @@ func (f *fakeRunner) run(name string, argv, envv []string) (string, error) { func TestLoadCredsReadsFourFields(t *testing.T) { f := &fakeRunner{out: map[string]string{ - "vault kv get -field=vaultwarden_email secret/workstation/claude-users/emo": "emo@x.me", + "vault kv get -field=vaultwarden_email secret/workstation/claude-users/emo": "emo@x.me", "vault kv get -field=vaultwarden_master_password secret/workstation/claude-users/emo": "hunter2", "vault kv get -field=vaultwarden_client_id secret/workstation/claude-users/emo": "user.abc", "vault kv get -field=vaultwarden_client_secret secret/workstation/claude-users/emo": "sek", @@ -235,181 +233,12 @@ func TestStatusSummaryUnconfigured(t *testing.T) { } } -func TestEnsureVaultTokenSetsScopedFallback(t *testing.T) { - dir := t.TempDir() - cfg := dir + "/.config/claude-auth-sync" - if err := os.MkdirAll(cfg, 0o700); err != nil { - t.Fatal(err) - } - if err := os.WriteFile(cfg+"/vault-token", []byte("SCOPED-TOK\n"), 0o600); err != nil { - t.Fatal(err) - } - t.Setenv("HOME", dir) - t.Setenv("VAULT_TOKEN", "") // no ambient token - - ensureVaultToken() - if got := os.Getenv("VAULT_TOKEN"); got != "SCOPED-TOK" { - t.Fatalf("VAULT_TOKEN = %q, want scoped fallback to be exported", got) - } -} - -func TestEnsureVaultTokenKeepsExplicitEnv(t *testing.T) { - dir := t.TempDir() - cfg := dir + "/.config/claude-auth-sync" - if err := os.MkdirAll(cfg, 0o700); err != nil { - t.Fatal(err) - } - if err := os.WriteFile(cfg+"/vault-token", []byte("SCOPED-TOK"), 0o600); err != nil { - t.Fatal(err) - } - t.Setenv("HOME", dir) - t.Setenv("VAULT_TOKEN", "ADMIN-TOK") - - ensureVaultToken() - if got := os.Getenv("VAULT_TOKEN"); got != "ADMIN-TOK" { - t.Fatalf("VAULT_TOKEN = %q, must not override an explicit token", got) - } -} - -func TestEnsureVaultTokenPrefersScopedOverFile(t *testing.T) { - // Regression: a power-user's read-only OIDC ~/.vault-token must NOT shadow the - // purpose-built scoped token (emo's setup hit 403 because it did, 2026-06-28). - dir := t.TempDir() - cfg := dir + "/.config/claude-auth-sync" - if err := os.MkdirAll(cfg, 0o700); err != nil { - t.Fatal(err) - } - if err := os.WriteFile(cfg+"/vault-token", []byte("SCOPED-TOK"), 0o600); err != nil { - t.Fatal(err) - } - if err := os.WriteFile(dir+"/.vault-token", []byte("STALE-OIDC-TOK"), 0o600); err != nil { - t.Fatal(err) - } - t.Setenv("HOME", dir) - t.Setenv("VAULT_TOKEN", "") - - ensureVaultToken() - if got := os.Getenv("VAULT_TOKEN"); got != "SCOPED-TOK" { - t.Fatalf("VAULT_TOKEN = %q, want the scoped token to win over a stale ~/.vault-token", got) - } -} - -func TestScopedTokenPath(t *testing.T) { - if got := scopedTokenPath("/home/emo"); got != "/home/emo/.config/claude-auth-sync/vault-token" { - t.Fatalf("scopedTokenPath = %q", got) - } -} - -func TestVaultTokenSource(t *testing.T) { - // Precedence: explicit $VAULT_TOKEN > the claude-auth-sync per-user scoped - // token > a native ~/.vault-token. Scoped beats the file so a power-user's - // read-only OIDC ~/.vault-token can't shadow the scoped token on the user's - // own path (emo, 2026-06-28). - cases := []struct { - name string - env string - haveVaultToken bool - scoped string - wantTok, wantSrc string - }{ - {"explicit env wins", "abc", true, "S", "", "env"}, - {"scoped beats a stale ~/.vault-token", "", true, "S-TOK", "S-TOK", "scoped"}, - {"scoped used when no file", "", false, "S-TOK", "S-TOK", "scoped"}, - {"native ~/.vault-token only when no scoped", "", true, "", "", "file"}, - {"scoped value is trimmed", "", false, " S-TOK\n", "S-TOK", "scoped"}, - {"whitespace-only scoped falls back to file", "", true, " \n", "", "file"}, - {"nothing configured", "", false, "", "", "none"}, - } - for _, c := range cases { - tok, src := vaultTokenSource(c.env, c.haveVaultToken, c.scoped) - if tok != c.wantTok || src != c.wantSrc { - t.Errorf("%s: vaultTokenSource(%q,%v,%q) = (%q,%q), want (%q,%q)", - c.name, c.env, c.haveVaultToken, c.scoped, tok, src, c.wantTok, c.wantSrc) - } - } -} - -func TestVaultAddrToSet(t *testing.T) { - // homelab vault is invoked by AFK agent sessions (non-login shells that - // never sourced /etc/environment), so the CLI must self-default VAULT_ADDR - // rather than rely on the ambient env — else every `vault` child hits the - // 127.0.0.1:8200 default and fails "connection refused" (exit 2). - cases := []struct { - name, env, want string - }{ - {"unset -> default", "", vaultAddrDefault}, - {"whitespace-only -> default", " \n", vaultAddrDefault}, - {"explicit kept (empty = leave alone)", "https://vault.example.com", ""}, - } - for _, c := range cases { - if got := vaultAddrToSet(c.env); got != c.want { - t.Errorf("%s: vaultAddrToSet(%q) = %q, want %q", c.name, c.env, got, c.want) - } - } -} - -func TestEnsureVaultTokenSetsDefaultAddr(t *testing.T) { - dir := t.TempDir() // no scoped token, no ~/.vault-token - t.Setenv("HOME", dir) - t.Setenv("VAULT_TOKEN", "") - t.Setenv("VAULT_ADDR", "") // emo's non-login-shell situation - - ensureVaultToken() - if got := os.Getenv("VAULT_ADDR"); got != vaultAddrDefault { - t.Fatalf("VAULT_ADDR = %q, want default %q to be exported", got, vaultAddrDefault) - } -} - -func TestEnsureVaultTokenKeepsExplicitAddr(t *testing.T) { - dir := t.TempDir() - t.Setenv("HOME", dir) - t.Setenv("VAULT_TOKEN", "") - t.Setenv("VAULT_ADDR", "https://vault.example.com") - - ensureVaultToken() - if got := os.Getenv("VAULT_ADDR"); got != "https://vault.example.com" { - t.Fatalf("VAULT_ADDR = %q, must not override an explicit addr", got) - } -} - -func TestAugmentErrSurfacesStderr(t *testing.T) { - if got := augmentErr(nil, []byte("ignored")); got != nil { - t.Fatalf("augmentErr(nil, …) = %v, want nil", got) - } - base := errors.New("exit status 2") - got := augmentErr(base, []byte(" dial tcp 127.0.0.1:8200: connect: connection refused\n")) - if got == nil || !strings.Contains(got.Error(), "connection refused") || !strings.Contains(got.Error(), "exit status 2") { - t.Fatalf("augmentErr did not surface stderr: %v", got) - } - if !errors.Is(got, base) { - t.Fatal("augmentErr lost the wrapped error (errors.Is failed)") - } - if got := augmentErr(base, []byte(" ")); got != base { - t.Fatalf("augmentErr with blank stderr = %v, want the original error unchanged", got) - } -} - -func TestKvWriteVerb(t *testing.T) { - // merge=true → read-modify-write patch (needs only read+update, NOT the - // `patch` capability the scoped workstation policy lacks). - if got := kvWriteVerb(true); !reflect.DeepEqual(got, []string{"kv", "patch", "-method=rw"}) { - t.Fatalf("kvWriteVerb(true) = %v", got) - } - // merge=false → put (creates the path on first use) - if got := kvWriteVerb(false); !reflect.DeepEqual(got, []string{"kv", "put"}) { - t.Fatalf("kvWriteVerb(false) = %v", got) - } -} - -func TestVaultWritePublicArgs(t *testing.T) { - got := vaultWritePublicArgs(true, "emo", "e@x.me", "user.ci") - want := []string{"kv", "patch", "-method=rw", "secret/workstation/claude-users/emo", +func TestVaultPatchPublicArgs(t *testing.T) { + got := vaultPatchPublicArgs("emo", "e@x.me", "user.ci") + want := []string{"kv", "patch", "secret/workstation/claude-users/emo", "vaultwarden_email=e@x.me", "vaultwarden_client_id=user.ci"} if !reflect.DeepEqual(got, want) { - t.Fatalf("vaultWritePublicArgs(merge) = %v", got) - } - if got := vaultWritePublicArgs(false, "emo", "e@x.me", "user.ci"); got[0] != "kv" || got[1] != "put" { - t.Fatalf("vaultWritePublicArgs(create) must use `kv put`, got %v", got) + t.Fatalf("vaultPatchPublicArgs = %v", got) } for _, a := range got { if strings.Contains(a, "master_password") || strings.Contains(a, "client_secret") { @@ -418,12 +247,12 @@ func TestVaultWritePublicArgs(t *testing.T) { } } -func TestVaultWriteSecretArgsNoValueInArgv(t *testing.T) { +func TestVaultPatchSecretArgsNoValueInArgv(t *testing.T) { for _, key := range []string{"vaultwarden_master_password", "vaultwarden_client_secret"} { - got := vaultWriteSecretArgs(true, "emo", key) - want := []string{"kv", "patch", "-method=rw", "secret/workstation/claude-users/emo", key + "=-"} + got := vaultPatchSecretArgs("emo", key) + want := []string{"kv", "patch", "secret/workstation/claude-users/emo", key + "=-"} if !reflect.DeepEqual(got, want) { - t.Fatalf("vaultWriteSecretArgs(%q) = %v", key, got) + t.Fatalf("vaultPatchSecretArgs(%q) = %v", key, got) } if got[len(got)-1] != key+"=-" { t.Fatalf("secret value must be read from stdin (`%s=-`), got %v", key, got) @@ -431,90 +260,6 @@ func TestVaultWriteSecretArgsNoValueInArgv(t *testing.T) { } } -// recStdin records a stdin-bearing call for assertions. -type recStdin struct { - argv []string - stdin string -} - -// TestWriteCredsCreatesThenMerges: when the path is ABSENT the first (public) -// write must `kv put` (create), and the two secrets must merge via patch -rw -// with values on stdin only — never the buggy plain `kv patch` (needs `patch`). -func TestWriteCredsCreatesThenMerges(t *testing.T) { - var calls [][]string - var stdinCalls []recStdin - run := func(name string, argv, envv []string) (string, error) { - calls = append(calls, append([]string{name}, argv...)) - if len(argv) >= 2 && argv[0] == "kv" && argv[1] == "get" { - return "", fmt.Errorf("no value found") // path absent - } - return "", nil - } - runStdin := func(name string, argv, envv []string, stdin string) (string, error) { - stdinCalls = append(stdinCalls, recStdin{append([]string{name}, argv...), stdin}) - return "", nil - } - c := vwCreds{Email: "e@x.me", MasterPassword: "PW", ClientID: "user.ci", ClientSecret: "CS"} - if err := writeCreds(run, runStdin, "emo", c); err != nil { - t.Fatalf("writeCreds: %v", err) - } - var sawPut, sawPlainPatch bool - for _, cl := range calls { - j := strings.Join(cl, " ") - if strings.Contains(j, "kv put") { - sawPut = true - } - if strings.Contains(j, "kv patch") && !strings.Contains(j, "-method=rw") { - sawPlainPatch = true - } - } - if !sawPut { - t.Fatalf("path absent → public write must be `kv put`; calls=%v", calls) - } - if sawPlainPatch { - t.Fatalf("must never use plain `kv patch` (needs `patch` capability); calls=%v", calls) - } - if len(stdinCalls) != 2 { - t.Fatalf("want 2 stdin secret writes, got %d", len(stdinCalls)) - } - for _, sc := range stdinCalls { - if !strings.Contains(strings.Join(sc.argv, " "), "kv patch -method=rw") { - t.Errorf("secret write must use patch -method=rw: %v", sc.argv) - } - for _, a := range sc.argv { - if strings.Contains(a, "PW") || strings.Contains(a, "CS") { - t.Errorf("secret leaked into argv: %v", sc.argv) - } - } - } - if stdinCalls[0].stdin != "PW" || stdinCalls[1].stdin != "CS" { - t.Errorf("stdin values wrong: %q,%q", stdinCalls[0].stdin, stdinCalls[1].stdin) - } -} - -// TestWriteCredsMergesWhenPresent: when the path EXISTS, every write must merge -// (patch -rw) — a `kv put` would wipe sibling keys (e.g. claude_ai_oauth_json). -func TestWriteCredsMergesWhenPresent(t *testing.T) { - var calls [][]string - run := func(name string, argv, envv []string) (string, error) { - calls = append(calls, append([]string{name}, argv...)) - return "{}", nil // get succeeds → path exists - } - runStdin := func(name string, argv, envv []string, stdin string) (string, error) { - calls = append(calls, append([]string{name}, argv...)) - return "", nil - } - c := vwCreds{Email: "e@x.me", MasterPassword: "PW", ClientID: "user.ci", ClientSecret: "CS"} - if err := writeCreds(run, runStdin, "emo", c); err != nil { - t.Fatalf("writeCreds: %v", err) - } - for _, cl := range calls { - if strings.Contains(strings.Join(cl, " "), "kv put") { - t.Fatalf("path exists → must NOT `kv put` (wipes siblings): %v", cl) - } - } -} - // TestNoSecretInArgvAcrossFlow is the load-bearing security test: across the // whole get flow (vault reads, bw config/status/login/unlock/get) NO secret // value may appear in any command's argv — secrets travel via env/stdin only. @@ -522,8 +267,8 @@ func TestNoSecretInArgvAcrossFlow(t *testing.T) { uid := fmt.Sprintf("%d", os.Getuid()) f := &fakeRunner{out: map[string]string{ "vault kv get -field=vaultwarden_master_password secret/workstation/claude-users/emo": "SUPERSECRETPW", - "vault kv get -field=vaultwarden_client_id secret/workstation/claude-users/emo": "user.x", - "vault kv get -field=vaultwarden_client_secret secret/workstation/claude-users/emo": "CLIENTSEKRET", + "vault kv get -field=vaultwarden_client_id secret/workstation/claude-users/emo": "user.x", + "vault kv get -field=vaultwarden_client_secret secret/workstation/claude-users/emo": "CLIENTSEKRET", "bw status": `{"status":"locked"}`, "bw unlock": "SESSIONXYZ", "bw get password github": "p@ss", @@ -608,8 +353,8 @@ func TestVaultBareGroupRegistered(t *testing.T) { func TestGetValueFlow(t *testing.T) { f := &fakeRunner{out: map[string]string{ "vault kv get -field=vaultwarden_master_password secret/workstation/claude-users/emo": "pw", - "vault kv get -field=vaultwarden_client_id secret/workstation/claude-users/emo": "user.x", - "vault kv get -field=vaultwarden_client_secret secret/workstation/claude-users/emo": "cs", + "vault kv get -field=vaultwarden_client_id secret/workstation/claude-users/emo": "user.x", + "vault kv get -field=vaultwarden_client_secret secret/workstation/claude-users/emo": "cs", "bw status": `{"status":"locked"}`, "bw unlock": "SESS", "bw get password github": "p@ss", @@ -621,437 +366,3 @@ func TestGetValueFlow(t *testing.T) { t.Fatalf("getValue = %q, %v", val, err) } } - -// --- vault get --all (browse all fields) ---------------------------------- - -func TestParseGetArgsAll(t *testing.T) { - o, err := parseGetArgs([]string{"github", "--all"}) - if err != nil || o.name != "github" || !o.all { - t.Fatalf("parseGetArgs(--all) = %+v err=%v", o, err) - } - // --all must skip --field validation (field is irrelevant for a full dump). - if _, err := parseGetArgs([]string{"github", "--all", "--field", "evil"}); err != nil { - t.Fatalf("--all must ignore an otherwise-invalid --field, got err=%v", err) - } - // A name is still required. - if _, err := parseGetArgs([]string{"--all"}); err == nil { - t.Fatal("get --all with no name must error") - } - // Without --all, the field allowlist still applies. - if _, err := parseGetArgs([]string{"github", "--field", "evil"}); err == nil { - t.Fatal("invalid --field without --all must still error") - } -} - -func TestBwItemArgs(t *testing.T) { - argv := bwItemArgs("github") - if !reflect.DeepEqual(argv, []string{"get", "item", "github"}) { - t.Fatalf("bwItemArgs = %v", argv) - } - for _, a := range argv { - if strings.Contains(a, "SESSION") || a == "--session" { - t.Fatalf("session must travel via env, not argv: %v", argv) - } - } -} - -// a representative `bw get item` payload: login fields, multiple URIs, a TOTP -// seed, notes, custom fields (text/hidden/boolean), plus bw internals that MUST -// be dropped (id/object/reprompt/passwordHistory). -const sampleLoginItemJSON = `{ - "object":"item","id":"abc-123","folderId":null,"type":1,"reprompt":0, - "name":"GitHub","notes":"my notes","favorite":false, - "fields":[ - {"name":"PIN","value":"1234","type":1}, - {"name":"endpoint","value":"https://api.gh","type":0}, - {"name":"enabled","value":"true","type":2} - ], - "login":{ - "username":"octocat","password":"hunter2", - "totp":"otpauth://totp/GitHub:octocat?secret=SEEDSEEDSEED", - "uris":[{"match":null,"uri":"https://github.com"},{"match":null,"uri":"https://gist.github.com"}] - }, - "passwordHistory":[{"password":"OLD-PASSWORD-XYZ"}] -}` - -func TestNormalizeItemLogin(t *testing.T) { - n, err := normalizeItem(sampleLoginItemJSON) - if err != nil { - t.Fatalf("normalizeItem: %v", err) - } - if n.Name != "GitHub" || n.Username != "octocat" || n.Password != "hunter2" || n.Notes != "my notes" { - t.Fatalf("standard fields wrong: %+v", n) - } - if !n.TOTP { - t.Fatal("TOTP presence flag must be true when a seed exists") - } - if !reflect.DeepEqual(n.URIs, []string{"https://github.com", "https://gist.github.com"}) { - t.Fatalf("URIs = %v", n.URIs) - } - want := map[string]string{"PIN": "1234", "endpoint": "https://api.gh", "enabled": "true"} - if !reflect.DeepEqual(n.Fields, want) { - t.Fatalf("custom fields = %v want %v", n.Fields, want) - } -} - -// The load-bearing security test: the raw TOTP seed (more powerful than a -// one-time code) and the password history must NEVER appear in the dump. -func TestNormalizeItemNeverLeaksSeedOrHistory(t *testing.T) { - n, err := normalizeItem(sampleLoginItemJSON) - if err != nil { - t.Fatalf("normalizeItem: %v", err) - } - out, err := json.Marshal(n) - if err != nil { - t.Fatalf("marshal: %v", err) - } - for _, leak := range []string{"SEEDSEEDSEED", "otpauth", "OLD-PASSWORD-XYZ", "passwordHistory", "abc-123"} { - if strings.Contains(string(out), leak) { - t.Fatalf("dump leaked %q: %s", leak, out) - } - } -} - -func TestNormalizeItemNoTOTP(t *testing.T) { - n, err := normalizeItem(`{"name":"X","type":1,"login":{"username":"u","password":"p"}}`) - if err != nil { - t.Fatalf("normalizeItem: %v", err) - } - if n.TOTP { - t.Fatal("TOTP must be false when no seed present") - } - out, _ := json.Marshal(n) - if strings.Contains(string(out), "totp") { - t.Fatalf("no-totp item must omit the totp key entirely: %s", out) - } -} - -func TestNormalizeItemEmptyStandardFieldsOmitted(t *testing.T) { - n, err := normalizeItem(`{"name":"Bare","type":1,"login":{"username":"","password":"","totp":"","uris":[]},"fields":[{"name":"only","value":"x","type":0}]}`) - if err != nil { - t.Fatalf("normalizeItem: %v", err) - } - out, _ := json.Marshal(n) - for _, k := range []string{"username", "password", "uris", "notes", "totp"} { - if strings.Contains(string(out), `"`+k+`"`) { - t.Fatalf("empty standard field %q must be omitted: %s", k, out) - } - } - if !strings.Contains(string(out), `"name":"Bare"`) || !strings.Contains(string(out), `"only":"x"`) { - t.Fatalf("name + custom field must survive: %s", out) - } -} - -func TestNormalizeItemSecureNoteNullLogin(t *testing.T) { - // type 2 (secure note): login is null — must not panic; notes + custom fields survive. - n, err := normalizeItem(`{"name":"SN","type":2,"notes":"secret note","login":null,"fields":[{"name":"k","value":"v","type":1}]}`) - if err != nil { - t.Fatalf("normalizeItem(null login): %v", err) - } - if n.Name != "SN" || n.Notes != "secret note" || n.Fields["k"] != "v" { - t.Fatalf("secure-note normalize wrong: %+v", n) - } - if n.Username != "" || n.Password != "" || n.TOTP { - t.Fatalf("login fields must be empty for a login-less item: %+v", n) - } -} - -func TestNormalizeItemDuplicateCustomNames(t *testing.T) { - // Bitwarden permits duplicate custom-field names; a JSON object can't hold - // dups, so last-wins (documented). - n, err := normalizeItem(`{"name":"D","fields":[{"name":"k","value":"first","type":0},{"name":"k","value":"second","type":0}]}`) - if err != nil { - t.Fatalf("normalizeItem: %v", err) - } - if n.Fields["k"] != "second" { - t.Fatalf("duplicate custom names must be last-wins, got %q", n.Fields["k"]) - } -} - -func TestNormalizeItemLinkedFieldSkipped(t *testing.T) { - // type 3 (linked) fields reference another field and carry a null value — - // they are not real data and must be skipped. - n, err := normalizeItem(`{"name":"L","login":{"username":"u"},"fields":[{"name":"linked","value":null,"type":3},{"name":"real","value":"r","type":0}]}`) - if err != nil { - t.Fatalf("normalizeItem: %v", err) - } - if _, ok := n.Fields["linked"]; ok { - t.Fatalf("linked field must be skipped: %v", n.Fields) - } - if n.Fields["real"] != "r" { - t.Fatalf("real custom field dropped: %v", n.Fields) - } -} - -func TestNormalizeItemMalformed(t *testing.T) { - if _, err := normalizeItem("not json"); err == nil { - t.Fatal("malformed item JSON must error") - } -} - -// getItem opens a session and runs `bw get item `, returning raw JSON. -func TestGetItemFlow(t *testing.T) { - f := &fakeRunner{out: map[string]string{ - "vault kv get -field=vaultwarden_master_password secret/workstation/claude-users/emo": "pw", - "vault kv get -field=vaultwarden_client_id secret/workstation/claude-users/emo": "user.x", - "vault kv get -field=vaultwarden_client_secret secret/workstation/claude-users/emo": "cs", - "bw status": `{"status":"locked"}`, - "bw unlock": "SESS", - "bw get item github": sampleLoginItemJSON, - }} - uid := fmt.Sprintf("%d", os.Getuid()) - raw, err := getItem(f.run, "emo", uid, "github") - if err != nil || !strings.Contains(raw, `"name":"GitHub"`) { - t.Fatalf("getItem = %q, %v", raw, err) - } - // The session key must reach bw via env, never argv. - for _, call := range f.calls { - for _, arg := range call { - if strings.Contains(arg, "SESS") { - t.Errorf("session leaked into argv: %v", call) - } - } - } -} - -func TestVaultHelpMentionsAll(t *testing.T) { - if !strings.Contains(vaultHelp(), "--all") { - t.Error("vault help must document --all") - } -} - -// --- bw sync on read (freshness) ------------------------------------------ - -func TestBwSyncArgs(t *testing.T) { - if got := bwSyncArgs(); !reflect.DeepEqual(got, []string{"sync"}) { - t.Fatalf("bwSyncArgs = %v", got) - } -} - -// Every read opens a session that first `bw sync`s, so reads reflect the latest -// server-side values: `bw unlock` is local-only, so without a sync a persisted -// (already-logged-in) session serves a stale local cache. -func TestOpenSessionSyncsBeforeRead(t *testing.T) { - f := &fakeRunner{out: map[string]string{ - "vault kv get -field=vaultwarden_master_password secret/workstation/claude-users/emo": "pw", - "vault kv get -field=vaultwarden_client_id secret/workstation/claude-users/emo": "user.x", - "vault kv get -field=vaultwarden_client_secret secret/workstation/claude-users/emo": "cs", - "bw status": `{"status":"locked"}`, - "bw unlock": "SESS", - "bw sync": "Syncing complete.", - "bw get password github": "p@ss", - }} - uid := fmt.Sprintf("%d", os.Getuid()) - if _, err := getValue(f.run, "emo", uid, getOpts{name: "github", field: "password"}); err != nil { - t.Fatalf("getValue: %v", err) - } - idx := func(prefix string) int { - for i, c := range f.calls { - if strings.HasPrefix(strings.Join(c, " "), prefix) { - return i - } - } - return -1 - } - syncAt, unlockAt, getAt := idx("bw sync"), idx("bw unlock"), idx("bw get password github") - if syncAt < 0 { - t.Fatal("expected a `bw sync` before the read") - } - if !(unlockAt < syncAt && syncAt < getAt) { - t.Fatalf("order wrong: unlock=%d sync=%d get=%d (want unlock= 2 && argv[0] == "kv" && argv[1] == "get" { - if tc.exists { - return `{"data":{"data":{}}}`, nil - } - return "", fmt.Errorf("No value found at secret/x") - } - return "", nil - } - runStdin := func(name string, argv, envv []string, stdin string) (string, error) { - stdinCalls = append(stdinCalls, recStdin{append([]string{name}, argv...), stdin}) - return "", nil - } - if err := kvPut(run, runStdin, "secret/x", "api_key", "SECRETVALUE"); err != nil { - t.Fatalf("kvPut: %v", err) - } - if len(stdinCalls) != 1 { - t.Fatalf("want exactly 1 stdin write, got %d", len(stdinCalls)) - } - sc := stdinCalls[0] - joined := strings.Join(sc.argv, " ") - if tc.wantCreate && !strings.Contains(joined, "kv put") { - t.Fatalf("absent path must use `kv put`: %v", sc.argv) - } - if !tc.wantCreate && !strings.Contains(joined, "kv patch -method=rw") { - t.Fatalf("present path must merge via `kv patch -method=rw`: %v", sc.argv) - } - if strings.Contains(joined, "kv patch") && !strings.Contains(joined, "-method=rw") { - t.Fatalf("must never use plain `kv patch`: %v", sc.argv) - } - if sc.stdin != "SECRETVALUE" { - t.Fatalf("value must travel via stdin, got %q", sc.stdin) - } - for _, a := range sc.argv { - if strings.Contains(a, "SECRETVALUE") { - t.Fatalf("value leaked into argv: %v", sc.argv) - } - } - }) - } -} - -func TestVaultHelpMentionsBothSystems(t *testing.T) { - h := vaultHelp() - for _, want := range []string{"Vaultwarden", "vault kv"} { - if !strings.Contains(h, want) { - t.Errorf("vault help must mention %q (distinguish the two systems)", want) - } - } - // Must name the infra-secrets system so the distinction is unambiguous. - if !strings.Contains(h, "HashiCorp") && !strings.Contains(h, "OpenBao") { - t.Error("vault help must name HashiCorp Vault / OpenBao (the infra secrets store)") - } -} diff --git a/cli/edges.go b/cli/edges.go deleted file mode 100644 index 396cc5b9..00000000 --- a/cli/edges.go +++ /dev/null @@ -1,164 +0,0 @@ -package main - -import ( - "fmt" - "regexp" - "strconv" - "strings" -) - -// edgesOpts is the parsed filter set for `homelab edges` (the who-talks-to-whom -// investigation helper over the goldmane_edges trail; see ADR-0014). -type edgesOpts struct { - ns string // edges touching this namespace (either direction) - src string // edges where src_ns = this - dst string // edges where dst_ns = this - peersOf string // distinct peers of this namespace (both directions) - newSince string // first_seen >= duration (24h/7d/30m) or date (YYYY-MM-DD) - denied bool // action = 'deny' only - asJSON bool // wrap result as a JSON array - limit int // row cap (default 200) -} - -// parseEdgesArgs parses the edges flag surface. Unknown flags error out so a -// typo surfaces instead of silently dumping the whole table. -func parseEdgesArgs(args []string) (edgesOpts, error) { - o := edgesOpts{limit: 200} - i := 0 - for i < len(args) { - a := args[i] - key, inline, hasInline := a, "", false - if eq := strings.IndexByte(a, '='); eq >= 0 { - key, inline, hasInline = a[:eq], a[eq+1:], true - } - needVal := func() (string, error) { - if hasInline { - return inline, nil - } - if i+1 < len(args) { - i++ - return args[i], nil - } - return "", fmt.Errorf("flag %s needs a value", key) - } - var err error - switch key { - case "--ns": - o.ns, err = needVal() - case "--src": - o.src, err = needVal() - case "--dst": - o.dst, err = needVal() - case "--peers-of": - o.peersOf, err = needVal() - case "--new-since": - o.newSince, err = needVal() - case "--denied": - o.denied = true - case "--json": - o.asJSON = true - case "--limit": - var v string - if v, err = needVal(); err == nil { - if o.limit, err = strconv.Atoi(v); err != nil { - err = fmt.Errorf("--limit must be an integer: %q", v) - } - } - default: - return o, fmt.Errorf("unknown flag: %s", a) - } - if err != nil { - return o, err - } - i++ - } - return o, nil -} - -// nsRE is the safe namespace-token charset (k8s names + "Global"). Used as the -// injection guard — anything else is rejected rather than quoted-and-hoped. -var nsRE = regexp.MustCompile(`^[A-Za-z0-9][A-Za-z0-9_.-]*$`) - -func validateNS(s string) error { - if s == "" || len(s) > 63 || !nsRE.MatchString(s) { - return fmt.Errorf("invalid namespace name: %q", s) - } - return nil -} - -// sqlStr renders a SQL string literal (belt-and-suspenders on top of validateNS). -func sqlStr(s string) string { return "'" + strings.ReplaceAll(s, "'", "''") + "'" } - -var ( - durRE = regexp.MustCompile(`^(\d+)([smhd])$`) - dateRE = regexp.MustCompile(`^\d{4}-\d{2}-\d{2}([ T]\d{2}:\d{2}(:\d{2})?)?$`) -) - -// newSinceCond turns a duration (24h/7d/30m/90s) or a date (YYYY-MM-DD[ HH:MM]) -// into a first_seen predicate. -func newSinceCond(v string) (string, error) { - if m := durRE.FindStringSubmatch(v); m != nil { - unit := map[string]string{"s": "seconds", "m": "minutes", "h": "hours", "d": "days"}[m[2]] - return fmt.Sprintf("first_seen >= now() - interval '%s %s'", m[1], unit), nil - } - if dateRE.MatchString(v) { - return "first_seen >= " + sqlStr(v), nil - } - return "", fmt.Errorf("--new-since must be a duration (e.g. 24h, 7d, 30m) or a date (YYYY-MM-DD): %q", v) -} - -// buildEdgesQuery renders the SQL for the given filters against the `edge` table. -func buildEdgesQuery(o edgesOpts) (string, error) { - limit := o.limit - if limit <= 0 { - limit = 200 - } - - // peers-of is a distinct-peer summary, a different shape from the row list. - if o.peersOf != "" { - if err := validateNS(o.peersOf); err != nil { - return "", err - } - p := sqlStr(o.peersOf) - return fmt.Sprintf("SELECT DISTINCT peer, action FROM ("+ - "SELECT dst_ns AS peer, action FROM edge WHERE src_ns = %s "+ - "UNION SELECT src_ns AS peer, action FROM edge WHERE dst_ns = %s"+ - ") t ORDER BY peer LIMIT %d", p, p, limit), nil - } - - var conds []string - for _, f := range []struct{ val, tmpl string }{ - {o.ns, "(src_ns = %[1]s OR dst_ns = %[1]s)"}, - {o.src, "src_ns = %s"}, - {o.dst, "dst_ns = %s"}, - } { - if f.val == "" { - continue - } - if err := validateNS(f.val); err != nil { - return "", err - } - conds = append(conds, fmt.Sprintf(f.tmpl, sqlStr(f.val))) - } - if o.denied { - conds = append(conds, "action = 'deny'") - } - if o.newSince != "" { - c, err := newSinceCond(o.newSince) - if err != nil { - return "", err - } - conds = append(conds, c) - } - - q := "SELECT src_ns, dst_ns, action, flow_count, first_seen, last_seen FROM edge" - if len(conds) > 0 { - q += " WHERE " + strings.Join(conds, " AND ") - } - q += fmt.Sprintf(" ORDER BY first_seen DESC LIMIT %d", limit) - - if o.asJSON { - q = "SELECT coalesce(json_agg(row_to_json(t)), '[]') FROM (" + q + ") t" - } - return q, nil -} diff --git a/cli/edges_test.go b/cli/edges_test.go deleted file mode 100644 index c8ead29d..00000000 --- a/cli/edges_test.go +++ /dev/null @@ -1,163 +0,0 @@ -package main - -import ( - "strings" - "testing" -) - -func TestParseEdgesArgs(t *testing.T) { - cases := []struct { - name string - args []string - want edgesOpts - }{ - {"defaults", nil, edgesOpts{limit: 200}}, - {"ns", []string{"--ns", "immich"}, edgesOpts{ns: "immich", limit: 200}}, - {"ns equals", []string{"--ns=immich"}, edgesOpts{ns: "immich", limit: 200}}, - {"src dst", []string{"--src", "a", "--dst", "b"}, edgesOpts{src: "a", dst: "b", limit: 200}}, - {"peers-of", []string{"--peers-of", "authentik"}, edgesOpts{peersOf: "authentik", limit: 200}}, - {"denied json", []string{"--denied", "--json"}, edgesOpts{denied: true, asJSON: true, limit: 200}}, - {"new-since", []string{"--new-since", "24h"}, edgesOpts{newSince: "24h", limit: 200}}, - {"limit", []string{"--limit", "50"}, edgesOpts{limit: 50}}, - } - for _, c := range cases { - t.Run(c.name, func(t *testing.T) { - got, err := parseEdgesArgs(c.args) - if err != nil { - t.Fatalf("parseEdgesArgs(%v) error: %v", c.args, err) - } - if got != c.want { - t.Fatalf("parseEdgesArgs(%v) = %+v, want %+v", c.args, got, c.want) - } - }) - } -} - -func TestParseEdgesArgsErrors(t *testing.T) { - for _, args := range [][]string{ - {"--limit", "abc"}, - {"--bogus"}, - } { - if _, err := parseEdgesArgs(args); err == nil { - t.Errorf("parseEdgesArgs(%v) expected error, got nil", args) - } - } -} - -func TestBuildEdgesQueryDefaults(t *testing.T) { - q, err := buildEdgesQuery(edgesOpts{limit: 200}) - if err != nil { - t.Fatal(err) - } - for _, want := range []string{"FROM edge", "ORDER BY first_seen DESC", "LIMIT 200"} { - if !strings.Contains(q, want) { - t.Errorf("query %q missing %q", q, want) - } - } - if strings.Contains(q, "WHERE") { - t.Errorf("no-filter query should have no WHERE: %q", q) - } -} - -func TestBuildEdgesQueryFilters(t *testing.T) { - cases := []struct { - name string - o edgesOpts - want string - }{ - {"ns both directions", edgesOpts{ns: "immich", limit: 10}, "(src_ns = 'immich' OR dst_ns = 'immich')"}, - {"src only", edgesOpts{src: "authentik", limit: 10}, "src_ns = 'authentik'"}, - {"dst only", edgesOpts{dst: "dbaas", limit: 10}, "dst_ns = 'dbaas'"}, - {"denied", edgesOpts{denied: true, limit: 10}, "action = 'deny'"}, - } - for _, c := range cases { - t.Run(c.name, func(t *testing.T) { - q, err := buildEdgesQuery(c.o) - if err != nil { - t.Fatal(err) - } - if !strings.Contains(q, "WHERE") || !strings.Contains(q, c.want) { - t.Errorf("query %q missing WHERE/%q", q, c.want) - } - }) - } -} - -func TestBuildEdgesQueryCombinedFiltersAnded(t *testing.T) { - q, err := buildEdgesQuery(edgesOpts{src: "a", denied: true, limit: 5}) - if err != nil { - t.Fatal(err) - } - if !strings.Contains(q, "src_ns = 'a' AND action = 'deny'") { - t.Errorf("combined filters not AND'd: %q", q) - } -} - -func TestBuildEdgesQueryPeersOf(t *testing.T) { - q, err := buildEdgesQuery(edgesOpts{peersOf: "authentik", limit: 100}) - if err != nil { - t.Fatal(err) - } - for _, want := range []string{"DISTINCT", "src_ns = 'authentik'", "dst_ns = 'authentik'", "UNION"} { - if !strings.Contains(q, want) { - t.Errorf("peers-of query %q missing %q", q, want) - } - } -} - -func TestBuildEdgesQueryJSON(t *testing.T) { - q, err := buildEdgesQuery(edgesOpts{asJSON: true, limit: 200}) - if err != nil { - t.Fatal(err) - } - if !strings.Contains(q, "json_agg") || !strings.Contains(q, "row_to_json") { - t.Errorf("json query missing json_agg wrapper: %q", q) - } -} - -func TestBuildEdgesQueryRejectsInjection(t *testing.T) { - for _, bad := range []string{"a'; DROP TABLE edge;--", "a b", "a;b", "a\"b"} { - if _, err := buildEdgesQuery(edgesOpts{ns: bad, limit: 10}); err == nil { - t.Errorf("buildEdgesQuery(ns=%q) expected validation error, got nil", bad) - } - } -} - -func TestNewSinceCond(t *testing.T) { - cases := []struct { - in string - want string - }{ - {"24h", "first_seen >= now() - interval '24 hours'"}, - {"7d", "first_seen >= now() - interval '7 days'"}, - {"30m", "first_seen >= now() - interval '30 minutes'"}, - {"2026-06-28", "first_seen >= '2026-06-28'"}, - } - for _, c := range cases { - got, err := newSinceCond(c.in) - if err != nil { - t.Fatalf("newSinceCond(%q) error: %v", c.in, err) - } - if got != c.want { - t.Errorf("newSinceCond(%q) = %q, want %q", c.in, got, c.want) - } - } - for _, bad := range []string{"yesterday", "1y", "'; DROP", ""} { - if _, err := newSinceCond(bad); err == nil { - t.Errorf("newSinceCond(%q) expected error, got nil", bad) - } - } -} - -func TestValidateNS(t *testing.T) { - for _, ok := range []string{"immich", "calico-system", "kube-system", "Global", "pg-cluster-rw"} { - if err := validateNS(ok); err != nil { - t.Errorf("validateNS(%q) unexpected error: %v", ok, err) - } - } - for _, bad := range []string{"", "a b", "a'b", "a;b", "../x", "a$b"} { - if err := validateNS(bad); err == nil { - t.Errorf("validateNS(%q) expected error, got nil", bad) - } - } -} diff --git a/cli/homelab.go b/cli/homelab.go index 14b0afd4..62c0c8aa 100644 --- a/cli/homelab.go +++ b/cli/homelab.go @@ -20,7 +20,6 @@ func buildRegistry() []Command { reg = append(reg, deployCommands()...) reg = append(reg, netCommands()...) reg = append(reg, obsCommands()...) - reg = append(reg, edgesCommands()...) reg = append(reg, usageCommands()...) reg = append(reg, haCommands()...) reg = append(reg, browserCommands()...) diff --git a/cli/memory_test.go b/cli/memory_test.go index 1c673c7b..7b14ef20 100644 --- a/cli/memory_test.go +++ b/cli/memory_test.go @@ -5,31 +5,8 @@ import ( "os" "strings" "testing" - "unicode/utf8" ) -func TestTruncatePreviewKeepsValidUTF8(t *testing.T) { - // Byte-slicing a long Cyrillic string at 240 splits a 2-byte rune and emits - // invalid UTF-8 — the bug that crashed the recall hook. truncatePreview must - // cut on a rune boundary and always stay valid UTF-8. - long := strings.Repeat("я", 300) // 300 runes / 600 bytes - got := truncatePreview(long, 240) - if !utf8.ValidString(got) { - t.Fatalf("truncatePreview produced invalid UTF-8: %q", got) - } - if r := []rune(got); len(r) != 241 || string(r[:240]) != strings.Repeat("я", 240) || r[240] != '…' { - t.Fatalf("truncatePreview = %d runes, want 240 Cyrillic + ellipsis", len(r)) - } - // Short multibyte strings pass through untouched (no ellipsis). - if got := truncatePreview("кратко", 240); got != "кратко" { - t.Fatalf("short string altered: %q", got) - } - // ASCII boundary still works. - if got := truncatePreview(strings.Repeat("a", 500), 240); got != strings.Repeat("a", 240)+"…" { - t.Fatalf("ascii truncation wrong: %q", got) - } -} - func TestResolveMemoryBase(t *testing.T) { old1, old2 := os.Getenv("CLAUDE_MEMORY_API_URL"), os.Getenv("MEMORY_API_URL") defer func() { os.Setenv("CLAUDE_MEMORY_API_URL", old1); os.Setenv("MEMORY_API_URL", old2) }() diff --git a/docs/adr/0003-keep-forgejo-canonical-complete-mirror.md b/docs/adr/0003-keep-forgejo-canonical-complete-mirror.md index 67022732..9e0e2192 100644 --- a/docs/adr/0003-keep-forgejo-canonical-complete-mirror.md +++ b/docs/adr/0003-keep-forgejo-canonical-complete-mirror.md @@ -13,7 +13,7 @@ The trigger was a proposal to swap Forgejo out for GitHub entirely. The grilling Do **not** swap to GitHub. Reaffirm and *complete* the model already in `CONTEXT.md`: - Every first-party repo has exactly **one** push target — its **Canonical repo** on Forgejo. GitHub is a one-way push-mirror (off-site backup + the source GitHub Actions builds from). **No repo is ever dual-pushed.** -- A small, explicit set of **GitHub-first repos** are the exception (canonical lives on GitHub, outside the mirror policy): third-party clones/forks where GitHub is genuinely upstream (`jsoncrack.com`, `snmp_exporter`, `SparkyFitness`, `agent-rules-books`, `Plotting-Your-Dream-Book`) and the deliberately-public first-party `health`. `Plotting-Your-Dream-Book` (owned by Anca, dev in her org) keeps its GHA build in-place and pushes the image to **its own org's ghcr** (`ghcr.io/passionprojectsanca/book-plotter`, private) via the workflow's built-in `GITHUB_TOKEN` — no Forgejo mirror, no `viktorbarzin`-namespace push, no shared PAT in her repo (2026-06-27, migrated off DockerHub). +- A small, explicit set of **GitHub-first repos** are the exception (canonical lives on GitHub, outside the mirror policy): third-party clones/forks where GitHub is genuinely upstream (`jsoncrack.com`, `snmp_exporter`, `SparkyFitness`, `agent-rules-books`, `Plotting-Your-Dream-Book`) and the deliberately-public first-party `health`. - `infra` is reconciled into the standard model: its GitHub-only `.github/workflows/build-*.yml` are brought onto Forgejo-canonical (inert on Forgejo, active on the mirror), then the mirror is enabled — ending the deliberate divergence while keeping Woodpecker on the Forgejo forge. - Enforcement is **structural**: reconciled clones keep only the Forgejo remote, so there is no GitHub remote to habitually push to; the execution rule is "push to the canonical forge only, never the mirror." diff --git a/docs/adr/0011-homelab-usage-telemetry.md b/docs/adr/0011-homelab-usage-telemetry.md index fc0c4e76..c383211b 100644 --- a/docs/adr/0011-homelab-usage-telemetry.md +++ b/docs/adr/0011-homelab-usage-telemetry.md @@ -5,14 +5,6 @@ exists to answer the question that drove the whole CLI — *which verbs are wort adding next* — with data instead of one maintainer's habits (the earlier mining covered a single user's ~51k commands, so the surface is shaped to that user). -> **Update (2026-06-26) — the cross-user privacy *norm* below is superseded by -> [ADR-0015](0015-os-is-the-authorization-boundary.md).** The prohibition this -> ADR leaned on ("reading another user's `~/.claude` is off-limits even for an -> owner in-session") no longer holds: the managed-settings policy now **defers -> to OS/sudo authorization**. The `usage top` telemetry design itself is -> unchanged and still current — only the "never read homes" framing in the -> third decision below is overtaken. - ## Decisions - **Emit on dispatch, in `dispatch()`.** The longest-prefix match already knows diff --git a/docs/adr/0014-service-identity-and-east-west-observability.md b/docs/adr/0014-service-identity-and-east-west-observability.md index cdccac4f..5eb1c83a 100644 --- a/docs/adr/0014-service-identity-and-east-west-observability.md +++ b/docs/adr/0014-service-identity-and-east-west-observability.md @@ -27,9 +27,3 @@ As the Service count grows we want an audit-grade record of which Service talks - **Enforcement gains a better data source.** Goldmane's allow/deny + policy-trace flows build the Wave 1 empirical egress allowlist faster than the current iptables-`LOG`→journald→Loki path, and policies select on namespace/label with no SA dependency. - **New ubiquitous language** recorded in `CONTEXT.md`: **Service identity** and **Goldmane / Whisker**. - **Revisit triggers:** adopt dedicated per-Service SAs if identity-aware NetworkPolicy needs a principal finer than namespace/label, or if mTLS is ever required; reconsider Retina if DNS/drop-level flow detail becomes necessary. - -## As-built (2026-06-25) - -Implemented across infra issues #57–#63. **One material deviation from the decision above:** the durable trail is NOT a Goldmane→Loki emitter (no such emitter exists in OSS Calico 3.30) — it is the **`goldmane-edge-aggregator`** service, which streams Goldmane's gRPC `Flows.Stream` API over mTLS and upserts the unique namespace-pair **edge set** (`edge(src_ns,dst_ns,action,first_seen,last_seen,flow_count)`, self-edges + empty-namespace flows dropped) into **CNPG DB `goldmane_edges`**, plus a daily `goldmane-edges-digest` CronJob → `#alerts` (all Slack consolidated to `#alerts`; the `#security` channel was abandoned 2026-06-25 — the shared webhook's Slack app isn't a member of it — see runbook). The mTLS client cert **reuses the operator's Tigera-CA-signed `whisker-backend-key-pair`** rather than copying the CA private key into TF state (Goldmane verifies CA-chain only, not identity) — re-apply `stacks/goldmane-edge-aggregator` if the operator rotates it. `service-identity` labels are live on the multi-Service namespaces (`monitoring`, `dbaas`). Whisker UI is Authentik-gated at `whisker.viktorbarzin.me`. Health: Prometheus alerts `AggregatorDown` + `DigestFailing` and cluster-health check #48. - -Full as-built, query recipes (incl. the Wave-1 egress-allowlist derivation), and troubleshooting: [`docs/runbooks/goldmane-flow-trail.md`](../runbooks/goldmane-flow-trail.md). Stacks: `stacks/calico` (Goldmane/Whisker + Whisker ingress), `stacks/goldmane-edge-aggregator` (the trail). Code: `~/code/goldmane-edge-aggregator`. diff --git a/docs/adr/0015-os-is-the-authorization-boundary.md b/docs/adr/0015-os-is-the-authorization-boundary.md deleted file mode 100644 index 8999682b..00000000 --- a/docs/adr/0015-os-is-the-authorization-boundary.md +++ /dev/null @@ -1,57 +0,0 @@ -# OS is the authorization boundary: agents defer to Unix/sudo, not a stricter in-policy rule - -Supersedes the cross-user privacy *norm* that the devvm managed-settings policy -carried and that ADR-0011 leaned on ("never read another user's home / -`~/.claude`, off-limits even for an owner in-session"). ADR-0011's actual -subject — `usage top` telemetry and its emit design — is unchanged and still -current; only the privacy prohibition it referenced is superseded here. - -## Context - -The devvm managed-settings policy (`/etc/claude-code/managed-settings.json`, -`claudeMd`) carried two rules that were, in practice, *stricter than the OS*: -"you are not the admin, do not escalate privileges" and "never read another -user's home directory, credentials, tokens, or `~/.claude`." The OS told a -different story: `wizard` holds `(ALL) NOPASSWD: ALL` — full passwordless root. -The kernel had already granted total read access; the policy was layering an -artificial refusal on top of an authorization the OS already permits, and the -"not the admin" framing was factually wrong for a NOPASSWD-root user. - -Two honest ways to resolve the inconsistency: tighten sudo to match the policy, -or loosen the policy to match the OS. The owner chose the latter on 2026-06-26, -for analytics/debugging across the shared box. - -## Decision - -- **Authorization follows the OS, not this policy.** Agents may access whatever - their OS user can access — directly or via `sudo` where they hold sudo rights - — and must not impose restrictions stricter than the OS. On this box that - includes other users' home directories and `~/.claude` for users who hold - broad sudo. -- **No separate prompt or carve-out** for OS-authorized access. The Unix - permission model + sudoers is the single source of truth for who may read - what. Other homes are `0750`-owned, so a cross-home read necessarily transits - `sudo` and is therefore captured in the sudo/auth audit log. -- **Cluster/infra RBAC tiering is unchanged.** kubectl / Vault / infra access - stays scoped to each user's RBAC tier; "defer to the OS" is about OS-level - file access, not a licence to exceed cluster RBAC. -- **Scope is symmetric and multi-user.** The rule lives in the *shared* - managed-settings, so every user's agents defer to that user's own sudo grant. - Any user with broad sudo gets the same cross-home read capability over other - users' files. Accepted by the owner with that understanding; emo's and - ancamilea's `~/.claude` is now agent-readable by sudo-holders. -- **Takes effect in a fresh session.** managed-settings loads at session start; - the session that made the change keeps running under the old policy. - -## Consequences - -- The privacy-preserving telemetry rationale in ADR-0011 (`usage top` as the - "cross-user analytics without reading homes" answer) remains useful but is no - longer the *only* sanctioned path; direct reads via `sudo` are now permitted. -- Larger blast radius: if an agent session running as a sudo-holder is - prompt-injected or otherwise compromised, it can now read every user's secrets - with no in-agent friction (sudo here is passwordless). The sudo/auth audit log - is the remaining accountability control. -- Reversible: restore the prior `claudeMd` bullets (backup kept at - `/etc/claude-code/managed-settings.json.bak-2026-06-26`) and start a fresh - session. diff --git a/docs/architecture/authentication.md b/docs/architecture/authentication.md index 620bcf6b..9decc8dc 100644 --- a/docs/architecture/authentication.md +++ b/docs/architecture/authentication.md @@ -86,56 +86,10 @@ Signin latency is dominated by screen count and round trips, not server time use the explicit-consent flow (it re-prompted every 4 weeks per app). - **Live tuning via `server.env`/`worker.env`** (the `authentik.*` Helm values are inert due to `existingSecret`): 3 gunicorn workers, 30m flow-plan cache, - 15m policy cache, gunicorn `max_requests=10000`/jitter=1000 (recycle - hardening — decorrelates the 9 workers' recycles from PG blips). **No - `CONN_MAX_AGE`** — persistent Django connections pin a PgBouncer server conn - 1:1 and saturate the session-mode pool (reverted 2026-06-10). + 15m policy cache, 60s persistent DB connections. - **Static assets cached immutable**: `/static` ingress carve-out adds `Cache-Control: public, max-age=31536000, immutable` (assets are version-fingerprinted; authentik itself sends no max-age). -- **Rate-limit carve-out** (2026-06-28): `/` and `/static` use a dedicated - `authentik-rate-limit` (100/1000) instead of the shared 10/50 default — the - login SPA cold-loads ~70 flow-executor chunks from `/static`; the default - burst 429'd the tail and a failed ES-module import left a blank login screen. -- **Readiness tolerance** (2026-06-28): server `readinessProbe.failureThreshold:8` - (~80s, was the chart-default ~30s). The probe (`/-/health/ready/`) queries the - DB; too-tight tolerance let a sub-60s PG/pgbouncer transient return 503 on all - 3 server pods at once → Traefik had no healthy backend → 502/503/504 (episodic - blank login + 30s hangs). 80s absorbs a full CNPG failover reconnect. Sessions - + cache are PostgreSQL-only since Redis was removed in 2026.2 (no external-cache - option), so request-serving is coupled to PG — this survives a short transient, - not a total CNPG outage. -- **Rolling-update strategy** (2026-06-28): the chart key is `deploymentStrategy` - (the repo's old `strategy:` key was silently inert → live ran the chart-default - 25%/25% and dropped a server pod out of rotation on every roll). Now - `maxSurge:1/maxUnavailable:0` keeps all 3 ready throughout a roll. -- **Old-browser login (SFE)** (2026-06-28): authentik's modern flow SPA is ES2022 - and renders a **blank login** on Safari/WebKit ≤16.3 (every iOS browser shares - the system WebKit, so it's not browser-choice — e.g. iPadOS ≤15). The overlay - image patches `flows/views/interface.py::compat_needs_sfe()` to also serve - authentik's built-in no-JS **Simplified Flow Executor** (SFE, ES5) to old Safari - **and any iOS browser** (Chrome/Firefox on iOS are WebKit skins) on iOS ≤16.3, - so those clients get the *real* authentik login (password + MFA + reputation — - no auth downgrade). The SFE can't render Identification-stage **sources** - (authentik limitation), so the patch also injects static social-login `` - links into `flow-sfe.html` (→ `/source/oauth/login//`, plain redirects) — - required for password-less accounts (e.g. Google-only users). A Traefik - basic-auth fallback was rejected: it would have put a single spoofable-UA - password in front of `vbarzin→wizard` (passwordless root on the devvm). See - `stacks/authentik/patch-compat-sfe.py`. -- **SFE + forced-WebAuthn MFA gotcha** (2026-06-28): the `default-authentication-flow` - MFA stage (`not_configured_action=configure`, `conf_stages=[webauthn]`) force-enrols - a WebAuthn passkey for any **password**-path user with no MFA device — but the SFE - **cannot render WebAuthn** (enrol *or* validate), so that user gets - `unsupported state: ak-stage-authenticator-webauthn`. Two escape hatches, **no MFA - downgrade**: (1) **social login** — sources run `default-source-authentication` - (UserLoginStage only, **no MFA stage**), so the SFE's "Continue with " - button always completes; (2) **enrol TOTP** — the SFE *can* validate TOTP codes, and - ≥1 confirmed device flips the stage from force-enrol to validate. User MFA devices are - runtime data (not Terraform): enrol via `ak shell` - (`TOTPDevice.objects.create(user=…, confirmed=True)`) and store the secret in the - user's own Vaultwarden item. (Done for emo — the Google-only iPadOS-15 case: TOTP in - his `authentik.viktorbarzin.me` Bitwarden item; e2e-verified the BW code is accepted.) - **Outpost**: 2 replicas, `log_level=info` (was 1 replica at `trace`). - **auth-proxy nginx**: upstream `keepalive 32` + HTTP/1.1 — no per-request TCP setup on the forward-auth subrequest path. diff --git a/docs/architecture/chrome-service.md b/docs/architecture/chrome-service.md index 118c0895..6f9c1ee4 100644 --- a/docs/architecture/chrome-service.md +++ b/docs/architecture/chrome-service.md @@ -205,43 +205,6 @@ healthy <0.3s, broken hangs). **Fix: cap `ulimit -n 65536` before x11vnc starts* wrapper in `main.tf` (so it applies deterministically even though the image is `:latest`/`IfNotPresent` and won't re-pull a rebuilt entrypoint). Same bug + fix as the android-emulator stack. - -### noVNC black after a browser-container restart (x11vnc supervision) - -A **distinct** failure from the fd-sweep gotcha above: the noVNC client *connects* -but the view is **black**, and the novnc container logs spew -`connecting to: localhost:5900` → `Failed to connect ... [Errno 111] Connection -refused` (x11vnc is **down**, not slow). Cause: `x11vnc` and `websockify` both run -in the **novnc** container, but x11vnc attaches to the **chrome-service** (browser) -container's Xvfb over `localhost:6099` (shared pod network). When the browser -container restarts — Chrome exits cleanly (exit 0, "Completed") or crashes — its -Xvfb vanishes and x11vnc loses its X connection and exits. - -`entrypoint.sh` **supervises** x11vnc: it launches x11vnc and websockify as -background children and `wait -n`s on them, exiting non-zero if **either** dies, so -the kubelet restarts the novnc container, which re-waits for Xvfb on `:6099` and -relaunches x11vnc — the bridge **self-heals** across browser-container restarts. -(Before 2026-06-27, x11vnc was an unsupervised background child of an `exec`ed -websockify; a dead x11vnc was never relaunched, leaving `:5900` dead — a -`` zombie — and the view black until a manual pod restart. Same -supervision pattern as the android-emulator stack's entrypoint.) - -**Diagnose:** `kubectl exec -c novnc -- ps aux | grep x11vnc` (a ``/Z -entry = the bug); or the RFB-banner probe from a sibling container (`python3 -c -"import socket;s=socket.socket();s.settimeout(2);s.connect(('127.0.0.1',5900));print(s.recv(12))"` -— healthy returns `b'RFB 003.008\n'`, broken = `ConnectionRefused`). **Immediate -recovery** (no image change): restart just the novnc container with `kubectl exec --n chrome-service deploy/chrome-service -c novnc -- kill 1` — re-runs its entrypoint -and relaunches x11vnc **without** touching the browser session/in-flight CDP jobs. - -> **Deploying a rebuilt novnc entrypoint:** Keel is **off** for this deployment -> (`keel.sh/policy=never`, because the browser container's playwright image is -> version-pinned to f1-stream) and the image is `:latest`/`IfNotPresent`, so a -> rebuilt `:latest` will **not** redeploy on its own. After the -> `build-chrome-service-novnc.yml` GHA build pushes `:latest` + `:`, -> **SHA-pin** the novnc `image` in `main.tf` to the new `:` to force the pull -> and rollout (the novnc image is TF-managed — not in the deployment's -> `lifecycle.ignore_changes`). - **snapshot-server sidecar** (`mcr.microsoft.com/playwright/python:v1.48.0-noble`) serves `GET /api/snapshot` from `/profile/snapshots/storage-state.json`, bearer-gated by `PW_TOKEN`. Service `chrome-snapshot` maps :8088 → :8088 @@ -293,42 +256,6 @@ Key facts: byte-identical copy of `files/stealth.js`, guarded by a drift test — so the CLI's stealth never diverges from the in-cluster callers'. -## Multi-user access (sharing the browser) - -There is ONE chrome-service browser with ONE persistent profile, warmed with -**Viktor's** logged-in sessions. CDP has no per-context auth, so anyone who can -drive the browser — over the noVNC view OR the CDP/`homelab browser` path — can -reach the persistent profile (`browser.contexts[0]`) and therefore Viktor's -sessions. Access is gated accordingly, per user. - -**Decision (2026-06-28):** emo (`emil.barzin` / `emil.barzin@gmail.com`) SHARES -Viktor's browser for form-filling + captcha solving, rather than getting an -isolated instance. The session-exposure trade-off above was explicitly accepted. - -Two independent grants make up "browser access" for a user: - -1. **noVNC (interactive view, `chrome.viktorbarzin.me`)** — gated by the Authentik - `admin-services-restriction` policy: the `CHROME_ALLOWED` set - (`stacks/authentik/admin-services-restriction.tf`) matches the user's Authentik - username OR email. Add the user there. No kubeconfig/RBAC needed. -2. **CLI (`homelab browser`, CDP over port-forward)** — needs `pods/portforward` - in `chrome-service` PLUS a non-interactive credential (a normal devvm user's - kubeconfig is interactive-OIDC-only and can't authenticate a headless agent - session). Provided by a per-user **ServiceAccount** with a long-lived token - (`stacks/chrome-service/rbac.tf`, e.g. `emo-browser`): `pods/portforward` in - this namespace + cluster read-only (`oidc-power-user-readonly`, so it can also - resolve the Service and doesn't regress the user's normal read). The devvm - provisioner (`scripts/t3-provision-users.sh` → `install_browser_kubeconfig`) - reads that token and installs it as the user's DEFAULT kubeconfig context - (`-browser@homelab`), keeping their personal OIDC login as the - `oidc@homelab` named context. The SA's existence is the source of truth for who - gets the CLI — the provisioner no-ops for users without a `-browser` SA. - -**To grant another user:** add them to `CHROME_ALLOWED` (noVNC) and/or add a -`-browser` SA + bindings mirroring `emo-browser` in `rbac.tf` (CLI), then run -the provisioner. To revoke: remove from `CHROME_ALLOWED` and delete the SA (rotate -a token by deleting its `-browser-token` Secret). - ## Limits + risks - **Anti-bot vs stealth arms race** — when an upstream beats us (DRM diff --git a/docs/architecture/ci-cd.md b/docs/architecture/ci-cd.md index 5a9c3722..35e041e6 100644 --- a/docs/architecture/ci-cd.md +++ b/docs/architecture/ci-cd.md @@ -115,67 +115,9 @@ claude-agent-service, claude-memory-mcp, kms-website, Freedify, instagram-poster, payslip-ingest, broker-sync (image name `wealthfolio-sync`), fire-planner, recruiter-responder, x402-gateway — plus **tripit** (the original pilot, 2026-06-09). Earlier public-repo apps already on GHA (Website, -k8s-portal, apple-health-data, audiblez-web, insta2spotify, +k8s-portal, apple-health-data, audiblez-web, plotting-book, insta2spotify, audiobook-search) now also land on ghcr. -**plotting-book** is a special case (a GitHub-first repo owned by Anca, -ADR-0003): the build runs in *her* GitHub repo -(`PassionProjectsAnca/Plotting-Your-Dream-Book`) and pushes to **private -`ghcr.io/passionprojectsanca/book-plotter`** — under her org's ghcr namespace, -not `viktorbarzin`, using the workflow's built-in `GITHUB_TOKEN` (no shared -PAT). The cluster pulls it via the Kyverno-synced `ghcr-credentials` secret (the -`plotting-book` namespace is on the allowlist; the shared `ghcr_pull_token` has -read access). Migrated off public DockerHub (`viktorbarzin/book-plotter`) on -2026-06-27. The Woodpecker deploy hook (repo 43, registered to Anca's repo) is -unchanged. Flow: - -```text - DEVELOP ─────────────────────────────────────────────────────────────────────── - Anca (Codex / t3 web agent) - │ git push → main - ▼ - ┌──────────────────────────────────────────────────────────────┐ - │ GitHub: PassionProjectsAnca/Plotting-Your-Dream-Book (private)│ ← canonical - │ .github/workflows/build-and-deploy.yml on: push → main │ - └───────────────────────────┬──────────────────────────────────┘ - │ GitHub Actions runner (off-infra build · ADR-0002) - ┌────────────────────┴─────────────────────────────────┐ - ▼ ▼ - ┌─────────────────────────────────────────────┐ ╔═══════════════════════════════════════╗ - │ build job │ push ║ GHCR · PRIVATE package ║ - │ • svu next --always → tag vX.Y.Z (→ repo) │═════▶║ ghcr.io/passionprojectsanca/ ║ - │ • buildx linux/amd64, provenance:false │ tags ║ book-plotter :vX.Y.Z :latest ║ - │ • login ghcr (GITHUB_TOKEN, packages:write)│ ╚═══════════════════╤═══════════════════╝ - │ • delete-package-versions (keep newest 10) │ │ - └───────────────────────┬─────────────────────┘ │ pull (private, - ▼ deploy job [gate: repo var DEPLOY_ENABLED ≠ "false"] via secret) - POST ci.viktorbarzin.me/api/repos/43/pipelines {IMAGE_TAG, IMAGE_NAME} │ - ▼ │ - ┌─────────────────────────────────────────────────────────────┐ │ - │ Woodpecker repo 43 · .woodpecker/deploy.yml (event: manual) │ │ - │ kubectl set image deployment/plotting-book = :vX.Y.Z │ │ - │ kubectl rollout status │ │ - └───────────────────────────┬─────────────────────────────────┘ │ - ▼ │ - ═══════════════ Kubernetes · ns: plotting-book ════════════════════════════ │ - ┌─────────────────────────────────────────────────────────────┐ │ - │ Deployment plotting-book (Recreate · image = ignore_changes)│ │ - │ imagePullSecrets: ghcr-credentials ────────pull───────────┼─────────────────┘ - │ Pod → Express :3001 + SQLite on PVC (proxmox-lvm) │ - └─────────────────────────────────────────────────────────────┘ - guards / supporting: - • Kyverno require-trusted-registries [Enforce] → ghcr.io/* ALLOWED (admission) - • Keel policy=patch @1h → watches GHCR via ghcr-credentials (backstop) - • ghcr-credentials ⇐ Kyverno generate-clone ⇐ Vault secret/viktor/ghcr_pull_token - - ═══════════════ Serving path (unchanged) ══════════════════════════════════ - Browser ─▶ plotting-book.viktorbarzin.me (non-proxied DNS → Traefik .203) - ─▶ Authentik forward-auth (gate) ─▶ Service :80 ─▶ Pod :3001 -``` - -Governance: the Deployment + Kyverno allowlist are Terraform (`stacks/plotting-book`, -`stacks/kyverno`); the live image *tag* is CI-owned (`ignore_changes`). - ### Infra-owned images (issues #29 / #30) Images owned by the infra repo build on GHA workflows **in the infra repo's own @@ -221,9 +163,9 @@ Woodpecker is **deploy + cluster-touching steps only**: | Pipeline | File | Purpose | |----------|------|---------| | per-app deploy | `.woodpecker/deploy.yml` (each repo) | `kubectl set image` + Slack notify (event: **manual**) | -| terragrunt apply | `.woodpecker/default.yml` | Changed-stacks apply on push to master (runs in `infra-ci`). **Skips Tier-0 `vault`** — it's human-applied via OIDC; the CI `ci` role lacks Vault-admin perms (`sys/mounts`, `sys/policies/acl`) so a CI apply 403s | +| terragrunt apply | `.woodpecker/default.yml` | Changed-stacks apply on push to master (runs in `infra-ci`) | | certbot | `.woodpecker/renew-tls.yml` | TLS renewal cron | -| drift-detection | `.woodpecker/drift-detection.yml` | Nightly Terraform drift (runs in `infra-ci`). **Skips Tier-0 `vault`** (its `plan` 403s under the `ci` role and would fail the whole run) | +| drift-detection | `.woodpecker/drift-detection.yml` | Nightly Terraform drift (runs in `infra-ci`) | | provision-user | `.woodpecker/provision-user.yml` | Add namespace-owner user from Vault spec | | registry-config-sync | `.woodpecker/registry-config-sync.yml` | SCP `modules/docker-registry/*` → `10.0.20.10` on change | | pve-nfs-exports-sync | `.woodpecker/pve-nfs-exports-sync.yml` | Sync `scripts/pve-nfs-exports` → `/etc/exports` on PVE | @@ -234,38 +176,6 @@ Woodpecker is **deploy + cluster-touching steps only**: **No build/test pipeline exists on any repo.** Do not (re)introduce one. -### `default.yml` apply: dual-registration de-dup + reliability (2026-06-28) - -infra is registered in Woodpecker on **both** the canonical Forgejo repo (id 82) -and the legacy GitHub mirror (id 1), and **both fire `default.yml` on every -push**. Left unguarded, two `terragrunt apply` runs race each other for the -per-stack PG state lock — historically the #1 source of `Error acquiring the -state lock` failures and push-supersede "killed" runs. - -- **Forge guard** (first command in the `apply` step): the push-apply runs **only - on the canonical Forgejo forge**; on the GitHub mirror it logs `[forge-guard]` - and `exit 0`s. Detection: `CI_REPO_URL`/`CI_FORGE_URL` contains `github.com` → - skip. Fail-open (unknown forge still applies). The mirror keeps running the - **crons** (drift-detection, renew-tls, …), which live on repo 1 — only its - duplicate push-apply no-ops. (Crons were NOT moved; deactivating repo 1 would - have killed them.) -- **Lock-skip matches both tiers**: a stack whose apply hits a lock is SKIPPED, - not failed. The grep now matches the Tier-0 Vault message (`is locked by`) **and** - the Tier-1 PG-backend message (`Error acquiring the state lock` / `already - locked`) — the PG case was previously miscounted as a hard failure. -- **Transient retry** (bounded, 3 attempts): only provider-registry download - timeouts (`Failed to install provider` / `Client.Timeout`) and Vault 5xx are - retried. Config errors (missing arg, invalid index) and helm `atomic` timeouts - are NOT retried — they fail fast. - -A pre-apply off-infra validate gate was evaluated and rejected: `terraform -validate` runs without state but catches ~0 of the observed failures (they are -provider-config-from-Vault-data, server-side-apply conflicts, helm installs, and -lock contention — all invisible to static validate), and `plan` cannot run -off-infra (no Vault/PG access). `terragrunt apply` already fails at its plan -phase without mutating on config errors, so a separate in-pipeline plan-gate was -also dropped as redundant. - ### Woodpecker API Uses **numeric repo IDs** (`/api/repos//pipelines`), NOT owner/name paths diff --git a/docs/architecture/monitoring.md b/docs/architecture/monitoring.md index 06ee943f..3c75a345 100644 --- a/docs/architecture/monitoring.md +++ b/docs/architecture/monitoring.md @@ -286,7 +286,7 @@ Uptime Kuma monitors: TCP SMTP (port 25) on `176.12.22.76` (external), IMAP (por #### Security Alerts (Wave 1 — planned, beads `code-8ywc`) -Routed via **Loki ruler → Alertmanager → the `slack-security` receiver, which posts to `#alerts`** (it keeps its `[SECURITY/]` title styling so security-lane alerts stand out there). Same handling path as infra alerts; severity labels carried in the alert (critical/warning/info). The dedicated `#security` channel was abandoned 2026-06-25 — the shared `alertmanager_slack_api_url` webhook's Slack app isn't a member of it (a `#security` override 404s), so everything consolidated to `#alerts`. Detection sources: K8s API audit log (`job=kube-audit`), Vault audit log (`job=vault-audit`), PVE sshd journald (`job=sshd-pve`), Calico flow logs (`job=calico-flow`, W1.6 only). +Routed via **Loki ruler → Alertmanager → `#security` Slack receiver**. Same handling path as infra alerts. Single channel with severity labels inside (critical/warning/info), not three separate channels. Detection sources: K8s API audit log (`job=kube-audit`), Vault audit log (`job=vault-audit`), PVE sshd journald (`job=sshd-pve`), Calico flow logs (`job=calico-flow`, W1.6 only). | # | Source | Event | Severity | |---|---|---|---| @@ -318,20 +318,9 @@ IOPS impact estimated ~1-2 GB/day additional disk writes after custom audit-poli Detects the inverse of the K-series alerts: a service that **must work WITHOUT Authentik SSO** getting accidentally walled off. Services on `ingress_factory auth = "required"` put Authentik forward-auth on `/`, which 302-bounces native-client / public / webhook / WebSocket / SPA-XHR paths. We carve those out with path-scoped `auth = "none"` ingresses; a TF revert, a bad deploy, or `ingress_factory`'s fail-closed `auth` default flipping back to `"required"` can silently clobber a carve-out. - **Mechanism**: `blackbox-exporter` (monitoring ns) probes a representative GET-able URL per carve-out with `no_follow_redirects: true`. The `http_no_authentik_redirect` module FAILS the probe (`fail_if_header_matches` on the `Location` header, regex `authentik\.viktorbarzin\.me|/outpost\.goauthentik\.io|/application/o/authorize`) iff the response redirects to Authentik. `valid_status_codes` enumerates all expected non-Authentik responses **including 301/302** (so a legitimate redirect, e.g. a short-link 302, or a 404 carve-out like meshcentral `/agent.ashx`, stays green). Scrape job: `blackbox-authentik-walloff` (1m). -- **Alert**: `probe_failed_due_to_regex{job="blackbox-authentik-walloff"} == 1` for 10m → `severity=warning`, `lane=security` → posts to **`#alerts`** via the `slack-security` receiver, which keeps its `[SECURITY]` styling (Slack-only, no paging; the dedicated `#security` channel was abandoned 2026-06-25 — the shared webhook's app isn't a member of it). `probe_failed_due_to_regex` (not bare `probe_success==0`) is the signal: it isolates the Authentik-redirect from unrelated 5xx/DNS/TLS failures already covered by reachability alerts. Inhibited by `TraefikDown` and `AuthentikDown` (symptom, not regression, during those outages). +- **Alert**: `probe_failed_due_to_regex{job="blackbox-authentik-walloff"} == 1` for 10m → `severity=warning`, `lane=security` → **`#security` Slack** (Slack-only, no paging). `probe_failed_due_to_regex` (not bare `probe_success==0`) is the signal: it isolates the Authentik-redirect from unrelated 5xx/DNS/TLS failures already covered by reachability alerts. Inhibited by `TraefikDown` and `AuthentikDown` (symptom, not regression, during those outages). - **Target list + how to add one**: `local.authentik_walloff_targets` in `stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf` — a map of `service → URL`. To guard a NEW carve-out, add ONE line. Verify it does NOT already 302 to Authentik first: `curl -s -o /dev/null -w '%{http_code} %{redirect_url}\n' ''`. The map key becomes the `service` label on the metric + alert. (Note: openclaw `task-webhook` is intentionally NOT probed — no public DNS record.) -#### East-west flow observability (Goldmane edge-aggregator) — `AggregatorDown` / `DigestFailing` (ADR-0014) - -Health for the durable "who-talks-to-whom" trail (Calico Goldmane → `goldmane-edge-aggregator` → CNPG `goldmane_edges` → daily `#alerts` digest; full trail in security.md + [runbooks/goldmane-flow-trail.md](../runbooks/goldmane-flow-trail.md)). The aggregator pod exposes **no `/metrics`**, so health is inferred from kube-state-metrics. Alert group `Network Observability (Goldmane)` in `prometheus_chart_values.tpl`; both route the default `slack-warning` receiver → **`#alerts`**. - -| Alert | Expr (abridged) | For | Severity | -|---|---|---|---| -| `AggregatorDown` | `kube_deployment_status_replicas_available{namespace="goldmane-edge-aggregator",deployment="goldmane-edge-aggregator"} < 1` (+ Prometheus-restart guard) | 15m | warning | -| `DigestFailing` | `kube_job_status_failed{namespace="goldmane-edge-aggregator",job_name=~"goldmane-edges-digest.*"} > 0` within 24h | 30m | warning | - -The two layers are **complementary**: `AggregatorDown` ⇒ no new edges land in the DB; `DigestFailing` ⇒ edges still land but nobody is told. (`< 1` requires the metric series to exist — a fully-deleted Deployment is instead caught by cluster-health check #48 below as "deployment missing".) A freshness probe (#61b) was deliberately skipped — `AggregatorDown` is the agreed floor. **Cluster-health check #48** (`check_goldmane_aggregator` in `scripts/cluster_healthcheck.sh`) reads the Deployment's `Available` condition independently (human / `--quiet` / `--json`; JSON key `goldmane_aggregator`). - #### Backup Alerts - **PostgreSQLBackupStale**: >36h since last backup - **MySQLBackupStale**: >36h since last backup diff --git a/docs/architecture/multi-tenancy.md b/docs/architecture/multi-tenancy.md index 2cabf9e7..c64a146c 100644 --- a/docs/architecture/multi-tenancy.md +++ b/docs/architecture/multi-tenancy.md @@ -541,7 +541,7 @@ Separate from the in-cluster namespace-owner model above, the **devvm** (`10.0.1 **RBAC tiers:** `admin` (Viktor — cluster-admin, unlocked tree, secrets) · `power-user` (cluster-wide read-only, NO Secrets, via a dedicated `oidc-power-user-readonly` ClusterRole) · `namespace-owner` (admin in own namespace only). Each session acts as the user's **own** OIDC identity (kubelogin), never the admin's. -**Config inheritance (live):** wizard authors the base (his chezmoi-versioned `~/.claude`). Two native layers carry it to every user — the enforced org `claudeMd` in `/etc/claude-code/managed-settings.json` (top precedence, all sessions) and per-user `~/.claude/{skills,rules,…}` **symlinks** to the base (seeded via `/etc/skel`; edits propagate live). Secrets stay per-user at mode 600, never symlinked. **(2026-06-26: the managed `claudeMd` now defers OS-level file access to the OS/sudo — a user holding broad `sudo` may read other users' files incl. `~/.claude`; the mode-600 / no-symlink posture is unchanged but is no longer reinforced by an agent "never read other homes" rule. See [ADR-0015](../adr/0015-os-is-the-authorization-boundary.md).)** **The managed config self-deploys from the repo** (2026-06-10): the hourly reconcile's `sync_managed_config` installs `scripts/workstation/managed-settings.json` to `/etc/claude-code/` whenever the repo copy changes — so editing the claudeMd = edit + commit, no manual install — and `refresh_codex_mirror` regenerates each user's `~/.codex/AGENTS.md` (a static mirror of the claudeMd; only files carrying the mirror header are touched, user-customized ones are left alone). Repo-level guidance (`.claude/CLAUDE.md`, `AGENTS.md`, `CONTEXT.md` in the infra repo) reaches non-admins through their auto-freshened clones — commit + push and every user has it within the hour. +**Config inheritance (live):** wizard authors the base (his chezmoi-versioned `~/.claude`). Two native layers carry it to every user — the enforced org `claudeMd` in `/etc/claude-code/managed-settings.json` (top precedence, all sessions) and per-user `~/.claude/{skills,rules,…}` **symlinks** to the base (seeded via `/etc/skel`; edits propagate live). Secrets stay per-user at mode 600, never symlinked. **The managed config self-deploys from the repo** (2026-06-10): the hourly reconcile's `sync_managed_config` installs `scripts/workstation/managed-settings.json` to `/etc/claude-code/` whenever the repo copy changes — so editing the claudeMd = edit + commit, no manual install — and `refresh_codex_mirror` regenerates each user's `~/.codex/AGENTS.md` (a static mirror of the claudeMd; only files carrying the mirror header are touched, user-customized ones are left alone). Repo-level guidance (`.claude/CLAUDE.md`, `AGENTS.md`, `CONTEXT.md` in the infra repo) reaches non-admins through their auto-freshened clones — commit + push and every user has it within the hour. **Memory — homelab CLI hooks (rolled out 2026-06-21, deploy-fixed 2026-06-22):** the per-user `claude_memory` MCP was retired for the **homelab-memory hooks** — the reconcile's `install_memory` (re)installs four scripts into `~/.claude/hooks/` each run (`homelab-memory-recall.py` UserPromptSubmit recall, `auto-learn.py` Stop-hook extraction, `pre-compact-backup.sh`/`post-compact-recovery.sh`), wires them into `settings.json` if-absent + additive, and removes the old `claude_memory` MCP. **The provisioner binary itself now self-deploys from the repo** (step 0: `bash -n`-gated `install` + re-exec when `scripts/t3-provision-users.sh` differs from `/usr/local/bin/t3-provision-users`, guarded against re-exec loops / DRY_RUN mutation) — added after this very rollout sat committed-but-undeployed for a day (only the manual `setup-devvm.sh` had ever deployed the binary), so the hourly reconcile kept running the pre-memory version and emo/anca silently lost memory (recall + auto-learn never wired). A latent `set -e` abort in `install_memory` (a bare `[[ -d plugin-dir ]] && …` returning non-zero) was also fixed; it had killed the reconcile after the first user the first time it actually ran. The hooks need a `MEMORY_API_KEY` (or `CLAUDE_MEMORY_API_KEY`) in the user's `settings.json` env — the `homelab` CLI defaults the API URL, so **the key is the only hard requirement**; `install_memory` reuses an existing key and only WARNs if absent (it does NOT mint one — that's an admin Vault step, see Remaining). wizard + emo carry a key from their original MCP setup; **ancamilea is keyless → her memory no-ops until a key is minted.** (`auto-learn.py`'s passive store calls the API directly, so it additionally needs `*_API_URL` in env to avoid its local-SQLite fallback; recall + manual `homelab memory store` go through the URL-defaulting CLI and need only the key.) diff --git a/docs/architecture/networking.md b/docs/architecture/networking.md index 070cc59e..4659038a 100644 --- a/docs/architecture/networking.md +++ b/docs/architecture/networking.md @@ -261,7 +261,7 @@ Traefik chain: 1. **Anti-AI bot-block** (`ai-bot-block` ForwardAuth, on by default via `ingress_factory`): blocks/tarpits known AI crawlers. **Fail-open** (currently a no-op `return 200` — poison-fountain scaled to 0; see `docs/architecture/security.md`). 2. **Authentik Forward-Auth** (if `protected = true`): SSO authentication via OIDC. Non-authenticated users are redirected to login. Auth headers are stripped before forwarding to backend. -3. **Rate Limiting**: Per-IP throttling. Returns **429 Too Many Requests** (not 503) when limit exceeded. Default is `rate-limit` (average 10 req/s, burst 50). Services whose clients legitimately burst harder get a dedicated middleware via `skip_default_rate_limit = true` + `extra_middlewares`: Immich (`immich-rate-limit`, 1000/20000, photo uploads), ActualBudget (`actualbudget-rate-limit`, 50/300 — the Actual web app boots with ~70 parallel asset/migration revalidations; the default burst 429'd the tail and stalled every page load), and authentik (`authentik-rate-limit`, 100/1000, on `/` and `/static` — the login SPA cold-loads ~70 flow-executor JS/CSS chunks from `/static`; the default burst 429'd the tail and a failed ES-module import left a blank login screen for cold/incognito/NAT-shared clients). +3. **Rate Limiting**: Per-IP throttling. Returns **429 Too Many Requests** (not 503) when limit exceeded. Default is `rate-limit` (average 10 req/s, burst 50). Services whose clients legitimately burst harder get a dedicated middleware via `skip_default_rate_limit = true` + `extra_middlewares`: Immich (`immich-rate-limit`, 1000/20000, photo uploads) and ActualBudget (`actualbudget-rate-limit`, 50/300 — the Actual web app boots with ~70 parallel asset/migration revalidations; the default burst 429'd the tail and stalled every page load). 4. **Retry**: 2 attempts with 100ms delay on transient failures (5xx errors, connection errors). Additional middleware: @@ -550,7 +550,7 @@ chain — a CrowdSec/LAPI outage cannot cause 503s; it only stops new bans.) Che **Diagnosis**: Check Traefik middleware config for the affected IngressRoute. -**Fix**: Give the service a dedicated higher-limit middleware (don't loosen the shared default): define `-rate-limit` in `stacks/traefik/modules/traefik/middleware.tf`, then set `skip_default_rate_limit = true` + `extra_middlewares = ["traefik--rate-limit@kubernetescrd"]` on its `ingress_factory` call. Shared default is average 10 req/s / burst 50; Immich uses 1000/20000, ActualBudget 50/300, authentik 100/1000 (login SPA `/static` chunk burst → blank screen). +**Fix**: Give the service a dedicated higher-limit middleware (don't loosen the shared default): define `-rate-limit` in `stacks/traefik/modules/traefik/middleware.tf`, then set `skip_default_rate_limit = true` + `extra_middlewares = ["traefik--rate-limit@kubernetescrd"]` on its `ingress_factory` call. Shared default is average 10 req/s / burst 50; Immich uses 1000/20000, ActualBudget 50/300. ### Large Downloads or Uploads Truncate / Fail Partway diff --git a/docs/architecture/security.md b/docs/architecture/security.md index 1cec0de6..7d3043ea 100644 --- a/docs/architecture/security.md +++ b/docs/architecture/security.md @@ -132,13 +132,6 @@ for the supersession history — there is no longer an inline Traefik bouncer.) account hard-limits to **one** list), and CAPI is already covered in-kernel on direct hosts and by Cloudflare's own managed protections on proxied hosts. Registered bouncer key: **`kvsync`**. -- **Rate-limit resilient (2026-06-27):** Cloudflare's Lists-API *write* endpoint - is throttled (~per-60s; `429 retry-after`). The CronJob runs `backoff_limit=0` - (one POST per cycle — the `*/2` schedule IS the retry cadence) and treats a CF - `429` as a soft-skip (exit 0, retry next cycle), the same fail-safe pattern it - uses for LAPI. An earlier `backoff_limit=2` fired 3 rapid POSTs/cycle and - escalated the throttle into a stuck state that left the list empty — a - self-inflicted DoS that this change prevents. - **Block-only**: the single-list limit precludes a separate captcha/managed-challenge list, so both ban and captcha decisions are enforced as a plain block at the edge. @@ -279,7 +272,7 @@ Beads epic: `code-8ywc`. **Status: partially live as of 2026-05-18.** The block below documents the locked design. -Response model: **(I) Slack-only, daily skim.** All security alerts post to **`#alerts`** via Alertmanager (the `slack-security` receiver keeps its distinct `[SECURITY/]` title styling so security-lane alerts still stand out). The dedicated `#security` channel was abandoned (2026-06-25) — the shared `alertmanager_slack_api_url` incoming webhook's Slack app isn't a member of it, so a channel override there returns HTTP `404 channel_not_found`; everything consolidated to `#alerts`. No paging. Mean detection time accepted as ~12-24h; the design weight sits on prevention (Kyverno enforce, NetworkPolicy default-deny egress) rather than runtime detection. +Response model: **(I) Slack-only, daily skim.** All security alerts land in a new `#security` Slack channel via Alertmanager. No paging. Mean detection time accepted as ~12-24h; the design weight sits on prevention (Kyverno enforce, NetworkPolicy default-deny egress) rather than runtime detection. #### Detection sources @@ -292,7 +285,7 @@ Response model: **(I) Slack-only, daily skim.** All security alerts post to **`# #### Alert rules (16 total) -Routed via **Loki ruler → Alertmanager → the `slack-security` receiver, which posts to `#alerts`** (it keeps its `[SECURITY/]` title styling so security-lane alerts stand out there; the dedicated `#security` channel was abandoned 2026-06-25 — the shared webhook's Slack app isn't a member of it). Same handling path as existing infra alerts — silenceable in Alertmanager UI, history queryable, severity labels (critical/warning/info) carried in the alert. +Routed via **Loki ruler → Alertmanager → `#security` Slack receiver**. Same handling path as existing infra alerts — silenceable in Alertmanager UI, history queryable, severity labels (critical/warning/info) inside the single `#security` channel. **K8s API audit (K2-K9, 8 rules — K1 cluster-admin-grant intentionally skipped):** @@ -371,69 +364,6 @@ Beads: `code-8ywc` W1.6 + W1.7. **Status: planned.** - Rare-event misses: a Sunday-only CronJob's egress won't appear in 7 days of flow logs. Mitigation: extend observation to 2 weeks for namespaces with weekly CronJobs. - Mass-rollout cascade: the 26h March 2026 outage (memory id=390) was a mass-change cascade. Mitigation: phased per-namespace with health-check pauses, similar to the 2026-05-17 Keel phased rollout (memory id=1972). -#### Deriving the per-namespace egress allowlist from the edge trail (Wave 1 W1.7) - -The durable **east-west flow trail** (below) is now the preferred data source for -the *internal* (namespace-to-namespace) half of each Wave-1 egress allowlist — -faster and identity-stamped vs the original iptables-`LOG`→journald→Loki path -(ADR-0014: "Enforcement gains a better data source"). The unique observed -namespace pairs live in CNPG DB `goldmane_edges`, table `edge`. To derive the -namespaces a source is observed talking to (the `allow` set that seeds its -NetworkPolicy): - -```sql -SELECT DISTINCT dst_ns FROM edge WHERE src_ns='' AND action='allow' ORDER BY dst_ns; -``` - -The full SQL recipe (whole-cluster matrix, deny sanity-checks, the ≥7-day -observation caveat) is in -[runbooks/goldmane-flow-trail.md → Deriving the Wave-1 egress allowlist](../runbooks/goldmane-flow-trail.md#deriving-the-wave-1-egress-allowlist-from-the-edge-table-infra-62). -**External / public-internet egress is NOT in this table** (empty-namespace flows -are dropped) — for those destinations keep using the Calico flow-log observation -(the W1.6 snapshot, `wave1-egress-observation-2026-05-22.md`). This feeds the -existing observe-then-enforce effort (beads `code-8ywc`); **enforce-flips remain -out of scope** of the trail — it is observe-and-derive only. - -### East-west flow observability (Goldmane / Whisker + edge trail) (ADR-0014) - -The "who-talks-to-whom" data plane that succeeds raw iptables-`LOG` lines (which -carried no identity). **Service identity = the workload's namespace** (primary), -refined by a `service-identity` label in the few multi-Service namespaces -(`monitoring`, `kube-system`, `dbaas`). End-to-end trail, three layers: - -1. **Calico Goldmane + Whisker** (`calico-system`) — Goldmane aggregates - identity-stamped flows (ns/pod/workload/labels + allow-deny + policy-trace) - streamed from Felix over gRPC into a **~60-min in-memory ring buffer** (no - etcd/API writes — the etcd-cost constraint that drove the design). **Whisker** - is its live web UI at `whisker.viktorbarzin.me` (Authentik-gated, - `auth = "required"` — Whisker has no own login; an additive NetworkPolicy ORs - Traefik past the operator's default-deny `whisker` NP). The ring buffer is - **not** a trail (lost on Goldmane restart). Enabled via operator CRs in - `stacks/calico/main.tf`; reversible toggle (Goldmane is OSS tech-preview). -2. **`goldmane-edge-aggregator`** (`stacks/goldmane-edge-aggregator`) — streams - Goldmane's gRPC `Flows.Stream` over **mTLS** and upserts the low-cardinality - namespace-pair edge set (`edge(src_ns,dst_ns,action,first_seen,last_seen, - flow_count)`) into CNPG DB `goldmane_edges`. Self-edges and empty-namespace - (public-internet) flows are dropped — in-cluster relationships only. The mTLS - client cert **reuses the operator's Tigera-CA-signed `whisker-backend-key-pair`** - (Goldmane verifies CA-chain only, not identity) rather than copying the CA - private key into TF state — **re-apply the stack if the operator rotates that - Secret**. -3. **`goldmane-edges-digest`** CronJob — posts first-seen edges daily to - **`#alerts`** (reuses the alert-digest webhook). All Slack now consolidates to - `#alerts`; the `#security` channel was abandoned 2026-06-25 because that - webhook's Slack app isn't a member of it (a `#security` override 404s). See - runbook. - -The trail is **attribution-grade, not cryptographic** (reconstructs events in a -trusted cluster; cannot prove identity against a spoofing pod — accepted trust-model -limit; east-west stays plaintext, no mTLS between app pods). Health is covered by -the **`AggregatorDown`** + **`DigestFailing`** alerts and cluster-health check #48 -(see monitoring.md). Full as-built, query recipes, and troubleshooting: -[runbooks/goldmane-flow-trail.md](../runbooks/goldmane-flow-trail.md). Decision: -[ADR-0014](../adr/0014-service-identity-and-east-west-observability.md); glossary -`CONTEXT.md` → **Service identity**, **Goldmane / Whisker**. - ### TLS & HTTP/3 **Traefik** handles TLS termination: diff --git a/docs/plans/2026-06-28-k8s-upgrade-gate-held-classification.md b/docs/plans/2026-06-28-k8s-upgrade-gate-held-classification.md deleted file mode 100644 index eaa24286..00000000 --- a/docs/plans/2026-06-28-k8s-upgrade-gate-held-classification.md +++ /dev/null @@ -1,117 +0,0 @@ -# k8s-upgrade compat-gate: classify "actionable" vs "held" blocks - -**Date:** 2026-06-28 -**Status:** design → implementation -**Stack:** `stacks/k8s-version-upgrade` (+ `stacks/monitoring` alert rules) - -## Problem - -The cluster is on k8s 1.35.6. The nightly `k8s-version-check` chain detects the -next minor (1.36.2), runs the preflight compat-gate, and the gate **refuses** -it — because no released kyverno/ESO supports k8s 1.36 yet, and gpu-operator is -deliberately pinned (its 26.3 bump needs a newer NVIDIA driver image + Ubuntu -release we're not ready for). The result, **every single night**: - -- a **Failed** preflight Job (`block()` exits 1), and -- `k8s_upgrade_blocked=1` → the **K8sUpgradeBlocked** alert. - -But this block is **not actionable** — there's nothing we can upgrade to clear -it; we can only wait for upstream (kyverno/ESO) and, separately, do the -gpu-operator/Ubuntu work. The gate is crying wolf: a "blocked, needs attention" -signal that's indistinguishable from a block we could actually fix. - -## Goal - -Make the gate **classify** each blocker and behave accordingly: - -| Class | Definition | Behaviour | -|-------|-----------|-----------| -| **actionable** | the compat matrix has a newer version of the addon whose `max_k8s >= target`, and the running version is older — upgrading it would clear the block | **alert** (`k8s_upgrade_blocked=1` → K8sUpgradeBlocked), with the specific "upgrade X → Y" remediation in the nightly report | -| **waiting-upstream** | **no** matrix version of the addon supports the target yet (kyverno/ESO for 1.36) | **quiet** (`k8s_upgrade_held=1`, no alert) — nightly report only | -| **pinned** | a supporting version exists but the addon carries `"pinned": true` in the matrix (gpu-operator) | **quiet** (held) | - -Removed-API and containerd blocks are always **actionable**. **Held wins:** if -*any* blocker is waiting-or-pinned, the whole target is **HELD** (quiet) — -acting on the actionable blockers wouldn't unblock it yet. The nightly report -still lists everything so the full eventual scope is visible. - -Also (scope decision: "tidy the block path"): deliberate gate decisions -(actionable-block **and** held) now make the preflight Job **Complete cleanly** -(exit 0) instead of Failing. Chain progression is gated on the verdict, not the -exit code. Real failures (unhealthy nodes, kubeadm errors, crashes) still exit -1 → `K8sUpgradeChainJobFailed`. - -## Design - -### `compat-gate.py` -- New exit codes: `0` safe · `2` actionable-block · `3` gate-error (fail-safe) · **`4` held**. -- Each stdout reason line is tagged `[ACTIONABLE]` / `[WAITING]` / `[PINNED]`. -- `check_addons`: when an addon blocks, decide its class: - - `pinned: true` in its matrix entry → `[PINNED]`. - - else a higher matrix version with `max_k8s >= target` exists → `[ACTIONABLE]` (`upgrade X to >= V`). - - else → `[WAITING]` (`no released X version supports k8s T yet`). - - unreadable image / below-matrix → `[ACTIONABLE]` (fail-safe — a human must look). -- `check_removed_apis`, `check_containerd`: tag `[ACTIONABLE]`. -- `exit_code(reasons)`: `0` if none; `4` if any `held_reason` (WAITING/PINNED); else `2`. - -### `upgrade-step.sh` -- New global `HALT_CHAIN=0`; `spawn_next()` returns early (no next Job) when set. -- Replace `block()` with `record_blocked()` / `record_held()` — push the gauge, - set `HALT_CHAIN=1`, **do not exit**. -- `phase_preflight` gate handling routes on the gate's exit code: - - `0` → push `blocked=0`+`held=0`, proceed. - - `2`/`3` → `record_blocked`, `return 0` (Job Completes, K8sUpgradeBlocked fires). - - `4` → `record_held`, `return 0` (Job Completes, **no alert**). -- Push the gauge **definitively once** per run (remove the pre-reset `blocked=0` - at gate start) so a standing block doesn't flap 1→0→1 and re-notify. -- postflight also clears `held=0` alongside the existing gauge resets. - -### detector (`main.tf`, the `k8s-version-check` CronJob) -- Consequence of the tidy change: refusals now **Complete** instead of Failing, - so the old "re-spawn only a *Failed* preflight" idempotency would skip a - refused-but-Complete preflight until its 7d TTL. Fix: re-spawn nightly when the - preflight is **Complete but no `k8s-upgrade-master-` Job exists** (the - gate refused — chain never advanced) — **silently** (no Slack), so a standing - hold re-evaluates each night without noise. -- The per-night `slack "K8s upgrade available…"` becomes an `echo`; the spawn - Slack fires only for a genuinely new spawn or a Failed-respawn (`ANNOUNCE` - flag), not for silent re-evaluations — killing the last nightly-noise source. - -### `addon-compat.json` -- Add `"pinned": true` + `"pin_reason"` to the gpu-operator entry (its - `26.3 → 1.36` row stays; `pinned` overrides classification to held). Document - the `pinned` flag in `_comment`. Unpinning later = delete two keys. - -### `stacks/monitoring` alert rules (`prometheus_chart_values.tpl`) -- `K8sUpgradeBlocked` (`k8s_upgrade_blocked == 1`): unchanged trigger, now - actionable-only; reword annotation (reasons are in the nightly report, not a - per-run chain Slack). -- `K8sUpgradeChainJobFailed`: **drop** the `unless on() (k8s_upgrade_blocked == 1)` - clause — deliberate blocks no longer create Failed Jobs, so the alert again - means a genuine wedge. -- **No alert** for `k8s_upgrade_held` (intentional — nothing to action; the - nightly report surfaces it). Add a comment recording this. - -### `nightly-report.py` -- Read `k8s_upgrade_held`. New `⏸️ HELD — not yet upgradable` headline. -- Group reasons by tag: *Action needed* / *Waiting on upstream* / *Pinned (held by us)* - (fallback bullets for untagged lines, so older reason strings still render). -- Fetch reasons when avail AND (blocked OR held). - -## Net effect on 1.36 today -**HELD, quiet** — waiting on kyverno + ESO (upstream) + gpu-operator (pinned); -Calico listed as the lone actionable piece. No nightly Failed Job, no alert — -just the nightly report's ⏸️ line. Flips to actionable (→ alert) only once -kyverno/ESO ship support **and** gpu-operator is unpinned. - -## Tests (TDD) -- `compat-gate`: waiting / actionable / pinned-is-held / mixed-held-wins, - removed-API & containerd are actionable, exit_code mapping, + existing - patch/safe cases stay green. -- `nightly-report`: held headline + grouped reasons; existing tests stay green. -- `upgrade-step.sh`: shellcheck; manual review of the HALT_CHAIN + gauge flow - (bash, not unit-tested). - -## Out of scope (separate follow-up) -Auto-refreshing the matrix when upstream ships 1.36 support (a periodic -addon-readiness probe). This change only *consumes* the matrix. diff --git a/docs/post-mortems/2026-05-16-metallb-l2-immutable-pg-vip-flap.md b/docs/post-mortems/2026-05-16-metallb-l2-immutable-pg-vip-flap.md deleted file mode 100644 index daf5006a..00000000 --- a/docs/post-mortems/2026-05-16-metallb-l2-immutable-pg-vip-flap.md +++ /dev/null @@ -1,128 +0,0 @@ -# Post-Mortem: MetalLB ServiceL2Status Stuck Immutable → PG LB VIP Flap → Woodpecker CI Tier 1 Applies Broken - -| Field | Value | -|-------|-------| -| **Date** | 2026-05-16 (mitigated) / 2026-05-26 (closed) | -| **Duration** | ~5 days of degraded CI (2026-04-21 first observed → 2026-05-16 mitigated). Symptom-only; no human-visible service downtime. | -| **Severity** | SEV3 — Woodpecker CI default.yml apply step failed on Tier 1 (PG-backend) stacks. Drift-detection ran silently broken. Manual `scripts/tg apply` continued to work. No data loss, no app downtime. | -| **Affected Services** | Woodpecker CI pipelines applying any of the 28+ Tier 1 stacks (monitoring, crowdsec, authentik, headscale, etc.). PostgreSQL backend itself was healthy. | -| **Issue** | Beads `code-aoxk` (closed 2026-05-26). | -| **Status** | Closed | - -## Summary - -Woodpecker CI surfaced as `ERROR: Cannot read PG credentials from Vault. Run: vault login -method=oidc` from `scripts/tg` whenever a pipeline tried to apply a Tier 1 stack. The error was misleading on two counts: - -1. **Vault was healthy.** A direct `vault read database/static-creds/pg-terraform-state` from inside a Woodpecker pipeline pod (using K8s SA JWT → `auth/kubernetes/login role=ci`) succeeded every time when run in isolation. -2. **The "Cannot read PG credentials" message in `scripts/tg` was a catch-all** that fired for *any* Terraform/Terragrunt failure during PG state-lock acquire-release, including TCP RSTs against the PG LoadBalancer VIP. - -Actual root cause: the MetalLB `ServiceL2Status` CR for the `postgresql-lb` service (`dbaas` namespace, VIP `10.0.20.200`) had a stuck `status.node` field that the controller treated as immutable. The L2 speaker kept failing to update it with `Invalid value: "k8s-nodeX": Value is immutable`, so the leader-elected announcer flapped between k8s-node3 and k8s-node4 every few seconds. Each flap dropped open TCP connections (RST). Terraform's state-lock acquire → operation → release sequence straddled flaps and failed mid-operation. `scripts/tg` surfaced this as the misleading "Cannot read PG credentials" message. - -Manual `scripts/tg apply` from the DevVM kept working because the developer's session happened to land on whichever node currently held the VIP and complete fast enough to not straddle a flap. CI pipelines, being slower (full stack walk), reliably straddled at least one flap. - -## Impact - -- **CI degradation**: Tier 1 stack changes pushed to master were NOT auto-applied. Required manual `scripts/tg apply` from DevVM after every push touching one of 28+ stacks. -- **Drift-detection broken**: The daily `drift-detection.yml` Woodpecker pipeline silently failed on every Tier 1 stack — meaning unannounced manual changes to those stacks could have persisted undetected for the duration. -- **No user-facing outage**: PG cluster itself, all apps that use PG, and all in-cluster traffic to `10.0.20.200` worked normally. Only the very specific `acquire-state-lock → run operation → release-state-lock` round-trip pattern from CI was unreliable. - -## Timeline (UTC) - -| Time | Event | -|------|-------| -| 2026-04-21 | First broken CI pipelines (#411, #412, #413). Drift-detection failures noticed. `code-aoxk` filed. Initial hypothesis: Vault auth/role mismatch. | -| 2026-04-22 — 2026-05-15 | Multiple investigation attempts. Verified Vault K8s `auth/kubernetes/role/ci` has correct policies (`terraform-state`, `ci`). Verified `database/static-creds/pg-terraform-state` exists, rotates on schedule, credentials valid. Could not reproduce the failure in isolated `vault read` from Woodpecker pods. | -| 2026-05-16 (~12:14 UTC) | `pg-cluster-3` came up (third CNPG replica); endpoint set churn likely triggered MetalLB L2 announcer to attempt to update the existing `ServiceL2Status` CR (was `l2-rgt9d`). Update was rejected as immutable. Speaker kept retrying. VIP flapped. | -| 2026-05-16 | RCA breakthrough: noticed `kubectl logs -n metallb-system -l component=speaker` was full of `Invalid value: "k8s-node…": Value is immutable` on the postgresql-lb ServiceL2Status. Correlated with `kubectl get servicel2status` returning multiple stale entries for the same service. | -| 2026-05-16 | **Mitigation**: `kubectl delete servicel2status.metallb.io l2-rgt9d -n metallb-system`. Speaker recreated the CR cleanly (became `l2-zj9ss`). Flap stopped. PG connections stable. Manual CI re-runs of `monitoring` stack apply succeeded immediately. | -| 2026-05-17 | Audit: acceptance criteria 1 + 2 met implicitly. #3 (post-mortem) remained pending. Beads task reverted from `in_progress` → `open`. | -| 2026-05-25 | Node2 SCSI LUN remap → encrypted PVC emergency_ro → containerd boltdb corruption outage. Unrelated, but pulled Woodpecker server off node2. Subsequent server pod restart on k8s-node4. | -| 2026-05-26 | Verification: from a live Woodpecker pipeline pod (`wp-01kshph6pa0w6ch0zf5x9bfqgr`), `vault write auth/kubernetes/login role=ci jwt=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)` succeeded. `vault read database/static-creds/pg-terraform-state` returned valid creds (`username=terraform_state`, last_vault_rotation 2026-05-21, TTL 58h). Live `default.yml` pipeline confirmed applying Tier 1 stacks: dbaas, authentik, crowdsec, monitoring, nvidia, cloudflared, kyverno, metallb — all `OK`. `postgresql-lb` ServiceL2Status currently single allocation (`l2-sv9vv` on k8s-node3, no flap). Beads task closed. | - -## Root Cause - -`metallb-speaker` reconciler in the deployed MetalLB version treats `ServiceL2Status.status.node` as immutable after first set. When the L2 announcer's leader-election picks a different node to announce a given VIP (which happens on speaker pod restart, node loss, endpoint set churn, or pod-anti-affinity reshuffles), the reconciler fails to patch the existing CR and gets stuck in a retry loop. Without manual deletion, the reconciler will not progress. - -Why it manifested as Vault credential errors: - -1. CI's `scripts/tg` pre-flight runs `vault read database/static-creds/pg-terraform-state` (line 83 in current code) to get PG credentials. That call succeeds. -2. CI then runs `terragrunt apply` against the Tier 1 stack. Terragrunt connects to `10.0.20.200:5432` for state-lock acquire (via `pg_advisory_lock`). The TCP connection lands on whichever node MetalLB last announced the VIP from. -3. Mid-operation, MetalLB tries to re-announce from a different node, sends gratuitous ARPs, and the upstream switch updates its MAC table. Open TCP sessions on the previous announcer's node are immediately RST. -4. Terragrunt's state-lock release (or any subsequent PG operation) fails with broken pipe / connection refused. -5. `scripts/tg` interpreted the wrapper-level failure as "PG creds bad" because that's the most common failure mode it handles. The actual error from terragrunt was buried in `2>/dev/null` suppression (since fixed — see Fix #1 below). - -## Detection - -We did not have any of: -- A direct alert for "MetalLB ServiceL2Status reconciler errors". -- An alert for "PG LB VIP node changed N times in M minutes". -- An end-to-end probe for the CI state-lock pattern (terragrunt against `10.0.20.200`). - -Detection mechanism was a human reading `kubectl logs -n metallb-system` for unrelated reasons. Took 25 days from first observed symptom to RCA. - -## Fixes & Mitigations - -### 1. Surface real error from `scripts/tg` (DONE) - -The original `scripts/tg` swallowed the real `vault read` / terragrunt error behind `2>/dev/null` and printed a static "Cannot read PG credentials from Vault" message. Fixed in the script: - -```sh -# scripts/tg lines 79-89 (current) -if ! command -v vault >/dev/null 2>&1; then - echo "ERROR: vault CLI not found on PATH. Install it or use an image that includes it (ci/Dockerfile)." >&2 - exit 1 -fi -VAULT_OUT=$(vault read -format=json database/static-creds/pg-terraform-state 2>&1) || { - echo "ERROR: Cannot read PG credentials from Vault. Vault output follows:" >&2 - echo "$VAULT_OUT" >&2 - echo "" >&2 - echo "Hint: humans run 'vault login -method=oidc'; CI auths via K8s SA (role=ci)." >&2 - exit 1 -} -``` - -Comment in the code explicitly references this incident. - -### 2. Stuck-CR cleanup procedure (DOCUMENTED) - -Reproduction check for future sessions (also in `code-aoxk` beads notes): - -```sh -kubectl logs -n metallb-system -l component=speaker --tail=200 | grep -iE 'Invalid value.*immutable' -# If matches found → same root cause. Delete the stuck CR: -kubectl get servicel2status -n metallb-system -kubectl delete servicel2status.metallb.io -n metallb-system -``` - -Speaker recreates the CR cleanly within seconds. - -### 3. Long-term MetalLB controller fix (DEFERRED) - -The underlying bug — speaker not recreating the CR when the immutable field needs to change — is upstream MetalLB behaviour. Two paths possible: - -- **Upgrade MetalLB** to a version where this is fixed (needs research — check changelogs). -- **File upstream issue / patch** with reproducer. - -Not done as part of this post-mortem; tracked separately. Risk acceptance: until then, the manual `delete servicel2status` workaround is the playbook, and is fast (<10s). - -### 4. Alerting (DEFERRED) - -Suggested but not implemented: -- Prometheus alert on `metallb_speaker_reconcile_errors_total{kind="ServiceL2Status"}` rate. -- Synthetic probe: a CronJob that does `pg_advisory_lock` + release against the PG VIP every 5min from CI namespace, alert if it ever fails. - -Tracked as future hardening (no beads task yet — only worth filing if recurrence happens). - -## Lessons - -1. **`2>/dev/null` is a time-bomb.** It hid the real error for weeks. Fix #1 already lands the principle; audit other places in `scripts/` for the same anti-pattern next time we touch them. -2. **CRD `status.*` immutability is non-obvious failure mode.** When debugging weird LB / VIP / endpoint behaviour, always grep speaker logs for `immutable`, `cannot update`, and reconciler errors. Add to cluster-health checks. -3. **Misleading wrapper errors cost weeks.** `scripts/tg` claimed "Cannot read PG credentials" — that's what the operator believed. The actual `vault read` step worked. The real failure was three steps later in a completely different subsystem. When a wrapper script makes a definitive claim about which subsystem failed, distrust it; reproduce the subsystem in isolation before chasing the claim. -4. **CNPG primary changes / endpoint churn can trigger L2 announcer flap.** The trigger (within the timeline) was likely the `pg-cluster-3` pod coming up. Worth flagging for any future CNPG topology changes. - -## References - -- Beads: `code-aoxk` — closed 2026-05-26. -- `scripts/tg` lines 65-95 — current pre-flight with explicit error surfacing. -- `kubectl get servicel2status -A` — current state, single allocation per service. -- This file: `infra/docs/post-mortems/2026-05-16-metallb-l2-immutable-pg-vip-flap.md`. diff --git a/docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md b/docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md deleted file mode 100644 index e6b11816..00000000 --- a/docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md +++ /dev/null @@ -1,97 +0,0 @@ -# Post-mortem: k8s 1.34→1.35 upgrade stalled — etcd IO starvation (2026-06-24) - -> Filename kept for inbound links. The originally-suspected cause (kubeadm-config -> OIDC drift) turned out **not** to be the crash — see "Correction" below. The OIDC -> drift was a real *separate* latent bug fixed in the same change. - -**Impact:** The autonomous k8s-version-upgrade chain (23:00 UTC nightly) reached -the master control-plane phase for the first time — preflight passed, etcd -snapshot taken, master cordoned + drained, etcd upgraded 3.6.5→3.6.6 — then the -kube-apiserver upgrade to v1.35.6 **crash-looped**. kubeadm waited its 5-minute -static-pod-hash window across all internal retries, then auto-rolled-back to -v1.34.9. The cluster stayed healthy on 1.34.9 (apiserver, all 7 nodes Ready), but -the run left **k8s-master cordoned** and the chain **wedged on `in_flight=1`**. -No data loss; no user-facing outage (the master carries control-plane taints, so -no workloads were displaced). - -**Trigger:** the first *minor* upgrade the chain ever attempted (1.34→1.35) — the -first time kubeadm upgrades etcd (3.6.5→3.6.6) and regenerates the control-plane -static pods, i.e. the first time the upgrade pushes real write-IO at etcd. - -## Root cause — etcd IO starvation on the shared HDD - -The new kube-apiserver could not establish/keep a working connection to etcd -during the upgrade because **etcd was IO-starved**. etcd's surviving container log -from the crash window (`/var/log/pods/.../etcd/0.log`, 23:04–23:20 UTC) shows: - -- **1,180** `apply request took too long` warnings in 16 minutes; -- individual applies of **4.3s / 2.9s / 2.7s / 1.8s** (healthy is <100ms), - clustered at **23:18:51 UTC** — exactly when kubeadm's final attempt was trying - to bring the new apiserver up. - -A reproduced 1.35.6 apiserver with no etcd dies with -`F instance.go:233 Error creating leases: error creating storage factory: context -deadline exceeded` — the same failure mode a multi-second etcd produces. etcd -lives on the contended `sdc` HDD (**beads code-oflt**: "etcd/critical VM disks on -shared sdc HDD — recurring IO-storm root cause"). The upgrade itself piled IO onto -that spindle: - -1. etcd's own upgrade-restart + WAL/db re-read (it restarted ~23:04, re-elected); -2. kubeadm dumping a full **~400MB etcd DB backup** to - `/etc/kubernetes/tmp/kubeadm-backup-etcd-/` (on the same HDD) before the - etcd upgrade — and **145 of these had accumulated to 28GB** (kubeadm never - cleans them up), pushing master root fs to **73%**, above the 70% kubelet - image-GC threshold, so image GC churned during the drain too; -3. master-drain pod evictions. - -### Correction — it was NOT the OIDC flag swap - -`kubeadm upgrade diff v1.35.6` showed the regenerated manifest also swaps -`--authentication-config` (structured multi-issuer OIDC) back to legacy -single-issuer `--oidc-*` flags (kubeadm-config drift, see secondary finding). That -was the *first* hypothesis — but an isolated repro of the 1.35.6 apiserver with -those exact `--oidc-*` flags **and authentik reachable** initialised OIDC cleanly -(`oidc.go:313`, no error) and ran fine until it hit the (deliberately dead) test -etcd. So the auth swap does **not** crash the apiserver; it was a red herring for -the crash. Image pull (all v1.35.6 images pre-pulled), OOM (none), and disk-full -were also ruled out. - -## Secondary finding (real, fixed separately) — kubeadm-config OIDC drift - -apiserver auth is configured in three places that must agree: -(1) `/etc/kubernetes/pki/auth-config.yaml` (structured, two issuers: `kubernetes` -+ `k8s-dashboard`, added 2026-06-19); (2) the live static-pod manifest -(`--authentication-config`); (3) the kubeadm-config `ClusterConfiguration` CM — -which still carried the legacy `--oidc-*` extraArgs. `kubeadm upgrade` regenerates -the manifest from (3), so it would have reverted structured auth → **dashboard + -kubectl SSO break after a successful upgrade** (recoverable: the chain's -post-master `restore.sh` re-adds the flag). This is a real bug, just not the crash. - -## Resolution - -1. **Reclaimed the 28GB kubeadm scratch** on master (`/etc/kubernetes/tmp/kubeadm-backup-*`) — root fs 73% → 23%. -2. **Reconciled kubeadm-config live** (zero cluster impact — CM only read at upgrade time): dropped `--oidc-*`, added `--authentication-config` via `kubeadm init phase upload-config kubeadm`. `kubeadm upgrade diff` then shows only the control-plane image bumps. -3. **Recovered:** uncordoned k8s-master, cleared the stuck `in_flight` gauge + annotation, deleted last night's Complete/Failed `1-35-6` phase jobs (a Complete preflight would otherwise make the detector idempotent-skip the re-run). - -## Prevention (landed in this change) - -| Gap | Fix | -|-----|-----| -| kubeadm leaks ~400MB etcd-DB backups into `/etc/kubernetes/tmp` forever (→ disk fills, image-GC churn, write-IO on etcd's spindle) | **`upgrade-step.sh` preflight now prunes** `/etc/kubernetes/tmp/kubeadm-backup-*` + `kubeadm-upgraded-manifests*` older than 3 days on master, every run. Best-effort, never aborts. | -| kubeadm-config drift would silently break SSO after an upgrade | `apiserver-oidc.tf`'s remote script now **also reconciles kubeadm-config** (`kubeadm init phase upload-config`), delivered via the `apiserver-oidc-restore` ConfigMap the chain re-runs (CI needs no ssh) or a local `-replace` apply. Preflight **alerts** (not blocks — SSO drift is recoverable) if `kubeadm upgrade diff` would still drop `--authentication-config`. | -| etcd on the contended `sdc` HDD starves under upgrade IO | **Durable fix is beads code-oflt** (move etcd/critical VM disks off `sdc`). Not in this change. Mitigations above reduce the upgrade's own IO; reclaimed disk removes the image-GC variable. | - -## Lessons - -- **Capture the failing component's own logs before concluding.** The `kubeadm - upgrade diff` made the OIDC swap look like the cause; only etcd's log (multi-second - applies) + an isolated apiserver repro showed the truth (etcd IO). A clean diff is - "what config changes," not "why it crashed." -- **etcd on shared HDD is the cluster's recurring fragility** (immich IO storm - 2026-05-25, this stall). Upgrades concentrate IO (etcd restart + kubeadm's 400MB - backup copy + drain) onto that spindle. code-oflt is the real fix. -- **Tools that leave per-operation scratch must be reaped.** kubeadm's - `/etc/kubernetes/tmp` etcd backups are throwaway (real backups → NFS) but never - GC'd; 28GB had silently accumulated. -- **Out-of-band control-plane edits must be written back to kubeadm-config** — else - `kubeadm upgrade` silently reverts them (here: SSO; could be admission/audit/API flags). diff --git a/docs/runbooks/claude-auth-renew-workstation.md b/docs/runbooks/claude-auth-renew-workstation.md index 8156530e..f5ce6625 100644 --- a/docs/runbooks/claude-auth-renew-workstation.md +++ b/docs/runbooks/claude-auth-renew-workstation.md @@ -11,11 +11,6 @@ inference every six hours and backs up only the `claudeAiOauth` object to: secret/workstation/claude-users/ ``` -The backup **merges** into that path (`vault kv patch -method=rw`, falling back to -`kv put` only when the path does not exist yet), so keys that other tools -co-locate there — notably `homelab vault`'s `vaultwarden_*` credentials — survive. -A blind `kv put` here silently wiped them on every six-hourly run (fixed 2026-06-26). - The user's unrelated `mcpOAuth` credentials never leave their home directory. Each renewal service has a distinct 32-day periodic Vault token, mode `0600`, at `~/.config/claude-auth-sync/vault-token`. Its policy can access only that user's @@ -80,64 +75,8 @@ sudo --preserve-env=VAULT_ADDR,VAULT_TOKEN /usr/local/bin/t3-provision-users ``` Never copy another user's `.credentials.json` or scoped Vault token. Never restore -a **shared** `CLAUDE_CODE_OAUTH_TOKEN` across users; environment credentials -outrank per-user login and would silently collapse all users onto one identity. -(A **per-user**, non-rotating setup-token tied to the user's OWN Enterprise -identity is a different, sanctioned thing — see "Long-lived per-user token" below.) - -## Long-lived per-user token (heavy concurrent-agent users) - -The six-hourly renewal above assumes Claude owns refresh-token rotation in a -single `~/.claude/.credentials.json`. A user who runs **many concurrent Claude -sessions** (interactive tmux panes + their `t3-serve` instance + always-on -`start-claude.sh` agents) breaks that assumption: when the shared access token -expires, the processes refresh **simultaneously**, the OAuth server rotates the -refresh token, and the losing writer persists an **empty** refresh token — -logging the user out roughly every access-token lifetime (~8h). Re-issuing the -credential does not help; the race recurs. - -The fix is a **per-user, long-lived setup-token** (`sk-ant-oat01-…`, ~1y, -**non-rotating**). With `CLAUDE_CODE_OAUTH_TOKEN` set, Claude uses it directly and -never touches `.credentials.json` — so there is nothing to race on. This is the -user's OWN Enterprise identity (scope `user:inference`; local MCP servers are -client-side and unaffected), stored only in their OWN Vault path — **NOT** the -forbidden shared token, and it never crosses OS users. - -**Enable it (one-time, per user):** - -1. The user mints their own token (interactive Enterprise SSO): - - ```bash - claude setup-token # opens an SSO URL; paste the code back -> prints sk-ant-oat01-… - ``` - -2. An admin stores it in that user's Vault path (MERGE, never `kv put` — siblings - like `claude_ai_oauth_json` / `vaultwarden_*` must survive): - - ```bash - vault kv patch -method=rw secret/workstation/claude-users/ \ - setup_token=sk-ant-oat01-… - ``` - -3. Materialize + activate (or just wait ≤6h for the timer): - - ```bash - systemctl start claude-auth-sync@.service - ``` - - `claude-auth-sync` writes `~/.config/claude-auth-sync/claude-oauth.env` - (`CLAUDE_CODE_OAUTH_TOKEN=…`, mode 0600) and, while a token is present, **skips** - the rotating-credential validate/backup/restore (so no false - `WorkstationClaudeAuthInvalid`). `start-claude.sh` and `t3-serve@.service` load - that env file. **Sessions started before activation keep the old credential - until relaunched** — the user must restart their agents / `t3-serve` to cut over. - -**Disable it:** clear the field (`vault kv patch -method=rw -secret/workstation/claude-users/ setup_token=""`) — the next sync removes -the env file and the user reverts to the per-user SSO credential flow. - -**Rotate before expiry:** setup-tokens expire 1y after mint. Re-mint (step 1) and -re-store (step 2); the env file refreshes on the next sync. +the old shared `CLAUDE_CODE_OAUTH_TOKEN`; environment credentials outrank per-user +login and would silently collapse all users onto one identity. ## Verification diff --git a/docs/runbooks/goldmane-flow-trail.md b/docs/runbooks/goldmane-flow-trail.md deleted file mode 100644 index dbf6f6d4..00000000 --- a/docs/runbooks/goldmane-flow-trail.md +++ /dev/null @@ -1,346 +0,0 @@ -# Goldmane Flow Trail — east-west "who-talks-to-whom" observability - -> As-built runbook for the Calico Goldmane + Whisker flow plane and the -> `goldmane-edge-aggregator` durable audit trail. Design + rationale: -> [ADR-0014](../adr/0014-service-identity-and-east-west-observability.md). -> Glossary: `CONTEXT.md` → **Service identity**, **Goldmane / Whisker**. -> Implements infra issues #57 (Whisker ingress), #58 (aggregator), #61 -> (monitoring), #62 (egress allowlist queries), #63 (these docs). - -## What the trail is - -Three layers turn raw east-west traffic into a queryable, durable record of -which Service talks to which. **Service identity = the workload's namespace** -(primary), refined by a `service-identity` label in the few multi-Service -namespaces (`monitoring`, `kube-system`, `dbaas`) — see ADR-0014. - -| Layer | Component | Lifetime | Where it lives | -|---|---|---|---| -| **Live map** | Calico **Goldmane** + **Whisker** | ~60-min in-memory ring buffer (lost on Goldmane restart) | `calico-system`; Whisker UI at `whisker.viktorbarzin.me` | -| **Durable trail** | `goldmane-edge-aggregator` (`aggregate` mode) | persistent | CNPG Postgres DB `goldmane_edges`, table `edge` | -| **Notification** | `goldmane-edges-digest` CronJob (`digest` mode) | daily | Slack `#alerts` | - -**Goldmane** aggregates identity-stamped flows (namespace / pod / workload / -labels + allow-deny + policy-trace) streamed from Felix (the existing -`calico-node` DaemonSet) over gRPC into a ~60-minute in-memory ring buffer — -**nothing is written to etcd or the K8s API** (the etcd-cost constraint that -drove the whole design). **Whisker** is its live web UI. Because the ring -buffer is *not* a trail (a Goldmane restart loses the window), the -`goldmane-edge-aggregator` consumes Goldmane's gRPC `Flows.Stream` API over -mTLS and upserts the unique **namespace-pair edge set** into Postgres; a daily -CronJob posts first-seen edges to Slack. - -The edge set is deliberately **low-cardinality** — one row per -`(src_ns, dst_ns, action)`, *not* per-pod or per-port — so the table stays -small no matter how much traffic flows. - -## Where the data lives - -### Whisker UI — live, ~60 min -- `https://whisker.viktorbarzin.me` (Authentik-gated — Whisker ships no own - login; `auth = "required"`). Shows the live flow stream + a service graph for - roughly the last hour. Use it for "what is talking right now"; it is **not** - history. -- In-cluster: `Service goldmane:7443` (gRPC/mTLS), `Service whisker:8081` - (HTTP), both in `calico-system`. -- **DNS fix + self-heal:** whisker's egress to the kube-dns ClusterIP is allowed - by `whisker-allow-dns-clusterip` (`stacks/calico`) — without it the UI goes - empty after any gRPC-stream break (see Troubleshooting → "Whisker UI empty"). - The `whisker-watchdog` CronJob (every 10 min) is a backstop that restarts - whisker if its backend ever wedges for another reason. - -### CNPG `goldmane_edges` — durable -- Postgres DB `goldmane_edges` on the CNPG cluster - (`pg-cluster-rw.dbaas.svc.cluster.local:5432`). One table: - - ``` - edge(src_ns text, dst_ns text, action text, - first_seen timestamptz, last_seen timestamptz, flow_count bigint, - PRIMARY KEY (src_ns, dst_ns, action)) - ``` - - - `action` ∈ `allow` / `deny` / `pass` / `unspecified` (normalised Goldmane - action). - - **Self-edges (`src_ns == dst_ns`) and empty-namespace flows** (host-endpoint - / public-internet) are **dropped** — the trail is about in-cluster service - relationships only. (Egress to the public internet is therefore NOT in this - table; it lives in the Wave-1 Calico flow-log path — see security.md.) - - A **"new edge"** = a row whose `first_seen` falls inside the digest window. - - Role `goldmane_edges` (Vault-rotated, 7-day) owns the DB. The `edge` table - is created idempotently by the aggregator at startup (canonical DDL also in - the repo at `migrations/0001_edge.sql`). - -### Slack `#alerts` — daily digest - -> **Channel note (2026-06-25):** posts to **`#alerts`**. The dedicated `#security` channel was abandoned — the shared `alertmanager_slack_api_url` incoming webhook's Slack app is not a member of it, so a channel override there returns HTTP `404 channel_not_found`. Everything now posts to `#alerts` (this digest plus alertmanager's `slack-security` receiver, which keeps its `[SECURITY]` styling so security-lane alerts still stand out there). - -- CronJob `goldmane-edges-digest` (08:00 Europe/London) posts edges first seen - in the last 24h. Quiet when there are none. Reuses the existing alert-digest - Slack incoming webhook (Vault `secret/viktor` → `alertmanager_slack_api_url`) - — no new webhook was created. - -## How to enable / disable - -### Goldmane + Whisker (the flow plane) -Operator CRs in **`stacks/calico/main.tf`** — NOT the Helm `goldmane`/`whisker` -flags (those stay `false`; the operator's own `installation`/`apiServer` are -operator-managed via the `goldmanes`/`whiskers.operator.tigera.io` CRDs): - -- `kubectl_manifest.goldmane` (kind `Goldmane`) — creating it makes the operator - re-render `calico-node` with the `FELIX_FLOWLOGSGOLDMANESERVER` env (the - operator auto-wires Felix — **do NOT patch FelixConfiguration**), triggering a - supervised `calico-node` DaemonSet roll. Yields `Deployment` + `Service - goldmane:7443`. -- `kubectl_manifest.whisker` (kind `Whisker`, `depends_on` goldmane; - `notifications = Disabled`). Yields `Deployment` + `Service whisker:8081`. - -**To disable:** delete those two CRs and re-apply `stacks/calico`. Reversible -toggle (Goldmane is tech-preview in OSS Calico 3.30 — the main standing risk per -ADR-0014). - -### Whisker public ingress (infra #57) -Also in `stacks/calico/main.tf`: -- `module "ingress_whisker"` (`ingress_factory`, `auth = "required"`, - `dns_type = "proxied"`) → `whisker.viktorbarzin.me`. -- `kubernetes_network_policy_v1.whisker_allow_traefik` — **required alongside the - ingress**: the operator's own `whisker` NetworkPolicy (owned by the Whisker CR) - is `policyTypes: [Ingress]` with no rules = default-deny ingress to the pod. - This additive NP ORs in an allow for `namespaceSelector - kubernetes.io/metadata.name=traefik` on TCP 8081. Without it Traefik 502s. - -### The aggregator + digest (the durable trail) — `stacks/goldmane-edge-aggregator` -A Tier-1 stack (PG state) mirroring the claude-memory pattern. `scripts/tg -apply` from `stacks/goldmane-edge-aggregator/`. It provisions: the namespace, -the mTLS client material, the Postgres DB-init Job, the `DATABASE_URL` -ExternalSecret (Vault static role `pg-goldmane-edges`), the Slack ExternalSecret, -the `aggregate` Deployment, and the `digest` CronJob. **To disable the trail -without touching the flow plane:** scale `deployment/goldmane-edge-aggregator` to -0 (transient) or remove the stack (permanent) — Goldmane/Whisker keep running. - -Image: `ghcr.io/viktorbarzin/goldmane-edge-aggregator` (PRIVATE) — the -`goldmane-edge-aggregator` namespace must be in the `ghcr-credentials` Kyverno -allowlist (`stacks/kyverno/modules/kyverno/ghcr-credentials.tf`, -`local.ghcr_private_namespaces`) or pulls 401. Code repo: -`~/code/goldmane-edge-aggregator` (see its `README.md` + `DEPLOY.md`). - -## mTLS cert — the REUSE decision (cert-reuse gotcha) - -The aggregator dials `goldmane:7443` over **mutual TLS**. Goldmane requires the -client cert to chain to the **Tigera CA**, but it does **NOT authorize by client -identity** — any Tigera-CA-signed cert is accepted. - -Rather than copy the Tigera CA **private key** into Terraform state to mint our -own cert (a needless CA-key exposure; the `hashicorp/tls` provider also clashes -with this repo's global generate-providers/lockfile pattern), the stack -**REUSES the operator-minted, Tigera-CA-signed `whisker-backend-key-pair` -Secret** (`calico-system`), copying its `tls.crt`/`tls.key` into the -`goldmane-client-tls` Secret in the aggregator namespace. The CA *bundle* that -verifies Goldmane's serving cert (`tigera-ca-bundle` ConfigMap, key -`tigera-ca-bundle.crt`) is likewise copied verbatim (a ConfigMap can't be -cross-namespace-mounted). - -> **GOTCHA — if the operator rotates `whisker-backend-key-pair`, re-apply -> `stacks/goldmane-edge-aggregator`** to re-sync the copied cert. Symptom of a -> stale copy: the `aggregate` pod logs TLS handshake / `Flows.Stream` failures -> and no `last_seen` updates land in the `edge` table. Hardening follow-up -> (noted in the stack): mint an own-identity cert in-namespace if Whisker is ever -> removed (which would delete the reused source Secret). - -The Deployment leaves `GOLDMANE_HOST=goldmane.calico-system.svc.cluster.local:7443` -and the default cert/CA paths; the default ServerName (host sans port) is a SAN -on Goldmane's live serving cert, so no `GOLDMANE_SERVER_NAME` / -`GOLDMANE_TLS_INSECURE` override is needed. - -## How to query who-talks-to-whom - -**Quickest — the `homelab edges` CLI** (the investigation helper; read-only -SELECT against the DB via the dbaas primary pod, no creds/SQL to remember): - -``` -homelab edges --ns # edges touching (either direction) -homelab edges --peers-of # 's distinct peer namespaces -homelab edges --src # 's egress peers (--dst for ingress) -homelab edges --new-since 24h # edges first seen in the last day (or a date) -homelab edges --denied # blocked / lateral-movement attempts -homelab edges --json [...] # machine-readable, for agents/pipelines -homelab edges --help # full flag list -``` - -For ad-hoc SQL, `psql` into the DB (creds: Vault static role -`static-creds/pg-goldmane-edges`, or exec a CNPG pod). All queries are against -the single `edge` table. - -```sql --- Everything talking to a namespace (inbound), most-active first -SELECT src_ns, action, flow_count, first_seen, last_seen -FROM edge WHERE dst_ns = '' ORDER BY flow_count DESC; - --- Everything a namespace talks TO (outbound) -SELECT dst_ns, action, flow_count, first_seen, last_seen -FROM edge WHERE src_ns = '' ORDER BY last_seen DESC; - --- New edges in the last 24h (what the digest reports) -SELECT src_ns, dst_ns, action, flow_count, first_seen -FROM edge WHERE first_seen > now() - interval '24 hours' -ORDER BY first_seen DESC; - --- Any DENIED edges (policy is dropping this pair) -SELECT src_ns, dst_ns, flow_count, last_seen -FROM edge WHERE action = 'deny' ORDER BY last_seen DESC; - --- Full edge set as a graph adjacency list -SELECT src_ns, dst_ns, action, flow_count FROM edge ORDER BY src_ns, dst_ns; -``` - -For the **live** (sub-hour) view including pod/port detail, use the Whisker UI — -the `edge` table intentionally aggregates that away. - -## Deriving the Wave-1 egress allowlist from the edge table (infra #62) - -The durable edge set is a faster, identity-stamped data source for the existing -**observe-then-enforce** egress effort (beads `code-8ywc`; snapshot -`docs/architecture/wave1-egress-observation-2026-05-22.md`) than the original -iptables-`LOG` → journald → Loki path (ADR-0014 consequence: "Enforcement gains -a better data source"). It replaces the *internal* (namespace-to-namespace) leg -of the allowlist; **external/public-internet egress is NOT in this table** (empty -dst namespace, dropped) — for those destinations keep using the Calico flow-log -path described in security.md. - -**Per-namespace internal egress allowlist** — the set of in-cluster namespaces a -given source is *observed* talking to with `action='allow'`: - -```sql --- Internal egress allowlist for one namespace (feeds its NetworkPolicy) -SELECT DISTINCT dst_ns -FROM edge -WHERE src_ns = '' AND action = 'allow' -ORDER BY dst_ns; -``` - -```sql --- Full internal egress matrix for all namespaces at once -SELECT src_ns, array_agg(DISTINCT dst_ns ORDER BY dst_ns) AS allowed_dst_ns -FROM edge -WHERE action = 'allow' -GROUP BY src_ns -ORDER BY src_ns; -``` - -```sql --- Sanity: namespaces with a DENY edge already (policy is biting; investigate --- before tightening further) -SELECT DISTINCT src_ns, dst_ns FROM edge WHERE action = 'deny'; -``` - -**How this feeds enforcement (scope):** the derived `dst_ns` set is the -*internal* half of a namespace's egress allowlist — it tells you which -in-cluster namespaces to permit before flipping that namespace to default-deny. -The universal baseline (kube-dns :53, often dbaas :3306/:5432, redis :6379) and -the external destinations still come from the Wave-1 observation snapshot. -**Enforce-flips remain OUT OF SCOPE** here — this is observe-and-derive only; -the phased per-namespace default-deny rollout (starting `recruiter-responder`) -is tracked under `code-8ywc`. Cross-links: -[security.md → NetworkPolicy Default-Deny Egress](../architecture/security.md#networkpolicy-default-deny-egress-wave-1--observe-then-enforce-tier-34), -[wave1-egress-observation-2026-05-22.md](../architecture/wave1-egress-observation-2026-05-22.md), -[ADR-0014](../adr/0014-service-identity-and-east-west-observability.md). - -> **Caveat (same as the Wave-1 snapshot):** an edge only exists if it was -> *observed*. A weekly CronJob or a 7-day Vault rotation may not have fired yet — -> collect ≥7 days of edges before treating a namespace's `allow` set as -> complete. The `first_seen` column tells you how long an edge has been known; -> the digest surfaces brand-new ones daily. - -## Monitoring & health (infra #61) - -The aggregator pod has **no `/metrics` endpoint** — health is inferred from -kube-state-metrics. Three complementary signals (memory ids 6598, 6599; -see also [monitoring.md → Security Alerts](../architecture/monitoring.md#security-alerts-wave-1--planned-beads-code-8ywc)): - -| Signal | What | Where | -|---|---|---| -| **`AggregatorDown`** | `kube_deployment_status_replicas_available{namespace="goldmane-edge-aggregator",deployment="goldmane-edge-aggregator"} < 1` for 15m → warning | Prometheus alert group `Network Observability (Goldmane)` in `stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl`; routes `slack-warning` → `#alerts` | -| **`DigestFailing`** | `kube_job_status_failed{...job_name=~"goldmane-edges-digest.*"} > 0` within 24h, for 30m → warning | same alert group → `#alerts` | -| **cluster-health #48** | `check_goldmane_aggregator` reads the Deployment's `Available` condition (missing or not-Available → FAIL) | `scripts/cluster_healthcheck.sh` (human / `--quiet` / `--json` modes; emits `goldmane_aggregator`) | - -The two alert layers are deliberately complementary: `AggregatorDown` → -**no new edges land** in the DB; `DigestFailing` → **edges still land but nobody -is told**. A freshness probe (#61b) was intentionally skipped — `AggregatorDown` -is the agreed floor. - -## Troubleshooting - -**Whisker UI 502 / unreachable.** The additive -`kubernetes_network_policy_v1.whisker_allow_traefik` is missing or the -operator's default-deny `whisker` NP regenerated — re-apply `stacks/calico`. A -brand-new ingress host is also invisible to LAN split-horizon until the hourly -`technitium-ingress-dns-sync` runs (memory #5349); test meanwhile with -`curl -sSI --resolve whisker.viktorbarzin.me:443:10.0.20.203 https://whisker.viktorbarzin.me` -(expect a 302 to Authentik — the gate working). - -**Whisker UI empty (but reachable — 302s to Authentik fine).** ROOT CAUSE (the -2026-06-28 incident): the operator's own `whisker` NetworkPolicy is -policyTypes:[Ingress,**Egress**], and its egress allows DNS only to the kube-dns -*pods* (podSelector `k8s-app=kube-dns`). But whisker-backend resolves -`goldmane.calico-system.svc` via the kube-dns **ClusterIP** (10.96.0.10), and -**Calico drops UDP DNS to a ClusterIP under a podSelector-only egress rule**. -Verified: from the whisker pod's netns, ClusterIP DNS = 100% timeout while direct -kube-dns *pod-IP* DNS = OK, and a pod with no egress policy resolves fine. -whisker-backend resolves goldmane ONCE in the brief startup window before the -policy programs, holds its long-lived gRPC stream, and only re-resolves when that -stream breaks (e.g. a node-reboot blip) — at which point the blocked ClusterIP -DNS wedges its Go resolver (`failed to stream flows` / `code = Unavailable: dns -... i/o timeout` forever) and the UI goes blank. The durable **aggregator is a -SEPARATE pod in its own (unrestricted) namespace** and is unaffected. - -FIX (applied 2026-06-28): `kubernetes_network_policy_v1.whisker_allow_dns_clusterip` -(`stacks/calico`) — an additive egress NP allowing whisker → the kube-dns -ClusterIP (`10.96.0.10/32`) on 53/UDP+TCP; k8s egress policies are additive so -the operator NP is untouched. Backstop: the `whisker-watchdog` CronJob restarts -the pod if it ever wedges for another reason. Immediate manual heal: -`kubectl -n calico-system delete pod -l k8s-app=whisker`. Diagnose by comparing, -from the whisker pod's netns, `nslookup goldmane.calico-system.svc.cluster.local -10.96.0.10` (the ClusterIP — times out if the NP fix is missing) against the same -query aimed at a kube-dns *pod IP* (always works). - -**No new `last_seen` updates / `AggregatorDown` firing.** Check the `aggregate` -pod logs (`kubectl logs -n goldmane-edge-aggregator deploy/goldmane-edge-aggregator`). -Common causes, in order: -1. **Stale mTLS cert** — the operator rotated `whisker-backend-key-pair`; re-apply - `stacks/goldmane-edge-aggregator` (see cert-reuse gotcha above). Symptom: TLS - handshake / `Flows.Stream` errors. -2. **Stale DB password** — the 7-day Vault rotation bounced the credential but - the pod kept the old one. The Deployment carries - `secret.reloader.stakater.com/reload: goldmane-edges-db-creds`; if it's not - restarting on rotation, verify the Reloader annotation and the ExternalSecret. -3. **Goldmane restarted** — the in-memory window was lost (expected); the stream - reconnects automatically and resumes upserting. No data loss in the DB - (only the sub-hour live window in Whisker is gone). - -**Digest never posts / `DigestFailing` firing.** Inspect the most recent -`goldmane-edges-digest-*` Job (`kubectl get jobs -n goldmane-edge-aggregator`; -`kubectl logs job/`). The CronJob's `ttl_seconds_after_finished=86400` GCs -pods after a day, so check soon after a failed run. With `SLACK_WEBHOOK_URL` -empty the binary forces a dry-run (no post) — verify the `goldmane-edges-slack` -ExternalSecret resolved. A dry run / smoke test: run the image with `args: -["digest"]` + `DRY_RUN=1` to print the message instead of POSTing. -> Resolved (2026-06-28): the digest posts cleanly to `#alerts` -> (`lastSuccessfulTime` current, `DigestFailing` clear; e.g. the 2026-06-28 08:00 -> London run reported "8 new edges in last 24h"). The 2026-06-25 failures were -> the `#security` channel override returning HTTP 404 — the shared -> `alertmanager_slack_api_url` webhook's Slack app isn't a member of `#security`; -> consolidating all Slack output to `#alerts` fixed it. - -**No edges at all in the table.** Confirm Goldmane is enabled -(`kubectl get goldmane,whisker -A`) and `calico-node` rolled with the -`FELIX_FLOWLOGSGOLDMANESERVER` env; confirm the `goldmane-edges-db-init` Job -completed; confirm the aggregator pod is `Running` and not `ImagePullBackOff` -(ghcr allowlist). - -## Related -- [ADR-0014 — Service identity & east-west observability](../adr/0014-service-identity-and-east-west-observability.md) -- [security.md — NetworkPolicy Default-Deny Egress + east-west flow observability](../architecture/security.md) -- [monitoring.md — east-west flow observability + alerts](../architecture/monitoring.md) -- [wave1-egress-observation-2026-05-22.md](../architecture/wave1-egress-observation-2026-05-22.md) -- `CONTEXT.md` glossary — **Service identity**, **Goldmane / Whisker** -- Code: `~/code/goldmane-edge-aggregator` (`README.md`, `DEPLOY.md`); stacks - `stacks/goldmane-edge-aggregator`, `stacks/calico` diff --git a/docs/runbooks/homelab-vault-onboarding.md b/docs/runbooks/homelab-vault-onboarding.md deleted file mode 100644 index b4bacced..00000000 --- a/docs/runbooks/homelab-vault-onboarding.md +++ /dev/null @@ -1,164 +0,0 @@ -# `homelab vault` onboarding (Vaultwarden access + `vault kv` infra secrets) - -## Scope - -`homelab vault` fronts **two unrelated secret stores** — the name collides, so -the command keeps them clearly separated: - -- **Vaultwarden** — your personal *password manager* (logins/passwords/TOTP). - The verbs below give each devvm roster user no-HITL access to **their own** - Vaultwarden vault (and any Organization Collection shared with their account). - It shells out to the official `bw` CLI; the user's Vaultwarden credentials live - only in their isolated Vault path `secret/workstation/claude-users/` - and are decrypted as that OS user — the admin never sees them. -- **HashiCorp Vault / OpenBao** — the homelab *infra* secrets store (the - `secret/…` KV mount at `vault.viktorbarzin.me`), under `homelab vault kv`. - These use the caller's **own** Vault token (`vault login -method=oidc` → - `~/.vault-token`), **not** the scoped Vaultwarden token (which only reads the - `claude-users/` path); access is whatever your Vault policy grants. - -```text -# Vaultwarden (password manager) -homelab vault setup one-time: store VW email + master password + API key -homelab vault status configured / unlocked / reachable (no secrets) -homelab vault list [--search Q] item names (no secrets) -homelab vault get [--field password|username|uri|notes|totp] [--json] -homelab vault get --all all fields (incl. custom) as JSON; pipe it (| jq) -homelab vault code current TOTP code -homelab vault lock lock / log out the local bw session - -# HashiCorp Vault / OpenBao (infra secrets; uses your own OIDC token) -homelab vault kv get [--field K] read an infra KV secret -homelab vault kv list list sub-paths -homelab vault kv put write one key (value via stdin; merges) -``` - -## How auth works (why a non-admin can use it) - -`homelab vault` runs `vault` as the calling user. It resolves a Vault token in -this order (`ensureVaultToken`, `cli/cmd_vault.go`): - -1. an explicit `$VAULT_TOKEN` (a deliberate override), then -2. the per-user **scoped token** that `claude-auth-sync` maintains at - `~/.config/claude-auth-sync/vault-token` (policy `workstation-claude-`), then -3. a native `~/.vault-token` (admins who carry one; non-admins usually don't). - -**The scoped token deliberately beats `~/.vault-token`.** This tool only touches -your own `secret/workstation/claude-users/` path, and a power-user who ran -`vault login -method=oidc` carries a read-only `~/.vault-token` (capability -`deny` on that path); letting it win would shadow the scoped token and fail every -op with `403 permission denied` (this is exactly what bit emo, 2026-06-28). The -CLI also **self-defaults `VAULT_ADDR`** to `https://vault.viktorbarzin.me` when -unset, so it works from non-login shells (tmux panes, AFK agent subprocesses) -that never sourced `/etc/environment` — otherwise every `vault` child hits the -`127.0.0.1:8200` default and fails `connection refused` (exit 2). - -That scoped policy grants exactly `create`/`read`/`update` on the user's own -`secret/workstation/claude-users/` path — no `patch` capability — so the -tool writes with `vault kv patch -method=rw` (read-modify-write), falling back to -`kv put` only when the path does not exist yet. This preserves the -`claude_ai_oauth_json` key that [claude-auth-sync](claude-auth-renew-workstation.md) -co-locates there. (The admin-only bugs were fixed 2026-06-27; the -`VAULT_ADDR`/token-precedence bugs above were fixed 2026-06-28.) - -## Prerequisites (per user) - -- The user is in `scripts/workstation/roster.yaml` and the **vault** stack has - been applied → their `workstation-claude-` policy exists. -- The user's workstation was provisioned (`setup-devvm.sh`) → their scoped Vault - token exists at `~/.config/claude-auth-sync/vault-token`. -- `bw` is installed **system-wide** at `/usr/bin/bw` (see below). -- The user has a Vaultwarden account at `https://vaultwarden.viktorbarzin.me` - (self-service signup is open; admin panel is disabled). - -## One-time admin steps (devvm) - -`bw` must be system-wide so every user resolves it (it is a Node script, and -`node` is already system-wide at `/usr/bin/node`). `setup-devvm.sh` installs it -to the npm `/usr` prefix; the guard checks the **system** path, not -`command -v bw` (an admin's own `~/.local/bin/bw` used to mask the system -install, leaving non-admins with no backend). To install on a running box: - -```bash -sudo npm install -g --prefix /usr "@bitwarden/cli@^2024" -bw --version # confirm /usr/bin/bw resolves -``` - -After landing a `cli/` change, rebuild the binary so users pick it up: - -```bash -# version is stamped from cli/VERSION, exactly as setup-devvm.sh does it -sudo bash -c 'cd /home/wizard/code/infra/cli && \ - go build -ldflags "-X main.version=$(cat VERSION 2>/dev/null || echo dev)" \ - -o /usr/local/bin/homelab .' -``` - -(or just re-run `scripts/workstation/setup-devvm.sh` as root, which rebuilds it.) - -## User onboarding - -The user runs these as themselves. The master password / API key are entered -interactively (never on the command line) and stored only in the user's Vault -path. - -1. In the Vaultwarden web vault → **Settings → Security → Keys → View API key**, - copy the `client_id` (`user.xxxx`) and `client_secret`. -2. Configure: - - ```bash - homelab vault setup # prompts: VW email, API client_id/secret, master password - homelab vault status # → "vault: configured, unlocked, reachable ✓" - homelab vault list # item names (own vault + any shared Collections) - ``` - -## Shared-Collection access (sharing passwords with a user) - -`homelab vault` surfaces Organization Collection items automatically once the -user's Vaultwarden account is a confirmed member. These steps are done by the -vault owner in the **Vaultwarden web UI** (they need the owner's master -password — not an infra/Terraform operation): - -1. Create or reuse an **Organization** and a **Collection** of shared logins. -2. **Invite** the user's Vaultwarden account to the Organization, granting - **"Can view"** on that Collection (least privilege). -3. The user accepts the email invite and confirms membership. -4. The user runs `homelab vault list` — the shared items now appear alongside - their own (a `homelab vault status` sync picks them up). - -## Security model (the no-HITL trade) - -Identity is the kernel UID. Anything running as the user can decrypt the user's -vault — this is the accepted trade for no-human-in-the-loop fetches. Secrets -never appear in `argv` (passed via env or stdin), core dumps are disabled, TOTP -fetches are logged to syslog/Loki, and on a TTY values go to the clipboard -(auto-clearing) rather than scrollback. The admin's Vault token is never used by -a non-admin: each user authenticates with their own scoped token. - -## Verification - -```bash -# the scoped token carries the right policy -VAULT_TOKEN="$(sudo cat /home//.config/claude-auth-sync/vault-token)" \ - vault token lookup -format=json | jq '.data.display_name, .data.policies' -# → "token-devvm-claude-auth-", [..., "workstation-claude-"] - -sudo -u -i bw --version # /usr/bin/bw resolves for the user -sudo -u -i homelab vault status -``` - -## Troubleshooting - -**`homelab vault setup` (or any verb) fails with `exit status 2`** — older -binaries swallowed the underlying `vault` error; the message now includes it. -Two historical causes (both fixed in-CLI 2026-06-28, kept here for diagnosis): - -- `... connection refused` to `127.0.0.1:8200` → `VAULT_ADDR` wasn't set in the - caller's shell. The CLI now self-defaults it, but if you see this on an old - binary: `export VAULT_ADDR=https://vault.viktorbarzin.me`. -- `403 permission denied` on `PUT .../secret/data/workstation/claude-users/` - → a stale read-only `~/.vault-token` (e.g. from `vault login -method=oidc`, - policy `default`, capability `deny` on that path) was shadowing the scoped - token. The CLI now prefers the scoped token; on an old binary, `rm - ~/.vault-token` (or `unset VAULT_TOKEN`) and retry. Confirm with - `VAULT_TOKEN="$(sudo cat /home//.config/claude-auth-sync/vault-token)" vault token capabilities secret/data/workstation/claude-users/` - → must be `create, read, update`. diff --git a/docs/runbooks/k8s-version-upgrade.md b/docs/runbooks/k8s-version-upgrade.md index 4b4b42b0..08d43926 100644 --- a/docs/runbooks/k8s-version-upgrade.md +++ b/docs/runbooks/k8s-version-upgrade.md @@ -36,13 +36,11 @@ envsubst on /template/job-template.yaml | kubectl apply -f - ▼ Job 0 — preflight (pinned: k8s-node1) - ├── compat-gate: addon/API/containerd support for target (else BLOCK-actionable+alert / HOLD-quiet) + ├── compat-gate: addon/API/containerd support for target (else BLOCK+alert) ├── All nodes Ready + no Mem/Disk pressure ├── halt-on-alert (kured-style ignore-list) ├── 24h-quiet baseline (no Ready transitions <24h ago) ├── kubeadm upgrade plan matches target (skipped when master already at target — partial-resume) - ├── apiserver-OIDC drift check: kubeadm upgrade diff drops --authentication-config? → Slack WARN (recoverable; not a block) - ├── reclaim kubeadm scratch: prune /etc/kubernetes/tmp/kubeadm-backup-* >3d on master (kubeadm leaks ~400MB etcd-db backups) ├── Push k8s_upgrade_in_flight=1, k8s_upgrade_started_timestamp=$(date +%s) ├── Trigger backup-etcd Job, wait, verify snapshot byte count ├── SSH master: containerd skew fix (if master < workers) @@ -114,36 +112,18 @@ inert for a patch (no API removal or containerd floor occurs inside a minor). This is the **"auto-upgrade when we can, halt + alert when we can't"** contract. -**The gate classifies each refusal** (2026-06-28) so it only cries wolf when -there's something to do — `compat-gate.py` exit code + a `[TAG]` on every reason: +**On a block**, the gate: +- pushes `k8s_upgrade_blocked=1` to Pushgateway (→ the `K8sUpgradeBlocked` + Prometheus alert), +- Slacks the **specific reasons** (which addon/API/node, current vs required), and +- **halts the chain** — it exits **non-fatal** (the upgrade simply isn't safe yet, + this is not a failure). Because the block happens **before any mutation, no + rollback is involved**; nothing was changed. -- **`[ACTIONABLE]`** (exit 2) — a newer version of the lagging addon **exists in - the compat matrix** and upgrading it would clear the block (or an in-use - deprecated API must be migrated / a node's containerd bumped). -- **`[WAITING]`** (exit 4 = held) — **no released addon version supports the - target yet** (e.g. kyverno/ESO behind a brand-new k8s minor). Only an upstream - release can clear it. -- **`[PINNED]`** (exit 4 = held) — a supporting version exists but the addon is - **deliberately pinned** in the matrix (`"pinned": true`, e.g. gpu-operator, - whose bump is coupled to a newer NVIDIA driver image + Ubuntu/kernel). -- **Held wins on a mix**: if any blocker is waiting/pinned the whole target is - held — acting on the actionable ones wouldn't unblock it yet. - -**On any refusal** the preflight pushes the verdict gauge (`k8s_upgrade_blocked=1` -for actionable, `k8s_upgrade_held=1` for held), sets `HALT_CHAIN` so the chain -doesn't advance, and **exits 0 — the Job Completes cleanly** (a refusal is a -decision, not a failure: no Failed Job, no `K8sUpgradeChainJobFailed`). It's -before any mutation, so no rollback. Reasons (grouped by class) appear in the -**morning nightly report**, not a per-run Slack. - -- **Actionable** → `K8sUpgradeBlocked` fires (once, via alert-on-change). Clear - it by doing the named upgrade/migration; the next nightly run proceeds. -- **Held** → **deliberately NO alert** — only the nightly report's `⏸️ HELD` - line, because it can't be actioned now (a nightly alert would cry wolf). It - clears itself once upstream ships support (refresh `addon-compat.json`) or the - pin is lifted (delete `pinned`+`pin_reason`). The detector re-evaluates every - night, silently re-spawning the refused-but-Complete preflight (so a cleared - block is picked up next run, not after the 7d Job TTL). +**To clear a block**: upgrade the named addon (or migrate the API caller off the +deprecated group/version, or bump containerd on the named node) so the offending +condition no longer holds. The **next nightly run then proceeds automatically** — +no manual chain restart needed. The **compat matrix** lives in `stacks/k8s-version-upgrade/scripts/addon-compat.json` — a map of `addon → highest @@ -183,8 +163,6 @@ Pushed by upgrade-step.sh during phase execution; observed by the | `k8s_upgrade_in_flight` (1/0) | preflight Job (set to 1) | postflight Job (set to 0) | | `k8s_upgrade_started_timestamp` (epoch s) | preflight Job | postflight Job (set to 0) | | `k8s_upgrade_snapshot_taken` (1/0) | preflight Job (set to 1 after Job=`pre-upgrade-etcd-*` completes with `Backup done:` log of ≥1 KiB) | postflight Job (0) | -| `k8s_upgrade_blocked` (1/0) | preflight Job — set 1 on an **actionable** compat refusal (→ `K8sUpgradeBlocked`) | preflight (definitive each run; 0 when safe) / postflight (0) | -| `k8s_upgrade_held` (1/0) | preflight Job — set 1 on a **held** (waiting-upstream/pinned) refusal; **no alert** | preflight (definitive each run; 0 when safe) / postflight (0) | | `k8s_upgrade_available{kind,running,target}` | detection CronJob | next detection run (overwrite) | | `k8s_version_check_last_run_timestamp` | detection CronJob | (cumulative) | @@ -193,8 +171,8 @@ Pushed by upgrade-step.sh during phase execution; observed by the - **`K8sVersionSkew`** — distinct kubelet/apiserver `gitVersion` count > 1 for 30m. Catches a half-done rollout. - **`EtcdPreUpgradeSnapshotMissing`** — `k8s_upgrade_in_flight==1 && k8s_upgrade_snapshot_taken==0` for 10m. Catches preflight Stage 2 failing silently. - **`K8sUpgradeStalled`** — `k8s_upgrade_in_flight==1 && time()-k8s_upgrade_started_timestamp > 5400` for 5m. Catches a Job in the chain dying without spawning its successor. -- **`K8sUpgradeChainJobFailed`** — `kube_job_status_failed{namespace="k8s-upgrade",job_name=~"k8s-upgrade-(preflight|master|worker|postflight)-.*",reason=~"BackoffLimitExceeded|DeadlineExceeded"} > 0` for 15m (warning). Catches a phase Job that **terminally failed before `k8s_upgrade_in_flight` was set** — the preflight gates exit pre-metric, so the two `in_flight`-based alerts above are blind to a failed preflight (this is what hid the 5-day 1.34.9 wedge on 2026-06-12). Reason-scoped to terminal job conditions so a retry-success doesn't false-positive (a bare failed-pod-count would otherwise also block kured for the Job's 7d TTL). The old `unless on() (k8s_upgrade_blocked == 1)` clause was **dropped 2026-06-28**: compat-gate refusals now Complete cleanly (exit 0) instead of Failing, so a terminally-Failed chain Job again means a genuine wedge with nothing to exclude. -- **`K8sUpgradeBlocked`** — `k8s_upgrade_blocked == 1` (warning). An **ACTIONABLE** compat-gate refusal — a newer version of the lagging addon exists and upgrading it would clear the block (or an in-use deprecated API must be migrated / a node's containerd bumped). Reasons (grouped by class) are in the **morning nightly report**; clear it by doing the named upgrade/migration, after which the next nightly run proceeds (see "Auto-upgrade compat gate"). No upgrade was attempted, so this is not a half-done-rollout alert. **There is deliberately NO companion alert for the held verdict** (`k8s_upgrade_held=1` — waiting-on-upstream / pinned): nothing can be actioned now, so it is surfaced only by the nightly report's `⏸️ HELD` line. +- **`K8sUpgradeChainJobFailed`** — `(kube_job_status_failed{namespace="k8s-upgrade",job_name=~"k8s-upgrade-(preflight|master|worker|postflight)-.*",reason=~"BackoffLimitExceeded|DeadlineExceeded"} > 0) unless on() (k8s_upgrade_blocked == 1)` for 15m (warning). Catches a phase Job that **terminally failed before `k8s_upgrade_in_flight` was set** — the preflight gates exit pre-metric, so the two `in_flight`-based alerts above are blind to a failed preflight (this is what hid the 5-day 1.34.9 wedge on 2026-06-12). Reason-scoped to terminal job conditions so a retry-success doesn't false-positive (a bare failed-pod-count would otherwise also block kured for the Job's 7d TTL). The `unless k8s_upgrade_blocked == 1` clause (added 2026-06-21) excludes a preflight that failed because the **compat gate deliberately refused** the target — that's owned by `K8sUpgradeBlocked` and was double-firing here; a genuine wedge exits without setting the blocked gauge, so it still fires. +- **`K8sUpgradeBlocked`** — `k8s_upgrade_blocked == 1` (warning). A k8s **auto-upgrade was refused** by the compat gate because a critical addon, an in-use deprecated API, or a node's containerd is too old for the detected target. The **specific reasons are in Slack**; clear it by upgrading the named addon / migrating the API caller / bumping containerd, after which the next nightly run proceeds (see "Auto-upgrade compat gate"). No upgrade was attempted, so this is not a half-done-rollout alert. - The first four alerts ALSO block kured (same `--prometheus-url` halt-on-alert mechanism) so the OS-reboot pipeline can't run on top of a half-done version upgrade. ### Nightly upgrade report (Slack) @@ -203,8 +181,8 @@ CronJob `k8s-upgrade-nightly-report` (k8s-upgrade ns, `var.report_schedule`, default `7 6 * * *` = 06:07 UTC — after the 23:00 chain, before the 08:00 London alert-digest) posts ONE Slack summary each morning of the previous night's run: running version, detector freshness, detected target + kind, the outcome -(⚪ no upgrade needed / 🔴 blocked-actionable + reasons / ⏸️ held = waiting-upstream/pinned / -🟢 upgraded / 🟡 in progress / ⚠️ detector stale), and recent chain jobs. Read-only — it reads +(⚪ no upgrade needed / 🔴 blocked + live blocker reasons / 🟢 upgraded / +🟡 in progress / ⚠️ detector stale), and recent chain jobs. Read-only — it reads the Pushgateway gauges + live nodes/jobs and re-runs `compat-gate.py` for fresh blocker reasons; reuses the chain's SA + `slack_webhook` + scripts ConfigMap. Logic + unit tests: `scripts/nightly-report.py`, `scripts/test_nightly_report.py`. @@ -244,34 +222,22 @@ Exposed in K8s via ExternalSecret `k8s-upgrade-creds` in the `k8s-upgrade` names ## Common Operations -### apiserver OIDC + kubeadm upgrades (kubeadm-config reconciliation since 2026-06-24) +### Post-upgrade: apiserver OIDC restore (AUTOMATED by the chain since 2026-06-19) `kubeadm upgrade apply` **regenerates `/etc/kubernetes/manifests/kube-apiserver.yaml` -from kubeadm-config**. apiserver auth uses a structured multi-issuer -`--authentication-config` (kubectl + dashboard SSO), but kubeadm-config used to -still carry the legacy single-issuer `--oidc-*` extraArgs — so every upgrade -reverted the flag, **silently breaking SSO after the upgrade** (the apiserver does -NOT crash on this — verified by isolated repro; it's recoverable via the restore -script below). NB: the **1.34→1.35 stall on 2026-06-24 was a *separate* issue — -etcd IO starvation**, not this drift; post-mortem: -`docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md`. +and drops the `--authentication-config` flag**, silently disabling apiserver +OIDC (kubectl/kubelogin CLI **and** the web dashboard SSO break — tokens get +401). This used to require a manual re-apply after **every** control-plane bump. -**Primary fix (2026-06-24):** `stacks/rbac/modules/rbac/apiserver-oidc.tf` now -**reconciles kubeadm-config** (`kubeadm init phase upload-config kubeadm`, rewriting -`apiServer.extraArgs`: drop `--oidc-*`, add `--authentication-config`) as part of -its remote script. So kubeadm regenerates a **correct** manifest and the apiserver -upgrades with a pure image bump — `kubeadm upgrade diff ` shows only the -image change. Zero live impact (the CM is read only during an upgrade). - -**Backstops:** -- **Preflight check 4b** runs `kubeadm upgrade diff` and **alerts** (Slack WARN, does - NOT block — the drift only breaks SSO, which is recoverable) if - `--authentication-config` would still be dropped. -- The `rbac` stack still publishes its restore script to the - `kube-system/apiserver-oidc-restore` ConfigMap, and `phase_master` re-runs it on - master right after `kubeadm upgrade apply` (idempotent, `/livez`-gated with - auto-rollback, non-fatal) — now redundant belt-and-suspenders that *also* - re-reconciles kubeadm-config. Self-skips when master is already at target. +**Now automated:** the `rbac` stack publishes its OIDC restore script to the +`kube-system/apiserver-oidc-restore` ConfigMap, and the version-upgrade chain's +`phase_master` re-runs it on master immediately after `kubeadm upgrade apply` +(while tigera-operator is still quiesced, so the flag-add apiserver restart can't +crashloop the operator). It's idempotent, health-gates `/livez` with +auto-rollback, and is **non-fatal** — a failure only lags SSO until the next rbac +apply (the version upgrade itself already succeeded). So a chain-driven +control-plane bump no longer breaks SSO. The master phase self-skips when master +is already at target, so this only runs when master was actually upgraded. **Manual fallback** — only for an out-of-band/manual `kubeadm` upgrade, or if the chain logged `WARN: --authentication-config absent after re-apply`: diff --git a/docs/runbooks/pfsense-egress.md b/docs/runbooks/pfsense-egress.md deleted file mode 100644 index 39bca116..00000000 --- a/docs/runbooks/pfsense-egress.md +++ /dev/null @@ -1,72 +0,0 @@ -# Runbook: pfSense WAN / egress outage - -**Scope:** the cluster (and home) loses **internet egress** while pfSense is -otherwise alive — internal VLAN routing and DNS keep working. This is the -**2026-06-27 incident class**: pfSense (Proxmox **VMID 101**) stopped passing -IPv4 egress for ~20 min (00:02→00:23 UTC) while LAN/OPT1 routing + Unbound -stayed up; recovery required a manual reboot, and **nothing alerted** (no egress -probe existed; the cloudflared replica metric stayed green). The alerts + -probes below close that gap. Incident detail: memory ids #6715–#6723. - -pfSense is a **single point of failure** (no HA): it is the k8s default gateway -(`10.0.20.1`), Kea DHCP, Unbound DNS, NAT, and the WireGuard hub. WAN is -**static** `192.168.1.2/24`, upstream gateway `WANGW = 192.168.1.1` (the TP-Link -Archer AX6000). The sole IPv4 default gateway, no gateway-group/failover. - -## Alerts (all in `stacks/monitoring/modules/monitoring/`) - -| Alert | Signal | Means | -|-------|--------|-------| -| `WANGatewayUnreachable` (critical) | in-cluster ICMP to `192.168.1.1` fails >3m | pfSense's upstream gateway is unreachable from the cluster | -| `InternetEgressDown` (critical) | in-cluster ICMP to **both** `9.9.9.9` and `1.1.1.1` fails >2m | internet egress through pfSense NAT is black-holed | -| `ExternalDNSResolutionDown` (warning) | UDP/53 to both public resolvers fails >3m | egress or external-DNS path broken | -| `EgressOnlyDivergence` (critical) | t3-probe `cloudflare` leg down **while** `internal` leg up >3m | egress-specific failure, internal healthy (the exact 2026-06-27 signature) | -| `PfSenseVMDown` (critical) | `pve_up{id="qemu/101"}==0` while host up >2m | the pfSense VM stopped/crashed (host fine) | -| `CloudflaredTunnelConnLoss` (warning, Loki) | >20 cloudflared edge-conn failures/5m | tunnel/egress trouble (canary that fires first; replica metric is blind) | - -Probes run **from inside the cluster** (blackbox-exporter, pod → node → pfSense -NAT), so they exercise the exact egress path that fails. `WANGatewayUnreachable` -/ `InternetEgressDown` **inhibit** the downstream egress symptoms so one root -alert pages, not a storm. - -`PfSenseVMDown` **does not** catch a *guest-internal* reboot — `pve_up` tracks -the qemu process, which survives an in-guest reboot (this is why 2026-06-27 was -metric-invisible). `CloudflaredTunnelConnLoss` + the probe alerts cover that case. - -## Diagnose (read-only first) - -1. **Confirm scope** — is it egress-only or total? - - `kubectl -n monitoring` Grafana → `probe_success{job=~"wan-gateway-icmp|internet-egress-icmp"}` and `t3probe_connected` by `leg`. - - Internal still up? `pve_up{id="qemu/101"}` should be `1`; internal k8s DNS (`10.0.20.1`) still resolving = pfSense alive, egress-only. -2. **Capture pfSense on-box logs BEFORE rebooting** (they persist on disk — no RAM-disk — and are the only source that proves the mechanism; they are NOT shipped to Loki): - ``` - ssh -i ~/.ssh/id_ed25519 admin@10.0.20.1 # devvm wizard key (id #6784) - clog /var/log/gateways.log | grep -iE 'WANGW|down|up|delay|loss' # dpinger gateway alarms - clog /var/log/routing.log | grep -iE 'default|route' # default-route add/delete - clog /var/log/system.log | tail -200 - netstat -rn | head # is the default route present? - ls -la /var/crash/ # panic/textdump? - ``` - (If SSH is rejected post-reboot, the reboot regenerated `authorized_keys` from - config.xml — re-add the key via console or WebGUI; see id #6718.) -3. **Upstream check** — is the TP-Link / ISP up? It held the same public IP with - clean DHCP renewals through the 2026-06-27 event, so a *sustained* upstream - fault is unlikely; a reboot fixing it points at **pfSense-side state**. - -## Recover - -- **Fast path (known fix):** reboot pfSense — re-adds the default route, re-arms - dpinger, flushes pf state. **Capture the logs above FIRST** (a reboot wipes - the volatile evidence needed to find the real mechanism). -- Targeted (if logs show a dpinger gateway-down): System → Routing → Gateways → - WANGW; check the monitor IP + dpinger state; re-enable the gateway / let it - re-eval. Confirm `netstat -rn` shows the default route restored. - -## Prevent / harden (deferred, needs a live-pfSense change) - -Not done in this monitoring change — tracked for a follow-up with hands-on -pfSense access: point dpinger's monitor at the local gateway (`192.168.1.1`) -instead of an external IP + widen thresholds; disable `gw_down_kill_states` for -the single WAN; add a failover gateway group; a 60s auto-recovery watchdog; -ship pfSense system/gateway/routing syslog to the cluster so these logs become -centrally queryable. diff --git a/scripts/cluster_healthcheck.sh b/scripts/cluster_healthcheck.sh index a5088137..51a13b5d 100755 --- a/scripts/cluster_healthcheck.sh +++ b/scripts/cluster_healthcheck.sh @@ -27,7 +27,7 @@ KUBECONFIG_PATH="${KUBECONFIG:-${HOME}/.kube/config}" [[ -f "$KUBECONFIG_PATH" ]] || KUBECONFIG_PATH="$(pwd)/config" KUBECTL="" JSON_RESULTS=() -TOTAL_CHECKS=48 +TOTAL_CHECKS=47 # Parallel execution settings. Each check function is self-contained — it # only reads cluster state and mutates the in-memory counters / JSON_RESULTS @@ -3156,44 +3156,6 @@ PYEOF esac } -# --- 48. Goldmane edge-aggregator availability --- -# -# The goldmane-edge-aggregator Deployment (ADR-0014 / infra #58) streams Calico -# Goldmane flows into the goldmane_edges CNPG DB — the durable who-talks-to-whom -# trail. The pod has NO /metrics endpoint, so its liveness can't be scraped; -# this check reads the Deployment's Available condition directly so the trail -# silently dying surfaces in the health board (mirrors the AggregatorDown -# Prometheus alert). Missing Deployment / not-Available -> FAIL. -check_goldmane_aggregator() { - section 48 "Goldmane Edge-Aggregator" - local ns="goldmane-edge-aggregator" dep="goldmane-edge-aggregator" - local avail desired ready - - # One get; absent Deployment is a hard fail (the trail isn't deployed). - if ! $KUBECTL get deploy "$dep" -n "$ns" >/dev/null 2>&1; then - [[ "$QUIET" == true ]] && section_always 48 "Goldmane Edge-Aggregator" - fail "Deployment $ns/$dep not found — who-talks-to-whom edge trail is not running" - json_add "goldmane_aggregator" "FAIL" "deployment missing" - return 0 - fi - - avail=$($KUBECTL get deploy "$dep" -n "$ns" \ - -o jsonpath='{.status.conditions[?(@.type=="Available")].status}' 2>/dev/null) - ready=$($KUBECTL get deploy "$dep" -n "$ns" -o jsonpath='{.status.readyReplicas}' 2>/dev/null) - desired=$($KUBECTL get deploy "$dep" -n "$ns" -o jsonpath='{.spec.replicas}' 2>/dev/null) - ready=${ready:-0} - desired=${desired:-0} - - if [[ "$avail" == "True" ]]; then - pass "Edge-aggregator Available ($ready/$desired ready)" - json_add "goldmane_aggregator" "PASS" "${ready}/${desired} ready" - else - [[ "$QUIET" == true ]] && section_always 48 "Goldmane Edge-Aggregator" - fail "Edge-aggregator NOT Available ($ready/$desired ready) — edge trail has stopped recording" - json_add "goldmane_aggregator" "FAIL" "${ready}/${desired} ready; Available=${avail:-unknown}" - fi -} - # --- Summary --- print_summary() { if [[ "$JSON" == true ]]; then @@ -3262,7 +3224,7 @@ main() { check_monitoring_prom_am check_monitoring_vault check_monitoring_css check_external_replicas check_external_divergence check_pve_thermals check_pve_load check_external_traefik_5xx check_ha_status_dashboard - check_immich_search check_csi_ghost_drift check_goldmane_aggregator + check_immich_search check_csi_ghost_drift ) # Auto-fix mutates cluster state inside individual checks — keep that diff --git a/scripts/t3-provision-users.sh b/scripts/t3-provision-users.sh index 1714596a..9cbc6c1e 100644 --- a/scripts/t3-provision-users.sh +++ b/scripts/t3-provision-users.sh @@ -240,79 +240,6 @@ EOF log "wrote OIDC kubeconfig -> $user:~/.kube/config" } -# Hands-off chrome-service browser credential. For a user who has a -# `-browser` ServiceAccount in the chrome-service namespace (created in -# stacks/chrome-service/rbac.tf), install a DUAL-CONTEXT kubeconfig whose DEFAULT -# context authenticates with that SA's long-lived token — so `homelab browser` -# (which shells out to `kubectl port-forward -n chrome-service`) works -# non-interactively, even from a headless agent session (the user's interactive -# OIDC login can't authenticate a headless kubectl). The user's personal OIDC -# identity is retained as the `oidc@homelab` named context -# (`kubectl --context oidc@homelab`). TF (the SA's existence) is the source of -# truth for WHO gets this — there is no roster flag. Idempotent (cmp-guarded; SA -# tokens are stable) + best-effort (cluster/secret unreachable -> WARN, never aborts). -install_browser_kubeconfig() { - local user="$1" home kc sa secret token server ca tmp - home="$(getent passwd "$user" | cut -d: -f6)" - [[ -z "$home" ]] && return 0 - sa="${user}-browser" - secret="${sa}-token" - [[ -r "$ADMIN_KUBECONFIG" ]] || return 0 - # Gate: only users with a chrome-service browser SA (TF-driven). Best-effort read. - KUBECONFIG="$ADMIN_KUBECONFIG" kubectl --request-timeout=10s -n chrome-service get serviceaccount "$sa" >/dev/null 2>&1 || return 0 - token="$(KUBECONFIG="$ADMIN_KUBECONFIG" kubectl --request-timeout=10s -n chrome-service get secret "$secret" -o jsonpath='{.data.token}' 2>/dev/null | base64 -d 2>/dev/null || true)" - [[ -n "$token" ]] || { log "WARN: browser SA token not ready for $user (secret chrome-service/$secret) — skipped"; return 0; } - server="$(KUBECONFIG="$ADMIN_KUBECONFIG" kubectl config view --raw --minify -o jsonpath='{.clusters[0].cluster.server}')" - ca="$(KUBECONFIG="$ADMIN_KUBECONFIG" kubectl config view --raw --minify -o jsonpath='{.clusters[0].cluster.certificate-authority-data}')" - [[ -n "$server" && -n "$ca" ]] || { log "WARN: could not read cluster server/CA -> skip browser kubeconfig for $user"; return 0; } - kc="$home/.kube/config" - tmp="$(mktemp)" - cat > "$tmp" </dev/null; then rm -f "$tmp"; return 0; fi # already current -> no churn - if [[ "$DRY_RUN" == 1 ]]; then echo "[dry-run] dual-context (SA default + OIDC) browser kubeconfig -> $user:$kc"; rm -f "$tmp"; return 0; fi - install -d -o "$user" -g "$user" -m 0700 "$home/.kube" - install -o "$user" -g "$user" -m 0600 "$tmp" "$kc" || { log "WARN: failed to write browser kubeconfig for $user"; rm -f "$tmp"; return 0; } - rm -f "$tmp" - log "wrote dual-context browser kubeconfig (SA default + OIDC) -> $user:~/.kube/config" - return 0 -} - # Idempotently set KEY=VALUE in a t3-serve env file, PRESERVING other lines — so writing # T3_PORT never clobbers an injected CLAUDE_CODE_OAUTH_TOKEN, and vice-versa. Mode 0600. env_set() { @@ -667,7 +594,6 @@ while IFS=$'\t' read -r os_user tier shell groups_csv code_layout repos_csv; do refresh_user_clone "$os_user" code fi install_user_kubeconfig "$os_user" - install_browser_kubeconfig "$os_user" # hands-off chrome-service CLI cred (no-op unless the user has a browser SA) deploy_user_launcher "$os_user" # keep ~/start-claude.sh current (skel only seeds new accounts) fi refresh_codex_mirror "$os_user" # all tiers — mirror of the managed claudeMd diff --git a/scripts/t3-serve@.service b/scripts/t3-serve@.service index 7f3d765d..4109b36b 100644 --- a/scripts/t3-serve@.service +++ b/scripts/t3-serve@.service @@ -11,12 +11,6 @@ Environment=HOME=/home/%i Environment=PATH=/usr/local/bin:/usr/bin:/bin:/home/%i/.local/bin Environment=NODE_ENV=production EnvironmentFile=/etc/t3-serve/%i.env -# Optional per-user long-lived CLAUDE_CODE_OAUTH_TOKEN, materialized by -# claude-auth-sync from the user's own Vault path. Non-rotating, so t3's -# concurrent agent sessions can't race on OAuth refresh-token rotation and wipe -# the shared ~/.claude/.credentials.json. Leading '-' = optional (absent for -# users on the normal per-user Enterprise-SSO credential flow). -EnvironmentFile=-/home/%i/.config/claude-auth-sync/claude-oauth.env WorkingDirectory=/home/%i ExecStart=/usr/bin/t3 serve --host 0.0.0.0 --port ${T3_PORT} --base-dir /home/%i/.t3 Restart=on-failure diff --git a/scripts/test-claude-auth-sync.sh b/scripts/test-claude-auth-sync.sh index 62c54e8b..10f07746 100755 --- a/scripts/test-claude-auth-sync.sh +++ b/scripts/test-claude-auth-sync.sh @@ -28,61 +28,5 @@ ok "accept own scoped Vault token" cas_vault_identity_ok token-devvm-claude-auth no "reject another user's token" cas_vault_identity_ok token-devvm-claude-auth-anca default,workstation-claude-anca no "reject wrong policy" cas_vault_identity_ok token-devvm-claude-auth-emo default,workstation-claude-anca -# --- Regression: cas_backup must MERGE into the shared Vault path, preserving -# sibling keys that other tools co-locate there (e.g. `homelab vault`'s -# vaultwarden_* creds) — NOT overwrite the whole KV document. A blind `kv put` -# wiped them every 6h (claude-auth-sync clobber, 2026-06-26). -fakebin="$tmp/bin"; mkdir -p "$fakebin" -store="$tmp/vault-store.json" -cat > "$fakebin/vault" <<'FAKE' -#!/usr/bin/env bash -# Minimal KV-v2 fake backed by $VAULT_FAKE_STORE (a flat JSON object). -[[ "$1" == kv ]] || { echo '{}'; exit 0; } # token lookup etc. -> ignore -op="$2"; shift 2 -store="$VAULT_FAKE_STORE" -case "$op" in - get) - for a in "$@"; do [[ "$a" == -field=* ]] && field="${a#-field=}"; done - if [[ "$*" == *-format=json* ]]; then - [[ -f "$store" ]] || { echo "No value found"; exit 2; } - jq -n --argjson d "$(cat "$store")" '{data:{data:$d}}'; exit 0 - fi - [[ -f "$store" ]] || exit 2 # bare get == existence check - if [[ -n "${field:-}" ]]; then - v="$(jq -r --arg k "$field" '.[$k] // empty' "$store")"; [[ -n "$v" ]] || exit 1 - printf '%s' "$v"; exit 0 - fi - exit 0 ;; - put) echo '{}' > "$store" ;; # full replace - patch) [[ -f "$store" ]] || { echo "No value found"; exit 2; } ;; # merge (rw) - *) exit 1 ;; -esac -for a in "$@"; do - case "$a" in - -*|secret/*) continue ;; # flags + the path arg - *=*) k="${a%%=*}"; v="${a#*=}" - t="$(mktemp)"; jq --arg k "$k" --arg v "$v" '.[$k]=$v' "$store" > "$t" && mv "$t" "$store" ;; - esac -done -exit 0 -FAKE -chmod +x "$fakebin/vault" - -CAS_VAULT_PATH="secret/workstation/claude-users/test" -CAS_CREDENTIALS="$tmp/credentials.json" -CAS_STATE_DIR="$tmp/state" -_oldpath="$PATH"; PATH="$fakebin:$PATH"; export VAULT_FAKE_STORE="$store" - -printf '{"vaultwarden_master_password":"keep-me"}\n' > "$store" # pretend `homelab vault setup` ran -ok "backup succeeds (existing doc)" cas_backup -eq "merge preserves sibling key" keep-me "$(jq -r '.vaultwarden_master_password' "$store")" -eq "merge writes claude oauth" access "$(jq -r '.claude_ai_oauth_json|fromjson|.accessToken' "$store")" - -rm -f "$store" # fresh user: no doc yet -ok "backup succeeds (creates doc)" cas_backup -eq "create writes claude oauth" access "$(jq -r '.claude_ai_oauth_json|fromjson|.accessToken' "$store")" - -PATH="$_oldpath"; unset VAULT_FAKE_STORE - printf '\n%d passed, %d failed\n' "$pass" "$fail" (( fail == 0 )) diff --git a/scripts/workstation/claude-auth-sync.sh b/scripts/workstation/claude-auth-sync.sh index b9676df9..dc3d780d 100755 --- a/scripts/workstation/claude-auth-sync.sh +++ b/scripts/workstation/claude-auth-sync.sh @@ -13,10 +13,6 @@ CAS_VAULT_TOKEN_FILE="${CLAUDE_AUTH_VAULT_TOKEN_FILE:-$CAS_CONFIG_DIR/vault-toke CAS_VAULT_PATH="${CLAUDE_AUTH_VAULT_PATH:-secret/workstation/claude-users/$CAS_USER}" CAS_STATE_DIR="${CLAUDE_AUTH_STATE_DIR:-$CAS_HOME/.local/state/claude-auth-sync}" CAS_LOG="$CAS_STATE_DIR/sync.log" -# Where a long-lived per-user setup-token is materialized as an env file -# (KEY=VALUE) for start-claude.sh + t3-serve@.service to load. Lives under the -# already-ReadWritePaths config dir so the sandboxed service may write it. -CAS_TOKEN_ENV_FILE="${CLAUDE_AUTH_TOKEN_ENV_FILE:-$CAS_CONFIG_DIR/claude-oauth.env}" cas_log() { mkdir -p "$CAS_STATE_DIR" @@ -86,17 +82,7 @@ cas_backup() { return 1 } expires="$(jq -r '.expiresAt' <<<"$oauth")" - # MERGE into the shared path so sibling keys other tools co-locate there - # (e.g. `homelab vault`'s vaultwarden_* creds) survive. `kv patch -method=rw` - # is read+update (needs no `patch` capability) but requires the secret to - # already exist, so create it with `kv put` on the very first backup only. - local -a write_cmd - if vault kv get "$CAS_VAULT_PATH" >/dev/null 2>&1; then - write_cmd=(vault kv patch -method=rw "$CAS_VAULT_PATH") - else - write_cmd=(vault kv put "$CAS_VAULT_PATH") - fi - "${write_cmd[@]}" \ + vault kv put "$CAS_VAULT_PATH" \ claude_ai_oauth_json="$oauth" \ credential_expires_at_ms="$expires" \ backed_up_at="$(date -Is)" >/dev/null || { @@ -137,41 +123,6 @@ cas_restore() { cas_log "RECOVERED restored Claude OAuth state from Vault" } -# A user-scoped, long-lived setup-token (`sk-ant-oat01-…`, ~1y, NON-rotating) may -# be stored in this user's OWN Vault path (field `setup_token`). When present it -# is the authoritative credential: it bypasses the shared -# ~/.claude/.credentials.json OAuth refresh-token rotation entirely — the fix for -# users running many concurrent Claude sessions (interactive + t3-serve + always-on -# agents) that otherwise race on refresh and wipe each other's refresh token. -# We materialize it to a user-owned env file that start-claude.sh and -# t3-serve@.service load as CLAUDE_CODE_OAUTH_TOKEN. This is the user's OWN -# Enterprise identity, NOT the forbidden legacy SHARED token — it never crosses -# OS users. Returns 0 when a token is active, so the caller skips the -# rotating-credential validate/backup/restore (probing the now-vestigial -# credential would otherwise emit false WorkstationClaudeAuthInvalid alerts). -cas_sync_setup_token() { - local token desired tmp - token="$(vault kv get -field=setup_token "$CAS_VAULT_PATH" 2>/dev/null)" || token="" - if [[ "$token" != sk-ant-oat01-* ]]; then - if [[ -e "$CAS_TOKEN_ENV_FILE" ]]; then - rm -f "$CAS_TOKEN_ENV_FILE" - cas_log "removed stale CLAUDE_CODE_OAUTH_TOKEN env (no setup-token in Vault)" - fi - return 1 - fi - desired="CLAUDE_CODE_OAUTH_TOKEN=$token" - if [[ -r "$CAS_TOKEN_ENV_FILE" && "$(<"$CAS_TOKEN_ENV_FILE")" == "$desired" ]]; then - cas_log "OK long-lived setup-token active (CLAUDE_CODE_OAUTH_TOKEN current); credential checks skipped" - return 0 - fi - tmp="$(mktemp "${CAS_TOKEN_ENV_FILE}.XXXXXX")" || { cas_log "FAIL could not stage token env file"; return 1; } - printf '%s\n' "$desired" > "$tmp" - chmod 0600 "$tmp" - mv "$tmp" "$CAS_TOKEN_ENV_FILE" - cas_log "OK long-lived setup-token active; CLAUDE_CODE_OAUTH_TOKEN materialized; credential checks skipped" - return 0 -} - cas_main() { umask 077 for bin in jq vault claude timeout flock; do @@ -182,11 +133,6 @@ cas_main() { flock -n 9 || { cas_log "SKIP another sync is already running"; return 0; } cas_prepare_vault || return 1 - # A long-lived per-user setup-token, if provisioned, is authoritative and - # non-rotating — materialize it and skip the rotating-credential dance. - if cas_sync_setup_token; then - return 0 - fi if cas_live_auth_ok; then cas_backup return diff --git a/scripts/workstation/claude-hooks/homelab-memory-recall.py b/scripts/workstation/claude-hooks/homelab-memory-recall.py index c9e1d1c3..7315f116 100755 --- a/scripts/workstation/claude-hooks/homelab-memory-recall.py +++ b/scripts/workstation/claude-hooks/homelab-memory-recall.py @@ -45,15 +45,9 @@ def main() -> None: try: res = subprocess.run( [homelab, "memory", "recall", prompt, "--limit", "5"], - capture_output=True, text=True, errors="replace", timeout=4, - env=os.environ, + capture_output=True, text=True, timeout=4, env=os.environ, ) - except Exception: - # Best-effort: ANY failure — timeout, OSError, or a UnicodeDecodeError on - # truncated multibyte (Cyrillic) output — must silently skip recall this - # turn, exactly like the MCP being unavailable. errors="replace" above - # also keeps a mid-rune-truncated payload from raising here at all. Never - # let this hook surface a "UserPromptSubmit hook error". + except (subprocess.TimeoutExpired, OSError): return out = (res.stdout or "").strip() diff --git a/scripts/workstation/claude-skills/README.md b/scripts/workstation/claude-skills/README.md index 1fa06d94..816cbcb7 100644 --- a/scripts/workstation/claude-skills/README.md +++ b/scripts/workstation/claude-skills/README.md @@ -19,29 +19,13 @@ unpinned-CLI dependencies out of the hourly **root** reconcile. - `mattpocock/skills` (https://github.com/mattpocock/skills) — all except `find-skills` - `vercel-labs/skills` (https://github.com/vercel-labs/skills) — `find-skills` -- **homelab-local, emo-PERSONALIZED** — `cluster-health` here is an - **emo-specific variant**, not a copy of the canonical skill. It started as a - copy of this repo's `.claude/skills/cluster-health/` but was rewritten on - 2026-06-26 to focus on ha-sofia + emo's Sofia devices (emo is the only entry - in `SKILL_USERS`, a read-only power-user). The canonical admin skill - (`.claude/skills/cluster-health/`) is the full 47-check version and is left - untouched. **Do NOT `cp -a` the canonical copy over this one** — that would - clobber the personalization. Maintain the two independently. ## Refreshing -Re-snapshot the upstream skills from a current install and commit the diff: +Re-snapshot from a current install and commit the diff: ```sh cp -a ~/.agents/skills/. scripts/workstation/claude-skills/ ``` -`cluster-health` is hand-maintained (emo variant) — it is **not** covered by the -`cp -a` above and must **not** be overwritten from `.claude/skills/`. Edit it in -place here when emo's needs change, then refresh his live copy (the provisioner's -`install_skills()` is if-absent, so it won't update an existing `~/.agents/skills` -copy — `cp` the new `SKILL.md` to `/home/emo/.agents/skills/cluster-health/` and -`chown emo:emo`, or remove emo's copy and re-run the reconcile). - -Snapshot taken 2026-06-23 (upstream); `cluster-health` vendored 2026-06-26, -personalized for emo 2026-06-26. +Snapshot taken 2026-06-23. diff --git a/scripts/workstation/claude-skills/cluster-health/SKILL.md b/scripts/workstation/claude-skills/cluster-health/SKILL.md deleted file mode 100644 index 20d13211..00000000 --- a/scripts/workstation/claude-skills/cluster-health/SKILL.md +++ /dev/null @@ -1,146 +0,0 @@ ---- -name: cluster-health -description: | - Personalized for emo. Check whether the homelab Kubernetes cluster is - affecting ha-sofia or the Sofia smart-home devices it runs (Tuya devices, - the MPPT ATS, lights, climate, security, irrigation). Use when: - (1) "is ha-sofia ok", "are my devices / the ATS / the lights down", - (2) "is the cluster affecting Sofia / my devices", - (3) "check the cluster", "cluster health", "is everything running", - (4) a device on the Барзини → Статус dashboard looks offline. - Runs the cluster-wide healthcheck read-only and triages it by what - ha-sofia actually depends on; the rest of the cluster is the admin's area. -author: Claude Code -version: 3.0.0-emo -date: 2026-06-26 ---- - -# Cluster Health — personalized for emo (ha-sofia focus) - -## What you actually care about - -You care about **ha-sofia** and the **Sofia smart-home devices** it runs — -the Tuya devices, the **MPPT ATS**, and the lights / climate / security / -irrigation on your **Барзини → Статус** dashboard. The wider Kubernetes -cluster matters to you **only when it's breaking something ha-sofia or your -devices depend on.** Anything else is the admin's (wizard's) area — note it in -one line and move on; don't chase it. - -You have **read-only** cluster access. You can SEE everything but change -nothing — so when something on your chain is broken, the job is to confirm it -and hand it off, not to repair it. - -## How ha-sofia depends on the cluster - -ha-sofia itself runs at the house (HAOS at https://ha-sofia.viktorbarzin.me) — -**not** in the cluster. The cluster reaches it through exactly two things: - -1. **tuya-bridge** (namespace `tuya-bridge`) — the REST API ha-sofia calls for - every Tuya device **and the MPPT ATS**. If it's unhealthy, your Tuya devices - + ATS stop responding. **This is the #1 thing to check.** -2. **The path that carries ha-sofia ⇄ tuya-bridge and keeps ha-sofia - reachable**: cloudflared (tunnel) → Traefik (LB) → the ingress + TLS cert - for `tuya-bridge.viktorbarzin.me` and `ha-sofia.viktorbarzin.me`, plus - Technitium DNS. If any of these break, ha-sofia can't reach tuya-bridge and - you can't reach ha-sofia remotely. - -Everything else in the cluster is unrelated to you unless it's hosting one of -those pods. - -## Step 1 — run the healthcheck (read-only, with your HA token) - -Your account can't read Vault, so load your own ha-sofia token first (it was -minted for you and lives at `~/.config/cluster-health/haos_token`). Then run -the script from YOUR clone, read-only: - -```bash -cd /home/emo/code -export HOME_ASSISTANT_SOFIA_TOKEN="$(cat ~/.config/cluster-health/haos_token)" -bash scripts/cluster_healthcheck.sh --no-fix --quiet -# machine-readable instead: -# bash scripts/cluster_healthcheck.sh --no-fix --quiet --json | tee /tmp/cluster-health.json -``` - -- **Never pass `--fix`** — it deletes pods (a write); you're read-only and it - will fail. -- Exit codes: `0` healthy, `1` warnings, `2` failures. - -With the token exported, the **ha-sofia checks run for you**: -26 Entity Availability · 27 Integration Health · 28 Automation Status · -29 System Resources · **45 Status Dashboard** — your Барзини → Статус view, -classifying every device tile as OK / ⚠️ / Offline across Сигурност, Мрежа & -IT, Енергия, Климат, Уреди, Мултимедия, Осветление, Поливна. Check 30 also -covers the **tuya** exporter. - -## Step 2 — triage the output by relevance to YOU - -Read the PASS/WARN/FAIL summary, then split the WARN/FAIL items in two: - -- **On your chain → this is what matters.** Anything touching: `tuya-bridge`, - `cloudflared`, `traefik`, DNS (check 21), the TLS cert / ingress for your two - hosts (checks 12, 22, 31, 32), or a **node** hosting those pods — plus all the - **ha-sofia** checks (26–29, 45) and the **tuya** exporter (30). -- **Not on your chain → one line, then drop it.** Summarise as "N unrelated - cluster issues (admin's area)" and don't investigate. - -## Step 3 — read-only checks for your chain - -All of these work with your read-only access: - -```bash -# tuya-bridge — your devices + the ATS -kubectl get pods -n tuya-bridge -kubectl rollout status deploy/tuya-bridge -n tuya-bridge -kubectl logs -n tuya-bridge deploy/tuya-bridge --tail=50 - -# the reachability path ha-sofia uses -kubectl get pods -n cloudflared -kubectl get pods -n traefik -kubectl get ingress -A | grep -Ei 'tuya-bridge|ha-sofia' - -# whole external path in one shot (DNS + tunnel + Traefik + cert): -curl -sI --max-time 10 https://tuya-bridge.viktorbarzin.me | head -1 -# reachable -> HTTP/2 200 / 401 / 403 (any HTTP response = path is up) -# broken -> curl: timeout / could not resolve host -``` - -The fastest **device-level** signal is your own dashboard: open -**https://ha-sofia.viktorbarzin.me → Барзини → Статус**. If devices show -Offline / Разкачен / ⚠️ **but tuya-bridge is healthy**, the problem is at the -house (device power / Wi-Fi / the Sofia TP-Link network) — **not** the cluster. - -## Step 4 — if something on your chain is broken - -You can't fix the cluster (read-only), so **capture + hand off**: - -```bash -kubectl describe pod -n tuya-bridge -kubectl logs -n tuya-bridge --previous --tail=200 -``` - -Then file it for the admin with the **`/file-issue`** skill — e.g. *"ha-sofia -Tuya devices + ATS unresponsive; tuya-bridge pod CrashLooping"* with the output -above. cloudflared / Traefik / DNS outages are cluster-wide — the admin's -alerting is already firing, but file it so it's tracked from your side too. - -## What will skip for you (expected — not failures) - -A few checks need access your account doesn't have. They warn/skip — that's -normal, and **none of them are on your ha-sofia chain**: - -- **Uptime Kuma (14)** — needs an admin password from Vault. -- **PVE host checks** — 36 (LVM snapshots), 43 (host thermals), 44 (host load), - and the Proxmox CSI ghost-disk check — all need root SSH to the Proxmox host. -- **`--fix`** — pod deletion (a write); not available to you. - -(The ha-sofia checks are **not** in this list — your token makes them work.) - -## Your ha-sofia token - -- Stored at `~/.config/cluster-health/haos_token` (yours, mode 600). -- It's a **dedicated** long-lived token, named `emo-cluster-health` under - ha-sofia → your profile → **Long-Lived Access Tokens**. Revoking it there - affects only you. -- It currently carries admin-level HA scope (Home Assistant only lets a token - be minted for the account that created it, and it was minted via the admin - account). If it ever stops working, tell wizard and a fresh one can be minted. diff --git a/scripts/workstation/managed-settings.json b/scripts/workstation/managed-settings.json index 6e8a13a5..de214a1b 100644 --- a/scripts/workstation/managed-settings.json +++ b/scripts/workstation/managed-settings.json @@ -1,4 +1,4 @@ { - "claudeMd": "# Viktor Barzin homelab — shared multi-user Claude Code Workstation (devvm)\n\nYou are running as a specific OS user on a SHARED devvm Workstation, not as the admin. These org-wide rules apply to EVERY user and sit at the top of settings precedence (they cannot be overridden by a user's own config):\n\n- Respect your permission tier for cluster/infra. kubectl, Vault, and infra access are scoped to your RBAC tier (admin / power-user / namespace-owner) — don't try to exceed it. At the OS level, authorization follows the OS: where your user holds `sudo`, using it is permitted and audited.\n- OS access follows OS permissions, not a separate rule here. You may read whatever your OS user can read — directly or via `sudo` where you hold it — including other users' home directories, credentials, tokens, and `~/.claude`. Do not impose restrictions stricter than the OS; no extra prompt is required for OS-authorized access. Protect your own secrets at mode 600.\n- Infrastructure changes go through Terraform/Terragrunt — never direct kubectl apply/edit/patch. Committed stack changes are auto-applied by CI on push to master; verify the live result with your read-only kubectl.\n- The AGENT does ALL git mechanics silently — the user may not know git, so never ask them to commit, push, pull, or open anything, and never surface git jargon. Lifecycle (worktrees, landing, cleanup): ~/.claude/rules/execution.md. Org red-lines on top:\n - THE COMMIT MESSAGE IS THE AUDIT TRAIL — subject says WHAT changed; body says WHY in plain words (paraphrase the user's actual request).\n - Never use [ci skip] as a non-admin (it hides the change from the audit feed).\n - Push rejected by branch protection (user not whitelisted) → fall back to a / branch + PR via the Forgejo API (token = password field in ~/.git-credentials).\n - Keep every clone on a clean master when done; tell the user in plain words what happened.\n - Full recipe: AGENTS.md → \"Non-admin workstation users\" in your infra clone.\n- Follow the engineering rules in ~/.claude/rules/ (execution, planning) and every CLAUDE.md in the repo tree.\n- Code lives under ~/code in one of two per-user layouts: either ~/code IS the git-crypt-LOCKED infra clone (single layout), or ~/code is a workspace directory of per-project clones — the locked infra clone at ~/code/infra plus other project repos alongside it. [ -d ~/code/.git ] means single. In locked infra clones secret files read as ciphertext — that is expected, not an error.\n", + "claudeMd": "# Viktor Barzin homelab — shared multi-user Claude Code Workstation (devvm)\n\nYou are running as a specific OS user on a SHARED devvm Workstation, not as the admin. These org-wide rules apply to EVERY user and sit at the top of settings precedence (they cannot be overridden by a user's own config):\n\n- Respect your permission tier. kubectl, Vault, and infra access are scoped to your RBAC tier (admin / power-user / namespace-owner). Do not attempt to escalate privileges or reach another user's resources.\n- Secrets are per-user. Never read another user's home directory, credentials, tokens, or ~/.claude secrets. Your own secrets live in your home at mode 600.\n- Infrastructure changes go through Terraform/Terragrunt — never direct kubectl apply/edit/patch. Committed stack changes are auto-applied by CI on push to master; verify the live result with your read-only kubectl.\n- The AGENT does ALL git mechanics silently — the user may not know git, so never ask them to commit, push, pull, or open anything, and never surface git jargon. Lifecycle (worktrees, landing, cleanup): ~/.claude/rules/execution.md. Org red-lines on top:\n - THE COMMIT MESSAGE IS THE AUDIT TRAIL — subject says WHAT changed; body says WHY in plain words (paraphrase the user's actual request).\n - Never use [ci skip] as a non-admin (it hides the change from the audit feed).\n - Push rejected by branch protection (user not whitelisted) → fall back to a / branch + PR via the Forgejo API (token = password field in ~/.git-credentials).\n - Keep every clone on a clean master when done; tell the user in plain words what happened.\n - Full recipe: AGENTS.md → \"Non-admin workstation users\" in your infra clone.\n- Follow the engineering rules in ~/.claude/rules/ (execution, planning) and every CLAUDE.md in the repo tree.\n- Code lives under ~/code in one of two per-user layouts: either ~/code IS the git-crypt-LOCKED infra clone (single layout), or ~/code is a workspace directory of per-project clones — the locked infra clone at ~/code/infra plus other project repos alongside it. [ -d ~/code/.git ] means single. In locked infra clones secret files read as ciphertext — that is expected, not an error.\n", "model": "claude-opus-4-8" } diff --git a/scripts/workstation/setup-devvm.sh b/scripts/workstation/setup-devvm.sh index 02bd9257..2969b803 100755 --- a/scripts/workstation/setup-devvm.sh +++ b/scripts/workstation/setup-devvm.sh @@ -72,14 +72,11 @@ if [[ -n "$want_t3" && "$(t3 --version 2>/dev/null | awk '{print $NF}' | sed 's/ fi # 2c) Bitwarden CLI — backs `homelab vault` (per-user no-HITL Vaultwarden access). -# Install SYSTEM-WIDE (npm prefix /usr → /usr/bin/bw) so EVERY user's PATH -# resolves it. The guard tests the SYSTEM path, NOT `command -v bw`: the -# latter is satisfied by an admin's own ~/.local/bin/bw and would skip the -# system install, leaving non-admins (emo, anca, …) with no backend. Pinned -# major; best-effort (a failure only disables `homelab vault`). -if [ ! -x /usr/bin/bw ] && [ ! -x /usr/local/bin/bw ]; then - log "npm: installing @bitwarden/cli system-wide (homelab vault backend)" - npm install -g --prefix /usr "@bitwarden/cli@^2024" >/dev/null 2>&1 || log "WARN: @bitwarden/cli install failed; homelab vault unavailable" +# npm-global so every user's PATH resolves it. Pinned major; best-effort (a +# failure only disables `homelab vault`, nothing else on the box). +if ! command -v bw >/dev/null; then + log "npm: installing @bitwarden/cli (homelab vault backend)" + npm install -g "@bitwarden/cli@^2024" >/dev/null 2>&1 || log "WARN: @bitwarden/cli install failed; homelab vault unavailable" fi # 3) kubelogin (kubectl oidc-login) system-wide — NOT the apt 'kubelogin' (= Azure tool). diff --git a/scripts/workstation/skel/start-claude.sh b/scripts/workstation/skel/start-claude.sh index 45ed9c4a..b3e25744 100755 --- a/scripts/workstation/skel/start-claude.sh +++ b/scripts/workstation/skel/start-claude.sh @@ -93,15 +93,6 @@ ensure_onboarding() { } ensure_onboarding -# Load a per-user long-lived CLAUDE_CODE_OAUTH_TOKEN if claude-auth-sync has -# materialized one from this user's own Vault path. A non-rotating setup-token -# sidesteps the shared ~/.claude/.credentials.json OAuth refresh-token race that -# logs out users running many concurrent agents (interactive + t3 + always-on). -# Absent file -> no-op (normal per-user Enterprise-SSO flow). The user's OWN -# token; never shared between OS users. -_oauth_env="$HOME/.config/claude-auth-sync/claude-oauth.env" -if [ -r "$_oauth_env" ]; then set -a; . "$_oauth_env"; set +a; fi - # Deliberately not `exec` so we can branch on the exit code: clean quit ends the # pane (ttyd closes the terminal); a crash drops to a shell so the tmux session # isn't destroyed-and-recreated in a ttyd auto-reconnect loop. diff --git a/stacks/actualbudget/main.tf b/stacks/actualbudget/main.tf index 13da68a8..33012033 100644 --- a/stacks/actualbudget/main.tf +++ b/stacks/actualbudget/main.tf @@ -5,9 +5,6 @@ variable "tls_secret_name" { variable "nfs_server" { type = string } resource "kubernetes_manifest" "external_secret" { - field_manager { - force_conflicts = true - } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/affine/main.tf b/stacks/affine/main.tf index 10a94ad7..bc63381c 100644 --- a/stacks/affine/main.tf +++ b/stacks/affine/main.tf @@ -5,9 +5,6 @@ variable "tls_secret_name" { variable "nfs_server" { type = string } resource "kubernetes_manifest" "external_secret" { - field_manager { - force_conflicts = true - } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" @@ -45,9 +42,6 @@ data "kubernetes_secret" "eso_secrets" { # DB credentials from Vault database engine (rotated automatically) # Provides DATABASE_URL that auto-updates when password rotates resource "kubernetes_manifest" "db_external_secret" { - field_manager { - force_conflicts = true - } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/authentik/Dockerfile b/stacks/authentik/Dockerfile deleted file mode 100644 index e60c5319..00000000 --- a/stacks/authentik/Dockerfile +++ /dev/null @@ -1,46 +0,0 @@ -# SLOW-1a overlay over the official authentik server image. -# -# The login flow's identification stage renders each enabled source's UI login -# button. Upstream authentik/stages/identification/stage.py does: -# current_stage.sources.filter(enabled=True).order_by("name").select_subclasses() -# The bare no-arg select_subclasses() (django-model-utils InheritanceManager) -# LEFT-JOINs EVERY Source subtype table; on the cold-login hot path that is ~1.5s -# (verified live on 2026.2.4: 1527ms vs 14ms). Passing only the subtypes that -# actually render a UI login button — every concrete Source type that overrides -# ui_login_button: oauth/saml/plex/telegram/kerberos, NOT the sync-only ldap/scim — -# is ~100x faster and BYTE-IDENTICAL output (verified: concrete types + rendered -# buttons match). django-model-utils accepts the lowercase subclass *accessor -# names* as strings, so no new import is needed (no circular-import risk) — the -# patch is a single, reviewable line edit. -# -# RE-VERIFY ON EVERY AUTHENTIK BUMP: bump the FROM tag below AND the image tag in -# modules/authentik/values.yaml together. The grep guards fail the build LOUDLY if -# the upstream target line moved. If a future authentik version adds a NEW -# login-capable source type, add its lowercase accessor to the list below. -# Upstream: the bare select_subclasses() is still present in main (no fix/PR as of -# 2026-06-28) — drop this overlay once upstream narrows the query. -FROM ghcr.io/goauthentik/server:2026.2.4 - -USER root -RUN set -eux; \ - F=/authentik/stages/identification/stage.py; \ - grep -q 'order_by("name").select_subclasses()' "$F"; \ - sed -i 's/order_by("name")\.select_subclasses()/order_by("name").select_subclasses("oauthsource", "samlsource", "plexsource", "telegramsource", "kerberossource")/' "$F"; \ - grep -q 'select_subclasses("oauthsource", "samlsource", "plexsource", "telegramsource", "kerberossource")' "$F"; \ - PY="$(command -v python || command -v python3)"; "$PY" -c "import ast,sys; ast.parse(open('$F').read())"; \ - rm -f /authentik/stages/identification/__pycache__/stage.*.pyc - -# PATCH #2 — old-browser BLANK LOGIN. authentik's modern flow SPA is ES2022 and -# hard-fails (blank login) on Safari<=16.3 (e.g. iPadOS<=16.3). authentik already -# ships a no-JS Simplified Flow Executor (SFE, ES5) but only serves it to -# IE/old-Edge/PKeyAuth. patch-compat-sfe.py (a) extends compat_needs_sfe() to -# serve the SFE to old Safari AND any iOS browser (Chrome/CriOS, Firefox/FxiOS — -# all share the system WebKit) on iOS<=16.3, and (b) injects static social-login -# links into the SFE shell (the SFE can't render Identification-stage sources; -# needed for password-less Google-only accounts). Clients get the REAL authentik -# login (password + MFA + reputation, NO auth downgrade) instead of a blank page. -# The script is guarded (asserts both upstream anchors + ast-parses) so the build -# fails loudly if upstream moves — re-verify on every authentik bump. -COPY patch-compat-sfe.py /tmp/patch-compat-sfe.py -RUN python3 /tmp/patch-compat-sfe.py && rm -f /tmp/patch-compat-sfe.py -USER authentik diff --git a/stacks/authentik/admin-services-restriction.tf b/stacks/authentik/admin-services-restriction.tf index 293c78b5..806dd417 100644 --- a/stacks/authentik/admin-services-restriction.tf +++ b/stacks/authentik/admin-services-restriction.tf @@ -49,15 +49,14 @@ resource "authentik_policy_expression" "admin_services_restriction" { host = request.context.get("host", "") - # chrome-service noVNC (chrome.viktorbarzin.me) exposes LIVE logged-in browser - # sessions from the SHARED persistent profile. Originally Viktor-only. - # 2026-06-28 (Viktor's explicit decision): emo SHARES Viktor's browser, so emo - # (emil.barzin / emil.barzin@gmail.com) is allowed in for noVNC form-filling + - # captcha solving. Trade-off accepted: emo can therefore reach Viktor's warmed - # sessions (the CLI half is the emo-browser ServiceAccount in - # stacks/chrome-service/rbac.tf). akadmin kept as break-glass. Match username OR - # email so neither attribute alone can lock anyone out. - CHROME_ALLOWED = {"akadmin", "akadmin@viktorbarzin.me", "vbarzin@gmail.com", "emil.barzin", "emil.barzin@gmail.com"} + # chrome-service noVNC (chrome.viktorbarzin.me) exposes Viktor's LIVE + # logged-in browser sessions, so lock it to Viktor's own accounts ONLY. + # "Home Server Admins" is NOT sufficient — emo (emil.barzin@gmail.com) is a + # member. akadmin kept as break-glass. The homelab-browser CDP path is + # already RBAC-gated (emo = oidc-power-user-readonly, no pods/portforward), + # so this closes the only remaining, human, noVNC path. Match username OR + # email so neither attribute alone can lock Viktor out. + CHROME_ALLOWED = {"akadmin", "akadmin@viktorbarzin.me", "vbarzin@gmail.com"} if host == "chrome.viktorbarzin.me": return request.user.username in CHROME_ALLOWED or request.user.email in CHROME_ALLOWED diff --git a/stacks/authentik/email-secret.tf b/stacks/authentik/email-secret.tf index 87be65d4..b3a7f201 100644 --- a/stacks/authentik/email-secret.tf +++ b/stacks/authentik/email-secret.tf @@ -6,9 +6,6 @@ # are non-secret and live in values.yaml. The reloader annotation rolls the # authentik pods if the password ever changes. resource "kubernetes_manifest" "authentik_email_secret" { - field_manager { - force_conflicts = true - } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/authentik/modules/authentik/main.tf b/stacks/authentik/modules/authentik/main.tf index 5c688452..3ae6d7c6 100644 --- a/stacks/authentik/modules/authentik/main.tf +++ b/stacks/authentik/modules/authentik/main.tf @@ -29,12 +29,7 @@ resource "kubernetes_namespace" "authentik" { labels = { tier = var.tier "resource-governance/custom-quota" = "true" - # Keel intentionally NOT enrolled: server+worker run our custom overlay image - # (ghcr.io/viktorbarzin/authentik-server — see values.yaml global.image + - # stacks/authentik/Dockerfile). The tag is pinned explicitly and bumped - # manually (rebuild the overlay FROM the new authentik version + repoint), so - # a Keel auto-bump would only risk re-introducing the upstream tag / the - # 2026-06-10 downgrade-boot-storm class. Re-enroll only if the overlay is dropped. + "keel.sh/enrolled" = "true" } } lifecycle { @@ -87,11 +82,6 @@ module "ingress" { service_name = "goauthentik-server" tls_secret_name = var.tls_secret_name anti_ai_scraping = false - # Swap the shared 10/50 default limiter for a dedicated 100/1000 carve-out: - # the login SPA + flow-executor API burst on a cold load otherwise 429s into - # a blank screen (see traefik middleware "authentik-rate-limit"). - skip_default_rate_limit = true - extra_middlewares = ["traefik-authentik-rate-limit@kubernetescrd"] extra_annotations = { "gethomepage.dev/enabled" = "true" "gethomepage.dev/name" = "Authentik" @@ -150,21 +140,14 @@ module "ingress-static" { # Same-host path carve-out of the public authentik UI ingress above, only # adding the cache-headers middleware for the static asset prefix. # auth = "none": versioned static assets of the (already public) Authentik login UI. - auth = "none" - namespace = kubernetes_namespace.authentik.metadata[0].name - name = "authentik-static" - host = "authentik" - service_name = "goauthentik-server" - ingress_path = ["/static"] - tls_secret_name = var.tls_secret_name - anti_ai_scraping = false - homepage_enabled = false - # /static serves ALL the SPA JS/CSS chunks; the default 10/50 limiter 429s the - # cold-load fan-out → blank screen. Dedicated 100/1000 carve-out (note the two - # namespaces: cache-headers is in ns authentik, rate-limit is in ns traefik). - skip_default_rate_limit = true - extra_middlewares = [ - "authentik-static-cache-headers@kubernetescrd", - "traefik-authentik-rate-limit@kubernetescrd", - ] + auth = "none" + namespace = kubernetes_namespace.authentik.metadata[0].name + name = "authentik-static" + host = "authentik" + service_name = "goauthentik-server" + ingress_path = ["/static"] + tls_secret_name = var.tls_secret_name + anti_ai_scraping = false + homepage_enabled = false + extra_middlewares = ["authentik-static-cache-headers@kubernetescrd"] } diff --git a/stacks/authentik/modules/authentik/values.yaml b/stacks/authentik/modules/authentik/values.yaml index f4b1b3f2..bfe755cd 100644 --- a/stacks/authentik/modules/authentik/values.yaml +++ b/stacks/authentik/modules/authentik/values.yaml @@ -39,16 +39,6 @@ server: value: "3" - name: AUTHENTIK_WEB__THREADS value: "4" - # Gunicorn worker recycle hardening (defaults max_requests=1000/jitter=50). - # A worker recycle that coincides with a transient PG/pgbouncer blip stalls - # in-flight requests (sessions+cache are on PostgreSQL since Redis was removed - # in 2026.2), and with 9 workers recycling on a tight 50-jitter window the - # recycles cluster — feeding the episodic all-pods-NotReady 502/504 cascade. - # 10x rarer recycles + 20x wider jitter (1000) decorrelate them from DB blips. - - name: AUTHENTIK_WEB__MAX_REQUESTS - value: "10000" - - name: AUTHENTIK_WEB__MAX_REQUESTS_JITTER - value: "1000" # Cache flow plans for 30m and policy evaluations for 15m (defaults 300s). # Authentik 2026.2 stores cache in Postgres, so a TTL hit is still a # SELECT — but a single indexed lookup beats re-planning the flow @@ -97,28 +87,11 @@ server: livenessProbe: failureThreshold: 6 timeoutSeconds: 5 - # Readiness widened from the chart default (3x10s/3s ~= 30s) to ~80s. The - # readiness probe (/-/health/ready/) queries the DB, so a sub-~60s PG/pgbouncer - # transient otherwise returns 503 and drops ALL 3 server pods from the Service - # at once -> Traefik has no healthy backend -> 502/504 (the episodic blank - # screen + 30s hang). 80s absorbs a full CNPG failover reconnect; liveness - # still reaps a truly hung pod. Partial override — the chart deep-merges the - # httpGet path /-/health/ready/ (same as the livenessProbe override above). - readinessProbe: - failureThreshold: 8 - periodSeconds: 10 - timeoutSeconds: 5 - # RollingUpdate strategy. The chart key is `deploymentStrategy`, NOT `strategy` - # (authentik.server reads .Values.server.deploymentStrategy) — the old - # `strategy:` key was silently ignored, so live ran the chart default 25%/25% - # and every rolling event dropped a server pod out of rotation, amplifying the - # NotReady cascade. maxSurge:1 + maxUnavailable:0 keeps all 3 ready throughout - # a roll (PDB minAvailable:2 + ResourceQuota headroom allow the transient pod). - deploymentStrategy: + strategy: type: RollingUpdate rollingUpdate: - maxSurge: 1 - maxUnavailable: 0 + maxSurge: 0 + maxUnavailable: 1 resources: requests: cpu: 100m @@ -145,23 +118,15 @@ server: global: addPrometheusAnnotations: true image: - # CUSTOM OVERLAY: two thin patches over the official authentik server image - # (see stacks/authentik/Dockerfile): (1) SLOW-1a — narrows the login-flow - # select_subclasses() query, ~1.4s -> ~14ms; (2) serve authentik's no-JS SFE - # login to old Safari/WebKit AND any iOS browser (Chrome/Firefox = WebKit) on - # iOS<=16.3 so old devices (e.g. iPadOS<=15) get a working login instead of a - # blank page, and injects social-login links into the SFE (it can't render - # sources; needed for password-less Google-only accounts). Built by - # .github/workflows/build-authentik.yml to ghcr.io/viktorbarzin/authentik-server - # (public package, anonymous pull — no imagePullSecret needed, like the - # upstream goauthentik image). Keel is NO LONGER enrolled for this namespace - # (see main.tf) so it can't bump/downgrade the tag; helm also defaults the tag - # to the chart appVersion (2026.2.2) — so BOTH repository AND tag are pinned - # explicitly here to prevent the 2026-06-10 downgrade-boot-storm class. - # UPGRADE = bump the Dockerfile FROM tag + this tag together (e.g. -> - # 2026.3.0-patch1), let GHA rebuild, then apply. - repository: ghcr.io/viktorbarzin/authentik-server - tag: "2026.2.4-patch3" + # Pin to the Keel-managed live tag. Keel (diun-annotated, keel.sh/enrolled + # namespace) bumps the IMAGE between chart releases, while helm defaults + # the tag to the chart appVersion — so any helm upgrade silently + # DOWNGRADES the running pods to the chart pin (2026-06-10: a values-only + # apply rolled live 2026.2.4 back to 2026.2.2 against a 2026.2.4-migrated + # DB → boot storm, see docs/post-mortems/2026-06-10-authentik-downgrade- + # boot-storm.md). Keep this tag in sync with what Keel has deployed when + # touching this chart; clear it only when bumping the chart version itself. + tag: "2026.2.4" worker: # 2 replicas: workers handle background tasks (LDAP sync, email, @@ -201,10 +166,7 @@ worker: secretKeyRef: name: authentik-email key: AUTHENTIK_EMAIL__PASSWORD - # Chart key is `deploymentStrategy`, not `strategy` (see server above). Workers - # serve no user traffic, so maxSurge:0/maxUnavailable:1 is fine — this is just - # the dead-key cleanup so the declared intent actually takes effect. - deploymentStrategy: + strategy: type: RollingUpdate rollingUpdate: maxSurge: 0 diff --git a/stacks/authentik/patch-compat-sfe.py b/stacks/authentik/patch-compat-sfe.py deleted file mode 100644 index 014603b7..00000000 --- a/stacks/authentik/patch-compat-sfe.py +++ /dev/null @@ -1,96 +0,0 @@ -#!/usr/bin/env python3 -"""Overlay patch — make authentik usable on OLD browsers (no modern-JS SPA). - -authentik's modern flow SPA is ES2022 (static{} init blocks) that hard-fail on -Safari/WebKit <= 16.3 (e.g. iPadOS <= 16.3) and render a COMPLETELY BLANK login. -authentik ships a no-JS Simplified Flow Executor (SFE, ES5) but only serves it to -IE / old-Edge / PKeyAuth, and the SFE itself canNOT render Identification-stage -sources (social-login buttons) — authentik docs list "Sources" as unsupported. - -This patch does TWO things, both guarded (assert the upstream anchor + verify the -result) so the image build fails LOUDLY if upstream moves. RE-VERIFY on every -authentik upgrade. - - 1. flows/views/interface.py::compat_needs_sfe() -> also return True for old - Safari/WebKit: (a) Safari/Mobile Safari Version <= 16.3 (covers desktop-mode - iPadOS which reports as Mac Safari), and (b) ANY iOS browser (Chrome/CriOS, - Firefox/FxiOS, Edge — all share the system WebKit) on iOS <= 16.3. So old - iPads get the SFE on EVERY browser, not just Safari. - - 2. flows/templates/if/flow-sfe.html -> inject static social-login links - (plain redirects to /source/oauth/login//, work on ANY browser) so SFE - users (who otherwise see only username/password) can use social login — - required for accounts with no password (e.g. Google-only users like emo). -""" -import ast -import glob -import os - -# --- Patch 1: compat_needs_sfe() UA gate ------------------------------------- -INTERFACE = "/authentik/flows/views/interface.py" -ANCHOR = ( - ' if "PKeyAuth" in ua["string"]:\n' - " return True\n" - " return False" -) -REPLACEMENT = ( - ' if "PKeyAuth" in ua["string"]:\n' - " return True\n" - " # OVERLAY: old WebKit can't parse the modern ES2022 flow SPA (blank\n" - " # login) -> serve the SFE (real authentik login). (a) desktop-mode\n" - " # Safari/iPadOS reports as Mac Safari with Version<=16.3:\n" - ' if ua["user_agent"]["family"] in ("Safari", "Mobile Safari"):\n' - " try:\n" - ' _maj = int(ua["user_agent"]["major"] or 0)\n' - ' _min = int(ua["user_agent"]["minor"] or 0)\n' - " except (TypeError, ValueError):\n" - " _maj = _min = 0\n" - " if _maj and (_maj < 16 or (_maj == 16 and _min <= 3)):\n" - " return True\n" - " # (b) ANY iOS browser (Chrome/CriOS, Firefox/FxiOS, Edge) shares the\n" - " # system WebKit, so iOS<=16.3 fails regardless of the browser family:\n" - ' if ua["os"]["family"] == "iOS":\n' - " try:\n" - ' _omaj = int(ua["os"]["major"] or 0)\n' - ' _omin = int(ua["os"]["minor"] or 0)\n' - " except (TypeError, ValueError):\n" - " _omaj = _omin = 0\n" - " if _omaj and (_omaj < 16 or (_omaj == 16 and _omin <= 3)):\n" - " return True\n" - " return False" -) -src = open(INTERFACE).read() -assert "def compat_needs_sfe" in src, "compat_needs_sfe() not found — upstream changed" -assert src.count(ANCHOR) == 1, f"anchor not found exactly once in {INTERFACE}" -src = src.replace(ANCHOR, REPLACEMENT) -open(INTERFACE, "w").write(src) -ast.parse(src) -assert 'ua["os"]["family"] == "iOS"' in open(INTERFACE).read() -for pyc in glob.glob("/authentik/flows/views/__pycache__/interface.*.pyc"): - os.remove(pyc) - -# --- Patch 2: social-login links on the SFE shell ---------------------------- -SFE_HTML = "/authentik/flows/templates/if/flow-sfe.html" -HTML_ANCHOR = ( - " \n" - " {% trans 'Powered by authentik' %}" -) -HTML_REPLACEMENT = ( - " \n" - " \n" - ' \n" - " {% trans 'Powered by authentik' %}" -) -html = open(SFE_HTML).read() -assert html.count(HTML_ANCHOR) == 1, f"SFE html anchor not found exactly once in {SFE_HTML}" -html = html.replace(HTML_ANCHOR, HTML_REPLACEMENT) -open(SFE_HTML, "w").write(html) -assert "Continue with Google" in open(SFE_HTML).read() - -print("patch-compat-sfe: SFE for old Safari + all iOS<=16.3; social-login links added to SFE") diff --git a/stacks/beads-server/main.tf b/stacks/beads-server/main.tf index eebed876..5b71373e 100644 --- a/stacks/beads-server/main.tf +++ b/stacks/beads-server/main.tf @@ -601,9 +601,6 @@ resource "kubernetes_config_map" "beadboard_config" { # Pulls the claude-agent-service bearer token from Vault so BeadBoard can # dispatch agent jobs via the in-cluster HTTP API. resource "kubernetes_manifest" "beadboard_agent_service_secret" { - field_manager { - force_conflicts = true - } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/broker-sync/main.tf b/stacks/broker-sync/main.tf index 76d822d8..2de168a1 100644 --- a/stacks/broker-sync/main.tf +++ b/stacks/broker-sync/main.tf @@ -28,9 +28,6 @@ resource "kubernetes_namespace" "broker_sync" { # trading212_api_keys — JSON array of {account_id, account_type, api_key, name, currency} # imap_host, imap_user, imap_password, imap_directory — for InvestEngine + Schwab email ingest resource "kubernetes_manifest" "external_secret" { - field_manager { - force_conflicts = true - } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/calico/main.tf b/stacks/calico/main.tf index 956534fb..39550024 100644 --- a/stacks/calico/main.tf +++ b/stacks/calico/main.tf @@ -22,7 +22,7 @@ resource "kubernetes_namespace" "calico_system" { name = "calico-system" labels = { name = "calico-system" - # calico-system namespace is managed by tigera-operator — auto-update is +# calico-system namespace is managed by tigera-operator — auto-update is # incompatible (operator reverts DaemonSet image from its Installation CR). # "keel.sh/enrolled" = "true" } @@ -161,8 +161,8 @@ resource "helm_release" "tigera_operator" { # render before their crds/ (which helm skips on upgrade) -> "ensure CRDs # are installed first". We instead enable them via the operator CRs applied # directly below (kubectl_manifest) now that the CRDs exist — see ADR-0014. - goldmane = { enabled = false } - whisker = { enabled = false } + goldmane = { enabled = false } + whisker = { enabled = false } # 512Mi (was 256Mi): the operator idles at ~38Mi but its STARTUP spike # (re-listing resources to build informer caches) exceeded 256Mi and # OOM-crashlooped on 2026-06-23 the first time the pod restarted (a latent @@ -212,229 +212,3 @@ resource "kubectl_manifest" "whisker" { spec = { notifications = "Disabled" } }) } - -# --------------------------------------------------------------------------- -# Gated public ingress for the Whisker UI (infra #57 / ADR-0014). -# -# whisker.viktorbarzin.me -> whisker:8081, Authentik-gated (auth="required": -# Whisker ships NO own login — it's an admin observability UI, so Authentik -# forward-auth is the only gate between strangers and the flow view). The -# operator replicated `tls-secret` into calico-system already. -# -# TWO coupled pieces are required because the operator's own `whisker` -# NetworkPolicy (owned by the Whisker CR above) sets policyTypes:[Ingress] -# with NO ingress rules => default-deny on ingress to the whisker pod. The -# additive NP below ORs in a Traefik allow (k8s NetworkPolicies are additive -# across policies selecting the same pod), so we never edit the operator NP. -module "ingress_whisker" { - source = "../../modules/kubernetes/ingress_factory" - dns_type = "proxied" - namespace = "calico-system" - name = "whisker" - service_name = "whisker" - port = 8081 - auth = "required" - tls_secret_name = "tls-secret" - extra_annotations = { - "gethomepage.dev/enabled" = "true" - "gethomepage.dev/name" = "Whisker" - "gethomepage.dev/description" = "Calico flow observability (who-talks-to-whom)" - "gethomepage.dev/icon" = "calico.png" - "gethomepage.dev/group" = "Infrastructure" - } -} - -# Additive NetworkPolicy: permit Traefik -> whisker:8081. ORs with the -# operator's default-deny `whisker` NP (selecting the same pod) so Traefik -# can reach the UI without touching the operator-owned policy. -resource "kubernetes_network_policy_v1" "whisker_allow_traefik" { - metadata { - name = "whisker-allow-traefik" - namespace = "calico-system" - } - spec { - pod_selector { - match_labels = { - "app.kubernetes.io/name" = "whisker" - } - } - policy_types = ["Ingress"] - ingress { - from { - namespace_selector { - match_labels = { - "kubernetes.io/metadata.name" = "traefik" - } - } - } - ports { - port = "8081" - protocol = "TCP" - } - } - } -} - -# Additive egress NetworkPolicy: permit whisker -> the kube-dns ClusterIP for DNS. -# -# ROOT CAUSE of the 2026-06-28 "Whisker UI empty" incident: the operator's own -# `whisker` NetworkPolicy is policyTypes:[Ingress,Egress] and its egress allows -# DNS only to the kube-dns *pods* (podSelector k8s-app=kube-dns). But -# whisker-backend resolves `goldmane...svc` via the kube-dns *ClusterIP* -# (10.96.0.10), and Calico drops UDP DNS to a ClusterIP under a podSelector-only -# egress rule (verified: from whisker's netns, ClusterIP DNS = 100% timeout -# while direct kube-dns pod-IP DNS = OK; a pod with no egress policy resolves -# fine). whisker-backend resolves once in the brief startup window before the -# policy programs, establishes its long-lived gRPC stream, and only re-resolves -# when that stream breaks — at which point the blocked ClusterIP DNS wedges its -# Go resolver and the UI goes empty (the durable aggregator, in its own -# unrestricted namespace, is unaffected). k8s egress policies are additive, so -# this ORs in an allow for the ClusterIP; the operator NP is left untouched. -# (Empirically: adding this ipBlock rule flips ClusterIP DNS from 100% fail to -# 100% ok.) See docs/runbooks/goldmane-flow-trail.md. -resource "kubernetes_network_policy_v1" "whisker_allow_dns_clusterip" { - metadata { - name = "whisker-allow-dns-clusterip" - namespace = "calico-system" - } - spec { - pod_selector { - match_labels = { - "app.kubernetes.io/name" = "whisker" - } - } - policy_types = ["Egress"] - egress { - # 10.96.0.10 is the kube-dns ClusterIP (cluster invariant — service CIDR - # 10.96.0.0/12, DNS always .10; the same IP CoreDNS/Technitium configs pin). - to { - ip_block { - cidr = "10.96.0.10/32" - } - } - ports { - port = "53" - protocol = "UDP" - } - ports { - port = "53" - protocol = "TCP" - } - } - } -} - -# --------------------------------------------------------------------------- -# Whisker self-heal watchdog (ADR-0014; added 2026-06-28 after a live incident). -# -# BACKSTOP. The REAL fix is kubernetes_network_policy_v1.whisker_allow_dns_clusterip -# above (it unblocks the root-cause ClusterIP DNS). This watchdog stays as -# defense-in-depth: whisker-backend has NO operator liveness probe, so if its -# long-lived goldmane gRPC stream ever wedges for any OTHER reason (the Go -# resolver spams `failed to stream flows` / `code = Unavailable` and never -# reconnects -> empty UI, while the durable aggregator in its own namespace is -# unaffected), nothing else would restart it. Whisker is operator-managed -# (Whisker CR) so we can't inject a probe; this is the supported-pattern -# alternative. With the DNS fix in place it should rarely, if ever, fire. -# -# It restarts the pod ONLY when the wedged signature is present AND Goldmane is -# Ready (so a real Goldmane outage doesn't cause restart-thrash). A fresh pod -# reconnects cleanly. See docs/runbooks/goldmane-flow-trail.md. -resource "kubernetes_service_account" "whisker_watchdog" { - metadata { - name = "whisker-watchdog" - namespace = kubernetes_namespace.calico_system.metadata[0].name - } -} - -# Namespaced Role (least privilege — only calico-system): read pod logs to -# detect the wedge, delete the whisker pod to heal it. -resource "kubernetes_role" "whisker_watchdog" { - metadata { - name = "whisker-watchdog" - namespace = kubernetes_namespace.calico_system.metadata[0].name - } - rule { - api_groups = [""] - resources = ["pods"] - verbs = ["get", "list", "delete"] - } - rule { - api_groups = [""] - resources = ["pods/log"] - verbs = ["get"] - } -} - -resource "kubernetes_role_binding" "whisker_watchdog" { - metadata { - name = "whisker-watchdog" - namespace = kubernetes_namespace.calico_system.metadata[0].name - } - role_ref { - api_group = "rbac.authorization.k8s.io" - kind = "Role" - name = kubernetes_role.whisker_watchdog.metadata[0].name - } - subject { - kind = "ServiceAccount" - name = kubernetes_service_account.whisker_watchdog.metadata[0].name - namespace = kubernetes_namespace.calico_system.metadata[0].name - } -} - -resource "kubernetes_cron_job_v1" "whisker_watchdog" { - metadata { - name = "whisker-watchdog" - namespace = kubernetes_namespace.calico_system.metadata[0].name - } - spec { - schedule = "*/10 * * * *" - successful_jobs_history_limit = 1 - failed_jobs_history_limit = 1 - concurrency_policy = "Forbid" - job_template { - metadata { - name = "whisker-watchdog" - } - spec { - template { - metadata { - name = "whisker-watchdog" - } - spec { - service_account_name = kubernetes_service_account.whisker_watchdog.metadata[0].name - container { - name = "watchdog" - image = "bitnami/kubectl:latest" - command = ["/bin/sh", "-c", <<-EOT - set -eu - NS=calico-system - # Don't thrash if Goldmane itself is down — that's not a whisker bug. - if ! kubectl -n "$NS" get pod -l k8s-app=goldmane \ - -o jsonpath='{.items[*].status.conditions[?(@.type=="Ready")].status}' 2>/dev/null | grep -q True; then - echo "goldmane not Ready — skipping (not a whisker problem)"; exit 0 - fi - ERRS=$(kubectl -n "$NS" logs -l k8s-app=whisker -c whisker-backend --since=11m --tail=500 2>/dev/null \ - | grep -cE 'failed to stream flows|failed to list filter hints|code = Unavailable|i/o timeout' || true) - ERRS=$${ERRS:-0} - if [ "$ERRS" -ge 10 ]; then - echo "whisker-backend WEDGED: $ERRS goldmane-connection errors in 11m — restarting whisker pod" - kubectl -n "$NS" delete pod -l k8s-app=whisker --ignore-not-found - else - echo "whisker-backend healthy: $ERRS goldmane-connection errors in 11m" - fi - EOT - ] - } - restart_policy = "Never" - } - } - } - } - } - lifecycle { - # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2 - ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config] - } -} diff --git a/stacks/changedetection/main.tf b/stacks/changedetection/main.tf index 319ebcf1..ee203e7b 100644 --- a/stacks/changedetection/main.tf +++ b/stacks/changedetection/main.tf @@ -19,9 +19,6 @@ resource "kubernetes_namespace" "changedetection" { } resource "kubernetes_manifest" "external_secret" { - field_manager { - force_conflicts = true - } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/chrome-service/files/novnc/entrypoint.sh b/stacks/chrome-service/files/novnc/entrypoint.sh index aeff9408..fae5c641 100644 --- a/stacks/chrome-service/files/novnc/entrypoint.sh +++ b/stacks/chrome-service/files/novnc/entrypoint.sh @@ -19,14 +19,14 @@ for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15; do sleep 2 done -# Both x11vnc and websockify run as supervised children of this entrypoint (PID -# 1) so their logs land on container stdout and the `wait -n` at the end can catch -# either one dying. `-noshm` skips MIT-SHM probes that fail across container -# boundaries (each container has its own /dev/shm); `-noxdamage` skips XDAMAGE -# which Xvfb doesn't expose; `-quiet` keeps the polling chatter out of pod logs. +# websockify runs as PID 1; x11vnc is a child so its logs land on container stdout +# `-noshm` skips MIT-SHM probes that fail across container boundaries (each +# container has its own /dev/shm); `-noxdamage` skips XDAMAGE which Xvfb +# doesn't expose; `-quiet` keeps the polling chatter out of pod logs. echo "starting x11vnc -> :5900" x11vnc -display localhost:99 -nopw -listen 0.0.0.0 -rfbport 5900 \ -forever -shared -noshm -noxdamage -quiet 2>&1 & +X11VNC_PID=$! for i in 1 2 3 4 5 6 7 8 9 10; do if echo > /dev/tcp/127.0.0.1/5900 2>/dev/null; then @@ -43,18 +43,4 @@ if ! echo > /dev/tcp/127.0.0.1/5900 2>/dev/null; then fi echo "starting websockify -> :6080" -# Run websockify in the background (it was `exec`ed before) so BOTH it and x11vnc -# are supervised. x11vnc attaches to the chrome-service container's Xvfb over -# localhost:6099 (shared pod network); when that container restarts, x11vnc loses -# its X connection and exits. Previously websockify was PID 1 and x11vnc was an -# unsupervised child, so a dead x11vnc was never relaunched: :5900 stayed dead and -# the noVNC view went black until a manual pod restart. Now if EITHER process -# exits, `wait -n` returns and we exit non-zero so the kubelet restarts this -# container, which re-waits for Xvfb and relaunches x11vnc — the bridge self-heals -# across browser-container restarts. (Same supervision pattern as the -# android-emulator stack's entrypoint.) -websockify --web=/usr/share/novnc 6080 localhost:5900 & - -wait -n || true -echo "novnc: a supervised process (x11vnc or websockify) exited; exiting so the kubelet restarts this container." >&2 -exit 1 +exec websockify --web=/usr/share/novnc 6080 localhost:5900 diff --git a/stacks/chrome-service/main.tf b/stacks/chrome-service/main.tf index 82e8fe45..2f679c00 100644 --- a/stacks/chrome-service/main.tf +++ b/stacks/chrome-service/main.tf @@ -41,9 +41,6 @@ resource "kubernetes_namespace" "chrome_service" { # --- Secrets (single-key extract: api_bearer_token) --- resource "kubernetes_manifest" "external_secret" { - field_manager { - force_conflicts = true - } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" @@ -333,23 +330,15 @@ resource "kubernetes_deployment" "chrome_service" { container { name = "novnc" # Phase 3 cutover 2026-05-07 — Forgejo registry consolidation. - # SHA-pinned (not :latest): Keel is OFF for this deployment - # (keel.sh/policy=never, below) and :latest/IfNotPresent won't re-pull a - # rebuilt image, so a new noVNC entrypoint only deploys when this digest - # is bumped here. Bump after build-chrome-service-novnc.yml pushes a new - # SHA tag — then WAIT for that apply pipeline to finish before pushing - # anything else: Woodpecker cancel-previous SIGKILLs an in-flight apply - # mid-run (memory id=1957), which is exactly how the 2026-06-27 apply got - # killed. 2026-06-27: bumped to land the x11vnc-supervision self-heal fix - # (noVNC went black after a browser-container restart; see - # docs/architecture/chrome-service.md "x11vnc supervision"). - image = "ghcr.io/viktorbarzin/chrome-service-novnc:19d0f0933a8ec75be6cfa077db88e0f8c3760f40" + image = "ghcr.io/viktorbarzin/chrome-service-novnc:latest" image_pull_policy = "IfNotPresent" # Cap RLIMIT_NOFILE before the entrypoint runs. Containerd grants pods # nofile=2^31; x11vnc sweeps the whole fd table on each client connect, # so every VNC connection hangs on "Connecting" until it times out - # (fd-sweep bug, same as android-emulator). entrypoint.sh also sets this; - # the wrapper keeps the cap deterministic even off a cached image. + # (fd-sweep bug, same as android-emulator). entrypoint.sh now also sets + # this, but the image is :latest/IfNotPresent so a rebuilt entrypoint + # isn't guaranteed to be pulled — this wrapper applies the cap + # deterministically on every rollout off the cached image. command = ["bash", "-c", "ulimit -n 65536; exec /entrypoint.sh"] port { name = "http" @@ -359,13 +348,9 @@ resource "kubernetes_deployment" "chrome_service" { # x11vnc connects to the chrome-service container's Xvfb over # localhost TCP (shared pod network). Same uid 1000 as chrome # container so we can read MIT-MAGIC-COOKIE if Xvfb adds one. - # 256Mi (was 96Mi): the 96Mi cap OOMKilled (exit 137) the sidecar under - # ACTIVE VNC use — x11vnc + websockify framebuffer/encode buffers spike - # well past idle (~37Mi) when a client streams the 1280x720 screen, so the - # noVNC view froze/hung on connect. Bumped 2026-06-28. resources { - requests = { cpu = "10m", memory = "64Mi" } - limits = { memory = "256Mi" } + requests = { cpu = "10m", memory = "32Mi" } + limits = { memory = "96Mi" } } } diff --git a/stacks/chrome-service/rbac.tf b/stacks/chrome-service/rbac.tf deleted file mode 100644 index f0043f1a..00000000 --- a/stacks/chrome-service/rbac.tf +++ /dev/null @@ -1,95 +0,0 @@ -# emo's hands-off "homelab browser" credential + chrome-service port-forward RBAC. -# -# Access decision (2026-06-28, Viktor's explicit call): emo SHARES Viktor's single -# chrome-service browser rather than getting an isolated instance. The noVNC half of -# that grant is the Authentik allowlist in -# stacks/authentik/admin-services-restriction.tf (CHROME_ALLOWED); THIS file is the -# CLI half — it lets emo's `homelab browser` reach the headed Chrome over CDP. -# -# `homelab browser` shells out to `kubectl port-forward -n chrome-service svc/chrome-service` -# (cli/browser.go). emo's normal kubeconfig is interactive-OIDC-only (kubelogin) and -# can't authenticate a headless agent session, and his power-user tier has no -# pods/portforward. So we mint a dedicated ServiceAccount with a long-lived token -# (the dashboard-sa.tf pattern) that the devvm provisioner installs as emo's DEFAULT -# kubeconfig context (scripts/t3-provision-users.sh install_browser_kubeconfig); his -# personal OIDC login stays available as the `oidc@homelab` named context. -# -# TRADE-OFF (accepted): CDP access == full control of the shared browser, including -# the persistent profile (browser.contexts[0]) where Viktor's warmed logins live. -# CDP has no per-context auth, so this SA can reach Viktor's sessions. That is inherent -# to sharing one browser (the isolated per-user instance was declined). -# See docs/architecture/chrome-service.md "Multi-user access". - -resource "kubernetes_service_account" "emo_browser" { - metadata { - name = "emo-browser" - namespace = kubernetes_namespace.chrome_service.metadata[0].name - } -} - -# Long-lived (non-expiring) token for the SA — the devvm provisioner reads this and -# writes it into emo's kubeconfig. Same pattern as stacks/rbac/.../dashboard-sa.tf. -resource "kubernetes_secret" "emo_browser_token" { - metadata { - name = "emo-browser-token" - namespace = kubernetes_namespace.chrome_service.metadata[0].name - annotations = { - "kubernetes.io/service-account.name" = kubernetes_service_account.emo_browser.metadata[0].name - } - } - type = "kubernetes.io/service-account-token" - wait_for_service_account_token = true -} - -# The ONLY verb emo's SA lacks for `kubectl port-forward svc/chrome-service`: the -# port-forward subresource. (get/list of pods + services + endpoints comes from the -# cluster-read binding below.) Namespace-scoped to chrome-service. -resource "kubernetes_role" "browser_portforward" { - metadata { - name = "chrome-service-portforward" - namespace = kubernetes_namespace.chrome_service.metadata[0].name - } - rule { - api_groups = [""] - resources = ["pods/portforward"] - verbs = ["create"] - } -} - -resource "kubernetes_role_binding" "emo_browser_portforward" { - metadata { - name = "emo-browser-portforward" - namespace = kubernetes_namespace.chrome_service.metadata[0].name - } - role_ref { - api_group = "rbac.authorization.k8s.io" - kind = "Role" - name = kubernetes_role.browser_portforward.metadata[0].name - } - subject { - kind = "ServiceAccount" - name = kubernetes_service_account.emo_browser.metadata[0].name - namespace = kubernetes_namespace.chrome_service.metadata[0].name - } -} - -# Cluster-wide read-only (NO secrets), mirroring emo's power-user OIDC access, bound -# to the SA. Needed because the SA becomes emo's DEFAULT kubectl context, so without -# this his everyday `kubectl get ...` would regress — AND port-forward itself needs -# get/list on services + pods + endpoints (all covered by oidc-power-user-readonly). -# That ClusterRole is defined in stacks/rbac (modules/rbac/main.tf); referenced by name. -resource "kubernetes_cluster_role_binding" "emo_browser_readonly" { - metadata { - name = "emo-browser-readonly" - } - role_ref { - api_group = "rbac.authorization.k8s.io" - kind = "ClusterRole" - name = "oidc-power-user-readonly" - } - subject { - kind = "ServiceAccount" - name = kubernetes_service_account.emo_browser.metadata[0].name - namespace = kubernetes_namespace.chrome_service.metadata[0].name - } -} diff --git a/stacks/ci-pipeline-health/main.tf b/stacks/ci-pipeline-health/main.tf index 44aacbec..17378f84 100644 --- a/stacks/ci-pipeline-health/main.tf +++ b/stacks/ci-pipeline-health/main.tf @@ -49,9 +49,6 @@ resource "kubernetes_namespace" "ci_pipeline_health" { # billing on PRIVATE mirrors, which a future scoped read:packages rotation of # the alias could not do. Blast radius = this single-CronJob namespace. resource "kubernetes_manifest" "external_secret" { - field_manager { - force_conflicts = true - } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/claude-agent-service/main.tf b/stacks/claude-agent-service/main.tf index a039f699..9f8b6478 100644 --- a/stacks/claude-agent-service/main.tf +++ b/stacks/claude-agent-service/main.tf @@ -38,9 +38,6 @@ resource "kubernetes_namespace" "claude_agent" { # --- Secrets --- resource "kubernetes_manifest" "external_secret" { - field_manager { - force_conflicts = true - } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/claude-breakglass/main.tf b/stacks/claude-breakglass/main.tf index ca700945..6b996b9e 100644 --- a/stacks/claude-breakglass/main.tf +++ b/stacks/claude-breakglass/main.tf @@ -57,9 +57,6 @@ resource "kubernetes_service_account" "breakglass" { # DENIED this path (see stacks/vault/main.tf) so the shared, prompt-injectable # pod can never read it. resource "kubernetes_manifest" "external_secret_ssh" { - field_manager { - force_conflicts = true - } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" @@ -85,9 +82,6 @@ resource "kubernetes_manifest" "external_secret_ssh" { # Env secrets: the Anthropic OAuth token (shared with claude-agent-service — # same account) and the app bearer token (in-cluster/CLI fallback caller auth). resource "kubernetes_manifest" "external_secret_env" { - field_manager { - force_conflicts = true - } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/claude-memory/main.tf b/stacks/claude-memory/main.tf index fad08b42..18c21fe5 100644 --- a/stacks/claude-memory/main.tf +++ b/stacks/claude-memory/main.tf @@ -29,9 +29,6 @@ resource "kubernetes_namespace" "claude-memory" { } resource "kubernetes_manifest" "external_secret" { - field_manager { - force_conflicts = true - } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" @@ -60,9 +57,6 @@ resource "kubernetes_manifest" "external_secret" { # DB credentials from Vault database engine (rotated every 24h) resource "kubernetes_manifest" "db_external_secret" { - field_manager { - force_conflicts = true - } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/coturn/main.tf b/stacks/coturn/main.tf index 9ab23e5d..caeb9a66 100644 --- a/stacks/coturn/main.tf +++ b/stacks/coturn/main.tf @@ -5,9 +5,6 @@ variable "tls_secret_name" { variable "public_ip" { type = string } resource "kubernetes_manifest" "external_secret" { - field_manager { - force_conflicts = true - } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/dawarich/main.tf b/stacks/dawarich/main.tf index 3eeb1540..2432e9c3 100644 --- a/stacks/dawarich/main.tf +++ b/stacks/dawarich/main.tf @@ -23,9 +23,6 @@ resource "kubernetes_namespace" "dawarich" { } resource "kubernetes_manifest" "external_secret" { - field_manager { - force_conflicts = true - } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/dbaas/modules/dbaas/main.tf b/stacks/dbaas/modules/dbaas/main.tf index d940f642..479263ed 100644 --- a/stacks/dbaas/modules/dbaas/main.tf +++ b/stacks/dbaas/modules/dbaas/main.tf @@ -745,10 +745,7 @@ resource "kubernetes_deployment" "phpmyadmin" { labels = { "app" = "phpmyadmin" tier = var.tier - # ADR-0014 service identity: dbaas is a multi-Service namespace, so the - # namespace alone can't attribute Goldmane flows. Value = the fronting - # Service name (kubernetes_service.phpmyadmin is named "pma"). - "service-identity" = "pma" + } annotations = { "reloader.stakater.com/search" = "true" @@ -765,10 +762,6 @@ resource "kubernetes_deployment" "phpmyadmin" { metadata { labels = { "app" = "phpmyadmin" - # ADR-0014: Goldmane/Felix stamps POD labels onto flows, so the - # disambiguating identity must live on the pod template (not just - # the Deployment metadata above). Not in selector → no replace. - "service-identity" = "pma" } } spec { @@ -819,19 +812,8 @@ resource "kubernetes_deployment" "phpmyadmin" { } } lifecycle { - ignore_changes = [ - spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2 - # This Deployment is Keel-enrolled (keel.sh/policy=patch). Ignore the - # attributes Keel/Kyverno mutate at runtime so `terragrunt apply` (incl. - # the daily drift plan) doesn't fight them or revert the live image — - # canonical KEEL/KYVERNO lifecycle guard, matches linkwarden/chrome-service. - metadata[0].annotations["keel.sh/policy"], - metadata[0].annotations["keel.sh/trigger"], - metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2 - metadata[0].annotations["keel.sh/match-tag"], - spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE — Keel manages tag updates - spec[0].template[0].metadata[0].annotations["keel.sh/update-time"], # KEEL_LIFECYCLE_V1 - ] + # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2 + ignore_changes = [spec[0].template[0].spec[0].dns_config] } } @@ -1517,10 +1499,6 @@ resource "kubernetes_deployment" "pgadmin" { } labels = { tier = var.tier - # ADR-0014 service identity: dbaas is a multi-Service namespace, so the - # namespace alone can't attribute Goldmane flows. Value = the fronting - # Service name (kubernetes_service.pgadmin is named "pgadmin"). - "service-identity" = "pgadmin" } } spec { @@ -1536,10 +1514,6 @@ resource "kubernetes_deployment" "pgadmin" { metadata { labels = { app = "pgadmin" - # ADR-0014: Goldmane/Felix stamps POD labels onto flows, so the - # disambiguating identity must live on the pod template (not just - # the Deployment metadata above). Not in selector → no replace. - "service-identity" = "pgadmin" } } spec { @@ -1594,20 +1568,8 @@ resource "kubernetes_deployment" "pgadmin" { } } lifecycle { - ignore_changes = [ - spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2 - # This Deployment is Keel-enrolled (keel.sh/policy=patch) and Keel has - # bumped the live image (dpage/pgadmin4:9.16). Ignore the Keel/Kyverno - # runtime-mutated attributes so `terragrunt apply` (incl. the daily drift - # plan) doesn't revert the image to bare `dpage/pgadmin4` or strip Keel's - # annotations — canonical guard, matches linkwarden/chrome-service. - metadata[0].annotations["keel.sh/policy"], - metadata[0].annotations["keel.sh/trigger"], - metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2 - metadata[0].annotations["keel.sh/match-tag"], - spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE — Keel manages tag updates - spec[0].template[0].metadata[0].annotations["keel.sh/update-time"], # KEEL_LIFECYCLE_V1 - ] + # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2 + ignore_changes = [spec[0].template[0].spec[0].dns_config] } } resource "kubernetes_service" "pgadmin" { diff --git a/stacks/diun/main.tf b/stacks/diun/main.tf index 81294806..9933f064 100644 --- a/stacks/diun/main.tf +++ b/stacks/diun/main.tf @@ -20,9 +20,6 @@ resource "kubernetes_namespace" "diun" { } resource "kubernetes_manifest" "external_secret" { - field_manager { - force_conflicts = true - } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/ebooks/main.tf b/stacks/ebooks/main.tf index 0813b45a..a5754590 100644 --- a/stacks/ebooks/main.tf +++ b/stacks/ebooks/main.tf @@ -20,9 +20,6 @@ resource "kubernetes_namespace" "ebooks" { # ExternalSecrets for all three sources resource "kubernetes_manifest" "calibre_external_secret" { - field_manager { - force_conflicts = true - } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" @@ -50,9 +47,6 @@ resource "kubernetes_manifest" "calibre_external_secret" { } resource "kubernetes_manifest" "audiobookshelf_external_secret" { - field_manager { - force_conflicts = true - } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" @@ -80,9 +74,6 @@ resource "kubernetes_manifest" "audiobookshelf_external_secret" { } resource "kubernetes_manifest" "servarr_external_secret" { - field_manager { - force_conflicts = true - } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/f1-stream/main.tf b/stacks/f1-stream/main.tf index bcd66c7f..a62ad01a 100644 --- a/stacks/f1-stream/main.tf +++ b/stacks/f1-stream/main.tf @@ -33,9 +33,6 @@ resource "kubernetes_namespace" "f1-stream" { } resource "kubernetes_manifest" "external_secret" { - field_manager { - force_conflicts = true - } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" @@ -65,9 +62,6 @@ resource "kubernetes_manifest" "external_secret" { # Pull the chrome-service bearer token into this namespace as a separate # Secret so the verifier can reach the in-cluster Playwright pool. resource "kubernetes_manifest" "chrome_service_client_secret" { - field_manager { - force_conflicts = true - } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/fire-planner/main.tf b/stacks/fire-planner/main.tf index be478699..21503a37 100644 --- a/stacks/fire-planner/main.tf +++ b/stacks/fire-planner/main.tf @@ -53,9 +53,6 @@ resource "kubernetes_namespace" "fire_planner" { # Seed before applying: # secret/fire-planner -> property `recompute_bearer_token` resource "kubernetes_manifest" "external_secret" { - field_manager { - force_conflicts = true - } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" @@ -118,9 +115,6 @@ resource "kubernetes_manifest" "external_secret" { # Template builds the asyncpg DSN consumed by the FastAPI app + CronJob # as DB_CONNECTION_STRING. resource "kubernetes_manifest" "db_external_secret" { - field_manager { - force_conflicts = true - } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" @@ -165,9 +159,6 @@ resource "kubernetes_manifest" "db_external_secret" { # pg-sync sidecar populates `daily_account_valuation` etc. hourly; the # fire-planner ingest reads those tables via this role. resource "kubernetes_manifest" "wealthfolio_sync_db_external_secret" { - field_manager { - force_conflicts = true - } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" @@ -459,90 +450,6 @@ resource "kubernetes_cron_job_v1" "fire_planner_recompute" { ] } -# Monthly FIRE-countdown target solve on the 2nd at 10:00 UTC (an hour after -# recompute-all, so account_snapshot is fresh). Binary-searches each Case's FIRE -# number per country at the 99% Guyton-Klinger bar and upserts fire_target, which -# the wealth Grafana dashboard's "FIRE Countdown" section reads. -resource "kubernetes_cron_job_v1" "fire_planner_fire_targets" { - metadata { - name = "fire-planner-fire-targets" - namespace = kubernetes_namespace.fire_planner.metadata[0].name - } - spec { - schedule = "0 10 2 * *" - concurrency_policy = "Forbid" - successful_jobs_history_limit = 3 - failed_jobs_history_limit = 5 - starting_deadline_seconds = 600 - - job_template { - metadata { - labels = local.labels - } - spec { - backoff_limit = 1 - ttl_seconds_after_finished = 86400 - # The full country sweep is CPU-bound (binary search × ~22 cities × - # 3 cases). Give it room rather than letting it run forever. - active_deadline_seconds = 3600 - template { - metadata { - labels = local.labels - } - spec { - restart_policy = "OnFailure" - image_pull_secrets { - name = "registry-credentials" - } - image_pull_secrets { - name = "ghcr-credentials" - } - container { - name = "fire-targets" - image = local.image - # --horizon 72: Viktor retires ~age 28 and plans to live to 100, so - # the portfolio must last 72 years (was the 60y default ≈ to age 88). - command = ["python", "-m", "fire_planner", "recompute-fire-targets", - "--countries", "all", "--horizon", "72"] - - env_from { - secret_ref { - name = "fire-planner-secrets" - } - } - env_from { - secret_ref { - name = "fire-planner-db-creds" - } - } - - resources { - requests = { - cpu = "500m" - memory = "1Gi" - } - limits = { - memory = "2Gi" - } - } - } - } - } - } - } - } - - lifecycle { - # KYVERNO_LIFECYCLE_V1 - ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config] - } - - depends_on = [ - kubernetes_manifest.external_secret, - kubernetes_manifest.db_external_secret, - ] -} - # Weekly refresh of the COL cache: walks col_snapshot for rows # expiring within 7 days, re-scrapes Numbeo + Expatistan, upserts. With # the user-chosen 1-year TTL, a healthy cache has 0 stale rows on most @@ -662,53 +569,16 @@ module "ingress_api" { auth = "none" } -# ExternalSecret in the monitoring namespace mirroring the rotating -# fire_planner DB password. Grafana mounts this via envFromSecrets in -# monitoring/grafana_chart_values.yaml; the datasource ConfigMap below -# references it as $__env{FIRE_PLANNER_PG_PASSWORD}. Reloader restarts -# Grafana whenever ESO updates this secret (on the 7d static-role -# rotation), so the provisioned datasource never goes stale — replaces -# the old plan-time `data.kubernetes_secret` bake that broke weekly. -# Mirrors the wealth-pg / payslips-pg pattern. -resource "kubernetes_manifest" "grafana_fire_planner_pg_creds" { - field_manager { - force_conflicts = true - } - manifest = { - apiVersion = "external-secrets.io/v1" - kind = "ExternalSecret" - metadata = { - name = "grafana-fire-planner-pg-creds" - namespace = "monitoring" - } - spec = { - refreshInterval = "15m" - secretStoreRef = { - name = "vault-database" - kind = "ClusterSecretStore" - } - target = { - name = "grafana-fire-planner-pg-creds" - template = { - metadata = { - annotations = { - "reloader.stakater.com/match" = "true" - } - } - data = { - FIRE_PLANNER_PG_PASSWORD = "{{ .password }}" - } - } - } - data = [{ - secretKey = "password" - remoteRef = { - key = "static-creds/pg-fire-planner" - property = "password" - } - }] - } +# Plan-time read of the ESO-created K8s Secret for Grafana datasource +# password. First-apply gotcha: must +# `terragrunt apply -target=kubernetes_manifest.db_external_secret` so +# the Secret exists before this data source plans. +data "kubernetes_secret" "fire_planner_db_creds" { + metadata { + name = "fire-planner-db-creds" + namespace = kubernetes_namespace.fire_planner.metadata[0].name } + depends_on = [kubernetes_manifest.db_external_secret] } # Grafana datasource for fire_planner PostgreSQL DB. @@ -745,15 +615,12 @@ resource "kubernetes_config_map" "grafana_fire_planner_datasource" { timescaledb = false } secureJsonData = { - # Live env from grafana-fire-planner-pg-creds (above), injected into - # Grafana via envFromSecrets; reloader refreshes it on rotation. - password = "$__env{FIRE_PLANNER_PG_PASSWORD}" + password = data.kubernetes_secret.fire_planner_db_creds.data["DB_PASSWORD"] } editable = true }] }) } - depends_on = [kubernetes_manifest.grafana_fire_planner_pg_creds] } # CI retrigger 2026-05-16T13:42:57+00:00 — bulk enrollment apply (pipeline #689 killed) @@ -794,9 +661,6 @@ variable "run_examples_bulk_ingest" { # Reddit OAuth creds pulled from Vault secret/viktor. resource "kubernetes_manifest" "external_secret_examples_reddit" { - field_manager { - force_conflicts = true - } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" @@ -837,9 +701,6 @@ resource "kubernetes_manifest" "external_secret_examples_reddit" { # claude-agent-service bearer pulled separately so its rotation cadence # is decoupled from the Reddit creds. resource "kubernetes_manifest" "external_secret_examples_claude" { - field_manager { - force_conflicts = true - } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/forgejo/email-secret.tf b/stacks/forgejo/email-secret.tf index d0e44c1c..034d45f2 100644 --- a/stacks/forgejo/email-secret.tf +++ b/stacks/forgejo/email-secret.tf @@ -6,9 +6,6 @@ # (stacks/authentik/email-secret.tf) — one credential, one rotation point. The # reloader annotation rolls the Forgejo pod if the password is ever rotated. resource "kubernetes_manifest" "forgejo_email_secret" { - field_manager { - force_conflicts = true - } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/freedify/main.tf b/stacks/freedify/main.tf index 2f017003..3e2cf8b4 100644 --- a/stacks/freedify/main.tf +++ b/stacks/freedify/main.tf @@ -3,9 +3,6 @@ variable "tls_secret_name" { sensitive = true } resource "kubernetes_manifest" "external_secret" { - field_manager { - force_conflicts = true - } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/freshrss/main.tf b/stacks/freshrss/main.tf index 61e2122e..31c5d20e 100644 --- a/stacks/freshrss/main.tf +++ b/stacks/freshrss/main.tf @@ -18,9 +18,6 @@ resource "kubernetes_namespace" "immich" { } resource "kubernetes_manifest" "external_secret" { - field_manager { - force_conflicts = true - } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/goldmane-edge-aggregator/main.tf b/stacks/goldmane-edge-aggregator/main.tf index 1c6fa58a..f2da273d 100644 --- a/stacks/goldmane-edge-aggregator/main.tf +++ b/stacks/goldmane-edge-aggregator/main.tf @@ -57,19 +57,16 @@ resource "kubernetes_namespace" "goldmane_edge_aggregator" { # ----------------------------------------------------------------------------- # The aggregator dials goldmane:7443 over mutual TLS. We mint a client cert # signed by the Tigera CA (the same CA that issues Goldmane's serving cert), so -# Goldmane requires mutual TLS on :7443 and verifies the client cert chains to -# the Tigera CA — it does NOT authorize by client identity, so ANY Tigera-CA- -# signed cert is accepted. Rather than copy the Tigera CA PRIVATE KEY into TF -# state to mint our own (a needless CA-key exposure; the hashicorp/tls provider -# is also incompatible with this repo's global generate-providers/lockfile -# pattern), we REUSE the operator-minted, Tigera-CA-signed client cert -# `whisker-backend-key-pair` (calico-system). We never touch the CA key. -# Trade-off: if the operator rotates that cert, re-apply to re-sync (hardening -# follow-up: mint an own-identity cert in-namespace if Whisker is ever removed). -data "kubernetes_secret" "whisker_backend" { +# Goldmane trusts the client and the client trusts Goldmane's server cert via +# the published CA bundle. +# +# The Tigera CA private key lives in the `tigera-ca-private` Secret in +# tigera-operator (Opaque; verified keys: tls.crt + tls.key). The stack's apply +# identity needs RBAC get on that secret — see the Role/RoleBinding below. +data "kubernetes_secret" "tigera_ca" { metadata { - name = "whisker-backend-key-pair" - namespace = "calico-system" + name = "tigera-ca-private" + namespace = "tigera-operator" } } @@ -96,11 +93,46 @@ resource "kubernetes_config_map" "tigera_ca_bundle" { data = data.kubernetes_config_map.tigera_ca_bundle.data } -# Client cert + key for mTLS to goldmane:7443, mounted at TLS_CERT_PATH / -# TLS_KEY_PATH defaults (/etc/goldmane-client-tls/tls.crt and .../tls.key). -# Sourced verbatim from the operator's whisker-backend client key-pair (read -# above) — already Tigera-CA-signed, which is all Goldmane verifies. No CA key -# is touched and no cross-namespace CA RBAC is needed. +# Client private key. +resource "tls_private_key" "goldmane_client" { + algorithm = "RSA" + rsa_bits = 2048 +} + +# CSR for the client cert. CN identifies the client; the service-DNS SAN mirrors +# how Felix/whisker-backend present a client identity to Goldmane. +resource "tls_cert_request" "goldmane_client" { + private_key_pem = tls_private_key.goldmane_client.private_key_pem + subject { + common_name = "goldmane-edge-aggregator" + organization = "goldmane-edge-aggregator" + } + dns_names = [ + "goldmane-edge-aggregator", + "goldmane-edge-aggregator.goldmane-edge-aggregator.svc.cluster.local", + ] +} + +# Sign the CSR with the Tigera CA. 10-year validity (87600h): re-apply rotates +# it well before expiry; a long horizon avoids surprise mTLS outages from an +# unattended stack. The Tigera CA itself outlives this (operator-managed). +resource "tls_locally_signed_cert" "goldmane_client" { + cert_request_pem = tls_cert_request.goldmane_client.cert_request_pem + ca_private_key_pem = data.kubernetes_secret.tigera_ca.data["tls.key"] + ca_cert_pem = data.kubernetes_secret.tigera_ca.data["tls.crt"] + + validity_period_hours = 87600 # 10y + early_renewal_hours = 720 # re-sign on apply when <30d remain + + allowed_uses = [ + "client_auth", + "digital_signature", + "key_encipherment", + ] +} + +# The minted client cert + key, mounted at TLS_CERT_PATH / TLS_KEY_PATH defaults +# (/etc/goldmane-client-tls/tls.crt and .../tls.key). resource "kubernetes_secret" "goldmane_client_tls" { metadata { name = "goldmane-client-tls" @@ -108,8 +140,47 @@ resource "kubernetes_secret" "goldmane_client_tls" { } type = "Opaque" data = { - "tls.crt" = data.kubernetes_secret.whisker_backend.data["tls.crt"] - "tls.key" = data.kubernetes_secret.whisker_backend.data["tls.key"] + "tls.crt" = tls_locally_signed_cert.goldmane_client.cert_pem + "tls.key" = tls_private_key.goldmane_client.private_key_pem + } +} + +# Narrow RBAC so this stack's apply identity (and ESO/Reloader are unaffected) +# can `get` the Tigera CA private key in tigera-operator. The data source above +# reads it at apply time; this Role/RoleBinding documents + grants that access +# rather than relying on cluster-admin. The subject is the same SA the other +# Tier-1 stacks apply as (claude-agent/terraform-state for headless, the human +# OIDC identity interactively) — both are cluster-admin today, so this is +# belt-and-braces / least-privilege intent for when apply identities tighten. +resource "kubernetes_role" "read_tigera_ca" { + metadata { + name = "goldmane-edge-aggregator-read-tigera-ca" + namespace = "tigera-operator" + } + rule { + api_groups = [""] + resources = ["secrets"] + resource_names = ["tigera-ca-private"] + verbs = ["get"] + } +} + +resource "kubernetes_role_binding" "read_tigera_ca" { + metadata { + name = "goldmane-edge-aggregator-read-tigera-ca" + namespace = "tigera-operator" + } + role_ref { + api_group = "rbac.authorization.k8s.io" + kind = "Role" + name = kubernetes_role.read_tigera_ca.metadata[0].name + } + # The headless apply identity (claude-agent-service runs Tier-1 applies as the + # `terraform-state` Vault K8s role in the claude-agent namespace). + subject { + kind = "ServiceAccount" + name = "default" + namespace = "claude-agent" } } @@ -156,11 +227,6 @@ resource "kubernetes_job" "db_init" { timeouts { create = "2m" } - lifecycle { - # KYVERNO_LIFECYCLE_V1: Kyverno injects dns_config (ndots=2); ignore it so - # this idempotent Job isn't replaced (Jobs are immutable) on every apply. - ignore_changes = [spec[0].template[0].spec[0].dns_config] - } } # ExternalSecret projecting the Vault-rotated (7-day) credential into a K8s @@ -168,9 +234,6 @@ resource "kubernetes_job" "db_init" { # place in the CNPG connection allowlist are added in stacks/vault/main.tf # (see this stack's terragrunt.hcl note). remoteRef key: static-creds/pg-goldmane-edges. resource "kubernetes_manifest" "db_external_secret" { - field_manager { - force_conflicts = true - } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" @@ -213,9 +276,6 @@ resource "kubernetes_manifest" "db_external_secret" { # into this namespace as SLACK_WEBHOOK_URL via an ExternalSecret (no new # webhook). The digest CronJob defaults to #security. resource "kubernetes_manifest" "slack_external_secret" { - field_manager { - force_conflicts = true - } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" @@ -235,7 +295,7 @@ resource "kubernetes_manifest" "slack_external_secret" { data = [{ secretKey = "SLACK_WEBHOOK_URL" remoteRef = { - key = "viktor" + key = "monitoring" property = "alertmanager_slack_api_url" } }] @@ -455,13 +515,8 @@ resource "kubernetes_cron_job_v1" "digest" { } } env { - name = "SLACK_CHANNEL" - # Posts to #alerts. The dedicated #security channel was abandoned - # 2026-06-25 — the shared alertmanager_slack_api_url webhook's - # Slack app isn't a member of it (channel override 404s), so all - # Slack (incl. alertmanager's security-lane alerts) consolidated - # to #alerts. See docs/runbooks/goldmane-flow-trail.md. - value = "#alerts" + name = "SLACK_CHANNEL" + value = "#security" } resources { diff --git a/stacks/grampsweb/main.tf b/stacks/grampsweb/main.tf index 139c6595..2d434ec7 100644 --- a/stacks/grampsweb/main.tf +++ b/stacks/grampsweb/main.tf @@ -5,9 +5,6 @@ variable "tls_secret_name" { variable "nfs_server" { type = string } resource "kubernetes_manifest" "external_secret" { - field_manager { - force_conflicts = true - } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/hackmd/main.tf b/stacks/hackmd/main.tf index 2e065c99..bbe6db40 100644 --- a/stacks/hackmd/main.tf +++ b/stacks/hackmd/main.tf @@ -208,9 +208,6 @@ module "ingress" { } resource "kubernetes_manifest" "external_secret" { - field_manager { - force_conflicts = true - } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/health/main.tf b/stacks/health/main.tf index 7baf5f9c..36fd17d6 100644 --- a/stacks/health/main.tf +++ b/stacks/health/main.tf @@ -250,9 +250,6 @@ module "ingress_test" { } resource "kubernetes_manifest" "external_secret_db" { - field_manager { - force_conflicts = true - } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" @@ -287,9 +284,6 @@ resource "kubernetes_manifest" "external_secret_db" { } resource "kubernetes_manifest" "external_secret_kv" { - field_manager { - force_conflicts = true - } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/hermes-agent/main.tf b/stacks/hermes-agent/main.tf index fff8578b..1293d7a5 100644 --- a/stacks/hermes-agent/main.tf +++ b/stacks/hermes-agent/main.tf @@ -37,9 +37,6 @@ module "tls_secret" { # --- Secrets (ESO from Vault) --- resource "kubernetes_manifest" "external_secret" { - field_manager { - force_conflicts = true - } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/immich/frame-emo.tf b/stacks/immich/frame-emo.tf deleted file mode 100644 index 577d84af..00000000 --- a/stacks/immich/frame-emo.tf +++ /dev/null @@ -1,155 +0,0 @@ -# Immich photo-frame for Emo (emil.barzin@gmail.com) — a second instance cloned -# from the London frame in frame.tf, scoped to Emo's Immich account + Sofia -# weather. Served at highlights-immich-emo.viktorbarzin.me and shown on Emo's -# Portal Mini (Sofia) via the portal-immich-frame app. -# API key: Vault secret/immich -> frame_api_key_emo (minted on Emo's account). - -resource "kubernetes_config_map" "frame_config_emo" { - metadata { - name = "config-emo" - namespace = "immich" - - labels = { - app = "frame-config-emo" - } - annotations = { - "reloader.stakater.com/match" = "true" - } - } - - data = { - "Settings.yml" = <<-EOF - General: - Layout: single - Interval: 45 - ImageZoom: true - ShowAlbumName: false - ShowProgressBar: false - ClockFormat: "HH:mm" - PhotoDateFormat: "dd/MM/yyyy" - WeatherApiKey: ${data.vault_kv_secret_v2.secrets.data["frame_weather_api_key"]} - UnitSystem: metric - WeatherLatLong: "42.6977,23.3219" - Language: en - Accounts: - - ImmichServerUrl: http://immich.viktorbarzin.me - ApiKey: ${data.vault_kv_secret_v2.secrets.data["frame_api_key_emo"]} - ImagesFromDays: 730 - EOF - } -} - - -resource "kubernetes_deployment" "immich-frame-emo" { - metadata { - name = "immich-frame-emo" - namespace = "immich" - annotations = { - "reloader.stakater.com/search" = "true" - } - labels = { - tier = local.tiers.gpu - } - } - - spec { - replicas = 1 - selector { - match_labels = { - app = "immich-frame-emo" - } - } - strategy { - type = "RollingUpdate" - } - template { - metadata { - labels = { - app = "immich-frame-emo" - } - annotations = { - "dependency.kyverno.io/wait-for" = "immich-server.immich:2283" - } - } - spec { - container { - image = "ghcr.io/immichframe/immichframe:v1.0.32.0" - name = "immich-frame-emo" - resources { - requests = { - cpu = "10m" - memory = "64Mi" - } - limits = { - memory = "128Mi" - } - } - port { - container_port = 8080 - protocol = "TCP" - name = "http" - } - volume_mount { - name = "config" - mount_path = "/app/Config" - read_only = true - } - } - volume { - name = "config" - config_map { - name = "config-emo" - } - } - } - } - } - lifecycle { - ignore_changes = [ - spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1 - metadata[0].annotations["keel.sh/policy"], - metadata[0].annotations["keel.sh/trigger"], - metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2 - metadata[0].annotations["keel.sh/match-tag"], - metadata[0].annotations["kubernetes.io/change-cause"], - metadata[0].annotations["deployment.kubernetes.io/revision"], - spec[0].template[0].metadata[0].annotations["keel.sh/update-time"], # KEEL_LIFECYCLE_V1 - spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE - ] - } -} - - -resource "kubernetes_service" "immich-frame-emo" { - metadata { - name = "immich-frame-emo" - namespace = "immich" - labels = { - "app" = "immich-frame-emo" - } - } - - spec { - selector = { - app = "immich-frame-emo" - } - port { - port = 80 - target_port = 8080 - } - } -} - -module "ingress_emo" { - source = "../../modules/kubernetes/ingress_factory" - # Photo-frame kiosk display on Emo's Portal — headless browser pulling images - # via an Immich API key (no user login). Forward-auth would 302 the device to - # Authentik with no way to complete login. - # auth = "none": photo-frame kiosk; headless browser with API key; no user login. - auth = "none" - dns_type = "proxied" - namespace = "immich" - name = "highlights-immich-emo" - tls_secret_name = var.tls_secret_name - service_name = "immich-frame-emo" -} diff --git a/stacks/immich/main.tf b/stacks/immich/main.tf index 809d6a2e..3009be5e 100644 --- a/stacks/immich/main.tf +++ b/stacks/immich/main.tf @@ -162,9 +162,6 @@ resource "kubernetes_resource_quota" "immich" { } resource "kubernetes_manifest" "external_secret" { - field_manager { - force_conflicts = true - } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/insta2spotify/main.tf b/stacks/insta2spotify/main.tf index 5e1cc4ef..9770afd3 100644 --- a/stacks/insta2spotify/main.tf +++ b/stacks/insta2spotify/main.tf @@ -20,9 +20,6 @@ resource "kubernetes_namespace" "insta2spotify" { } resource "kubernetes_manifest" "external_secret" { - field_manager { - force_conflicts = true - } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/instagram-poster/modules/instagram-poster/main.tf b/stacks/instagram-poster/modules/instagram-poster/main.tf index 7dc3f846..65714739 100644 --- a/stacks/instagram-poster/modules/instagram-poster/main.tf +++ b/stacks/instagram-poster/modules/instagram-poster/main.tf @@ -35,14 +35,6 @@ resource "kubernetes_namespace" "instagram_poster" { # - immich_tag_instagram (optional — auto-resolved if missing) # - immich_tag_posted (optional — auto-resolved if missing) resource "kubernetes_manifest" "external_secret" { - # The external-secrets controller takes server-side-apply ownership of - # .spec.refreshInterval, so a plain TF apply conflicts. force_conflicts lets - # TF win (values match, so it's stable) — same pattern as grafana/woodpecker/ - # traefik/k8s-version-upgrade. Surfaced 2026-06-24 by the first IG apply since - # the ESO v1 migration (the scale-to-0 push). - field_manager { - force_conflicts = true - } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" @@ -147,11 +139,6 @@ resource "kubernetes_manifest" "external_secret" { # ESO refreshes the K8s Secret every 15m. `reloader.stakater.com/match` # bounces the pod when the password changes. resource "kubernetes_manifest" "benchmark_db_external_secret" { - # See external_secret above — ESO owns .spec.refreshInterval; force_conflicts - # lets the TF apply win instead of erroring on the field-manager conflict. - field_manager { - force_conflicts = true - } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" @@ -240,11 +227,7 @@ resource "kubernetes_deployment" "instagram_poster" { } spec { - # Scaled to 0 (2026-06-24): Instagram Graph integration is unused and its - # ExternalSecret is dead (missing ig_graph_long_lived_token / - # ig_business_account_id in Vault secret/instagram-poster). Set back to 1 - # after minting a Meta long-lived token and populating those keys. - replicas = 0 + replicas = 1 # RWO PVC — cannot rolling-update. strategy { type = "Recreate" diff --git a/stacks/job-hunter/main.tf b/stacks/job-hunter/main.tf index 94927bf6..a008e83c 100644 --- a/stacks/job-hunter/main.tf +++ b/stacks/job-hunter/main.tf @@ -41,9 +41,6 @@ resource "kubernetes_namespace" "job_hunter" { # digest_to_address — where the weekly digest goes # digest_from_address — From: header for the digest resource "kubernetes_manifest" "external_secret" { - field_manager { - force_conflicts = true - } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" @@ -108,9 +105,6 @@ resource "kubernetes_manifest" "external_secret" { # DB credentials from Vault database engine (7-day rotation). # Template builds the asyncpg DSN consumed by the FastAPI app as DB_CONNECTION_STRING. resource "kubernetes_manifest" "db_external_secret" { - field_manager { - force_conflicts = true - } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" @@ -331,9 +325,6 @@ resource "kubernetes_service" "job_hunter" { # references it as $__env{JOB_HUNTER_PG_PASSWORD}. Reloader restarts # Grafana whenever ESO updates this secret (every 7d on rotation). resource "kubernetes_manifest" "grafana_job_hunter_db_external_secret" { - field_manager { - force_conflicts = true - } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/k8s-dashboard/oauth2_proxy.tf b/stacks/k8s-dashboard/oauth2_proxy.tf index 032d5057..5ed73793 100644 --- a/stacks/k8s-dashboard/oauth2_proxy.tf +++ b/stacks/k8s-dashboard/oauth2_proxy.tf @@ -5,9 +5,6 @@ # ----------------------------------------------------------------------------- resource "kubernetes_manifest" "oauth2_proxy_externalsecret" { - field_manager { - force_conflicts = true - } manifest = { apiVersion = "external-secrets.io/v1" kind = "ExternalSecret" diff --git a/stacks/k8s-portal/modules/k8s-portal/files/src/routes/+page.svelte b/stacks/k8s-portal/modules/k8s-portal/files/src/routes/+page.svelte index 7b617fd0..2d13fa39 100644 --- a/stacks/k8s-portal/modules/k8s-portal/files/src/routes/+page.svelte +++ b/stacks/k8s-portal/modules/k8s-portal/files/src/routes/+page.svelte @@ -5,11 +5,9 @@

Kubernetes Access Portal

-
- Fastest way in: open the web terminal or the - dashboard and sign in — no install, no VPN needed. Prefer your - own machine? The local-setup guide covers VPN + kubectl, and the - Getting Started page compares all three access paths. +
+ VPN Required — The cluster is on a private network. You need Headscale VPN access before kubectl will work. + See the Getting Started guide for VPN setup instructions.
@@ -28,7 +26,6 @@

Assigned namespaces: {data.namespaces.join(', ')}

Quick Commands

-

Run these as-is in the web terminal — it's already signed in as you.

 # Check your pods
 kubectl get pods -n {data.namespaces[0]}
@@ -50,23 +47,16 @@ vault write kubernetes/creds/{data.namespaces[0]}-deployer \
 
 	

Get Started

-

No setup — start now

-
    -
  1. Open the web terminal — a ready shell with kubectl, Vault and your repos already set up
  2. -
  3. Open the dashboard — point-and-click view of your workloads
  4. -
-

On your own machine

    {#if data.role === 'namespace-owner'} -
  1. Follow the namespace-owner setup (VPN, kubectl, Vault, encrypted state)
  2. +
  3. Complete the namespace-owner onboarding guide
  4. {:else} -
  5. Follow the local setup (VPN, kubectl, git)
  6. +
  7. Complete the onboarding guide (VPN, kubectl, git)
  8. {/if}
  9. Install kubectl and kubelogin
  10. Download your kubeconfig
  11. Run kubectl get namespaces to verify access
-

Compare all three access paths →

@@ -101,12 +91,12 @@ vault write kubernetes/creds/{data.namespaces[0]}-deployer \ border-radius: 6px; margin: 1rem 0; } - .callout.info { - background: #e8f4fd; - border-left: 4px solid #2196f3; + .callout.warning { + background: #fff3cd; + border-left: 4px solid #ffc107; } .callout a { - color: #0d47a1; + color: #856404; font-weight: 600; } diff --git a/stacks/k8s-portal/modules/k8s-portal/files/src/routes/onboarding/+page.svelte b/stacks/k8s-portal/modules/k8s-portal/files/src/routes/onboarding/+page.svelte index 6b2d73dd..d6ec35b9 100644 --- a/stacks/k8s-portal/modules/k8s-portal/files/src/routes/onboarding/+page.svelte +++ b/stacks/k8s-portal/modules/k8s-portal/files/src/routes/onboarding/+page.svelte @@ -5,175 +5,87 @@

Getting Started

-

- Welcome! There are three ways to reach the home Kubernetes cluster. Pick the one that fits — - the first two need zero setup and open right in your browser. -

+

Welcome! Follow these steps to get access to the home Kubernetes cluster.

+ +
-

Three ways in

- - - - - - - - - - - - - - - - - - - -
PathBest forSetup
A — Web terminalJust want to start working nowNone — opens in your browser
B — Web dashboardClick around, watch your app, read logsNone — opens in your browser
C — Your own machinekubectl / Terraform locally, full controlVPN + one-line installer
-
- Not sure? Start with the web terminal (Path A). - Everything is already installed and your repos are already cloned — you can run your first - kubectl command within a minute, from any device. -
-
- -
-

Path A — Web terminal Recommended No setup

-

- A full terminal that runs in your browser — nothing to install, works from any device - (even a tablet). It drops you into your own account on the shared workstation, with every - tool already set up. -

+

Step 0 — Join the VPN

+

The cluster is on a private network (10.0.20.0/24). You need VPN access first.

    -
  1. Open t3.viktorbarzin.me
  2. -
  3. Sign in with your Authentik account (the same SSO login as this portal)
  4. -
  5. You land in a ready-to-use shell. Try it: -
    kubectl get pods -n YOUR_NAMESPACE
    +
  6. Install Tailscale for your OS
  7. +
  8. Run this in your terminal: +
    tailscale login --login-server https://headscale.viktorbarzin.me
  9. +
  10. A browser window will open with a registration URL
  11. +
  12. Send that URL to Viktor via email (vbarzin@gmail.com) or Slack
  13. +
  14. Wait for approval (usually within a few hours)
  15. +
  16. Once approved, test:
    ping 10.0.20.100
-
- Already done for you on the workstation: -
    -
  • kubectl + your kubeconfig, scoped to your namespaces (no login dance)
  • -
  • vault, terragrunt, terraform, sops, kubeseal
  • -
  • Your repos cloned under ~/code — the infra repo plus your own project repos
  • -
  • Claude Code, ready to pair with you on changes
  • -
-
-
- No access yet? The workstation is provisioned per person. If - t3.viktorbarzin.me says you're not authorized, ask Viktor to add you - (vbarzin@gmail.com or Slack). -
-
-

Path B — Web dashboard No setup

-

- A point-and-click view of the cluster — browse your pods, read logs, restart a deployment, - check events. Nothing to install. -

-
    -
  1. Open k8s.viktorbarzin.me
  2. -
  3. Sign in with your Authentik account
  4. -
  5. - You're dropped straight into the Kubernetes Dashboard, already authenticated as you — - no token to paste. The portal injects your personal access token for you. -
  6. -
-
- Scoped to your namespace(s): you can see and manage your own workloads, but not other - tenants'. This path uses a per-user token that does not depend on CLI login, so it - keeps working even if kubectl OIDC login is having a bad day — making it the - reliable fallback for Path C. -
+
+

Step 1 — Log in to the portal

+

Visit k8s-portal.viktorbarzin.me and sign in with your Authentik account.

+

If you don't have an account yet, ask Viktor to create one.

-
-

Path C — From your own machine

-

- For running kubectl, vault and Terraform locally. This is the most - powerful path and the one to use for infrastructure changes — it just needs a bit more setup - because the cluster API lives on a private network. -

- - -

- {#if showNamespaceOwner} - Namespace owner — you'll also set up Vault and encrypted Terraform state so you can deploy - your own app stacks. - {:else} - General user — VPN, kubectl and git access. (Managing your own app stack? Switch to the - Namespace Owner tab above.) - {/if} -

+
+

Step 2 — Set up kubectl

+

Run one of these commands in your terminal to install everything automatically:

+

macOS

+

Requires Homebrew. Install it first if you don't have it.

+
bash <(curl -fsSL https://k8s-portal.viktorbarzin.me/setup/script?os=mac)
+

Linux

+
bash <(curl -fsSL https://k8s-portal.viktorbarzin.me/setup/script?os=linux)
+

Windows

+

Use WSL2 and follow the Linux instructions.

+
+ {#if showNamespaceOwner}
-

Step 1 — Join the VPN

-

The cluster API is on a private network (10.0.20.0/24), so you need VPN access first.

-
    -
  1. Install Tailscale for your OS
  2. -
  3. Run this in your terminal: -
    tailscale login --login-server https://headscale.viktorbarzin.me
    -
  4. -
  5. A browser window opens with a registration URL
  6. -
  7. Send that URL to Viktor via email (vbarzin@gmail.com) or Slack
  8. -
  9. Wait for approval (usually within a few hours)
  10. -
  11. Once approved, test:
    ping 10.0.20.100
  12. -
+

Step 3 — Log into Vault

+

Vault manages your secrets and issues dynamic Kubernetes credentials.

+
vault login -method=oidc
+

This opens your browser for Authentik SSO. After login, your token is saved to ~/.vault-token.

-

Step 2 — Install the tools

-

Run one of these to install everything automatically (kubectl, kubelogin, vault, terragrunt, terraform, kubeseal) and write your kubeconfig to ~/.kube/config-home:

-

macOS

-

Requires Homebrew. Install it first if you don't have it.

-
bash <(curl -fsSL https://k8s-portal.viktorbarzin.me/setup/script?os=mac)
-

Linux

-
bash <(curl -fsSL https://k8s-portal.viktorbarzin.me/setup/script?os=linux)
-

Windows

-

Use WSL2 and follow the Linux instructions.

+

Step 4 — Verify kubectl access

+

Run this command. It will open your browser for OIDC login the first time:

+
kubectl get pods -n YOUR_NAMESPACE
+

You should see an empty list (no resources) or your running pods.

-

Step 3 — Verify access

-

Run this. The first time, it opens your browser for SSO login:

-
kubectl get {showNamespaceOwner ? 'pods -n YOUR_NAMESPACE' : 'namespaces'}
-

You should see your resources (or an empty list if you haven't deployed anything yet).

-
- Browser login loops, or kubectl says "Unauthorized"? Command-line SSO - (OIDC) can occasionally be unavailable. When that happens, use the - web dashboard (Path B) or the - web terminal (Path A) — both authenticate a different way and - keep working — and let Viktor know. -
-

Connection error instead? Make sure the VPN is up: tailscale status.

-
- - {#if showNamespaceOwner} -
-

Step 4 — Log into Vault

-

Vault manages your secrets and issues dynamic Kubernetes credentials.

-
vault login -method=oidc
-

This opens your browser for Authentik SSO. After login, your token is saved to ~/.vault-token.

-
- -
-

Step 5 — Clone the infra repo

-
git clone https://github.com/ViktorBarzin/infra.git
+			

Step 5 — Clone the infra repo

+
git clone https://github.com/ViktorBarzin/infra.git
 cd infra
-

This is where all the infrastructure configuration lives. Terraform state is committed as encrypted files.

-
+

This is where all the infrastructure configuration lives. Terraform state is committed as encrypted files.

+
-
-

Step 6 — Decrypt your state

-

Terraform state is encrypted with SOPS. Your Vault login gives you access to only your stacks.

-
# Make sure you're logged into Vault
+		
+

Step 6 — Install tools

+

You need sops and terragrunt to work with infrastructure state:

+

macOS

+
brew install sops terragrunt
+

Linux

+
# sops
+curl -LO https://github.com/getsops/sops/releases/latest/download/sops-v3.9.4.linux.amd64
+sudo mv sops-*.linux.amd64 /usr/local/bin/sops && sudo chmod +x /usr/local/bin/sops
+
+# terragrunt
+curl -LO https://github.com/gruntwork-io/terragrunt/releases/latest/download/terragrunt_linux_amd64
+sudo mv terragrunt_linux_amd64 /usr/local/bin/terragrunt && sudo chmod +x /usr/local/bin/terragrunt
+
+ +
+

Step 7 — Decrypt your state

+

Terraform state is encrypted with SOPS. Your Vault login gives you access to only your stacks.

+
# Make sure you're logged into Vault
 vault login -method=oidc
 
 # Decrypt your stack's state
@@ -183,157 +95,160 @@ scripts/state-sync decrypt YOUR_NAMESPACE
 cd stacks/YOUR_NAMESPACE
 ../../scripts/tg plan
-
-

How state encryption works

-
-
-
vault login -method=oidc
-
-
Authentik SSO
-
-
~/.vault-token
-
-
-
-
scripts/tg plan
-
-
state-sync decrypt
-
-
Vault Transit
sops-state-YOUR_NS
-
-
-
-
terragrunt plan/apply
-
-
state-sync encrypt
-
-
git commit + push
-
+
+

How state encryption works

+
+
+
vault login -method=oidc
+
+
Authentik SSO
+
+
~/.vault-token
+
+
+
+
scripts/tg plan
+
+
state-sync decrypt
+
+
Vault Transit
sops-state-YOUR_NS
+
+
+
+
terragrunt plan/apply
+
+
state-sync encrypt
+
+
git commit + push
+
-
- Access control: You can only decrypt state for your own namespaces. - Each namespace has its own Vault Transit encryption key. Your Vault policy - (sops-user-YOUR_USERNAME) only grants access to your keys. -
-
+
+ Access control: You can only decrypt state for your own namespaces. + Each namespace has its own Vault Transit encryption key. Your Vault policy + (sops-user-YOUR_USERNAME) only grants access to your keys. +
+
-
-

Step 7 — Create your first app stack

-
    -
  1. Copy the template:
    cp -r stacks/_template stacks/myapp
    +		
    +

    Step 8 — Create your first app stack

    +
      +
    1. Copy the template:
      cp -r stacks/_template stacks/myapp
       mv stacks/myapp/main.tf.example stacks/myapp/main.tf
    2. -
    3. Edit stacks/myapp/main.tf — replace all <placeholders>
    4. -
    5. Store secrets in Vault: -
      vault kv put secret/YOUR_USERNAME/myapp DB_PASSWORD=secret123
      -
    6. -
    7. Apply your stack: -
      cd stacks/myapp && ../../scripts/tg apply
      -
    8. -
    9. Commit encrypted state: -
      cd ../..
      +				
    10. Edit stacks/myapp/main.tf — replace all <placeholders>
    11. +
    12. Store secrets in Vault: +
      vault kv put secret/YOUR_USERNAME/myapp DB_PASSWORD=secret123
      +
    13. +
    14. Apply your stack: +
      cd stacks/myapp && ../../scripts/tg apply
      +
    15. +
    16. Commit encrypted state: +
      cd ../..
       git add stacks/myapp/ state/stacks/myapp/terraform.tfstate.enc
       git commit -m "add myapp stack"
       git push
      -
    17. -
    -
    +
  2. +
+
-
-

Architecture Overview

-

Here's how your changes flow through the system:

+
+

Architecture Overview

+

Here's how your changes flow through the system:

-
-

Apply workflow

-
-
-
Your Machine
-
git pull
-
-
scripts/tg plan
-
auto-decrypt
-
scripts/tg apply
-
auto-encrypt
-
git push
-
-
-
Vault
-
OIDC auth
Authentik SSO
-
-
Transit decrypt
sops-state-*
-
-
Transit encrypt
per-stack key
-
-
-
Cluster
-
K8s API
-
-
Your namespace
pods, services
-
-
Traefik ingress
*.viktorbarzin.me
-
+
+

Apply workflow

+
+
+
Your Machine
+
git pull
+
+
scripts/tg plan
+
auto-decrypt
+
scripts/tg apply
+
auto-encrypt
+
git push
+
+
+
Vault
+
OIDC auth
Authentik SSO
+
+
Transit decrypt
sops-state-*
+
+
Transit encrypt
per-stack key
+
+
+
Cluster
+
K8s API
+
+
Your namespace
pods, services
+
+
Traefik ingress
*.viktorbarzin.me
+
-
-

Security model

- - - - - - - - - -
LayerWhatHow
AuthenticationWho are you?Authentik SSO (OIDC) → Vault token
AuthorizationWhat can you access?Vault policy (sops-user-*) scoped to your namespaces
Encryption at restState in gitSOPS + Vault Transit (per-stack key)
Encryption fallbackBootstrap / DRage keys (admin only)
NetworkCluster accessHeadscale VPN (private 10.0.20.0/24)
-
-
- {:else} -
-

Step 4 — Clone the repo

-
git clone https://github.com/ViktorBarzin/infra.git
+			
+

Security model

+ + + + + + + + + +
LayerWhatHow
AuthenticationWho are you?Authentik SSO (OIDC) → Vault token
AuthorizationWhat can you access?Vault policy (sops-user-*) scoped to your namespaces
Encryption at restState in gitSOPS + Vault Transit (per-stack key)
Encryption fallbackBootstrap / DRage keys (admin only)
NetworkCluster accessHeadscale VPN (private 10.0.20.0/24)
+
+
+ {:else} +
+

Step 3 — Verify access

+

Run this command. It will open your browser for login the first time:

+
kubectl get namespaces
+

You should see output like:

+
NAME              STATUS   AGE
+default           Active   200d
+kube-system       Active   200d
+monitoring        Active   200d
+...
+

If you get a connection error, make sure your VPN is connected (tailscale status).

+
+ +
+

Step 4 — Clone the repo

+
git clone https://github.com/ViktorBarzin/infra.git
 cd infra
-

This is where all the infrastructure configuration lives.

-
+

This is where all the infrastructure configuration lives.

+
-
-

Step 5 — Your first change

-
    -
  1. Create a branch:
    git checkout -b my-first-change
  2. -
  3. Edit a service file (e.g., change an image tag in stacks/echo/main.tf)
  4. -
  5. Commit and push:
    git add . && git commit -m "my first change" && git push -u origin my-first-change
  6. -
  7. Open a Pull Request on GitHub
  8. -
  9. Viktor reviews and merges
  10. -
  11. Woodpecker CI automatically applies the change to the cluster
  12. -
  13. Slack notification confirms it worked
  14. -
-
- {/if} -
+
+

Step 5 — Your first change

+
    +
  1. Create a branch:
    git checkout -b my-first-change
  2. +
  3. Edit a service file (e.g., change an image tag in stacks/echo/main.tf)
  4. +
  5. Commit and push:
    git add . && git commit -m "my first change" && git push -u origin my-first-change
  6. +
  7. Open a Pull Request on GitHub
  8. +
  9. Viktor reviews and merges
  10. +
  11. Woodpecker CI automatically applies the change to the cluster
  12. +
  13. Slack notification confirms it worked
  14. +
+
+ {/if}