Compare commits
1 commit
master
...
wizard/gol
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
e002fddede |
162 changed files with 4565 additions and 12322 deletions
|
|
@ -16,7 +16,6 @@
|
|||
**ALL infrastructure changes MUST go through Terraform/Terragrunt.** Never use `kubectl apply/edit/patch/set`, `helm install/upgrade`, or any manual cluster mutation as the final state.
|
||||
|
||||
- **No exceptions for "quick fixes"** — even one-line changes must be in `.tf` files and applied via `scripts/tg apply`
|
||||
- **Apply locally OR let CI do it — but ALWAYS commit.** You don't have to wait for CI: with apply access you MAY run the apply yourself (`scripts/tg apply <stack>` / `homelab tf apply <stack>`), but **from the main checkout, never a worktree** (git-crypt'd `*.tfvars` come through as ciphertext under the worktree filter-bypass, so a worktree apply reads garbage). **Every applied change MUST be committed and pushed to `master` the same session** — the repo is the source of truth, so applied-but-uncommitted HCL is drift that the next CI apply / daily drift-detection will try to revert. Order either way: apply locally then commit + push (CI's changed-stack apply then no-ops), or commit + push and let CI apply. Never apply an uncommitted edit; never leave a committed change unapplied.
|
||||
- **kubectl is for read-only operations and temporary debugging only** (get, describe, logs, exec, port-forward)
|
||||
- **If a resource isn't in Terraform yet**, evaluate whether it can be added before making manual changes. If manual change is unavoidable (e.g., emergency), document it immediately and create the Terraform resource in the same session
|
||||
- **kubectl scale/patch during migrations is acceptable** as a transient step, but the final state must be in Terraform and applied via `scripts/tg apply`
|
||||
|
|
@ -204,7 +203,7 @@ the workflow's built-in `GITHUB_TOKEN` (`packages: write`).
|
|||
- **PDBs**: minAvailable=2 on Traefik and Authentik.
|
||||
- **Fallback proxies**: basicAuth when Authentik is down, fail-open when poison-fountain is down.
|
||||
- **CrowdSec enforcement is out-of-band** (no Traefik plugin/middleware — the dead Yaegi `crowdsec-bouncer-traefik-plugin` was removed on Traefik 3.7.5): banned IPs are dropped **in-kernel via nftables** by the `cs-firewall-bouncer` DaemonSet on **direct** hosts (drops in BOTH the `input` and `forward` hooks — Traefik is ETP=Local so client traffic is DNAT'd to the pod via `forward`; pulls ALL decisions incl. the ~31k CAPI blocklist), and **blocked at the Cloudflare edge** for **proxied** hosts (one `crowdsec_ban` Rules List + a zone WAF block rule, fed by the `crowdsec-cf-sync` CronJob in `rybbit` ns every 2 min — excludes CAPI). Zero per-request latency; **fails open** (LAPI down → no new bans, existing drops persist, legit traffic never blocked). Whitelist covers RFC1918 + tailnet + internal CIDRs. Full as-built: `docs/architecture/security.md`.
|
||||
- **Rate limiting**: Return 429 (not 503). Per-service tuning via dedicated middleware + `skip_default_rate_limit` (default 10/s burst 50): Immich 1000/20000, ActualBudget 50/300 (app boot = ~70 parallel revalidations), authentik 100/1000 on `/`+`/static` (login SPA cold-loads ~70 flow chunks from `/static`; default burst 429'd them → blank login screen).
|
||||
- **Rate limiting**: Return 429 (not 503). Per-service tuning via dedicated middleware + `skip_default_rate_limit` (default 10/s burst 50): Immich 1000/20000, ActualBudget 50/300 (app boot = ~70 parallel revalidations).
|
||||
- **Retry middleware**: 2 attempts, 100ms — in default ingress chain.
|
||||
- **Entrypoint transport timeouts** (`websecure` `respondingTimeouts`): `writeTimeout=0` (unlimited download duration), `readTimeout=3600s` (uploads ≤1h), `idleTimeout=600s`. These are **HARD total-duration caps**, not nginx-style per-read idle timeouts — a finite `writeTimeout` truncates *any* large download at that wall-clock mark (a prior `writeTimeout=60s` silently cut Immich videos at 60s). **Do NOT re-tighten `writeTimeout`**; keep `readTimeout` finite (slow-loris backstop) but ≥ longest expected upload. Full rationale: `docs/architecture/networking.md` → "Entrypoint Transport Timeouts".
|
||||
- **HTTP/3 (QUIC)**: Enabled on Traefik. Works for **direct (non-proxied) apps** via the dedicated LB IP below (ETP=Local). Proxied apps get QUIC at the Cloudflare edge.
|
||||
|
|
@ -219,7 +218,7 @@ the workflow's built-in `GITHUB_TOKEN` (`packages: write`).
|
|||
| Immich | ML on SSD (CUDA), disable ModSecurity (breaks streaming), frequent upgrades. **`immich-machine-learning` MUST run with `MACHINE_LEARNING_MODEL_TTL > 0`** (set to `600` in `stacks/immich/main.tf`, env on the `immich-machine-learning` deployment). At `0`, no model ever unloads and onnxruntime's CUDA arena (OCR's dynamic input shapes inflate it to ~10 GB) is held forever on the **time-sliced T4 it shares with llama-swap/frigate/immich-server** — which has no VRAM isolation, so immich-ml starved llama-swap (qwen3-8b) and silently broke recruiter-responder triage for ~5 h on 2026-06-02 (post-mortem `docs/post-mortems/2026-06-02-immich-ml-ttl-gpu-oom-recruiter.md`). TTL>0 lets idle models (OCR, face — AND CLIP) free VRAM. The TTL is a single GLOBAL knob (no per-model pin), so CLIP would also unload after 600s idle; the `clip-keepalive` CronJob (`*/5 * * * *`, same stack) pings the CLIP textual encoder so smart-search stays warm without pinning the ad-hoc models. **Smart search has a SECOND warmth layer in Postgres** (don't conflate it with the ML model): the ~665MB vchord `clip_index` must stay resident in PG `shared_buffers`, else an ANN probe that lands on an evicted list pays a ~1.8s cold storage read vs ~4ms warm. The `postStart` hook prewarms it ONCE at pod start and `pg_prewarm.autoprewarm` only re-warms at *startup*, so the index decays out of cache over days under job buffer-pressure (observed ~33% resident after 9d uptime → slow context search, easily misattributed to the ML model). The `clip-index-prewarm` CronJob (`*/5`, same stack) re-runs `pg_prewarm('clip_index')` to pin it hot; `immich-search-probe` (`*/5`) measures live latency + residency → Pushgateway gauges (`immich_smart_search_db_seconds`, `immich_clip_index_cached_pct`) → alerts `ImmichSmartSearchSlow`/`ImmichClipIndexColdCache`/`ImmichSearchProbeStale` + cluster-health check #46 (`check_immich_search`). immich PG role is a superuser so the CronJobs can run `pg_prewarm`/`pg_buffercache`. **Video transcoding is GPU-accelerated**: `immich-server` is pinned to GPU node1 (nodeSelector `nvidia.com/gpu.present` + NoSchedule toleration + `gpu-workload` priority) with a time-sliced `nvidia.com/gpu=1` slice — the stock immich-server image's ffmpeg already ships h264/hevc_nvenc + NVDEC. Activated via `ffmpeg.accel=nvenc` + `accelDecode=true` in the **DB** system-config (`system_metadata` table, key `system-config`, JSONB — NOT Terraform; app config is DB-managed here like oauth/smtp). Direct DB edits need a pod **recreate** to reload (config is cached at boot; only API-driven changes broadcast a reload). **Streaming bitrate is capped** to keep 4K playback smooth on the contended HDD and over remote uplinks: `ffmpeg.maxBitrate=20000k` + `preset=medium` + `transcode=bitrate` (set 2026-06-01 — was uncapped `maxBitrate=0` + `ultrafast` + `targetResolution=original`, which produced 77–264 Mbps 4K transcodes that stuttered for every client, local and remote, since even a single stream needs ~10–13.5 MB/s off the shared `sdc` spindle). 4K resolution is preserved (`targetResolution=original`); originals are NEVER modified — only the `encoded-video/` streaming copy. To re-apply transcode settings to EXISTING videos (config changes only affect new/missing ones): delete the offenders' `asset_file` rows `WHERE type='encoded_video'` (derived/regenerable — never touches originals) then run videoConversion `force=false` (admin Jobs API → "Missing"); it regenerates them to the deterministic `<assetId>.mp4` path at concurrency 1 (gentle on sdc). See `docs/runbooks/immich-transcode-bitrate.md`. If Immich is ever reinstalled fresh (not restored), re-set these keys (accel, accelDecode, **maxBitrate=20000k, preset=medium, transcode=bitrate**). Thumbnails/previews live on SSD NFS (sdb) — do NOT move to block storage (HDD sdc = slower + the contended IO domain). **Background-job concurrency is capped to protect sdc** (DB-managed system-config, `system_metadata` key `system-config`, JSONB `job.*.concurrency`; re-set on fresh install): `thumbnailGeneration=2`, `metadataExtraction=2`, `library=2` — these jobs read ORIGINALS off the HDD library. Left uncapped (were 8/4/4) a library-wide job (e.g. Duplicate Detection on 2026-06-01) fans the ML/thumbnail backfill out into a read storm that saturates sdc and starves etcd → apiserver down. `sidecar`/`smartSearch`/`faceDetection` stay at Immich defaults (small `.xmp` / SSD previews). Apply via Job Settings UI or the `system-config` API; **direct DB edits need an `immich-server` pod recreate to reload** (config cached at boot). See `docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md`. |
|
||||
| CrowdSec | Pin version, disable Metabase when not needed (CPU hog), LAPI scaled to 3, **DB on PostgreSQL** (migrated from MySQL), flush config: max_items=10000/max_age=7d/agents_autodelete=30d, DECISION_DURATION=168h in blocklist CronJob. **Enforcement is out-of-band, NOT a Traefik plugin** (the Yaegi `crowdsec-bouncer-traefik-plugin` was dead on Traefik 3.7.5 and removed): `cs-firewall-bouncer` DaemonSet drops in-kernel via nftables on direct hosts (bouncer key `firewall`, v0.0.34 binary fetched at runtime, hostNetwork+NET_ADMIN, `stacks/crowdsec/modules/crowdsec/firewall_bouncer.tf`); `crowdsec-cf-sync` CronJob blocks at the CF edge for proxied hosts (bouncer key `kvsync`, `stacks/rybbit/crowdsec_edge.tf`). Both fail open. See `docs/architecture/security.md` |
|
||||
| Frigate | GPU stall detection in liveness probe (inference speed check), high CPU |
|
||||
| Authentik | 3 server replicas + 2-replica embedded outpost (PG-backed sessions), PgBouncer in front of PostgreSQL, strip auth headers before forwarding. **`authentik.*` Helm values are INERT** (existingSecret skips chart env rendering) — tune via `server.env`/`worker.env` in `modules/authentik/values.yaml`. Single-screen login (password embedded in identification stage); all first-party OIDC apps use implicit consent (2026-06-10). `/static` ingress carve-out serves assets with immutable Cache-Control; `/`+`/static` use a dedicated `authentik-rate-limit` (100/1000) so the cold-load chunk burst isn't 429'd into a blank screen. **Reliability (2026-06-28): the chart key is `deploymentStrategy`, NOT `strategy`** — the old `strategy:` key was inert, so live ran the chart default 25%/25% and dropped a server pod out of rotation on every roll; now `maxSurge:1/maxUnavailable:0`. Readiness `failureThreshold:8` (~80s, was 30s): the DB-coupled `/-/health/ready/` returns 503 on a PG/pgbouncer blip, and with too-tight tolerance all 3 server pods left the Service at once → Traefik 502/504 (the episodic blank-screen + 30s-hang). gunicorn `max_requests=10000`/jitter=1000 decorrelates worker recycles from DB blips. Redis is GONE since 2026.2 (sessions+cache+channels on PostgreSQL, no external-cache option) — a short PG transient is now survived, but a TOTAL CNPG outage still takes authentik down. **Custom overlay image (2026-06-28):** server+worker run `ghcr.io/viktorbarzin/authentik-server:2026.2.4-patch3` (built by `.github/workflows/build-authentik.yml` from `stacks/authentik/Dockerfile` + `patch-compat-sfe.py`) with TWO guarded patches: **#1 SLOW-1a** — narrows the identification-stage `select_subclasses()` query (~1.4s→~14ms; bare upstream call LEFT-JOINs every source subtype); **#2 old-browser blank login** — `patch-compat-sfe.py` (a) extends `compat_needs_sfe()` to serve authentik's built-in no-JS **SFE** login to old Safari/WebKit AND **any iOS browser** (Chrome/CriOS, Firefox/FxiOS — all share the system WebKit) on iOS≤16.3, and (b) **injects static social-login `<a>` links into the SFE shell** (`flow-sfe.html`) since the SFE can't render Identification-stage sources — required for password-less accounts (e.g. emo = Google-only). The modern flow SPA is ES2022 (needs Safari 16.4+) and renders BLANK on older WebKit; every iOS browser shares that WebKit, so it's not browser-choice (emo's iPadOS-15.8 iPad hit this). SFE = the *real* authentik login (password + MFA + reputation, no auth downgrade) — chosen over a Traefik basic-auth fallback which would have put a spoofable-UA single password in front of `vbarzin→wizard` passwordless-root. Social link = plain redirect to `/source/oauth/login/<slug>/` (works on any browser); slugs (google/github/facebook) are static — re-verify on source changes. **Keel un-enrolled** for the ns → image pinned in `global.image` (repo+tag), **upgraded manually**: bump the Dockerfile `FROM` + the values tag (+ re-verify both patches) together, GHA rebuilds, then apply. |
|
||||
| Authentik | 3 server replicas + 2-replica embedded outpost (PG-backed sessions), PgBouncer in front of PostgreSQL, strip auth headers before forwarding. **`authentik.*` Helm values are INERT** (existingSecret skips chart env rendering) — tune via `server.env`/`worker.env` in `modules/authentik/values.yaml`. Single-screen login (password embedded in identification stage); all first-party OIDC apps use implicit consent (2026-06-10). `/static` ingress carve-out serves assets with immutable Cache-Control. |
|
||||
| Kyverno | failurePolicy=Ignore to prevent blocking cluster, pin chart version |
|
||||
| MySQL Standalone | Raw `kubernetes_stateful_set_v1` pinned to `mysql:8.4.8` exactly (migrated from InnoDB Cluster 2026-04-16; **pinned to 8.4.8 on 2026-05-18** after Keel-driven `mysql:8.4` → 8.4.9 bump stalled the DD upgrade and required a full PVC-wipe + dump-restore — see `docs/runbooks/restore-mysql.md` and beads code-eme8/code-k40p). `skip-log-bin`, `innodb_flush_log_at_trx_commit=2`, `innodb_doublewrite=ON`. ConfigMap `mysql-standalone-cnf`. PVC `data-mysql-standalone-0` (5Gi initial → 30Gi via autoresizer, `proxmox-lvm-encrypted`). Service `mysql.dbaas` unchanged. Anti-affinity excludes k8s-node1. Bitnami charts deprecated (Broadcom Aug 2025) — use official images. |
|
||||
| phpIPAM | IPAM — no active scanning. `pfsense-import` CronJob (hourly) pulls Kea leases + ARP via SSH. `dns-sync` CronJob (15min) bidirectional sync with Technitium. Kea DDNS on pfSense handles all 3 subnets. API app `claude` (ssl_token). |
|
||||
|
|
@ -232,10 +231,9 @@ the workflow's built-in `GITHUB_TOKEN` (`packages: write`).
|
|||
- Alertmanager is now scraped (`extraScrapeConfigs` job `alertmanager`) → `alertmanager_notifications_total`/`_alerts`/`_notifications_failed_total` available; it had no `prometheus.io/scrape` annotation so notification volume was previously unmeasurable.
|
||||
- Every new service gets Prometheus scrape config + Uptime Kuma monitor. External monitors auto-created for Cloudflare-proxied services by `external-monitor-sync` CronJob (10min, uptime-kuma ns). Mechanism: `ingress_factory` auto-adds `uptime.viktorbarzin.me/external-monitor=true` whenever `dns_type != "none"` (see `modules/kubernetes/ingress_factory/main.tf`) — no manual action needed on new services. The `cloudflare_proxied_names` list in `config.tfvars` is a legacy fallback for the 17 hostnames not yet migrated to `ingress_factory` `dns_type`; don't check that list when debugging "is this monitored?" questions.
|
||||
- **External monitoring**: `[External] <service>` monitors in Uptime Kuma test full external path (DNS → Cloudflare → Tunnel → Traefik). Divergence metric `external_internal_divergence_count` → alert `ExternalAccessDivergence` (15min). Config: `stacks/uptime-kuma/`, targets from `cloudflare_proxied_names` in `config.tfvars` (17 remaining centrally-managed hostnames; most DNS records now auto-created by `ingress_factory` `dns_type` param).
|
||||
- Key alerts: OOMKill, pod replica mismatch, 4xx/5xx error rates, UPS battery, CPU temp, SSD writes, NFS responsiveness, ClusterMemoryRequestsHigh (>85%), ContainerNearOOM (>85% limit), PodUnschedulable, ExternalAccessDivergence, ImmichSmartSearchSlow (context-search latency / clip_index cache eviction), AuthentikRootRouter5xxHigh (all-3-server-pods-NotReady cascade → 502/503/504 on the authentik `/` router). **The Traefik scrape keeps `traefik_router_requests_total`** (per-router `code` label) — the drop-regex in the `traefik` scrape job drops only the high-cardinality `*_duration_seconds_bucket` histogram, NOT the request counter, so per-router 429/5xx is queryable + alertable.
|
||||
- Key alerts: OOMKill, pod replica mismatch, 4xx/5xx error rates, UPS battery, CPU temp, SSD writes, NFS responsiveness, ClusterMemoryRequestsHigh (>85%), ContainerNearOOM (>85% limit), PodUnschedulable, ExternalAccessDivergence, ImmichSmartSearchSlow (context-search latency / clip_index cache eviction).
|
||||
- **E2E email monitoring**: CronJob `email-roundtrip-monitor` (every 20 min) sends test email via Brevo HTTP API to `smoke-test@viktorbarzin.me` (catch-all → `spam@`), verifies IMAP delivery, deletes test email, pushes metrics to Pushgateway + Uptime Kuma. Alerts: `EmailRoundtripFailing` (60m), `EmailRoundtripStale` (60m), `EmailRoundtripNeverRun` (60m). Outbound relay: Brevo EU (`smtp-relay.brevo.com:587`, 300/day free — migrated from Mailgun). Inbound external traffic enters via pfSense HAProxy on `10.0.20.1:{25,465,587,993}`, which forwards to k8s `mailserver-proxy` NodePort (30125-30128) with `send-proxy-v2`. Mailserver pod runs alt PROXY-speaking listeners (2525/4465/5587/10993) alongside stock PROXY-free ones (25/465/587/993) for intra-cluster clients. Real client IPs recovered from PROXY v2 header despite kube-proxy SNAT (replaces pre-2026-04-19 MetalLB `10.0.20.202` ETP:Local scheme; see bd code-yiu + `docs/runbooks/mailserver-pfsense-haproxy.md`). Vault: `brevo_api_key` in `secret/viktor` (probe + relay).
|
||||
- **Authentik walling-off guard**: `blackbox-exporter` (monitoring ns, `stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf`) probes each must-stay-public `auth = "none"` carve-out URL with `no_follow_redirects` and FAILS (`fail_if_header_matches` on `Location`) iff it 302s to Authentik. Catches a carve-out regressing (TF revert / deploy / `ingress_factory` `auth` default flipping back to `"required"`). Scrape job `blackbox-authentik-walloff` (1m) → alert `AuthentikWallingOffPublicPath` (`probe_failed_due_to_regex == 1`, for 10m, `lane=security` → posts to `#alerts` via the `slack-security` receiver, which keeps its `[SECURITY]` styling; the dedicated `#security` channel was abandoned 2026-06-25 — the shared webhook's app isn't a member of it). **To guard a new carve-out: add one line to `local.authentik_walloff_targets`** (a `service → URL` map; `valid_status_codes` includes 301/302 so legit redirects/404s stay green — only the Authentik `Location` fails the probe). `curl -sI '<url>'` must NOT show a Location to `authentik.viktorbarzin.me` before adding.
|
||||
- **pfSense egress / WAN monitoring** (added 2026-06-28 after the 2026-06-27 egress-only incident — pfSense VMID 101 stopped passing internet egress for ~20 min while internal routing + Unbound stayed up, and NOTHING alerted: no egress probe existed and the cloudflared replica metric stayed green): `blackbox-exporter` gained `icmp_egress` + `dns_external` modules (+ `NET_RAW` on the pod) in `authentik_walloff_probe.tf`. Three in-cluster probe jobs (`wan-gateway-icmp` → 192.168.1.1, `internet-egress-icmp` → 9.9.9.9/1.1.1.1, `internet-egress-dns` → cloudflare.com via both) traverse the pod→node→pfSense-NAT path that fails. Alerts (group `Egress / pfSense` in `alerting_rules.yml`): `WANGatewayUnreachable`, `InternetEgressDown` (`max()==0` = both providers dead, not a single-provider blip), `ExternalDNSResolutionDown`, `EgressOnlyDivergence` (t3-probe `cloudflare` leg down WHILE `internal` leg up — the incident signature, reuses the existing t3-probe), `PfSenseVMDown` (`pve_up{id="qemu/101"}==0` while host up — does NOT catch a guest-internal reboot, `pve_up` tracks the qemu process). Plus Loki ruler `CloudflaredTunnelConnLoss` (>20 edge-conn failures/5m; calibrated live: steady-state ~2/6h vs 37-85/5m in-incident; the cloudflared replica metric is blind to tunnel-connection loss). `WANGatewayUnreachable`/`InternetEgressDown` **inhibit** the downstream egress symptoms (ExternalDNSResolutionDown/EgressOnlyDivergence/CloudflaredTunnelConnLoss/Email*/ExternalAccessDivergence). Runbook: `docs/runbooks/pfsense-egress.md`. **Deferred (needs a live-pfSense change, not in this monitoring-only change):** point dpinger's monitor at the local gateway + widen thresholds, disable `gw_down_kill_states`, add a failover gateway group + auto-recovery watchdog, and ship pfSense system/gateway/routing syslog to Loki (today only filterlog → CrowdSec; those logs are NOT centrally queryable — id #6717). No Uptime-Kuma egress monitor was added (the `external-monitor-sync` is purpose-built for `*.viktorbarzin.me` Cloudflare-path discovery; the blackbox probes cover egress directly).
|
||||
- **Authentik walling-off guard**: `blackbox-exporter` (monitoring ns, `stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf`) probes each must-stay-public `auth = "none"` carve-out URL with `no_follow_redirects` and FAILS (`fail_if_header_matches` on `Location`) iff it 302s to Authentik. Catches a carve-out regressing (TF revert / deploy / `ingress_factory` `auth` default flipping back to `"required"`). Scrape job `blackbox-authentik-walloff` (1m) → alert `AuthentikWallingOffPublicPath` (`probe_failed_due_to_regex == 1`, for 10m, `lane=security` → `#security` Slack). **To guard a new carve-out: add one line to `local.authentik_walloff_targets`** (a `service → URL` map; `valid_status_codes` includes 301/302 so legit redirects/404s stay green — only the Authentik `Location` fails the probe). `curl -sI '<url>'` must NOT show a Location to `authentik.viktorbarzin.me` before adding.
|
||||
|
||||
## Security Posture (Wave 1 — locked 2026-05-18)
|
||||
|
||||
|
|
@ -243,10 +241,9 @@ Plan in `docs/architecture/security.md` + response playbook in `docs/runbooks/se
|
|||
|
||||
- **Identity allowlist for security rules**: ONLY `me@viktorbarzin.me`. NOT `viktor@viktorbarzin.me`, NOT `emo@viktorbarzin.me` (those don't exist). emo's identity scheme is unknown — ask before assuming.
|
||||
- **Source-IP allowlist (K2, K9, V7, S1)**: `10.0.20.0/22`, `192.168.1.0/24` (Proxmox + Sofia LAN), K8s pod CIDR, K8s service CIDR, Headscale tailnet. **Policy: no public-IP access** — Vault, kube-apiserver, PVE sshd must transit LAN or Headscale. **One documented exception (2026-06-11): break-glass SSH** — PVE sshd on a WAN-exposed `:52222`, key-only, dedicated break-glass key only (`Match LocalPort`), rate-limited + fail2ban; intentionally cluster-independent so it survives an outage. As-built `docs/runbooks/breakglass-ssh.md`. (Replaced the 2026-05-30 port-knock design — circular Vault dep caused a lockout.)
|
||||
- **Response model**: (I) Slack-only daily skim. All security alerts via Loki ruler → Alertmanager → the `slack-security` receiver, which posts to `#alerts` (it keeps its `[SECURITY/<sev>]` title styling so security-lane alerts stand out). Severity labels carried in the alert (critical/warning/info). No paging. The dedicated `#security` channel was abandoned 2026-06-25 — the shared `alertmanager_slack_api_url` webhook's Slack app isn't a member of it (a `#security` override 404s), so everything consolidated to `#alerts`.
|
||||
- **Response model**: (I) Slack-only daily skim. All security alerts via Loki ruler → Alertmanager → `#security` Slack receiver. Single channel with severity labels inside (critical/warning/info). No paging.
|
||||
- **Kyverno policies (wave 1)**: `deny-privileged-containers`, `deny-host-namespaces`, `restrict-sys-admin`, `require-trusted-registries` flip Audit→Enforce with the 31-namespace exclude list (memory id=1970). `failurePolicy: Ignore` preserved. Cosign `verify-images` deferred.
|
||||
- **NetworkPolicy default-deny egress (wave 1)**: observe-then-enforce (γ approach) — Calico flow logs cluster-wide + GlobalNetworkPolicy log-only on tier 3+4, build empirical allowlist after 1 week, phased per-namespace enforce starting `recruiter-responder`. Tier 0/1/2 deferred. **The internal (ns-to-ns) half of each allowlist now derives faster from the east-west flow trail** (below): `SELECT DISTINCT dst_ns FROM edge WHERE src_ns='<ns>' AND action='allow'`. External egress is NOT in that table (empty-ns flows dropped) — those still come from the Calico flow-log W1.6 snapshot. Enforce-flips remain out of scope of the trail (observe-and-derive only; beads `code-8ywc`).
|
||||
- **East-west flow trail (who-talks-to-whom, ADR-0014)**: Calico **Goldmane** (`goldmane.calico-system:7443`, gRPC/mTLS, ~60-min in-memory ring buffer — no etcd writes) + **Whisker** live UI (`whisker.viktorbarzin.me`, Authentik-gated) → **`goldmane-edge-aggregator`** streams Goldmane's `Flows.Stream` over mTLS and upserts the namespace-pair **edge set** (`edge(src_ns,dst_ns,action,first_seen,last_seen,flow_count)`, self-edges + public-internet flows dropped) into **CNPG DB `goldmane_edges`** → daily **`goldmane-edges-digest`** CronJob posts first-seen edges to `#alerts` (consolidated to `#alerts`; the `#security` channel was abandoned 2026-06-25 — the shared webhook's Slack app isn't a member of it, so a `#security` override 404s; see runbook). **CERT-REUSE GOTCHA**: the aggregator's mTLS client cert reuses the operator's Tigera-CA-signed `whisker-backend-key-pair` Secret (Goldmane verifies CA-chain only) — **re-apply `stacks/goldmane-edge-aggregator` if the operator rotates it** (symptom: no `last_seen` updates, `AggregatorDown`). Service identity = namespace, + `service-identity` label only in `monitoring`/`kube-system`/`dbaas`. Health: `AggregatorDown` + `DigestFailing` alerts + cluster-health #48. **WHISKER-WEDGE GOTCHA** (2026-06-28): the operator's `whisker` NetworkPolicy allows DNS egress only to kube-dns *pods*, but whisker-backend resolves goldmane via the kube-dns *ClusterIP* — Calico drops UDP DNS to a ClusterIP under a podSelector-only egress rule, so when whisker-backend's gRPC stream breaks and it re-resolves, it wedges and the UI goes **empty** (the aggregator, a separate pod, is unaffected). FIX = additive egress NP `whisker-allow-dns-clusterip` (`stacks/calico`, allows whisker→10.96.0.10/32:53); the `whisker-watchdog` CronJob is a backstop. Manual heal `kubectl -n calico-system delete pod -l k8s-app=whisker`. Runbook: `docs/runbooks/goldmane-flow-trail.md`. (Goldmane is OSS tech-preview — reversible operator-CR toggle in `stacks/calico/main.tf`.)
|
||||
- **NetworkPolicy default-deny egress (wave 1)**: observe-then-enforce (γ approach) — Calico flow logs cluster-wide + GlobalNetworkPolicy log-only on tier 3+4, build empirical allowlist after 1 week, phased per-namespace enforce starting `recruiter-responder`. Tier 0/1/2 deferred.
|
||||
- **What's NOT in scope**: canary tokens (rejected — self-trigger risk with Viktor's normal `vault kv list secret/viktor` and `kubectl get secret -A` workflows), Falco/Tetragon (too noisy for Slack-only daily check), Cloudflare/GitHub audit polling (deferred to wave 2).
|
||||
|
||||
## Storage & Backup Architecture
|
||||
|
|
|
|||
|
|
@ -13,8 +13,6 @@
|
|||
| authentik | Identity provider (SSO) | authentik |
|
||||
| cloudflared | Cloudflare tunnel | cloudflared |
|
||||
| authelia | Auth middleware (may be merged into ebooks or removed) | platform |
|
||||
| goldmane | Calico 3.30 OSS flow aggregator (`goldmane.calico-system.svc:7443`, gRPC/mTLS). Stamps identity (ns/pod/workload/labels + allow-deny) on every flow from Felix into a ~60-min in-memory ring buffer — no etcd/API writes. East-west "who-talks-to-whom" source (ADR-0014). Enabled via operator CR (`kubectl_manifest.goldmane`). | calico |
|
||||
| whisker | Calico 3.30 OSS live flow-observability UI (`whisker.calico-system.svc:8081`) at `whisker.viktorbarzin.me` (Authentik-gated, `auth=required` — no own login; additive NP ORs Traefik past the operator default-deny). ~60-min live view of Goldmane flows, NOT history. Enabled via operator CR (`kubectl_manifest.whisker`). | calico |
|
||||
| monitoring | Prometheus/Grafana/Loki stack | monitoring |
|
||||
|
||||
## Storage & Security (Tier: cluster)
|
||||
|
|
@ -39,7 +37,6 @@
|
|||
## Active Use
|
||||
| Service | Description | Stack |
|
||||
|---------|-------------|-------|
|
||||
| goldmane-edge-aggregator | Durable who-talks-to-whom audit trail (ADR-0014 / #58). Go service: `aggregate` Deployment streams Goldmane's gRPC `Flows.Stream` (mTLS) and upserts the low-cardinality namespace-pair edge set (`edge(src_ns,dst_ns,action,first_seen,last_seen,flow_count)`) into CNPG DB `goldmane_edges`; `goldmane-edges-digest` CronJob posts first-seen edges daily to `#alerts` (the `#security` channel was abandoned 2026-06-25 — shared webhook's app isn't a member of it). mTLS client cert REUSES the operator's `whisker-backend-key-pair` (re-apply if rotated). Tier-4-aux. Image `ghcr.io/viktorbarzin/goldmane-edge-aggregator` (private). Runbook: [goldmane-flow-trail.md](../../docs/runbooks/goldmane-flow-trail.md). | goldmane-edge-aggregator |
|
||||
| mailserver | Email (docker-mailserver) | mailserver |
|
||||
| shadowsocks | Proxy | shadowsocks |
|
||||
| webhook_handler | Webhook processing | webhook_handler |
|
||||
|
|
@ -164,4 +161,3 @@ procedures) are documented in `infra/docs/runbooks/`:
|
|||
| pfSense + Unbound DNS | [pfsense-unbound.md](../../docs/runbooks/pfsense-unbound.md) |
|
||||
| Mailserver PROXY-protocol / HAProxy | [mailserver-pfsense-haproxy.md](../../docs/runbooks/mailserver-pfsense-haproxy.md) |
|
||||
| Technitium apply flow | [technitium-apply.md](../../docs/runbooks/technitium-apply.md) |
|
||||
| Goldmane flow trail (east-west who-talks-to-whom) | [goldmane-flow-trail.md](../../docs/runbooks/goldmane-flow-trail.md) |
|
||||
|
|
|
|||
|
|
@ -11,8 +11,8 @@ description: |
|
|||
There are TWO Home Assistant deployments: ha-london (default) and ha-sofia.
|
||||
Always use Home Assistant for smart home control.
|
||||
author: Claude Code
|
||||
version: 2.1.0
|
||||
date: 2026-06-24
|
||||
version: 2.0.0
|
||||
date: 2026-02-07
|
||||
---
|
||||
|
||||
# Home Assistant Control
|
||||
|
|
@ -395,27 +395,14 @@ Advanced SSH, File Editor, Studio Code Server, InfluxDB, Mosquitto, Node-RED, Fr
|
|||
## ha-london Knowledge Map
|
||||
|
||||
### Overview
|
||||
- **HA Version**: 2026.5.2 on **Home Assistant OS** (HAOS — managed appliance, NOT a `docker run` container). Latest is 2026.6.4 (update available, deliberately not applied).
|
||||
- **HA Version**: 2025.9.1 (Docker container on Raspberry Pi)
|
||||
- **Location**: London, UK
|
||||
- **Platform**: Raspberry Pi 4, HA OS
|
||||
- **Access from the Sofia devvm**: london is **remote** — `homelab ha ssh --instance london` generally WON'T connect (ADR-0012). Drive it via the API: `homelab ha token --instance london` + `https://ha-london.viktorbarzin.me/api/...`, and the WebSocket API `wss://ha-london.viktorbarzin.me/api/websocket` for dashboards / config-entries / HACS installs.
|
||||
- **SSH (only from the London LAN)**: `ssh hassio@192.168.8.103` (requires `sudo` for file access)
|
||||
- **Config path**: `/config/`
|
||||
- **Platform**: Raspberry Pi 4, HA OS (not Docker standalone)
|
||||
- **SSH**: `ssh hassio@192.168.8.103` (requires `sudo` for file access)
|
||||
- **Config path**: `/config/` (requires `sudo` for file access)
|
||||
- **3 tracked people**: Viktor Barzin, Anca Milea, Gheorghe Milea
|
||||
- **Zone**: London (home)
|
||||
|
||||
### Dashboards (redesigned 2026-06-24)
|
||||
**Glossary** (HA terms — keep distinct):
|
||||
- **Dashboard** = a sidebar entry (Overview, Air Quality, Map). Sidebar *order* is a per-USER frontend preference, not in any dashboard config.
|
||||
- **View** = a tab inside a dashboard. View order is global (stored in the dashboard config).
|
||||
- **Card** = a widget inside a view.
|
||||
|
||||
- **Overview** (`lovelace`, the default): responsive **sections** views, styled with Mushroom + mini-graph-card.
|
||||
- **Home** tab: *Who's home* · *Comfort & Air* (CO₂/temp/humidity/PM2.5/VOC chips + CO₂ and temp/humidity trend graphs + link to Air Quality) · *Cowboy* (battery/range/last-ride) · *Energy* (5 Kasa plugs + power trend) · *Quick actions* (Netflix/Stremio/Night).
|
||||
- **More** tab: *Network* (GL-MT6000 router) · *System* (HA version/update, last backup, RPi power) · *Phones*.
|
||||
- **Air Quality** (`air-quality`): deep-dive (views: Home, Detailed). (`detialed`→`detailed` path typo fixed 2026-06-24.)
|
||||
- Built via the WS `lovelace/config/save` API (london is remote — no SSH path).
|
||||
|
||||
### Key Systems
|
||||
|
||||
#### 1. Smart Plugs (TP-Link Kasa) — Energy Monitoring
|
||||
|
|
@ -437,15 +424,10 @@ Named plugs with power/energy tracking:
|
|||
- PM1.0/2.5/4.0/10 particulate sensors
|
||||
- VOC, NOx, ammonia, CO, ethanol, hydrogen, methane, NO2 gas sensors
|
||||
|
||||
#### 3. Cowboy E-Bike (`elsbrock/cowboy-ha`)
|
||||
Bike named **"Classic Performance"** → entities are `sensor.classic_performance_*` (26 total). The old `sensor.bike_*` names are GONE (they were the dead `jdejaegh` integration).
|
||||
- `sensor.classic_performance_remaining_battery`: Battery % (was `sensor.bike_state_of_charge`)
|
||||
- `sensor.classic_performance_remaining_range`: Range km
|
||||
- `sensor.classic_performance_mileage`: Total km (was `sensor.bike_total_distance`)
|
||||
- `sensor.classic_performance_saved_co2`: Lifetime CO2 saved (was `sensor.bike_total_co2_saved`)
|
||||
- Plus `_distance_today`, `_last_trip_*`, `_battery_health`, `device_tracker.classic_performance`, etc.
|
||||
- **GOTCHA**: live battery/range/mileage read `unknown` while the bike is parked/asleep — Cowboy only reports live SoC when awake (ridden/charging); trip-history + `distance_today` stay live regardless.
|
||||
- Auth: account **email+password** (no AWS Cognito — that was the dead `jdejaegh`/`cowboybike` lineage). Setup via UI config flow / REST `config_entries/flow`. Creds in Vaultwarden item **"cowboy bike"** (`homelab vault get "cowboy bike"`).
|
||||
#### 3. Cowboy E-Bike
|
||||
- `sensor.bike_state_of_charge`: Battery %
|
||||
- `sensor.bike_total_distance`: Total km
|
||||
- `sensor.bike_total_co2_saved`: CO2 saved (grams)
|
||||
|
||||
#### 4. Uptime Monitoring (UptimeRobot)
|
||||
- `sensor.blog`: blog uptime
|
||||
|
|
@ -464,17 +446,12 @@ Bike named **"Classic Performance"** → entities are `sensor.classic_performanc
|
|||
- Scripts: `script.start_netflix`, `script.start_stremio`
|
||||
- Scene: `scene.night` (turns off Livia + Michelle plugs)
|
||||
|
||||
### Custom Components (HACS integrations)
|
||||
- **cowboy** (`elsbrock/cowboy-ha` v1.2.0): Cowboy e-bike — revived 2026-06-24. The old `jdejaegh/home-assistant-cowboy` repo is **dead (404)**; don't chase it.
|
||||
- **hildebrandglow_dcc**: UK smart meter DCC energy — **DISABLED by user** (config entry `disabled_by: user`), not broken.
|
||||
|
||||
### HACS frontend cards (plugins)
|
||||
- **Mushroom** (`piitaya/lovelace-mushroom`), **mini-graph-card** (`kalkih/mini-graph-card`), **plotly-graph-card** (`dbuezas/lovelace-plotly-graph-card`) — used by the redesigned Overview. Install over WS `hacs/repository/download`; resources auto-register in storage mode.
|
||||
### Custom Components
|
||||
- **cowboy**: Cowboy e-bike integration (HACS)
|
||||
- **hildebrandglow_dcc**: UK smart meter DCC energy data (HACS)
|
||||
|
||||
### Integrations
|
||||
ESPHome, TP-Link Kasa, Tapo, UptimeRobot, **Cowboy** (elsbrock), Oral-B BLE, Ookla Speedtest (exposes only an `update` entity, no live speed sensors), HACS, OpenRouter (free LLMs), Piper (TTS), Whisper (STT), Android TV/ADB.
|
||||
- **Disabled by user (NOT broken)**: `met` + `metoffice` (weather — so `weather.*` entities are ABSENT), `roomba` (Rumi vacuum), `hildebrandglow_dcc` (energy).
|
||||
- **Failing**: `tplink` **Tapo P100** projector plug — `setup_retry`, 403 KLAP handshake from 192.168.8.108 (plug off / firmware). Left as-is.
|
||||
ESPHome, TP-Link Kasa, Tapo, UptimeRobot, Cowboy, Hildebrand Glow DCC, Oral-B BLE, Ookla Speedtest, HACS, OpenRouter (multiple free LLMs), Piper (local TTS), Whisper (local STT), Android TV/ADB
|
||||
|
||||
### AI / Voice Assistants
|
||||
- 5 free LLM conversation agents: Google Gemma 3 27B, Meta Llama 3.2 3B, Mistral Devstral 2, OpenAI GPT-OSS-20B, Z.AI GLM 4.5 Air
|
||||
|
|
@ -489,8 +466,15 @@ ESPHome, TP-Link Kasa, Tapo, UptimeRobot, **Cowboy** (elsbrock), Oral-B BLE, Ook
|
|||
- Anca arrival/departure notifications
|
||||
- Night scene: turns off Livia + Michelle
|
||||
|
||||
### Platform (HAOS — ignore any legacy `docker run` snippet)
|
||||
ha-london runs **Home Assistant OS** (managed appliance), NOT a hand-run Docker container. There is no `docker run homeassistant/home-assistant` to manage. Install HACS components over the WebSocket API (`hacs/repository/download` with the repo's HACS id), then restart via `POST /api/services/homeassistant/restart` — a HAOS restart drops automations for ~1–2 min and resets `sensor.uptime` (use that as the "back up" marker).
|
||||
### Docker Setup
|
||||
```bash
|
||||
docker run -d --name homeassistant --privileged \
|
||||
-e TZ=Europe/London \
|
||||
-v /home/pi/docker/homeAssistant:/config \
|
||||
-v /run/dbus:/run/dbus:ro \
|
||||
--network=host --restart=unless-stopped \
|
||||
homeassistant/home-assistant:2025.9
|
||||
```
|
||||
|
||||
### SSH Access
|
||||
```bash
|
||||
|
|
|
|||
39
.github/workflows/build-authentik.yml
vendored
39
.github/workflows/build-authentik.yml
vendored
|
|
@ -1,39 +0,0 @@
|
|||
name: Build Custom Authentik Image
|
||||
|
||||
# ADR-0002: infra-owned image built off-infra on GHA → ghcr.
|
||||
# Thin SLOW-1a overlay over the official authentik server (narrows the login
|
||||
# identification stage's select_subclasses() to the login-capable source subtypes;
|
||||
# see stacks/authentik/Dockerfile). Rebuild only when the Dockerfile changes — on
|
||||
# every authentik bump, edit the FROM tag + the patchN suffix here + the image tag
|
||||
# in modules/authentik/values.yaml together.
|
||||
on:
|
||||
push:
|
||||
branches: [master]
|
||||
paths:
|
||||
- 'stacks/authentik/Dockerfile'
|
||||
workflow_dispatch: {}
|
||||
|
||||
permissions:
|
||||
contents: read
|
||||
packages: write
|
||||
|
||||
jobs:
|
||||
build:
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
- uses: docker/setup-buildx-action@v3
|
||||
- uses: docker/login-action@v3
|
||||
with:
|
||||
registry: ghcr.io
|
||||
username: ${{ github.actor }}
|
||||
password: ${{ secrets.GITHUB_TOKEN }}
|
||||
- uses: docker/build-push-action@v6
|
||||
with:
|
||||
context: stacks/authentik
|
||||
platforms: linux/amd64
|
||||
provenance: false
|
||||
push: true
|
||||
tags: |
|
||||
ghcr.io/viktorbarzin/authentik-server:2026.2.4-patch3
|
||||
ghcr.io/viktorbarzin/authentik-server:latest
|
||||
|
|
@ -65,21 +65,6 @@ steps:
|
|||
# don't need explicit token propagation.
|
||||
VAULT_ADDR: http://vault-active.vault.svc.cluster.local:8200
|
||||
commands:
|
||||
# ── Forge guard: apply ONLY on the canonical Forgejo forge ──
|
||||
# infra is registered in Woodpecker on BOTH the Forgejo canonical repo and
|
||||
# the legacy GitHub mirror, and BOTH fire this push pipeline. Without this
|
||||
# guard both run `terragrunt apply` on every push and race each other for
|
||||
# the per-stack PG state lock — the dominant cause of the "Error acquiring
|
||||
# the state lock" failures + push-supersede "killed" runs. The GitHub-mirror
|
||||
# registration keeps running the CRONS (drift-detection, renew-tls, …) — only
|
||||
# its duplicate push-apply no-ops here. Fail-open: an unknown forge (neither
|
||||
# env var set) still applies, preserving prior behaviour.
|
||||
- |
|
||||
if echo "${CI_REPO_URL:-}${CI_FORGE_URL:-}" | grep -qi 'github\.com'; then
|
||||
echo "[forge-guard] GitHub-mirror push — apply runs only on the Forgejo canonical repo (avoids double-apply + state-lock races). Skipping."
|
||||
exit 0
|
||||
fi
|
||||
|
||||
# ── Skip CI commits ──
|
||||
- |
|
||||
if echo "$CI_COMMIT_MESSAGE" | grep -q '\[CI SKIP\]\|\[ci skip\]'; then
|
||||
|
|
@ -228,40 +213,23 @@ steps:
|
|||
if [ -s .platform_apply ]; then
|
||||
echo "=== Applying platform stacks (serial, locked) ==="
|
||||
while read -r stack; do
|
||||
# Tier-0 `vault` is human-applied via OIDC; the CI `ci` Vault role
|
||||
# lacks Vault-admin perms (sys/mounts + sys/policies/acl), so a CI
|
||||
# apply always 403s and fails the pipeline. Kept in PLATFORM_STACKS
|
||||
# (so the app-stack detector still excludes it) but skipped here.
|
||||
# (2026-06-27 — see docs/architecture/ci-cd.md)
|
||||
if [ "$stack" = "vault" ]; then echo "[vault] SKIPPED (Tier-0, human-applied via OIDC)"; continue; fi
|
||||
echo "[$stack] Starting apply..."
|
||||
ATTEMPT=0
|
||||
while :; do
|
||||
ATTEMPT=$((ATTEMPT + 1))
|
||||
set +e
|
||||
OUTPUT=$(cd "stacks/$stack" && ../../scripts/tg apply --non-interactive 2>&1)
|
||||
EXIT=$?
|
||||
set -e
|
||||
if [ $EXIT -eq 0 ]; then
|
||||
echo "$OUTPUT" | tail -3; echo "[$stack] OK"; break
|
||||
if [ $EXIT -ne 0 ]; then
|
||||
if echo "$OUTPUT" | grep -q "is locked by"; then
|
||||
echo "[$stack] SKIPPED (locked by another session)"
|
||||
else
|
||||
echo "$OUTPUT" | tail -50
|
||||
echo "[$stack] FAILED (exit $EXIT)"
|
||||
FAILED_PLATFORM_STACKS="$FAILED_PLATFORM_STACKS $stack"
|
||||
fi
|
||||
# Lock contention → SKIP, not fail. Match BOTH the Tier-0 Vault lock
|
||||
# ("is locked by", from scripts/tg) AND the Tier-1 PG-backend lock
|
||||
# ("Error acquiring the state lock" / "already locked"). The PG case
|
||||
# was previously counted as a failure — the #1 source of false reds.
|
||||
if echo "$OUTPUT" | grep -qE 'is locked by|Error acquiring the state lock|already locked'; then
|
||||
echo "[$stack] SKIPPED (locked by another session/run)"; break
|
||||
else
|
||||
echo "$OUTPUT" | tail -3
|
||||
echo "[$stack] OK"
|
||||
fi
|
||||
# Transient: provider-registry download timeout / Vault 5xx → bounded
|
||||
# retry. Deliberately NOT helm atomic-timeouts or config errors
|
||||
# (missing arg, invalid index) — those must fail fast, retry can't fix
|
||||
# them and can worsen a stuck helm release.
|
||||
if [ $ATTEMPT -lt 3 ] && echo "$OUTPUT" | grep -qE 'Failed to install provider|Client\.Timeout exceeded while awaiting headers|error reading from Vault.*Code: 5[0-9][0-9]'; then
|
||||
echo "[$stack] transient error (attempt $ATTEMPT/3) — retrying in 15s..."; sleep 15; continue
|
||||
fi
|
||||
echo "$OUTPUT" | tail -50; echo "[$stack] FAILED (exit $EXIT)"
|
||||
FAILED_PLATFORM_STACKS="$FAILED_PLATFORM_STACKS $stack"; break
|
||||
done
|
||||
done < .platform_apply
|
||||
fi
|
||||
# Deferred until after app stacks so both lists get a chance to run.
|
||||
|
|
@ -274,27 +242,22 @@ steps:
|
|||
echo "=== Applying app stacks (serial, locked) ==="
|
||||
while read -r stack; do
|
||||
echo "[$stack] Starting apply..."
|
||||
ATTEMPT=0
|
||||
while :; do
|
||||
ATTEMPT=$((ATTEMPT + 1))
|
||||
set +e
|
||||
OUTPUT=$(cd "stacks/$stack" && ../../scripts/tg apply --non-interactive 2>&1)
|
||||
EXIT=$?
|
||||
set -e
|
||||
if [ $EXIT -eq 0 ]; then
|
||||
echo "$OUTPUT" | tail -3; echo "[$stack] OK"; break
|
||||
if [ $EXIT -ne 0 ]; then
|
||||
if echo "$OUTPUT" | grep -q "is locked by"; then
|
||||
echo "[$stack] SKIPPED (locked by another session)"
|
||||
else
|
||||
echo "$OUTPUT" | tail -50
|
||||
echo "[$stack] FAILED (exit $EXIT)"
|
||||
FAILED_APP_STACKS="$FAILED_APP_STACKS $stack"
|
||||
fi
|
||||
# Lock contention → SKIP, not fail (Tier-0 Vault + Tier-1 PG; see platform loop).
|
||||
if echo "$OUTPUT" | grep -qE 'is locked by|Error acquiring the state lock|already locked'; then
|
||||
echo "[$stack] SKIPPED (locked by another session/run)"; break
|
||||
else
|
||||
echo "$OUTPUT" | tail -3
|
||||
echo "[$stack] OK"
|
||||
fi
|
||||
# Transient provider-download / Vault 5xx → bounded retry (see platform loop).
|
||||
if [ $ATTEMPT -lt 3 ] && echo "$OUTPUT" | grep -qE 'Failed to install provider|Client\.Timeout exceeded while awaiting headers|error reading from Vault.*Code: 5[0-9][0-9]'; then
|
||||
echo "[$stack] transient error (attempt $ATTEMPT/3) — retrying in 15s..."; sleep 15; continue
|
||||
fi
|
||||
echo "$OUTPUT" | tail -50; echo "[$stack] FAILED (exit $EXIT)"
|
||||
FAILED_APP_STACKS="$FAILED_APP_STACKS $stack"; break
|
||||
done
|
||||
done < .app_apply
|
||||
fi
|
||||
# Fail the step loudly so the pipeline `default` workflow state
|
||||
|
|
|
|||
|
|
@ -85,13 +85,6 @@ steps:
|
|||
stack=$(basename "$stack_dir")
|
||||
[ -f "$stack_dir/terragrunt.hcl" ] || continue
|
||||
|
||||
# Tier-0 `vault` is human-applied via OIDC; the CI `ci` Vault role lacks
|
||||
# Vault-admin perms (sys/mounts + sys/policies/acl), so `terragrunt plan`
|
||||
# on it ERRORs (detailed-exitcode 1) and fails the whole nightly drift
|
||||
# run. Skip it — drift on Tier-0 vault is caught at human apply time.
|
||||
# (2026-06-27)
|
||||
[ "$stack" = "vault" ] && continue
|
||||
|
||||
echo -n "[$stack] planning... "
|
||||
OUTPUT=$(cd "$stack_dir" && terragrunt plan -detailed-exitcode -input=false 2>&1)
|
||||
EXIT=$?
|
||||
|
|
|
|||
|
|
@ -273,11 +273,8 @@ To land a finished change from such a clone:
|
|||
Slack audit feed; a no-op CI apply on a docs-only commit is harmless.
|
||||
4. Leave the clone on clean `master` so auto-refresh keeps working.
|
||||
5. Tell the user in plain language what happened. Stack changes are
|
||||
auto-applied by CI on push — or, with apply access, applied locally yourself
|
||||
(`scripts/tg apply`, from the main checkout, not a worktree); either path is
|
||||
fine, but the change must always be committed here, never applied
|
||||
uncommitted. Verify the live result with the user's read-only kubectl before
|
||||
saying "it's live".
|
||||
auto-applied by CI — verify the live result with the user's read-only
|
||||
kubectl before saying "it's live".
|
||||
|
||||
If a push to `master` is rejected by branch protection (user not on the
|
||||
whitelist — e.g. new users before Viktor grants it), fall back to a
|
||||
|
|
|
|||
|
|
@ -125,7 +125,7 @@ How a **Service** is named in flow/audit data — its **namespace** is the prima
|
|||
_Avoid_: equating "service identity" with a workload's **ServiceAccount** (that's the deferred enforcement principal, not the attribution key) or with cryptographic/SPIFFE identity; "Service" here is the domain **Service**, not the K8s `Service` object.
|
||||
|
||||
**Goldmane / Whisker**:
|
||||
Calico 3.30's OSS flow-observability pair — **Goldmane** aggregates identity-stamped flows (namespace/pod/workload/labels + allow-deny + policy trace) streamed from Felix over gRPC into an in-memory ~60-min ring buffer (no etcd/API writes); **Whisker** is its live web UI. The east-west "who-talks-to-whom" data plane, succeeding raw iptables-`LOG`→journald lines (which carry no identity). The in-memory buffer alone is not an audit trail — durable history is the **`goldmane-edge-aggregator`** (the implemented trail; ADR-0014 originally framed this as a Loki emitter), which streams Goldmane's gRPC `Flows.Stream` over mTLS and upserts the namespace-pair **edge set** into CNPG DB `goldmane_edges` + a daily `#alerts` digest (the `#security` channel was abandoned 2026-06-25). As-built: `docs/runbooks/goldmane-flow-trail.md`.
|
||||
Calico 3.30's OSS flow-observability pair — **Goldmane** aggregates identity-stamped flows (namespace/pod/workload/labels + allow-deny + policy trace) streamed from Felix over gRPC into an in-memory ~60-min ring buffer (no etcd/API writes); **Whisker** is its live web UI. The east-west "who-talks-to-whom" data plane, succeeding raw iptables-`LOG`→journald lines (which carry no identity). Durable history requires emitting Goldmane flows to **Loki**; the in-memory buffer alone is not an audit trail.
|
||||
_Avoid_: assuming Goldmane persists (it's a ring buffer — lost on restart); expecting a ServiceAccount field in its schema (it carries labels, not SA); confusing it with Cilium **Hubble** (needs the Cilium datapath, unusable on Calico) or **Kiali** (needs an Istio mesh).
|
||||
|
||||
### Storage
|
||||
|
|
|
|||
|
|
@ -202,69 +202,6 @@ runs on the devvm, `setInputFiles` streams local files to the remote browser ove
|
|||
CDP — no `chmod`/staging-dir workaround. See `docs/architecture/chrome-service.md`
|
||||
and `docs/adr/0013`.
|
||||
|
||||
### v0.9 verbs — edges (east-west "who-talks-to-whom" trail)
|
||||
|
||||
Read-only investigation helper over the `goldmane_edges` CNPG trail (ADR-0014):
|
||||
filters render to a single safe `SELECT` (namespace values validated to the k8s
|
||||
name charset) run via the dbaas primary pod — the same exec path as `k8s db`.
|
||||
|
||||
| Command | Tier | What it does |
|
||||
| --- | --- | --- |
|
||||
| `edges --ns <ns>` | read | edges touching `<ns>` (either direction) |
|
||||
| `edges --src <ns>` / `--dst <ns>` | read | directional: `<ns>`'s egress / ingress peers |
|
||||
| `edges --peers-of <ns>` | read | distinct peer namespaces of `<ns>` (both directions) |
|
||||
| `edges --new-since <24h\|7d\|YYYY-MM-DD>` | read | edges first seen since a duration or date |
|
||||
| `edges --denied` | read | only `action='deny'` edges (blocked / lateral-movement) |
|
||||
| `edges --json` / `--limit N` | read | JSON array output / row cap (default 200) |
|
||||
|
||||
### v0.10 — `vault get --all` (browse every field)
|
||||
|
||||
`vault get <name> --all` returns the **whole item** as a normalized JSON object,
|
||||
so an agent can discover and read fields the single-field `--field` allowlist
|
||||
can't reach — notably arbitrary **custom fields**.
|
||||
|
||||
| Command | Tier | What it does |
|
||||
| --- | --- | --- |
|
||||
| `vault get <name> --all` | read | all fields as JSON: `{name, username?, password?, uris?, totp?, notes?, fields?}` |
|
||||
|
||||
Shape notes: present standard fields only (empty ones omitted); `fields` is a
|
||||
custom `name→value` map (duplicate names → last-wins; `linked` fields skipped).
|
||||
The TOTP **seed is never emitted** — `totp` is a presence flag (`true`), so the
|
||||
only seed-derived path stays the specially-audited `vault code`. Like
|
||||
`get --json`, the dump is all secret values, so it **refuses a terminal** — pipe
|
||||
it (`homelab vault get <name> --all | jq`).
|
||||
|
||||
### v0.10.1 — reads `bw sync` first (always fresh)
|
||||
|
||||
Every vault read (`get`, `get --all`, `list`, `code`, `status`) now runs `bw
|
||||
sync` when opening its session, so it reflects the latest server-side values.
|
||||
`bw unlock` only decrypts the *local* cache, so without this a persisted
|
||||
(already-logged-in) session served stale data — a password changed in the web
|
||||
vault wouldn't show up until the next login. The sync is **best-effort**: a
|
||||
transient failure warns on stderr and falls back to the cached vault rather than
|
||||
failing the read.
|
||||
|
||||
### v0.11 — `vault kv` (HashiCorp Vault / OpenBao infra secrets)
|
||||
|
||||
`homelab vault` now fronts **two unrelated stores**, made explicit in the bare
|
||||
`homelab vault` help and via `[vaultwarden]` / `[hashicorp-vault]` summary tags:
|
||||
|
||||
- **Vaultwarden** — your personal password manager (`vault get/list/code/…`, unchanged).
|
||||
- **HashiCorp Vault / OpenBao** — homelab infra secrets, the `secret/…` KV store, under `vault kv`.
|
||||
|
||||
| Command | Tier | What it does |
|
||||
| --- | --- | --- |
|
||||
| `vault kv get <path> [--field K]` | read | read a secret: `--field K` → one value (TTY-aware clipboard/stdout); no field → all fields as JSON (refuses a bare TTY) |
|
||||
| `vault kv list <path>` | read | list sub-paths under `<path>` (no values) |
|
||||
| `vault kv put <path> <key>` | write | write one key; **value via stdin** (piped or no-echo prompt, never argv); creates the path or **merges** (never clobbers siblings) |
|
||||
|
||||
**Different credentials:** the Vaultwarden verbs use the per-user *scoped* token
|
||||
(bound to `claude-users/<user>`); `vault kv` uses your **own** Vault token
|
||||
(`vault login -method=oidc` → `~/.vault-token`, or `$VAULT_TOKEN`) — the kv
|
||||
handlers set `VAULT_ADDR` but never inject the scoped token (which would 403 off
|
||||
its own path). Access is whatever your policy grants. Writes are merge-only;
|
||||
`put` (replace) / `delete` are out of scope — use the raw `vault` CLI.
|
||||
|
||||
## Build / install
|
||||
|
||||
Built from source to `/usr/local/bin/homelab` during devvm provisioning
|
||||
|
|
|
|||
|
|
@ -1 +1 @@
|
|||
v0.11.0
|
||||
v0.8.1
|
||||
|
|
|
|||
|
|
@ -1,69 +0,0 @@
|
|||
package main
|
||||
|
||||
import "fmt"
|
||||
|
||||
func edgesCommands() []Command {
|
||||
return []Command{
|
||||
{Path: []string{"edges"}, Tier: TierRead,
|
||||
Summary: "who-talks-to-whom trail: edges [--ns|--src|--dst|--peers-of N] [--new-since 24h] [--denied] [--json] [--limit N]",
|
||||
Run: edgesRun},
|
||||
}
|
||||
}
|
||||
|
||||
// edgesRun renders the filter flags to SQL and runs it read-only against the
|
||||
// goldmane_edges CNPG DB via the dbaas primary pod (same exec path as `k8s db`).
|
||||
func edgesRun(args []string) error {
|
||||
for _, a := range args {
|
||||
if a == "-h" || a == "--help" {
|
||||
fmt.Print(edgesUsage())
|
||||
return nil
|
||||
}
|
||||
}
|
||||
o, err := parseEdgesArgs(args)
|
||||
if err != nil {
|
||||
return fmt.Errorf("%w\n\n%s", err, edgesUsage())
|
||||
}
|
||||
sql, err := buildEdgesQuery(o)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
// pg-cluster-rw is a Service (not exec-able); resolve the primary POD.
|
||||
pod, err := kubectlCapture("dbaas", "get", "pod", "-l", "cnpg.io/instanceRole=primary",
|
||||
"-o", "jsonpath={.items[0].metadata.name}")
|
||||
if err != nil || pod == "" {
|
||||
return fmt.Errorf("could not resolve CNPG primary pod in dbaas: %v", err)
|
||||
}
|
||||
exec := []string{"exec", pod, "-c", "postgres", "--", "psql", "-U", "postgres", "-d", "goldmane_edges"}
|
||||
if o.asJSON {
|
||||
exec = append(exec, "-tAc", sql) // raw tuple → the JSON array
|
||||
} else {
|
||||
exec = append(exec, "-P", "pager=off", "-c", sql) // aligned table for humans
|
||||
}
|
||||
return kubectlStream("dbaas", exec...)
|
||||
}
|
||||
|
||||
func edgesUsage() string {
|
||||
return `homelab edges — query the who-talks-to-whom trail (goldmane_edges, ADR-0014)
|
||||
|
||||
Usage: homelab edges [filters]
|
||||
|
||||
Filters (AND-combined; namespace values are validated to the k8s name charset):
|
||||
--ns NAME edges touching NAME (either direction)
|
||||
--src NAME edges where source namespace = NAME
|
||||
--dst NAME edges where destination namespace = NAME
|
||||
--peers-of NAME distinct peer namespaces of NAME (both directions)
|
||||
--new-since SPEC first seen since SPEC: a duration (24h, 7d, 30m, 90s) or a date (YYYY-MM-DD)
|
||||
--denied only denied (action='deny') edges — blocked / lateral-movement attempts
|
||||
--json output a JSON array (for agents/pipelines)
|
||||
--limit N cap rows (default 200)
|
||||
|
||||
Examples:
|
||||
homelab edges --ns immich # everything immich talks to / is talked to by
|
||||
homelab edges --peers-of authentik # authentik's peer namespaces
|
||||
homelab edges --src recruiter-responder # that namespace's egress peers
|
||||
homelab edges --new-since 24h # edges first seen in the last day
|
||||
homelab edges --denied --json # blocked flows, machine-readable
|
||||
|
||||
Read-only SELECT against CNPG DB goldmane_edges via the dbaas primary pod.
|
||||
`
|
||||
}
|
||||
|
|
@ -54,7 +54,10 @@ func printMemories(raw []byte, jsonOut bool) error {
|
|||
return nil
|
||||
}
|
||||
for _, m := range r.Memories {
|
||||
c := truncatePreview(strings.ReplaceAll(m.Content, "\n", " "), 240)
|
||||
c := strings.ReplaceAll(m.Content, "\n", " ")
|
||||
if len(c) > 240 {
|
||||
c = c[:240] + "…"
|
||||
}
|
||||
fmt.Printf("#%d [%s] (%.2f) %s\n", m.ID, m.Category, m.Importance, c)
|
||||
if m.Tags != "" {
|
||||
fmt.Printf(" tags: %s\n", m.Tags)
|
||||
|
|
@ -63,21 +66,6 @@ func printMemories(raw []byte, jsonOut bool) error {
|
|||
return nil
|
||||
}
|
||||
|
||||
// truncatePreview shortens s to at most maxRunes RUNES, appending "…" when it
|
||||
// trims. Counting runes (not bytes) is load-bearing: a byte slice like s[:240]
|
||||
// can cut through the middle of a multibyte UTF-8 character (e.g. 2-byte
|
||||
// Cyrillic), leaving a dangling lead byte = invalid UTF-8. That crashed strict
|
||||
// decoders downstream — notably the homelab-memory-recall.py UserPromptSubmit
|
||||
// hook (subprocess text=True), which surfaced as a recurring "UserPromptSubmit
|
||||
// hook error" for Cyrillic-language users.
|
||||
func truncatePreview(s string, maxRunes int) string {
|
||||
r := []rune(s)
|
||||
if len(r) <= maxRunes {
|
||||
return s
|
||||
}
|
||||
return string(r[:maxRunes]) + "…"
|
||||
}
|
||||
|
||||
func memoryRecall(args []string) error {
|
||||
req := memRecallReq{}
|
||||
jsonOut := false
|
||||
|
|
|
|||
357
cli/cmd_vault.go
357
cli/cmd_vault.go
|
|
@ -4,7 +4,6 @@ import (
|
|||
"bufio"
|
||||
"encoding/base64"
|
||||
"encoding/json"
|
||||
"errors"
|
||||
"fmt"
|
||||
"os"
|
||||
"os/exec"
|
||||
|
|
@ -16,60 +15,43 @@ import (
|
|||
// Identity is the kernel UID; per-user creds live in that user's isolated Vault
|
||||
// path (secret/workstation/claude-users/<user>) read via their scoped token, and
|
||||
// decryption is done by the official `bw` CLI. See
|
||||
// docs/runbooks/homelab-vault-onboarding.md.
|
||||
// docs/superpowers/specs/2026-06-24-homelab-vault-design.md.
|
||||
func vaultCommands() []Command {
|
||||
cmds := []Command{
|
||||
// Vaultwarden — your personal password manager (logins/passwords/TOTP).
|
||||
return []Command{
|
||||
{Path: []string{"vault", "setup"}, Tier: TierWrite,
|
||||
Summary: "[vaultwarden] one-time: store your master password + API key in your Vault path", Run: vaultSetup},
|
||||
Summary: "one-time: store your Vaultwarden master password + API key in your Vault path", Run: vaultSetup},
|
||||
{Path: []string{"vault", "status"}, Tier: TierRead,
|
||||
Summary: "[vaultwarden] show whether your vault is configured/reachable (no secrets)", Run: vaultStatus},
|
||||
Summary: "show whether your vault is configured/reachable (no secrets)", Run: vaultStatus},
|
||||
{Path: []string{"vault", "list"}, Tier: TierRead,
|
||||
Summary: "[vaultwarden] list your item names: vault list [--search Q]", Run: vaultList},
|
||||
Summary: "list your item names: vault list [--search Q]", Run: vaultList},
|
||||
{Path: []string{"vault", "get"}, Tier: TierRead,
|
||||
Summary: "[vaultwarden] fetch one login: vault get <name> [--field password|username|uri|notes|totp] [--json] [--all]", Run: vaultGet},
|
||||
Summary: "fetch one item: vault get <name> [--field password|username|uri|notes|totp] [--json]", Run: vaultGet},
|
||||
{Path: []string{"vault", "search"}, Tier: TierRead,
|
||||
Summary: "[vaultwarden] search your item names: vault search <query>", Run: vaultSearch},
|
||||
Summary: "search your item names: vault search <query>", Run: vaultSearch},
|
||||
{Path: []string{"vault", "code"}, Tier: TierRead,
|
||||
Summary: "[vaultwarden] current TOTP code for an item: vault code <name>", Run: vaultCode},
|
||||
Summary: "current TOTP code for an item: vault code <name>", Run: vaultCode},
|
||||
{Path: []string{"vault", "lock"}, Tier: TierWrite,
|
||||
Summary: "[vaultwarden] lock/log out the local bw session", Run: vaultLock},
|
||||
Summary: "lock/log out the local bw session", Run: vaultLock},
|
||||
{Path: []string{"vault"}, Tier: TierRead,
|
||||
Summary: "two stores: Vaultwarden (logins) + HashiCorp Vault/OpenBao kv (infra secrets) — run `homelab vault` for help",
|
||||
Summary: "Vaultwarden access for your own vault (run `homelab vault` for help)",
|
||||
Run: func([]string) error { fmt.Print(vaultHelp()); return nil }},
|
||||
}
|
||||
// HashiCorp Vault / OpenBao — homelab INFRA secrets (the secret/… KV store).
|
||||
return append(cmds, vaultKVCommands()...)
|
||||
}
|
||||
|
||||
// vaultHelp is shown for bare `homelab vault`. It LEADS with the distinction
|
||||
// between the two unrelated "vaults" this command fronts, because the name
|
||||
// collides: Vaultwarden (a password manager) vs HashiCorp Vault / OpenBao (the
|
||||
// infra secrets store).
|
||||
// vaultHelp is shown for bare `homelab vault`.
|
||||
func vaultHelp() string {
|
||||
return `homelab vault — two different secret stores under one command:
|
||||
return `homelab vault — read YOUR OWN Vaultwarden logins (no-HITL after one-time setup)
|
||||
|
||||
• Vaultwarden your personal PASSWORD MANAGER (logins / passwords / TOTP)
|
||||
• HashiCorp Vault / OpenBao homelab INFRA secrets (the secret/… KV store) → 'vault kv …'
|
||||
|
||||
── Vaultwarden (reads YOUR OWN vault; no-HITL after one-time setup) ──
|
||||
homelab vault setup one-time: store your master password + API key in your Vault path
|
||||
homelab vault status configured / unlocked / reachable (no secrets)
|
||||
homelab vault list [--search Q] list your item names (no secrets)
|
||||
homelab vault get <name> [--field password|username|uri|notes|totp] [--json]
|
||||
TTY → clipboard (auto-clears); piped → stdout
|
||||
homelab vault get <name> --all all fields (incl. custom) as JSON; piped only.
|
||||
TOTP shown as presence flag — use 'vault code' for a code.
|
||||
homelab vault code <name> current TOTP code
|
||||
homelab vault lock lock / log out the local bw session
|
||||
|
||||
── HashiCorp Vault / OpenBao (infra secrets; uses your own OIDC vault token) ──
|
||||
homelab vault kv get <path> [--field K] read an infra KV secret
|
||||
homelab vault kv list <path> list sub-paths
|
||||
homelab vault kv put <path> <key> write one key (value via stdin)
|
||||
|
||||
Vaultwarden creds live only in your own Vault path; the admin never sees them.
|
||||
Security model: docs/runbooks/homelab-vault-onboarding.md
|
||||
Creds live only in your own Vault path; the admin never sees them. Identity is
|
||||
your unix UID. Security model: docs/superpowers/specs/2026-06-24-homelab-vault-design.md
|
||||
(note: anything running as your user can decrypt your vault — the accepted no-HITL trade).
|
||||
`
|
||||
}
|
||||
|
|
@ -97,33 +79,7 @@ func realRunner(name string, argv, envv []string) (string, error) {
|
|||
out, err := cmd.Output()
|
||||
// Trim only the trailing newline the tool appends — NOT all whitespace, so a
|
||||
// fetched secret with significant leading/trailing spaces is preserved.
|
||||
return strings.TrimRight(string(out), "\r\n"), augmentErr(err, exitStderr(err))
|
||||
}
|
||||
|
||||
// exitStderr returns the stderr captured by cmd.Output() on a failed exec (it
|
||||
// stows it on *exec.ExitError), or nil. The tools we shell out to (vault, bw)
|
||||
// write the actionable message there — "connection refused", "permission
|
||||
// denied" — which the caller would otherwise never see behind a bare
|
||||
// "exit status N".
|
||||
func exitStderr(err error) []byte {
|
||||
var ee *exec.ExitError
|
||||
if errors.As(err, &ee) {
|
||||
return ee.Stderr
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
// augmentErr appends captured stderr to an error so failures are diagnosable
|
||||
// (not just "exit status 2"). Returns nil when err is nil, and err unchanged
|
||||
// when there's no stderr; preserves the wrapped error for errors.Is/As.
|
||||
func augmentErr(err error, stderr []byte) error {
|
||||
if err == nil {
|
||||
return nil
|
||||
}
|
||||
if s := strings.TrimSpace(string(stderr)); s != "" {
|
||||
return fmt.Errorf("%w: %s", err, s)
|
||||
}
|
||||
return err
|
||||
return strings.TrimRight(string(out), "\r\n"), err
|
||||
}
|
||||
|
||||
// realRunnerStdin runs a command feeding `stdin` to it, for secret values that
|
||||
|
|
@ -136,7 +92,7 @@ func realRunnerStdin(name string, argv, envv []string, stdin string) (string, er
|
|||
}
|
||||
cmd.Stdin = strings.NewReader(stdin)
|
||||
out, err := cmd.Output()
|
||||
return strings.TrimRight(string(out), "\r\n"), augmentErr(err, exitStderr(err))
|
||||
return strings.TrimRight(string(out), "\r\n"), err
|
||||
}
|
||||
|
||||
func vwCredsPath(user string) string { return vwUserPathPrefix + user }
|
||||
|
|
@ -172,89 +128,6 @@ func loadCreds(run cmdRunner, user string) (vwCreds, error) {
|
|||
var vaultCurrentUser = func() string { return os.Getenv("USER") }
|
||||
var vaultCurrentUID = func() string { return fmt.Sprintf("%d", os.Getuid()) }
|
||||
|
||||
// scopedTokenPath is where claude-auth-sync keeps the user's scoped Vault token.
|
||||
// MUST match CAS_VAULT_TOKEN_FILE in scripts/workstation/claude-auth-sync.sh.
|
||||
func scopedTokenPath(home string) string {
|
||||
return home + "/.config/claude-auth-sync/vault-token"
|
||||
}
|
||||
|
||||
// vaultTokenSource decides which Vault token the `vault` child processes should
|
||||
// use. Precedence: an explicit $VAULT_TOKEN (deliberate override), then the
|
||||
// per-user scoped token claude-auth-sync maintains at scopedTokenPath(HOME)
|
||||
// (policy workstation-claude-<user>, which grants exactly the create/read/update
|
||||
// this tool needs on the user's own path), then a native ~/.vault-token.
|
||||
//
|
||||
// The scoped token MUST beat ~/.vault-token: this tool only ever touches the
|
||||
// caller's own secret/workstation/claude-users/<user> path, and a power-user who
|
||||
// ran `vault login -method=oidc` carries a read-only ~/.vault-token whose
|
||||
// capability on that path is `deny` — letting it win shadows the scoped token
|
||||
// and every op fails 403/deny (emo, 2026-06-28). ~/.vault-token is only the
|
||||
// right credential when there is no scoped token (admins). Returns the token to
|
||||
// export — "" when the vault CLI should read the ambient/native credential —
|
||||
// plus a source tag for tests/logging.
|
||||
func vaultTokenSource(envToken string, haveVaultTokenFile bool, scopedToken string) (token, source string) {
|
||||
switch {
|
||||
case envToken != "":
|
||||
return "", "env"
|
||||
case strings.TrimSpace(scopedToken) != "":
|
||||
return strings.TrimSpace(scopedToken), "scoped"
|
||||
case haveVaultTokenFile:
|
||||
return "", "file"
|
||||
default:
|
||||
return "", "none"
|
||||
}
|
||||
}
|
||||
|
||||
// vaultAddrDefault is the cluster Vault the workstation talks to. The bw server
|
||||
// is likewise hardcoded (openSession), so a sane default here is consistent.
|
||||
const vaultAddrDefault = "https://vault.viktorbarzin.me"
|
||||
|
||||
// vaultAddrToSet returns the VAULT_ADDR to export when the caller's environment
|
||||
// doesn't already set one, else "". homelab vault is invoked by AFK agent
|
||||
// sessions — frequently non-login shells (tmux panes, agent subprocesses) that
|
||||
// never sourced /etc/environment — so, like claude-auth-sync, the CLI must NOT
|
||||
// depend on an ambient VAULT_ADDR; otherwise every `vault` child falls back to
|
||||
// the 127.0.0.1:8200 default and fails "connection refused" (exit 2).
|
||||
func vaultAddrToSet(envAddr string) string {
|
||||
if strings.TrimSpace(envAddr) == "" {
|
||||
return vaultAddrDefault
|
||||
}
|
||||
return ""
|
||||
}
|
||||
|
||||
// ensureVaultAddr exports the default VAULT_ADDR when none is set, so the vault
|
||||
// child processes reach the cluster Vault regardless of the caller's shell. An
|
||||
// explicit VAULT_ADDR (admins, CI) is left untouched.
|
||||
func ensureVaultAddr() {
|
||||
if a := vaultAddrToSet(os.Getenv("VAULT_ADDR")); a != "" {
|
||||
os.Setenv("VAULT_ADDR", a)
|
||||
}
|
||||
}
|
||||
|
||||
// fileNonEmpty reports whether path exists and has content.
|
||||
func fileNonEmpty(path string) bool {
|
||||
fi, err := os.Stat(path)
|
||||
return err == nil && fi.Size() > 0
|
||||
}
|
||||
|
||||
// ensureVaultToken wires vaultTokenSource to the real environment: when the user
|
||||
// has no ambient Vault credential, it exports the claude-auth-sync scoped token
|
||||
// so the `vault` child processes authenticate as workstation-claude-<user>. It
|
||||
// is idempotent and safe for admins, whose explicit $VAULT_TOKEN / ~/.vault-token
|
||||
// take precedence and are left untouched.
|
||||
func ensureVaultToken() {
|
||||
// Every vault verb funnels through here, so this is the one place that also
|
||||
// guarantees VAULT_ADDR is set (see vaultAddrToSet for why it can't be
|
||||
// assumed from the caller's shell).
|
||||
ensureVaultAddr()
|
||||
home := os.Getenv("HOME")
|
||||
scoped, _ := os.ReadFile(scopedTokenPath(home))
|
||||
tok, src := vaultTokenSource(os.Getenv("VAULT_TOKEN"), home != "" && fileNonEmpty(home+"/.vault-token"), string(scoped))
|
||||
if src == "scoped" {
|
||||
os.Setenv("VAULT_TOKEN", tok)
|
||||
}
|
||||
}
|
||||
|
||||
// bwBaseEnv is the minimal non-secret environment bw/node need. We deliberately
|
||||
// do NOT inherit the full parent env (keeps stray secrets out of the child).
|
||||
func bwBaseEnv(appdata string) []string {
|
||||
|
|
@ -287,9 +160,7 @@ func bwSecretEnv(appdata string, c vwCreds, session string) []string {
|
|||
func bwLoginArgs() []string { return []string{"login", "--apikey"} }
|
||||
func bwUnlockArgs() []string { return []string{"unlock", "--passwordenv", "BW_PASSWORD", "--raw"} }
|
||||
func bwGetArgs(field, name string) []string { return []string{"get", field, name} }
|
||||
func bwItemArgs(name string) []string { return []string{"get", "item", name} }
|
||||
func bwStatusArgs() []string { return []string{"status"} }
|
||||
func bwSyncArgs() []string { return []string{"sync"} }
|
||||
|
||||
// bwNeedsLogin parses `bw status` JSON and reports whether a `bw login` is
|
||||
// required. Unparseable/empty output → true (safer to attempt login).
|
||||
|
|
@ -456,23 +327,13 @@ func openSession(run cmdRunner, user, uid string) (session, error) {
|
|||
if err != nil {
|
||||
return session{}, err
|
||||
}
|
||||
sessEnv := bwSecretEnv(appdata, creds, sess)
|
||||
// Pull the latest server-side state so reads reflect current values. `bw
|
||||
// unlock` only decrypts the LOCAL cache, so a persisted (already-logged-in)
|
||||
// session would otherwise serve stale data until the next login. Best-effort:
|
||||
// a transient sync failure must not break a read — fall back to the cached
|
||||
// vault and warn (status reports reachability separately).
|
||||
if _, err := run("bw", bwSyncArgs(), sessEnv); err != nil {
|
||||
fmt.Fprintln(os.Stderr, "homelab vault: warning: bw sync failed; using cached vault (values may be stale): "+err.Error())
|
||||
}
|
||||
return session{env: sessEnv}, nil
|
||||
return session{env: bwSecretEnv(appdata, creds, sess)}, nil
|
||||
}
|
||||
|
||||
type getOpts struct {
|
||||
name string
|
||||
field string
|
||||
json bool
|
||||
all bool // dump every field (incl. custom) as normalized JSON
|
||||
}
|
||||
|
||||
var validGetFields = map[string]bool{"password": true, "username": true, "uri": true, "notes": true, "totp": true}
|
||||
|
|
@ -484,8 +345,6 @@ func parseGetArgs(args []string) (getOpts, error) {
|
|||
switch {
|
||||
case a == "--json":
|
||||
o.json = true
|
||||
case a == "--all":
|
||||
o.all = true
|
||||
case a == "--field" && i+1 < len(args):
|
||||
o.field = args[i+1]
|
||||
i++
|
||||
|
|
@ -496,10 +355,9 @@ func parseGetArgs(args []string) (getOpts, error) {
|
|||
}
|
||||
}
|
||||
if o.name == "" {
|
||||
return o, fmt.Errorf("usage: homelab vault get <name> [--field password|username|uri|notes|totp] [--json] [--all]")
|
||||
return o, fmt.Errorf("usage: homelab vault get <name> [--field password|username|uri|notes|totp] [--json]")
|
||||
}
|
||||
// --all dumps the whole item, so --field is irrelevant — skip its allowlist.
|
||||
if !o.all && !validGetFields[o.field] {
|
||||
if !validGetFields[o.field] {
|
||||
return o, fmt.Errorf("invalid --field %q (want password|username|uri|notes|totp)", o.field)
|
||||
}
|
||||
return o, nil
|
||||
|
|
@ -515,81 +373,6 @@ func getValue(run cmdRunner, user, uid string, o getOpts) (string, error) {
|
|||
return bwGet(run, s.env, o.field, o.name)
|
||||
}
|
||||
|
||||
// getItem opens a session and returns the whole item as raw `bw get item` JSON.
|
||||
// Used by `get --all`; normalization is a separate, pure step (normalizeItem).
|
||||
func getItem(run cmdRunner, user, uid, name string) (string, error) {
|
||||
s, err := openSession(run, user, uid)
|
||||
if err != nil {
|
||||
return "", err
|
||||
}
|
||||
return run("bw", bwItemArgs(name), s.env)
|
||||
}
|
||||
|
||||
// normalizedItem is the browse-all-fields projection of a Vaultwarden item: the
|
||||
// standard login fields that are present, notes, and a flat map of custom field
|
||||
// name→value. bw internals (id, object, reprompt, passwordHistory) are dropped,
|
||||
// and the TOTP *seed* is reduced to a presence flag — the only seed-derived path
|
||||
// stays the specially-audited `vault code` (see the design §10/§16).
|
||||
type normalizedItem struct {
|
||||
Name string `json:"name"`
|
||||
Username string `json:"username,omitempty"`
|
||||
Password string `json:"password,omitempty"`
|
||||
URIs []string `json:"uris,omitempty"`
|
||||
TOTP bool `json:"totp,omitempty"` // presence only, never the seed
|
||||
Notes string `json:"notes,omitempty"`
|
||||
Fields map[string]string `json:"fields,omitempty"` // custom field name→value
|
||||
}
|
||||
|
||||
// bwFieldLinked is the Bitwarden custom-field type for a "linked" field: it
|
||||
// references another field and carries a null value, so it is not real data.
|
||||
const bwFieldLinked = 3
|
||||
|
||||
// normalizeItem parses a `bw get item` payload into the browse projection. It is
|
||||
// pure (no I/O), so it is the unit-tested heart of `get --all`.
|
||||
func normalizeItem(raw string) (normalizedItem, error) {
|
||||
var it struct {
|
||||
Name string `json:"name"`
|
||||
Notes string `json:"notes"`
|
||||
Login *struct {
|
||||
Username string `json:"username"`
|
||||
Password string `json:"password"`
|
||||
Totp string `json:"totp"`
|
||||
URIs []struct {
|
||||
URI string `json:"uri"`
|
||||
} `json:"uris"`
|
||||
} `json:"login"`
|
||||
Fields []struct {
|
||||
Name string `json:"name"`
|
||||
Value string `json:"value"`
|
||||
Type int `json:"type"`
|
||||
} `json:"fields"`
|
||||
}
|
||||
if err := json.Unmarshal([]byte(raw), &it); err != nil {
|
||||
return normalizedItem{}, fmt.Errorf("parse bw item: %w", err)
|
||||
}
|
||||
n := normalizedItem{Name: it.Name, Notes: it.Notes}
|
||||
if it.Login != nil {
|
||||
n.Username = it.Login.Username
|
||||
n.Password = it.Login.Password
|
||||
n.TOTP = it.Login.Totp != ""
|
||||
for _, u := range it.Login.URIs {
|
||||
if u.URI != "" {
|
||||
n.URIs = append(n.URIs, u.URI)
|
||||
}
|
||||
}
|
||||
}
|
||||
for _, f := range it.Fields {
|
||||
if f.Type == bwFieldLinked {
|
||||
continue // references another field, no value of its own
|
||||
}
|
||||
if n.Fields == nil {
|
||||
n.Fields = map[string]string{}
|
||||
}
|
||||
n.Fields[f.Name] = f.Value // duplicate names: last-wins (rare; documented)
|
||||
}
|
||||
return n, nil
|
||||
}
|
||||
|
||||
// clipboardDecision picks how to return a secret value. "stdout" prints it (a
|
||||
// pipe/agent — the intended machine path); "clipboard" copies via OSC52;
|
||||
// "refuse" emits nothing sensitive (would otherwise risk dumping the secret's
|
||||
|
|
@ -660,7 +443,6 @@ func runList(run cmdRunner, user, uid, search string) ([]string, error) {
|
|||
|
||||
func vaultList(args []string) error {
|
||||
hardenProcess()
|
||||
ensureVaultToken()
|
||||
search := ""
|
||||
for i := 0; i < len(args); i++ {
|
||||
if args[i] == "--search" && i+1 < len(args) {
|
||||
|
|
@ -695,7 +477,6 @@ func vaultSearch(args []string) error {
|
|||
|
||||
func vaultCode(args []string) error {
|
||||
hardenProcess()
|
||||
ensureVaultToken()
|
||||
if len(args) == 0 {
|
||||
return fmt.Errorf("usage: homelab vault code <name>")
|
||||
}
|
||||
|
|
@ -727,9 +508,7 @@ func statusSummary(run cmdRunner, user, uid string) string {
|
|||
if err != nil {
|
||||
return "vault: configured, but unlock/login FAILED (creds stale? run `homelab vault setup`): " + err.Error()
|
||||
}
|
||||
// openSession already did a best-effort sync; status re-runs it explicitly so
|
||||
// a reachability failure surfaces in this report rather than only on stderr.
|
||||
if _, err := run("bw", bwSyncArgs(), s.env); err != nil {
|
||||
if _, err := run("bw", []string{"sync"}, s.env); err != nil {
|
||||
return "vault: configured + unlocked, but sync/reachability failed: " + err.Error()
|
||||
}
|
||||
return "vault: configured, unlocked, reachable ✓"
|
||||
|
|
@ -737,7 +516,6 @@ func statusSummary(run cmdRunner, user, uid string) string {
|
|||
|
||||
func vaultStatus(args []string) error {
|
||||
hardenProcess()
|
||||
ensureVaultToken()
|
||||
uid := vaultCurrentUID()
|
||||
unlock, err := withUserLock(uid)
|
||||
if err != nil {
|
||||
|
|
@ -764,61 +542,32 @@ func vaultLock(args []string) error {
|
|||
return nil // lock/logout best-effort; never error the caller
|
||||
}
|
||||
|
||||
// kvWriteVerb selects the KV write semantics. merge=true → `kv patch -method=rw`
|
||||
// (read-modify-write: needs only read+update, NOT the `patch` capability the
|
||||
// scoped workstation-claude-<user> policy lacks, and preserves co-located keys
|
||||
// such as claude-auth-sync's claude_ai_oauth_json). merge=false → `kv put`
|
||||
// (creates the path on first use, before any sibling keys exist).
|
||||
func kvWriteVerb(merge bool) []string {
|
||||
if merge {
|
||||
return []string{"kv", "patch", "-method=rw"}
|
||||
}
|
||||
return []string{"kv", "put"}
|
||||
}
|
||||
|
||||
// vaultWritePublicArgs writes the non-secret identifiers via argv. Neither the
|
||||
// vaultPatchPublicArgs writes the non-secret identifiers via argv. Neither the
|
||||
// email nor the API client_id is a usable credential on its own.
|
||||
func vaultWritePublicArgs(merge bool, user, email, clientID string) []string {
|
||||
return append(kvWriteVerb(merge), vwCredsPath(user),
|
||||
"vaultwarden_email="+email,
|
||||
"vaultwarden_client_id="+clientID,
|
||||
)
|
||||
func vaultPatchPublicArgs(user, email, clientID string) []string {
|
||||
return []string{"kv", "patch", vwCredsPath(user),
|
||||
"vaultwarden_email=" + email,
|
||||
"vaultwarden_client_id=" + clientID,
|
||||
}
|
||||
}
|
||||
|
||||
// vaultWriteSecretArgs writes ONE secret value via the `key=-` stdin form, so the
|
||||
// value never appears in argv (ps / /proc/<pid>/cmdline). Fed on stdin by
|
||||
// realRunnerStdin.
|
||||
func vaultWriteSecretArgs(merge bool, user, key string) []string {
|
||||
return append(kvWriteVerb(merge), vwCredsPath(user), key+"=-")
|
||||
// vaultPatchSecretArgs writes ONE secret value via the `key=-` stdin form, so
|
||||
// the value never appears in argv (ps / /proc/<pid>/cmdline). The value is fed
|
||||
// on stdin by realRunnerStdin.
|
||||
func vaultPatchSecretArgs(user, key string) []string {
|
||||
return []string{"kv", "patch", vwCredsPath(user), key + "=-"}
|
||||
}
|
||||
|
||||
// credsPathExists reports whether the user's KV path already holds data. Used to
|
||||
// pick create (`kv put`) vs merge (`kv patch -method=rw`) for the first write:
|
||||
// claude-auth-sync usually creates the path first (Claude OAuth backup), but a
|
||||
// user could run `homelab vault setup` before that ever happens.
|
||||
func credsPathExists(run cmdRunner, user string) bool {
|
||||
_, err := run("vault", []string{"kv", "get", "-format=json", vwCredsPath(user)}, nil)
|
||||
return err == nil
|
||||
}
|
||||
|
||||
// cmdRunnerStdin is realRunnerStdin's shape, injected so writeCreds is testable.
|
||||
type cmdRunnerStdin func(name string, argv, envv []string, stdin string) (string, error)
|
||||
|
||||
// writeCreds stores all four fields in the user's Vault path using only the
|
||||
// capabilities the scoped policy grants (create/read/update — NOT `patch`). The
|
||||
// first (public) write creates the path when absent; the two real secrets then
|
||||
// merge in via read-modify-write so the public keys — and any claude-auth-sync
|
||||
// keys already present — survive. Secret values travel on stdin, never argv.
|
||||
func writeCreds(run cmdRunner, runStdin cmdRunnerStdin, user string, c vwCreds) error {
|
||||
merge := credsPathExists(run, user)
|
||||
if _, err := run("vault", vaultWritePublicArgs(merge, user, c.Email, c.ClientID), nil); err != nil {
|
||||
// writeCreds stores all four fields in the user's Vault path. The two real
|
||||
// secrets (master password, API client_secret) go via stdin — never argv.
|
||||
func writeCreds(user string, c vwCreds) error {
|
||||
if _, err := realRunner("vault", vaultPatchPublicArgs(user, c.Email, c.ClientID), nil); err != nil {
|
||||
return err
|
||||
}
|
||||
// The path now exists regardless of the branch above → merge the secrets in.
|
||||
if _, err := runStdin("vault", vaultWriteSecretArgs(true, user, "vaultwarden_master_password"), nil, c.MasterPassword); err != nil {
|
||||
if _, err := realRunnerStdin("vault", vaultPatchSecretArgs(user, "vaultwarden_master_password"), nil, c.MasterPassword); err != nil {
|
||||
return err
|
||||
}
|
||||
if _, err := runStdin("vault", vaultWriteSecretArgs(true, user, "vaultwarden_client_secret"), nil, c.ClientSecret); err != nil {
|
||||
if _, err := realRunnerStdin("vault", vaultPatchSecretArgs(user, "vaultwarden_client_secret"), nil, c.ClientSecret); err != nil {
|
||||
return err
|
||||
}
|
||||
return nil
|
||||
|
|
@ -844,7 +593,6 @@ func promptLine(prompt string) (string, error) {
|
|||
|
||||
func vaultSetup(args []string) error {
|
||||
hardenProcess()
|
||||
ensureVaultToken()
|
||||
fmt.Fprintln(os.Stderr, "One-time setup. Stored ONLY in your own Vault path; the admin never sees it.")
|
||||
fmt.Fprintln(os.Stderr, "Get your API key at https://vaultwarden.viktorbarzin.me → Settings → Security → Keys → View API key.")
|
||||
email, err := promptLine("Vaultwarden email: ")
|
||||
|
|
@ -867,7 +615,7 @@ func vaultSetup(args []string) error {
|
|||
return fmt.Errorf("all fields are required")
|
||||
}
|
||||
c := vwCreds{Email: email, MasterPassword: master, ClientID: clientID, ClientSecret: clientSecret}
|
||||
if err := writeCreds(realRunner, realRunnerStdin, vaultCurrentUser(), c); err != nil {
|
||||
if err := writeCreds(vaultCurrentUser(), c); err != nil {
|
||||
return fmt.Errorf("writing creds to your Vault path failed (scoped token present?): %w", err)
|
||||
}
|
||||
fmt.Fprintln(os.Stderr, "Stored. Verifying unlock…")
|
||||
|
|
@ -886,7 +634,6 @@ func vaultSetup(args []string) error {
|
|||
|
||||
func vaultGet(args []string) error {
|
||||
hardenProcess()
|
||||
ensureVaultToken()
|
||||
o, err := parseGetArgs(args)
|
||||
if err != nil {
|
||||
return err
|
||||
|
|
@ -898,9 +645,6 @@ func vaultGet(args []string) error {
|
|||
}
|
||||
defer unlock()
|
||||
user := vaultCurrentUser()
|
||||
if o.all {
|
||||
return getAllFields(user, uid, o.name)
|
||||
}
|
||||
val, err := getValue(realRunner, user, uid, o)
|
||||
if err != nil {
|
||||
return err
|
||||
|
|
@ -917,28 +661,3 @@ func vaultGet(args []string) error {
|
|||
return nil
|
||||
}
|
||||
|
||||
// getAllFields prints every field of one item as normalized JSON. Like
|
||||
// `get --json`, the payload is all secret values, so it refuses a terminal
|
||||
// (pipe it). The TOTP seed is never emitted — only a presence flag — so no extra
|
||||
// TOTP audit is needed; the op-log uses a distinct verb so a bulk dump is
|
||||
// distinguishable from a single-field get (the item name is still never logged).
|
||||
func getAllFields(user, uid, name string) error {
|
||||
if !jsonToStdoutOK(stdoutIsTTY()) {
|
||||
return fmt.Errorf("refusing to print all fields as JSON to a terminal; pipe it (e.g. | jq)")
|
||||
}
|
||||
raw, err := getItem(realRunner, user, uid, name)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
item, err := normalizeItem(raw)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
out, err := json.Marshal(item)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
writeOpLog(opRecord{User: user, Verb: "get-all", PID: os.Getpid(), PPID: os.Getppid(), ParentComm: parentComm(os.Getppid()), ItemName: name})
|
||||
fmt.Println(string(out))
|
||||
return nil
|
||||
}
|
||||
|
|
|
|||
|
|
@ -1,248 +0,0 @@
|
|||
package main
|
||||
|
||||
import (
|
||||
"encoding/json"
|
||||
"fmt"
|
||||
"io"
|
||||
"os"
|
||||
"strings"
|
||||
)
|
||||
|
||||
// The `vault kv` verbs talk to HashiCorp Vault / OpenBao — the homelab INFRA
|
||||
// secrets store (the `secret/…` KV-v2 mount at vault.viktorbarzin.me) — NOT
|
||||
// Vaultwarden. They are a thin, TTY-aware wrapper over the `vault` CLI that adds
|
||||
// the same conveniences as the Vaultwarden verbs: a self-defaulted VAULT_ADDR
|
||||
// (so non-login agent shells work) and clipboard/refuse-on-TTY secret handling.
|
||||
//
|
||||
// CREDENTIALS DIFFER FROM THE VAULTWARDEN VERBS. Those use the per-user *scoped*
|
||||
// token (bound only to secret/workstation/claude-users/<user>). A general kv read
|
||||
// of e.g. secret/viktor must use the caller's OWN Vault token (the OIDC
|
||||
// ~/.vault-token or an explicit $VAULT_TOKEN) — the scoped token has `deny`
|
||||
// everywhere else and would 403. So the kv handlers call ensureVaultAddr() to
|
||||
// guarantee VAULT_ADDR but deliberately do NOT call ensureVaultToken() (which
|
||||
// injects the scoped token). Access is then whatever the caller's policy grants.
|
||||
func vaultKVCommands() []Command {
|
||||
return []Command{
|
||||
{Path: []string{"vault", "kv", "get"}, Tier: TierRead,
|
||||
Summary: "[hashicorp-vault] read an infra KV secret: vault kv get <path> [--field K]", Run: vaultKVGet},
|
||||
{Path: []string{"vault", "kv", "list"}, Tier: TierRead,
|
||||
Summary: "[hashicorp-vault] list infra KV sub-paths: vault kv list <path>", Run: vaultKVList},
|
||||
{Path: []string{"vault", "kv", "put"}, Tier: TierWrite,
|
||||
Summary: "[hashicorp-vault] write one KV key (value via stdin): vault kv put <path> <key>", Run: vaultKVPut},
|
||||
{Path: []string{"vault", "kv"}, Tier: TierRead,
|
||||
Summary: "[hashicorp-vault] infra secrets (run `homelab vault kv` for help)",
|
||||
Run: func([]string) error { fmt.Print(vaultKVHelp()); return nil }},
|
||||
}
|
||||
}
|
||||
|
||||
func vaultKVHelp() string {
|
||||
return `homelab vault kv — HashiCorp Vault / OpenBao (homelab INFRA secrets, the secret/… KV store)
|
||||
|
||||
homelab vault kv get <path> [--field K] read a secret
|
||||
--field K → one value (TTY → clipboard; piped → stdout)
|
||||
no --field → all fields as JSON (piped only)
|
||||
homelab vault kv list <path> list sub-paths under <path> (no values)
|
||||
homelab vault kv put <path> <key> write one key; value read from stdin
|
||||
(piped, or no-echo prompt); merges — never clobbers siblings
|
||||
|
||||
Uses YOUR Vault token (vault login -method=oidc → ~/.vault-token); access is
|
||||
whatever your policy grants. This is NOT Vaultwarden — for your personal logins
|
||||
use 'homelab vault get' (see 'homelab vault').
|
||||
`
|
||||
}
|
||||
|
||||
// --- arg builders (pure; values never travel via argv) --------------------
|
||||
|
||||
func vaultKVGetFieldArgs(path, field string) []string {
|
||||
return []string{"kv", "get", "-field=" + field, path}
|
||||
}
|
||||
func vaultKVGetJSONArgs(path string) []string { return []string{"kv", "get", "-format=json", path} }
|
||||
func vaultKVListArgs(path string) []string { return []string{"kv", "list", "-format=json", path} }
|
||||
|
||||
// vaultKVPutArgs builds the write argv. merge=true → `kv patch -method=rw`
|
||||
// (read-modify-write: merges, needs only read+update — not the `patch` capability
|
||||
// — and preserves sibling keys); merge=false → `kv put` (creates the path on
|
||||
// first write). The value is ALWAYS read from stdin via the `<key>=-` form, so it
|
||||
// never appears in argv (visible via ps / /proc/<pid>/cmdline to same-UID procs).
|
||||
func vaultKVPutArgs(merge bool, path, key string) []string {
|
||||
return append(kvWriteVerb(merge), path, key+"=-")
|
||||
}
|
||||
|
||||
// --- pure parsers ----------------------------------------------------------
|
||||
|
||||
// extractKVData returns the inner secret object from a `vault kv get -format=json`
|
||||
// envelope (`{"data":{"data":{…},"metadata":{…}}}`), dropping the metadata/request
|
||||
// wrapper so only the secret's own key→value data is emitted.
|
||||
func extractKVData(jsonOut string) (string, error) {
|
||||
var env struct {
|
||||
Data struct {
|
||||
Data json.RawMessage `json:"data"`
|
||||
} `json:"data"`
|
||||
}
|
||||
if err := json.Unmarshal([]byte(jsonOut), &env); err != nil {
|
||||
return "", fmt.Errorf("parse vault kv json: %w", err)
|
||||
}
|
||||
if len(env.Data.Data) == 0 {
|
||||
return "", fmt.Errorf("no secret data at that path")
|
||||
}
|
||||
return string(env.Data.Data), nil
|
||||
}
|
||||
|
||||
// parseKVList parses the JSON array `vault kv list -format=json` prints.
|
||||
func parseKVList(jsonOut string) ([]string, error) {
|
||||
var keys []string
|
||||
if err := json.Unmarshal([]byte(jsonOut), &keys); err != nil {
|
||||
return nil, fmt.Errorf("parse vault kv list json: %w", err)
|
||||
}
|
||||
return keys, nil
|
||||
}
|
||||
|
||||
// --- testable cores (injected cmdRunner) -----------------------------------
|
||||
|
||||
func kvGetField(run cmdRunner, path, field string) (string, error) {
|
||||
return run("vault", vaultKVGetFieldArgs(path, field), nil)
|
||||
}
|
||||
|
||||
func kvGetJSON(run cmdRunner, path string) (string, error) {
|
||||
out, err := run("vault", vaultKVGetJSONArgs(path), nil)
|
||||
if err != nil {
|
||||
return "", err
|
||||
}
|
||||
return extractKVData(out)
|
||||
}
|
||||
|
||||
func kvList(run cmdRunner, path string) ([]string, error) {
|
||||
out, err := run("vault", vaultKVListArgs(path), nil)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
return parseKVList(out)
|
||||
}
|
||||
|
||||
// kvPathExists reports whether the KV path already holds data, to pick create
|
||||
// (`kv put`) vs merge (`kv patch -method=rw`) — so a write never clobbers
|
||||
// sibling keys on an existing path.
|
||||
func kvPathExists(run cmdRunner, path string) bool {
|
||||
_, err := run("vault", vaultKVGetJSONArgs(path), nil)
|
||||
return err == nil
|
||||
}
|
||||
|
||||
// kvPut writes one key, creating the path when absent and merging when present.
|
||||
// The value travels on stdin only (never argv).
|
||||
func kvPut(run cmdRunner, runStdin cmdRunnerStdin, path, key, value string) error {
|
||||
merge := kvPathExists(run, path)
|
||||
_, err := runStdin("vault", vaultKVPutArgs(merge, path, key), nil, value)
|
||||
return err
|
||||
}
|
||||
|
||||
// --- handlers --------------------------------------------------------------
|
||||
|
||||
func vaultKVGet(args []string) error {
|
||||
hardenProcess()
|
||||
ensureVaultAddr() // own token, NOT the scoped one (see file header)
|
||||
var path, field string
|
||||
for i := 0; i < len(args); i++ {
|
||||
a := args[i]
|
||||
switch {
|
||||
case a == "--field" && i+1 < len(args):
|
||||
field = args[i+1]
|
||||
i++
|
||||
case strings.HasPrefix(a, "--field="):
|
||||
field = strings.TrimPrefix(a, "--field=")
|
||||
case !strings.HasPrefix(a, "-") && path == "":
|
||||
path = a
|
||||
}
|
||||
}
|
||||
if path == "" {
|
||||
return fmt.Errorf("usage: homelab vault kv get <path> [--field <key>]")
|
||||
}
|
||||
if field != "" {
|
||||
val, err := kvGetField(realRunner, path, field)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
emitSecret(val) // TTY-aware: clipboard on a terminal, stdout when piped
|
||||
return nil
|
||||
}
|
||||
// No --field → the whole secret. All values, so refuse a bare TTY (like
|
||||
// `vault get --json`): pick a --field for the clipboard path, or pipe it.
|
||||
if !jsonToStdoutOK(stdoutIsTTY()) {
|
||||
return fmt.Errorf("refusing to print all KV fields as JSON to a terminal; use --field <key>, or pipe it (e.g. | jq)")
|
||||
}
|
||||
out, err := kvGetJSON(realRunner, path)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
fmt.Println(out)
|
||||
return nil
|
||||
}
|
||||
|
||||
func vaultKVList(args []string) error {
|
||||
ensureVaultAddr()
|
||||
var path string
|
||||
for _, a := range args {
|
||||
if !strings.HasPrefix(a, "-") {
|
||||
path = a
|
||||
break
|
||||
}
|
||||
}
|
||||
if path == "" {
|
||||
return fmt.Errorf("usage: homelab vault kv list <path>")
|
||||
}
|
||||
keys, err := kvList(realRunner, path)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
for _, k := range keys {
|
||||
fmt.Println(k)
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
func vaultKVPut(args []string) error {
|
||||
hardenProcess()
|
||||
ensureVaultAddr()
|
||||
var path, key string
|
||||
for _, a := range args {
|
||||
if strings.HasPrefix(a, "-") {
|
||||
continue
|
||||
}
|
||||
switch {
|
||||
case path == "":
|
||||
path = a
|
||||
case key == "":
|
||||
key = a
|
||||
}
|
||||
}
|
||||
if path == "" || key == "" {
|
||||
return fmt.Errorf("usage: homelab vault kv put <path> <key> (value read from stdin)")
|
||||
}
|
||||
value, err := readSecretValue("Value for " + key + ": ")
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
if value == "" {
|
||||
return fmt.Errorf("empty value; aborting (nothing written)")
|
||||
}
|
||||
if err := kvPut(realRunner, realRunnerStdin, path, key, value); err != nil {
|
||||
return fmt.Errorf("writing %q to %s failed (does your token have write access? path correct?): %w", key, path, err)
|
||||
}
|
||||
fmt.Fprintln(os.Stderr, "wrote "+key+" to "+path)
|
||||
return nil
|
||||
}
|
||||
|
||||
// readSecretValue obtains a secret value WITHOUT putting it in argv: piped stdin
|
||||
// is read verbatim (trailing newline trimmed, internal newlines preserved so
|
||||
// multi-line values like PEM keys survive); an interactive TTY is prompted
|
||||
// without echo.
|
||||
func readSecretValue(prompt string) (string, error) {
|
||||
fi, err := os.Stdin.Stat()
|
||||
if err == nil && fi.Mode()&os.ModeCharDevice == 0 {
|
||||
b, rerr := io.ReadAll(os.Stdin)
|
||||
if rerr != nil {
|
||||
return "", rerr
|
||||
}
|
||||
return strings.TrimRight(string(b), "\r\n"), nil
|
||||
}
|
||||
return promptNoEcho(prompt)
|
||||
}
|
||||
|
|
@ -2,8 +2,6 @@ package main
|
|||
|
||||
import (
|
||||
"encoding/base64"
|
||||
"encoding/json"
|
||||
"errors"
|
||||
"fmt"
|
||||
"os"
|
||||
"reflect"
|
||||
|
|
@ -235,181 +233,12 @@ func TestStatusSummaryUnconfigured(t *testing.T) {
|
|||
}
|
||||
}
|
||||
|
||||
func TestEnsureVaultTokenSetsScopedFallback(t *testing.T) {
|
||||
dir := t.TempDir()
|
||||
cfg := dir + "/.config/claude-auth-sync"
|
||||
if err := os.MkdirAll(cfg, 0o700); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
if err := os.WriteFile(cfg+"/vault-token", []byte("SCOPED-TOK\n"), 0o600); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
t.Setenv("HOME", dir)
|
||||
t.Setenv("VAULT_TOKEN", "") // no ambient token
|
||||
|
||||
ensureVaultToken()
|
||||
if got := os.Getenv("VAULT_TOKEN"); got != "SCOPED-TOK" {
|
||||
t.Fatalf("VAULT_TOKEN = %q, want scoped fallback to be exported", got)
|
||||
}
|
||||
}
|
||||
|
||||
func TestEnsureVaultTokenKeepsExplicitEnv(t *testing.T) {
|
||||
dir := t.TempDir()
|
||||
cfg := dir + "/.config/claude-auth-sync"
|
||||
if err := os.MkdirAll(cfg, 0o700); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
if err := os.WriteFile(cfg+"/vault-token", []byte("SCOPED-TOK"), 0o600); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
t.Setenv("HOME", dir)
|
||||
t.Setenv("VAULT_TOKEN", "ADMIN-TOK")
|
||||
|
||||
ensureVaultToken()
|
||||
if got := os.Getenv("VAULT_TOKEN"); got != "ADMIN-TOK" {
|
||||
t.Fatalf("VAULT_TOKEN = %q, must not override an explicit token", got)
|
||||
}
|
||||
}
|
||||
|
||||
func TestEnsureVaultTokenPrefersScopedOverFile(t *testing.T) {
|
||||
// Regression: a power-user's read-only OIDC ~/.vault-token must NOT shadow the
|
||||
// purpose-built scoped token (emo's setup hit 403 because it did, 2026-06-28).
|
||||
dir := t.TempDir()
|
||||
cfg := dir + "/.config/claude-auth-sync"
|
||||
if err := os.MkdirAll(cfg, 0o700); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
if err := os.WriteFile(cfg+"/vault-token", []byte("SCOPED-TOK"), 0o600); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
if err := os.WriteFile(dir+"/.vault-token", []byte("STALE-OIDC-TOK"), 0o600); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
t.Setenv("HOME", dir)
|
||||
t.Setenv("VAULT_TOKEN", "")
|
||||
|
||||
ensureVaultToken()
|
||||
if got := os.Getenv("VAULT_TOKEN"); got != "SCOPED-TOK" {
|
||||
t.Fatalf("VAULT_TOKEN = %q, want the scoped token to win over a stale ~/.vault-token", got)
|
||||
}
|
||||
}
|
||||
|
||||
func TestScopedTokenPath(t *testing.T) {
|
||||
if got := scopedTokenPath("/home/emo"); got != "/home/emo/.config/claude-auth-sync/vault-token" {
|
||||
t.Fatalf("scopedTokenPath = %q", got)
|
||||
}
|
||||
}
|
||||
|
||||
func TestVaultTokenSource(t *testing.T) {
|
||||
// Precedence: explicit $VAULT_TOKEN > the claude-auth-sync per-user scoped
|
||||
// token > a native ~/.vault-token. Scoped beats the file so a power-user's
|
||||
// read-only OIDC ~/.vault-token can't shadow the scoped token on the user's
|
||||
// own path (emo, 2026-06-28).
|
||||
cases := []struct {
|
||||
name string
|
||||
env string
|
||||
haveVaultToken bool
|
||||
scoped string
|
||||
wantTok, wantSrc string
|
||||
}{
|
||||
{"explicit env wins", "abc", true, "S", "", "env"},
|
||||
{"scoped beats a stale ~/.vault-token", "", true, "S-TOK", "S-TOK", "scoped"},
|
||||
{"scoped used when no file", "", false, "S-TOK", "S-TOK", "scoped"},
|
||||
{"native ~/.vault-token only when no scoped", "", true, "", "", "file"},
|
||||
{"scoped value is trimmed", "", false, " S-TOK\n", "S-TOK", "scoped"},
|
||||
{"whitespace-only scoped falls back to file", "", true, " \n", "", "file"},
|
||||
{"nothing configured", "", false, "", "", "none"},
|
||||
}
|
||||
for _, c := range cases {
|
||||
tok, src := vaultTokenSource(c.env, c.haveVaultToken, c.scoped)
|
||||
if tok != c.wantTok || src != c.wantSrc {
|
||||
t.Errorf("%s: vaultTokenSource(%q,%v,%q) = (%q,%q), want (%q,%q)",
|
||||
c.name, c.env, c.haveVaultToken, c.scoped, tok, src, c.wantTok, c.wantSrc)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func TestVaultAddrToSet(t *testing.T) {
|
||||
// homelab vault is invoked by AFK agent sessions (non-login shells that
|
||||
// never sourced /etc/environment), so the CLI must self-default VAULT_ADDR
|
||||
// rather than rely on the ambient env — else every `vault` child hits the
|
||||
// 127.0.0.1:8200 default and fails "connection refused" (exit 2).
|
||||
cases := []struct {
|
||||
name, env, want string
|
||||
}{
|
||||
{"unset -> default", "", vaultAddrDefault},
|
||||
{"whitespace-only -> default", " \n", vaultAddrDefault},
|
||||
{"explicit kept (empty = leave alone)", "https://vault.example.com", ""},
|
||||
}
|
||||
for _, c := range cases {
|
||||
if got := vaultAddrToSet(c.env); got != c.want {
|
||||
t.Errorf("%s: vaultAddrToSet(%q) = %q, want %q", c.name, c.env, got, c.want)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func TestEnsureVaultTokenSetsDefaultAddr(t *testing.T) {
|
||||
dir := t.TempDir() // no scoped token, no ~/.vault-token
|
||||
t.Setenv("HOME", dir)
|
||||
t.Setenv("VAULT_TOKEN", "")
|
||||
t.Setenv("VAULT_ADDR", "") // emo's non-login-shell situation
|
||||
|
||||
ensureVaultToken()
|
||||
if got := os.Getenv("VAULT_ADDR"); got != vaultAddrDefault {
|
||||
t.Fatalf("VAULT_ADDR = %q, want default %q to be exported", got, vaultAddrDefault)
|
||||
}
|
||||
}
|
||||
|
||||
func TestEnsureVaultTokenKeepsExplicitAddr(t *testing.T) {
|
||||
dir := t.TempDir()
|
||||
t.Setenv("HOME", dir)
|
||||
t.Setenv("VAULT_TOKEN", "")
|
||||
t.Setenv("VAULT_ADDR", "https://vault.example.com")
|
||||
|
||||
ensureVaultToken()
|
||||
if got := os.Getenv("VAULT_ADDR"); got != "https://vault.example.com" {
|
||||
t.Fatalf("VAULT_ADDR = %q, must not override an explicit addr", got)
|
||||
}
|
||||
}
|
||||
|
||||
func TestAugmentErrSurfacesStderr(t *testing.T) {
|
||||
if got := augmentErr(nil, []byte("ignored")); got != nil {
|
||||
t.Fatalf("augmentErr(nil, …) = %v, want nil", got)
|
||||
}
|
||||
base := errors.New("exit status 2")
|
||||
got := augmentErr(base, []byte(" dial tcp 127.0.0.1:8200: connect: connection refused\n"))
|
||||
if got == nil || !strings.Contains(got.Error(), "connection refused") || !strings.Contains(got.Error(), "exit status 2") {
|
||||
t.Fatalf("augmentErr did not surface stderr: %v", got)
|
||||
}
|
||||
if !errors.Is(got, base) {
|
||||
t.Fatal("augmentErr lost the wrapped error (errors.Is failed)")
|
||||
}
|
||||
if got := augmentErr(base, []byte(" ")); got != base {
|
||||
t.Fatalf("augmentErr with blank stderr = %v, want the original error unchanged", got)
|
||||
}
|
||||
}
|
||||
|
||||
func TestKvWriteVerb(t *testing.T) {
|
||||
// merge=true → read-modify-write patch (needs only read+update, NOT the
|
||||
// `patch` capability the scoped workstation policy lacks).
|
||||
if got := kvWriteVerb(true); !reflect.DeepEqual(got, []string{"kv", "patch", "-method=rw"}) {
|
||||
t.Fatalf("kvWriteVerb(true) = %v", got)
|
||||
}
|
||||
// merge=false → put (creates the path on first use)
|
||||
if got := kvWriteVerb(false); !reflect.DeepEqual(got, []string{"kv", "put"}) {
|
||||
t.Fatalf("kvWriteVerb(false) = %v", got)
|
||||
}
|
||||
}
|
||||
|
||||
func TestVaultWritePublicArgs(t *testing.T) {
|
||||
got := vaultWritePublicArgs(true, "emo", "e@x.me", "user.ci")
|
||||
want := []string{"kv", "patch", "-method=rw", "secret/workstation/claude-users/emo",
|
||||
func TestVaultPatchPublicArgs(t *testing.T) {
|
||||
got := vaultPatchPublicArgs("emo", "e@x.me", "user.ci")
|
||||
want := []string{"kv", "patch", "secret/workstation/claude-users/emo",
|
||||
"vaultwarden_email=e@x.me", "vaultwarden_client_id=user.ci"}
|
||||
if !reflect.DeepEqual(got, want) {
|
||||
t.Fatalf("vaultWritePublicArgs(merge) = %v", got)
|
||||
}
|
||||
if got := vaultWritePublicArgs(false, "emo", "e@x.me", "user.ci"); got[0] != "kv" || got[1] != "put" {
|
||||
t.Fatalf("vaultWritePublicArgs(create) must use `kv put`, got %v", got)
|
||||
t.Fatalf("vaultPatchPublicArgs = %v", got)
|
||||
}
|
||||
for _, a := range got {
|
||||
if strings.Contains(a, "master_password") || strings.Contains(a, "client_secret") {
|
||||
|
|
@ -418,12 +247,12 @@ func TestVaultWritePublicArgs(t *testing.T) {
|
|||
}
|
||||
}
|
||||
|
||||
func TestVaultWriteSecretArgsNoValueInArgv(t *testing.T) {
|
||||
func TestVaultPatchSecretArgsNoValueInArgv(t *testing.T) {
|
||||
for _, key := range []string{"vaultwarden_master_password", "vaultwarden_client_secret"} {
|
||||
got := vaultWriteSecretArgs(true, "emo", key)
|
||||
want := []string{"kv", "patch", "-method=rw", "secret/workstation/claude-users/emo", key + "=-"}
|
||||
got := vaultPatchSecretArgs("emo", key)
|
||||
want := []string{"kv", "patch", "secret/workstation/claude-users/emo", key + "=-"}
|
||||
if !reflect.DeepEqual(got, want) {
|
||||
t.Fatalf("vaultWriteSecretArgs(%q) = %v", key, got)
|
||||
t.Fatalf("vaultPatchSecretArgs(%q) = %v", key, got)
|
||||
}
|
||||
if got[len(got)-1] != key+"=-" {
|
||||
t.Fatalf("secret value must be read from stdin (`%s=-`), got %v", key, got)
|
||||
|
|
@ -431,90 +260,6 @@ func TestVaultWriteSecretArgsNoValueInArgv(t *testing.T) {
|
|||
}
|
||||
}
|
||||
|
||||
// recStdin records a stdin-bearing call for assertions.
|
||||
type recStdin struct {
|
||||
argv []string
|
||||
stdin string
|
||||
}
|
||||
|
||||
// TestWriteCredsCreatesThenMerges: when the path is ABSENT the first (public)
|
||||
// write must `kv put` (create), and the two secrets must merge via patch -rw
|
||||
// with values on stdin only — never the buggy plain `kv patch` (needs `patch`).
|
||||
func TestWriteCredsCreatesThenMerges(t *testing.T) {
|
||||
var calls [][]string
|
||||
var stdinCalls []recStdin
|
||||
run := func(name string, argv, envv []string) (string, error) {
|
||||
calls = append(calls, append([]string{name}, argv...))
|
||||
if len(argv) >= 2 && argv[0] == "kv" && argv[1] == "get" {
|
||||
return "", fmt.Errorf("no value found") // path absent
|
||||
}
|
||||
return "", nil
|
||||
}
|
||||
runStdin := func(name string, argv, envv []string, stdin string) (string, error) {
|
||||
stdinCalls = append(stdinCalls, recStdin{append([]string{name}, argv...), stdin})
|
||||
return "", nil
|
||||
}
|
||||
c := vwCreds{Email: "e@x.me", MasterPassword: "PW", ClientID: "user.ci", ClientSecret: "CS"}
|
||||
if err := writeCreds(run, runStdin, "emo", c); err != nil {
|
||||
t.Fatalf("writeCreds: %v", err)
|
||||
}
|
||||
var sawPut, sawPlainPatch bool
|
||||
for _, cl := range calls {
|
||||
j := strings.Join(cl, " ")
|
||||
if strings.Contains(j, "kv put") {
|
||||
sawPut = true
|
||||
}
|
||||
if strings.Contains(j, "kv patch") && !strings.Contains(j, "-method=rw") {
|
||||
sawPlainPatch = true
|
||||
}
|
||||
}
|
||||
if !sawPut {
|
||||
t.Fatalf("path absent → public write must be `kv put`; calls=%v", calls)
|
||||
}
|
||||
if sawPlainPatch {
|
||||
t.Fatalf("must never use plain `kv patch` (needs `patch` capability); calls=%v", calls)
|
||||
}
|
||||
if len(stdinCalls) != 2 {
|
||||
t.Fatalf("want 2 stdin secret writes, got %d", len(stdinCalls))
|
||||
}
|
||||
for _, sc := range stdinCalls {
|
||||
if !strings.Contains(strings.Join(sc.argv, " "), "kv patch -method=rw") {
|
||||
t.Errorf("secret write must use patch -method=rw: %v", sc.argv)
|
||||
}
|
||||
for _, a := range sc.argv {
|
||||
if strings.Contains(a, "PW") || strings.Contains(a, "CS") {
|
||||
t.Errorf("secret leaked into argv: %v", sc.argv)
|
||||
}
|
||||
}
|
||||
}
|
||||
if stdinCalls[0].stdin != "PW" || stdinCalls[1].stdin != "CS" {
|
||||
t.Errorf("stdin values wrong: %q,%q", stdinCalls[0].stdin, stdinCalls[1].stdin)
|
||||
}
|
||||
}
|
||||
|
||||
// TestWriteCredsMergesWhenPresent: when the path EXISTS, every write must merge
|
||||
// (patch -rw) — a `kv put` would wipe sibling keys (e.g. claude_ai_oauth_json).
|
||||
func TestWriteCredsMergesWhenPresent(t *testing.T) {
|
||||
var calls [][]string
|
||||
run := func(name string, argv, envv []string) (string, error) {
|
||||
calls = append(calls, append([]string{name}, argv...))
|
||||
return "{}", nil // get succeeds → path exists
|
||||
}
|
||||
runStdin := func(name string, argv, envv []string, stdin string) (string, error) {
|
||||
calls = append(calls, append([]string{name}, argv...))
|
||||
return "", nil
|
||||
}
|
||||
c := vwCreds{Email: "e@x.me", MasterPassword: "PW", ClientID: "user.ci", ClientSecret: "CS"}
|
||||
if err := writeCreds(run, runStdin, "emo", c); err != nil {
|
||||
t.Fatalf("writeCreds: %v", err)
|
||||
}
|
||||
for _, cl := range calls {
|
||||
if strings.Contains(strings.Join(cl, " "), "kv put") {
|
||||
t.Fatalf("path exists → must NOT `kv put` (wipes siblings): %v", cl)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// TestNoSecretInArgvAcrossFlow is the load-bearing security test: across the
|
||||
// whole get flow (vault reads, bw config/status/login/unlock/get) NO secret
|
||||
// value may appear in any command's argv — secrets travel via env/stdin only.
|
||||
|
|
@ -621,437 +366,3 @@ func TestGetValueFlow(t *testing.T) {
|
|||
t.Fatalf("getValue = %q, %v", val, err)
|
||||
}
|
||||
}
|
||||
|
||||
// --- vault get --all (browse all fields) ----------------------------------
|
||||
|
||||
func TestParseGetArgsAll(t *testing.T) {
|
||||
o, err := parseGetArgs([]string{"github", "--all"})
|
||||
if err != nil || o.name != "github" || !o.all {
|
||||
t.Fatalf("parseGetArgs(--all) = %+v err=%v", o, err)
|
||||
}
|
||||
// --all must skip --field validation (field is irrelevant for a full dump).
|
||||
if _, err := parseGetArgs([]string{"github", "--all", "--field", "evil"}); err != nil {
|
||||
t.Fatalf("--all must ignore an otherwise-invalid --field, got err=%v", err)
|
||||
}
|
||||
// A name is still required.
|
||||
if _, err := parseGetArgs([]string{"--all"}); err == nil {
|
||||
t.Fatal("get --all with no name must error")
|
||||
}
|
||||
// Without --all, the field allowlist still applies.
|
||||
if _, err := parseGetArgs([]string{"github", "--field", "evil"}); err == nil {
|
||||
t.Fatal("invalid --field without --all must still error")
|
||||
}
|
||||
}
|
||||
|
||||
func TestBwItemArgs(t *testing.T) {
|
||||
argv := bwItemArgs("github")
|
||||
if !reflect.DeepEqual(argv, []string{"get", "item", "github"}) {
|
||||
t.Fatalf("bwItemArgs = %v", argv)
|
||||
}
|
||||
for _, a := range argv {
|
||||
if strings.Contains(a, "SESSION") || a == "--session" {
|
||||
t.Fatalf("session must travel via env, not argv: %v", argv)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// a representative `bw get item` payload: login fields, multiple URIs, a TOTP
|
||||
// seed, notes, custom fields (text/hidden/boolean), plus bw internals that MUST
|
||||
// be dropped (id/object/reprompt/passwordHistory).
|
||||
const sampleLoginItemJSON = `{
|
||||
"object":"item","id":"abc-123","folderId":null,"type":1,"reprompt":0,
|
||||
"name":"GitHub","notes":"my notes","favorite":false,
|
||||
"fields":[
|
||||
{"name":"PIN","value":"1234","type":1},
|
||||
{"name":"endpoint","value":"https://api.gh","type":0},
|
||||
{"name":"enabled","value":"true","type":2}
|
||||
],
|
||||
"login":{
|
||||
"username":"octocat","password":"hunter2",
|
||||
"totp":"otpauth://totp/GitHub:octocat?secret=SEEDSEEDSEED",
|
||||
"uris":[{"match":null,"uri":"https://github.com"},{"match":null,"uri":"https://gist.github.com"}]
|
||||
},
|
||||
"passwordHistory":[{"password":"OLD-PASSWORD-XYZ"}]
|
||||
}`
|
||||
|
||||
func TestNormalizeItemLogin(t *testing.T) {
|
||||
n, err := normalizeItem(sampleLoginItemJSON)
|
||||
if err != nil {
|
||||
t.Fatalf("normalizeItem: %v", err)
|
||||
}
|
||||
if n.Name != "GitHub" || n.Username != "octocat" || n.Password != "hunter2" || n.Notes != "my notes" {
|
||||
t.Fatalf("standard fields wrong: %+v", n)
|
||||
}
|
||||
if !n.TOTP {
|
||||
t.Fatal("TOTP presence flag must be true when a seed exists")
|
||||
}
|
||||
if !reflect.DeepEqual(n.URIs, []string{"https://github.com", "https://gist.github.com"}) {
|
||||
t.Fatalf("URIs = %v", n.URIs)
|
||||
}
|
||||
want := map[string]string{"PIN": "1234", "endpoint": "https://api.gh", "enabled": "true"}
|
||||
if !reflect.DeepEqual(n.Fields, want) {
|
||||
t.Fatalf("custom fields = %v want %v", n.Fields, want)
|
||||
}
|
||||
}
|
||||
|
||||
// The load-bearing security test: the raw TOTP seed (more powerful than a
|
||||
// one-time code) and the password history must NEVER appear in the dump.
|
||||
func TestNormalizeItemNeverLeaksSeedOrHistory(t *testing.T) {
|
||||
n, err := normalizeItem(sampleLoginItemJSON)
|
||||
if err != nil {
|
||||
t.Fatalf("normalizeItem: %v", err)
|
||||
}
|
||||
out, err := json.Marshal(n)
|
||||
if err != nil {
|
||||
t.Fatalf("marshal: %v", err)
|
||||
}
|
||||
for _, leak := range []string{"SEEDSEEDSEED", "otpauth", "OLD-PASSWORD-XYZ", "passwordHistory", "abc-123"} {
|
||||
if strings.Contains(string(out), leak) {
|
||||
t.Fatalf("dump leaked %q: %s", leak, out)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func TestNormalizeItemNoTOTP(t *testing.T) {
|
||||
n, err := normalizeItem(`{"name":"X","type":1,"login":{"username":"u","password":"p"}}`)
|
||||
if err != nil {
|
||||
t.Fatalf("normalizeItem: %v", err)
|
||||
}
|
||||
if n.TOTP {
|
||||
t.Fatal("TOTP must be false when no seed present")
|
||||
}
|
||||
out, _ := json.Marshal(n)
|
||||
if strings.Contains(string(out), "totp") {
|
||||
t.Fatalf("no-totp item must omit the totp key entirely: %s", out)
|
||||
}
|
||||
}
|
||||
|
||||
func TestNormalizeItemEmptyStandardFieldsOmitted(t *testing.T) {
|
||||
n, err := normalizeItem(`{"name":"Bare","type":1,"login":{"username":"","password":"","totp":"","uris":[]},"fields":[{"name":"only","value":"x","type":0}]}`)
|
||||
if err != nil {
|
||||
t.Fatalf("normalizeItem: %v", err)
|
||||
}
|
||||
out, _ := json.Marshal(n)
|
||||
for _, k := range []string{"username", "password", "uris", "notes", "totp"} {
|
||||
if strings.Contains(string(out), `"`+k+`"`) {
|
||||
t.Fatalf("empty standard field %q must be omitted: %s", k, out)
|
||||
}
|
||||
}
|
||||
if !strings.Contains(string(out), `"name":"Bare"`) || !strings.Contains(string(out), `"only":"x"`) {
|
||||
t.Fatalf("name + custom field must survive: %s", out)
|
||||
}
|
||||
}
|
||||
|
||||
func TestNormalizeItemSecureNoteNullLogin(t *testing.T) {
|
||||
// type 2 (secure note): login is null — must not panic; notes + custom fields survive.
|
||||
n, err := normalizeItem(`{"name":"SN","type":2,"notes":"secret note","login":null,"fields":[{"name":"k","value":"v","type":1}]}`)
|
||||
if err != nil {
|
||||
t.Fatalf("normalizeItem(null login): %v", err)
|
||||
}
|
||||
if n.Name != "SN" || n.Notes != "secret note" || n.Fields["k"] != "v" {
|
||||
t.Fatalf("secure-note normalize wrong: %+v", n)
|
||||
}
|
||||
if n.Username != "" || n.Password != "" || n.TOTP {
|
||||
t.Fatalf("login fields must be empty for a login-less item: %+v", n)
|
||||
}
|
||||
}
|
||||
|
||||
func TestNormalizeItemDuplicateCustomNames(t *testing.T) {
|
||||
// Bitwarden permits duplicate custom-field names; a JSON object can't hold
|
||||
// dups, so last-wins (documented).
|
||||
n, err := normalizeItem(`{"name":"D","fields":[{"name":"k","value":"first","type":0},{"name":"k","value":"second","type":0}]}`)
|
||||
if err != nil {
|
||||
t.Fatalf("normalizeItem: %v", err)
|
||||
}
|
||||
if n.Fields["k"] != "second" {
|
||||
t.Fatalf("duplicate custom names must be last-wins, got %q", n.Fields["k"])
|
||||
}
|
||||
}
|
||||
|
||||
func TestNormalizeItemLinkedFieldSkipped(t *testing.T) {
|
||||
// type 3 (linked) fields reference another field and carry a null value —
|
||||
// they are not real data and must be skipped.
|
||||
n, err := normalizeItem(`{"name":"L","login":{"username":"u"},"fields":[{"name":"linked","value":null,"type":3},{"name":"real","value":"r","type":0}]}`)
|
||||
if err != nil {
|
||||
t.Fatalf("normalizeItem: %v", err)
|
||||
}
|
||||
if _, ok := n.Fields["linked"]; ok {
|
||||
t.Fatalf("linked field must be skipped: %v", n.Fields)
|
||||
}
|
||||
if n.Fields["real"] != "r" {
|
||||
t.Fatalf("real custom field dropped: %v", n.Fields)
|
||||
}
|
||||
}
|
||||
|
||||
func TestNormalizeItemMalformed(t *testing.T) {
|
||||
if _, err := normalizeItem("not json"); err == nil {
|
||||
t.Fatal("malformed item JSON must error")
|
||||
}
|
||||
}
|
||||
|
||||
// getItem opens a session and runs `bw get item <name>`, returning raw JSON.
|
||||
func TestGetItemFlow(t *testing.T) {
|
||||
f := &fakeRunner{out: map[string]string{
|
||||
"vault kv get -field=vaultwarden_master_password secret/workstation/claude-users/emo": "pw",
|
||||
"vault kv get -field=vaultwarden_client_id secret/workstation/claude-users/emo": "user.x",
|
||||
"vault kv get -field=vaultwarden_client_secret secret/workstation/claude-users/emo": "cs",
|
||||
"bw status": `{"status":"locked"}`,
|
||||
"bw unlock": "SESS",
|
||||
"bw get item github": sampleLoginItemJSON,
|
||||
}}
|
||||
uid := fmt.Sprintf("%d", os.Getuid())
|
||||
raw, err := getItem(f.run, "emo", uid, "github")
|
||||
if err != nil || !strings.Contains(raw, `"name":"GitHub"`) {
|
||||
t.Fatalf("getItem = %q, %v", raw, err)
|
||||
}
|
||||
// The session key must reach bw via env, never argv.
|
||||
for _, call := range f.calls {
|
||||
for _, arg := range call {
|
||||
if strings.Contains(arg, "SESS") {
|
||||
t.Errorf("session leaked into argv: %v", call)
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func TestVaultHelpMentionsAll(t *testing.T) {
|
||||
if !strings.Contains(vaultHelp(), "--all") {
|
||||
t.Error("vault help must document --all")
|
||||
}
|
||||
}
|
||||
|
||||
// --- bw sync on read (freshness) ------------------------------------------
|
||||
|
||||
func TestBwSyncArgs(t *testing.T) {
|
||||
if got := bwSyncArgs(); !reflect.DeepEqual(got, []string{"sync"}) {
|
||||
t.Fatalf("bwSyncArgs = %v", got)
|
||||
}
|
||||
}
|
||||
|
||||
// Every read opens a session that first `bw sync`s, so reads reflect the latest
|
||||
// server-side values: `bw unlock` is local-only, so without a sync a persisted
|
||||
// (already-logged-in) session serves a stale local cache.
|
||||
func TestOpenSessionSyncsBeforeRead(t *testing.T) {
|
||||
f := &fakeRunner{out: map[string]string{
|
||||
"vault kv get -field=vaultwarden_master_password secret/workstation/claude-users/emo": "pw",
|
||||
"vault kv get -field=vaultwarden_client_id secret/workstation/claude-users/emo": "user.x",
|
||||
"vault kv get -field=vaultwarden_client_secret secret/workstation/claude-users/emo": "cs",
|
||||
"bw status": `{"status":"locked"}`,
|
||||
"bw unlock": "SESS",
|
||||
"bw sync": "Syncing complete.",
|
||||
"bw get password github": "p@ss",
|
||||
}}
|
||||
uid := fmt.Sprintf("%d", os.Getuid())
|
||||
if _, err := getValue(f.run, "emo", uid, getOpts{name: "github", field: "password"}); err != nil {
|
||||
t.Fatalf("getValue: %v", err)
|
||||
}
|
||||
idx := func(prefix string) int {
|
||||
for i, c := range f.calls {
|
||||
if strings.HasPrefix(strings.Join(c, " "), prefix) {
|
||||
return i
|
||||
}
|
||||
}
|
||||
return -1
|
||||
}
|
||||
syncAt, unlockAt, getAt := idx("bw sync"), idx("bw unlock"), idx("bw get password github")
|
||||
if syncAt < 0 {
|
||||
t.Fatal("expected a `bw sync` before the read")
|
||||
}
|
||||
if !(unlockAt < syncAt && syncAt < getAt) {
|
||||
t.Fatalf("order wrong: unlock=%d sync=%d get=%d (want unlock<sync<get)", unlockAt, syncAt, getAt)
|
||||
}
|
||||
}
|
||||
|
||||
// Sync is best-effort: a transient sync failure must NOT fail the read — the
|
||||
// cached value is still returned (a stderr warning is emitted, not asserted here).
|
||||
func TestReadSucceedsWhenSyncFails(t *testing.T) {
|
||||
f := &fakeRunner{
|
||||
out: map[string]string{
|
||||
"vault kv get -field=vaultwarden_master_password secret/workstation/claude-users/emo": "pw",
|
||||
"vault kv get -field=vaultwarden_client_id secret/workstation/claude-users/emo": "user.x",
|
||||
"vault kv get -field=vaultwarden_client_secret secret/workstation/claude-users/emo": "cs",
|
||||
"bw status": `{"status":"locked"}`,
|
||||
"bw unlock": "SESS",
|
||||
"bw get password github": "p@ss",
|
||||
},
|
||||
err: map[string]error{"bw sync": errors.New("Failed to sync: network error")},
|
||||
}
|
||||
uid := fmt.Sprintf("%d", os.Getuid())
|
||||
val, err := getValue(f.run, "emo", uid, getOpts{name: "github", field: "password"})
|
||||
if err != nil || val != "p@ss" {
|
||||
t.Fatalf("read must succeed despite a sync failure: val=%q err=%v", val, err)
|
||||
}
|
||||
}
|
||||
|
||||
// --- vault kv (HashiCorp Vault / OpenBao infra secrets) --------------------
|
||||
|
||||
func TestVaultKVCommandsRegistered(t *testing.T) {
|
||||
want := map[string]Tier{
|
||||
"vault kv get": TierRead,
|
||||
"vault kv list": TierRead,
|
||||
"vault kv put": TierWrite,
|
||||
}
|
||||
got := map[string]Tier{}
|
||||
for _, c := range vaultCommands() {
|
||||
got[c.name()] = c.Tier
|
||||
}
|
||||
for name, tier := range want {
|
||||
if got[name] != tier {
|
||||
t.Errorf("command %q: tier=%q, want %q", name, got[name], tier)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func TestVaultKVArgs(t *testing.T) {
|
||||
if got := vaultKVGetFieldArgs("secret/viktor", "github_pat"); !reflect.DeepEqual(got, []string{"kv", "get", "-field=github_pat", "secret/viktor"}) {
|
||||
t.Fatalf("vaultKVGetFieldArgs = %v", got)
|
||||
}
|
||||
if got := vaultKVGetJSONArgs("secret/viktor"); !reflect.DeepEqual(got, []string{"kv", "get", "-format=json", "secret/viktor"}) {
|
||||
t.Fatalf("vaultKVGetJSONArgs = %v", got)
|
||||
}
|
||||
if got := vaultKVListArgs("secret/"); !reflect.DeepEqual(got, []string{"kv", "list", "-format=json", "secret/"}) {
|
||||
t.Fatalf("vaultKVListArgs = %v", got)
|
||||
}
|
||||
// create (path absent) → put; merge (path present) → patch -method=rw. Either
|
||||
// way the VALUE travels via the `key=-` stdin form, never argv.
|
||||
create := vaultKVPutArgs(false, "secret/x", "api_key")
|
||||
if !reflect.DeepEqual(create, []string{"kv", "put", "secret/x", "api_key=-"}) {
|
||||
t.Fatalf("vaultKVPutArgs(create) = %v", create)
|
||||
}
|
||||
merge := vaultKVPutArgs(true, "secret/x", "api_key")
|
||||
if !reflect.DeepEqual(merge, []string{"kv", "patch", "-method=rw", "secret/x", "api_key=-"}) {
|
||||
t.Fatalf("vaultKVPutArgs(merge) = %v", merge)
|
||||
}
|
||||
for _, args := range [][]string{create, merge} {
|
||||
for _, a := range args {
|
||||
if strings.Contains(a, "SECRETVALUE") || strings.HasSuffix(a, "=SECRETVALUE") {
|
||||
t.Fatalf("value must not appear in argv: %v", args)
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func TestExtractKVData(t *testing.T) {
|
||||
// `vault kv get -format=json` wraps the secret in {"data":{"data":{...},"metadata":{...}}}.
|
||||
env := `{"request_id":"x","data":{"data":{"github_pat":"ghp_abc","email":"e@x.me"},"metadata":{"version":3}}}`
|
||||
out, err := extractKVData(env)
|
||||
if err != nil {
|
||||
t.Fatalf("extractKVData: %v", err)
|
||||
}
|
||||
// Round-trip to a map so key order doesn't matter.
|
||||
var m map[string]string
|
||||
if err := json.Unmarshal([]byte(out), &m); err != nil {
|
||||
t.Fatalf("result not a JSON object: %q (%v)", out, err)
|
||||
}
|
||||
if m["github_pat"] != "ghp_abc" || m["email"] != "e@x.me" {
|
||||
t.Fatalf("extractKVData inner data wrong: %v", m)
|
||||
}
|
||||
// metadata must NOT leak into the output.
|
||||
if strings.Contains(out, "metadata") || strings.Contains(out, "request_id") {
|
||||
t.Fatalf("envelope internals leaked: %s", out)
|
||||
}
|
||||
if _, err := extractKVData("not json"); err == nil {
|
||||
t.Fatal("malformed envelope must error")
|
||||
}
|
||||
}
|
||||
|
||||
func TestParseKVList(t *testing.T) {
|
||||
keys, err := parseKVList(`["app1","app2/","viktor"]`)
|
||||
if err != nil {
|
||||
t.Fatalf("parseKVList: %v", err)
|
||||
}
|
||||
if !reflect.DeepEqual(keys, []string{"app1", "app2/", "viktor"}) {
|
||||
t.Fatalf("parseKVList = %v", keys)
|
||||
}
|
||||
if _, err := parseKVList("not json"); err == nil {
|
||||
t.Fatal("malformed list must error")
|
||||
}
|
||||
}
|
||||
|
||||
func TestKVGetFieldFlow(t *testing.T) {
|
||||
f := &fakeRunner{out: map[string]string{
|
||||
"vault kv get -field=github_pat secret/viktor": "ghp_secret",
|
||||
}}
|
||||
val, err := kvGetField(f.run, "secret/viktor", "github_pat")
|
||||
if err != nil || val != "ghp_secret" {
|
||||
t.Fatalf("kvGetField = %q, %v", val, err)
|
||||
}
|
||||
}
|
||||
|
||||
func TestKVListFlow(t *testing.T) {
|
||||
f := &fakeRunner{out: map[string]string{
|
||||
"vault kv list -format=json secret/": `["app1","app2/"]`,
|
||||
}}
|
||||
keys, err := kvList(f.run, "secret/")
|
||||
if err != nil || !reflect.DeepEqual(keys, []string{"app1", "app2/"}) {
|
||||
t.Fatalf("kvList = %v, %v", keys, err)
|
||||
}
|
||||
}
|
||||
|
||||
// kvPut creates the path on first write and merges thereafter, with the value on
|
||||
// stdin only (mirrors writeCreds). Never plain `kv patch` (needs the patch cap).
|
||||
func TestKVPutCreatesThenMerges(t *testing.T) {
|
||||
for _, tc := range []struct {
|
||||
name string
|
||||
exists bool
|
||||
wantCreate bool
|
||||
}{
|
||||
{"absent path → create (put)", false, true},
|
||||
{"present path → merge (patch -rw)", true, false},
|
||||
} {
|
||||
t.Run(tc.name, func(t *testing.T) {
|
||||
var stdinCalls []recStdin
|
||||
run := func(name string, argv, envv []string) (string, error) {
|
||||
if len(argv) >= 2 && argv[0] == "kv" && argv[1] == "get" {
|
||||
if tc.exists {
|
||||
return `{"data":{"data":{}}}`, nil
|
||||
}
|
||||
return "", fmt.Errorf("No value found at secret/x")
|
||||
}
|
||||
return "", nil
|
||||
}
|
||||
runStdin := func(name string, argv, envv []string, stdin string) (string, error) {
|
||||
stdinCalls = append(stdinCalls, recStdin{append([]string{name}, argv...), stdin})
|
||||
return "", nil
|
||||
}
|
||||
if err := kvPut(run, runStdin, "secret/x", "api_key", "SECRETVALUE"); err != nil {
|
||||
t.Fatalf("kvPut: %v", err)
|
||||
}
|
||||
if len(stdinCalls) != 1 {
|
||||
t.Fatalf("want exactly 1 stdin write, got %d", len(stdinCalls))
|
||||
}
|
||||
sc := stdinCalls[0]
|
||||
joined := strings.Join(sc.argv, " ")
|
||||
if tc.wantCreate && !strings.Contains(joined, "kv put") {
|
||||
t.Fatalf("absent path must use `kv put`: %v", sc.argv)
|
||||
}
|
||||
if !tc.wantCreate && !strings.Contains(joined, "kv patch -method=rw") {
|
||||
t.Fatalf("present path must merge via `kv patch -method=rw`: %v", sc.argv)
|
||||
}
|
||||
if strings.Contains(joined, "kv patch") && !strings.Contains(joined, "-method=rw") {
|
||||
t.Fatalf("must never use plain `kv patch`: %v", sc.argv)
|
||||
}
|
||||
if sc.stdin != "SECRETVALUE" {
|
||||
t.Fatalf("value must travel via stdin, got %q", sc.stdin)
|
||||
}
|
||||
for _, a := range sc.argv {
|
||||
if strings.Contains(a, "SECRETVALUE") {
|
||||
t.Fatalf("value leaked into argv: %v", sc.argv)
|
||||
}
|
||||
}
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
func TestVaultHelpMentionsBothSystems(t *testing.T) {
|
||||
h := vaultHelp()
|
||||
for _, want := range []string{"Vaultwarden", "vault kv"} {
|
||||
if !strings.Contains(h, want) {
|
||||
t.Errorf("vault help must mention %q (distinguish the two systems)", want)
|
||||
}
|
||||
}
|
||||
// Must name the infra-secrets system so the distinction is unambiguous.
|
||||
if !strings.Contains(h, "HashiCorp") && !strings.Contains(h, "OpenBao") {
|
||||
t.Error("vault help must name HashiCorp Vault / OpenBao (the infra secrets store)")
|
||||
}
|
||||
}
|
||||
|
|
|
|||
164
cli/edges.go
164
cli/edges.go
|
|
@ -1,164 +0,0 @@
|
|||
package main
|
||||
|
||||
import (
|
||||
"fmt"
|
||||
"regexp"
|
||||
"strconv"
|
||||
"strings"
|
||||
)
|
||||
|
||||
// edgesOpts is the parsed filter set for `homelab edges` (the who-talks-to-whom
|
||||
// investigation helper over the goldmane_edges trail; see ADR-0014).
|
||||
type edgesOpts struct {
|
||||
ns string // edges touching this namespace (either direction)
|
||||
src string // edges where src_ns = this
|
||||
dst string // edges where dst_ns = this
|
||||
peersOf string // distinct peers of this namespace (both directions)
|
||||
newSince string // first_seen >= duration (24h/7d/30m) or date (YYYY-MM-DD)
|
||||
denied bool // action = 'deny' only
|
||||
asJSON bool // wrap result as a JSON array
|
||||
limit int // row cap (default 200)
|
||||
}
|
||||
|
||||
// parseEdgesArgs parses the edges flag surface. Unknown flags error out so a
|
||||
// typo surfaces instead of silently dumping the whole table.
|
||||
func parseEdgesArgs(args []string) (edgesOpts, error) {
|
||||
o := edgesOpts{limit: 200}
|
||||
i := 0
|
||||
for i < len(args) {
|
||||
a := args[i]
|
||||
key, inline, hasInline := a, "", false
|
||||
if eq := strings.IndexByte(a, '='); eq >= 0 {
|
||||
key, inline, hasInline = a[:eq], a[eq+1:], true
|
||||
}
|
||||
needVal := func() (string, error) {
|
||||
if hasInline {
|
||||
return inline, nil
|
||||
}
|
||||
if i+1 < len(args) {
|
||||
i++
|
||||
return args[i], nil
|
||||
}
|
||||
return "", fmt.Errorf("flag %s needs a value", key)
|
||||
}
|
||||
var err error
|
||||
switch key {
|
||||
case "--ns":
|
||||
o.ns, err = needVal()
|
||||
case "--src":
|
||||
o.src, err = needVal()
|
||||
case "--dst":
|
||||
o.dst, err = needVal()
|
||||
case "--peers-of":
|
||||
o.peersOf, err = needVal()
|
||||
case "--new-since":
|
||||
o.newSince, err = needVal()
|
||||
case "--denied":
|
||||
o.denied = true
|
||||
case "--json":
|
||||
o.asJSON = true
|
||||
case "--limit":
|
||||
var v string
|
||||
if v, err = needVal(); err == nil {
|
||||
if o.limit, err = strconv.Atoi(v); err != nil {
|
||||
err = fmt.Errorf("--limit must be an integer: %q", v)
|
||||
}
|
||||
}
|
||||
default:
|
||||
return o, fmt.Errorf("unknown flag: %s", a)
|
||||
}
|
||||
if err != nil {
|
||||
return o, err
|
||||
}
|
||||
i++
|
||||
}
|
||||
return o, nil
|
||||
}
|
||||
|
||||
// nsRE is the safe namespace-token charset (k8s names + "Global"). Used as the
|
||||
// injection guard — anything else is rejected rather than quoted-and-hoped.
|
||||
var nsRE = regexp.MustCompile(`^[A-Za-z0-9][A-Za-z0-9_.-]*$`)
|
||||
|
||||
func validateNS(s string) error {
|
||||
if s == "" || len(s) > 63 || !nsRE.MatchString(s) {
|
||||
return fmt.Errorf("invalid namespace name: %q", s)
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
// sqlStr renders a SQL string literal (belt-and-suspenders on top of validateNS).
|
||||
func sqlStr(s string) string { return "'" + strings.ReplaceAll(s, "'", "''") + "'" }
|
||||
|
||||
var (
|
||||
durRE = regexp.MustCompile(`^(\d+)([smhd])$`)
|
||||
dateRE = regexp.MustCompile(`^\d{4}-\d{2}-\d{2}([ T]\d{2}:\d{2}(:\d{2})?)?$`)
|
||||
)
|
||||
|
||||
// newSinceCond turns a duration (24h/7d/30m/90s) or a date (YYYY-MM-DD[ HH:MM])
|
||||
// into a first_seen predicate.
|
||||
func newSinceCond(v string) (string, error) {
|
||||
if m := durRE.FindStringSubmatch(v); m != nil {
|
||||
unit := map[string]string{"s": "seconds", "m": "minutes", "h": "hours", "d": "days"}[m[2]]
|
||||
return fmt.Sprintf("first_seen >= now() - interval '%s %s'", m[1], unit), nil
|
||||
}
|
||||
if dateRE.MatchString(v) {
|
||||
return "first_seen >= " + sqlStr(v), nil
|
||||
}
|
||||
return "", fmt.Errorf("--new-since must be a duration (e.g. 24h, 7d, 30m) or a date (YYYY-MM-DD): %q", v)
|
||||
}
|
||||
|
||||
// buildEdgesQuery renders the SQL for the given filters against the `edge` table.
|
||||
func buildEdgesQuery(o edgesOpts) (string, error) {
|
||||
limit := o.limit
|
||||
if limit <= 0 {
|
||||
limit = 200
|
||||
}
|
||||
|
||||
// peers-of is a distinct-peer summary, a different shape from the row list.
|
||||
if o.peersOf != "" {
|
||||
if err := validateNS(o.peersOf); err != nil {
|
||||
return "", err
|
||||
}
|
||||
p := sqlStr(o.peersOf)
|
||||
return fmt.Sprintf("SELECT DISTINCT peer, action FROM ("+
|
||||
"SELECT dst_ns AS peer, action FROM edge WHERE src_ns = %s "+
|
||||
"UNION SELECT src_ns AS peer, action FROM edge WHERE dst_ns = %s"+
|
||||
") t ORDER BY peer LIMIT %d", p, p, limit), nil
|
||||
}
|
||||
|
||||
var conds []string
|
||||
for _, f := range []struct{ val, tmpl string }{
|
||||
{o.ns, "(src_ns = %[1]s OR dst_ns = %[1]s)"},
|
||||
{o.src, "src_ns = %s"},
|
||||
{o.dst, "dst_ns = %s"},
|
||||
} {
|
||||
if f.val == "" {
|
||||
continue
|
||||
}
|
||||
if err := validateNS(f.val); err != nil {
|
||||
return "", err
|
||||
}
|
||||
conds = append(conds, fmt.Sprintf(f.tmpl, sqlStr(f.val)))
|
||||
}
|
||||
if o.denied {
|
||||
conds = append(conds, "action = 'deny'")
|
||||
}
|
||||
if o.newSince != "" {
|
||||
c, err := newSinceCond(o.newSince)
|
||||
if err != nil {
|
||||
return "", err
|
||||
}
|
||||
conds = append(conds, c)
|
||||
}
|
||||
|
||||
q := "SELECT src_ns, dst_ns, action, flow_count, first_seen, last_seen FROM edge"
|
||||
if len(conds) > 0 {
|
||||
q += " WHERE " + strings.Join(conds, " AND ")
|
||||
}
|
||||
q += fmt.Sprintf(" ORDER BY first_seen DESC LIMIT %d", limit)
|
||||
|
||||
if o.asJSON {
|
||||
q = "SELECT coalesce(json_agg(row_to_json(t)), '[]') FROM (" + q + ") t"
|
||||
}
|
||||
return q, nil
|
||||
}
|
||||
|
|
@ -1,163 +0,0 @@
|
|||
package main
|
||||
|
||||
import (
|
||||
"strings"
|
||||
"testing"
|
||||
)
|
||||
|
||||
func TestParseEdgesArgs(t *testing.T) {
|
||||
cases := []struct {
|
||||
name string
|
||||
args []string
|
||||
want edgesOpts
|
||||
}{
|
||||
{"defaults", nil, edgesOpts{limit: 200}},
|
||||
{"ns", []string{"--ns", "immich"}, edgesOpts{ns: "immich", limit: 200}},
|
||||
{"ns equals", []string{"--ns=immich"}, edgesOpts{ns: "immich", limit: 200}},
|
||||
{"src dst", []string{"--src", "a", "--dst", "b"}, edgesOpts{src: "a", dst: "b", limit: 200}},
|
||||
{"peers-of", []string{"--peers-of", "authentik"}, edgesOpts{peersOf: "authentik", limit: 200}},
|
||||
{"denied json", []string{"--denied", "--json"}, edgesOpts{denied: true, asJSON: true, limit: 200}},
|
||||
{"new-since", []string{"--new-since", "24h"}, edgesOpts{newSince: "24h", limit: 200}},
|
||||
{"limit", []string{"--limit", "50"}, edgesOpts{limit: 50}},
|
||||
}
|
||||
for _, c := range cases {
|
||||
t.Run(c.name, func(t *testing.T) {
|
||||
got, err := parseEdgesArgs(c.args)
|
||||
if err != nil {
|
||||
t.Fatalf("parseEdgesArgs(%v) error: %v", c.args, err)
|
||||
}
|
||||
if got != c.want {
|
||||
t.Fatalf("parseEdgesArgs(%v) = %+v, want %+v", c.args, got, c.want)
|
||||
}
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
func TestParseEdgesArgsErrors(t *testing.T) {
|
||||
for _, args := range [][]string{
|
||||
{"--limit", "abc"},
|
||||
{"--bogus"},
|
||||
} {
|
||||
if _, err := parseEdgesArgs(args); err == nil {
|
||||
t.Errorf("parseEdgesArgs(%v) expected error, got nil", args)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func TestBuildEdgesQueryDefaults(t *testing.T) {
|
||||
q, err := buildEdgesQuery(edgesOpts{limit: 200})
|
||||
if err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
for _, want := range []string{"FROM edge", "ORDER BY first_seen DESC", "LIMIT 200"} {
|
||||
if !strings.Contains(q, want) {
|
||||
t.Errorf("query %q missing %q", q, want)
|
||||
}
|
||||
}
|
||||
if strings.Contains(q, "WHERE") {
|
||||
t.Errorf("no-filter query should have no WHERE: %q", q)
|
||||
}
|
||||
}
|
||||
|
||||
func TestBuildEdgesQueryFilters(t *testing.T) {
|
||||
cases := []struct {
|
||||
name string
|
||||
o edgesOpts
|
||||
want string
|
||||
}{
|
||||
{"ns both directions", edgesOpts{ns: "immich", limit: 10}, "(src_ns = 'immich' OR dst_ns = 'immich')"},
|
||||
{"src only", edgesOpts{src: "authentik", limit: 10}, "src_ns = 'authentik'"},
|
||||
{"dst only", edgesOpts{dst: "dbaas", limit: 10}, "dst_ns = 'dbaas'"},
|
||||
{"denied", edgesOpts{denied: true, limit: 10}, "action = 'deny'"},
|
||||
}
|
||||
for _, c := range cases {
|
||||
t.Run(c.name, func(t *testing.T) {
|
||||
q, err := buildEdgesQuery(c.o)
|
||||
if err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
if !strings.Contains(q, "WHERE") || !strings.Contains(q, c.want) {
|
||||
t.Errorf("query %q missing WHERE/%q", q, c.want)
|
||||
}
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
func TestBuildEdgesQueryCombinedFiltersAnded(t *testing.T) {
|
||||
q, err := buildEdgesQuery(edgesOpts{src: "a", denied: true, limit: 5})
|
||||
if err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
if !strings.Contains(q, "src_ns = 'a' AND action = 'deny'") {
|
||||
t.Errorf("combined filters not AND'd: %q", q)
|
||||
}
|
||||
}
|
||||
|
||||
func TestBuildEdgesQueryPeersOf(t *testing.T) {
|
||||
q, err := buildEdgesQuery(edgesOpts{peersOf: "authentik", limit: 100})
|
||||
if err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
for _, want := range []string{"DISTINCT", "src_ns = 'authentik'", "dst_ns = 'authentik'", "UNION"} {
|
||||
if !strings.Contains(q, want) {
|
||||
t.Errorf("peers-of query %q missing %q", q, want)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func TestBuildEdgesQueryJSON(t *testing.T) {
|
||||
q, err := buildEdgesQuery(edgesOpts{asJSON: true, limit: 200})
|
||||
if err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
if !strings.Contains(q, "json_agg") || !strings.Contains(q, "row_to_json") {
|
||||
t.Errorf("json query missing json_agg wrapper: %q", q)
|
||||
}
|
||||
}
|
||||
|
||||
func TestBuildEdgesQueryRejectsInjection(t *testing.T) {
|
||||
for _, bad := range []string{"a'; DROP TABLE edge;--", "a b", "a;b", "a\"b"} {
|
||||
if _, err := buildEdgesQuery(edgesOpts{ns: bad, limit: 10}); err == nil {
|
||||
t.Errorf("buildEdgesQuery(ns=%q) expected validation error, got nil", bad)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func TestNewSinceCond(t *testing.T) {
|
||||
cases := []struct {
|
||||
in string
|
||||
want string
|
||||
}{
|
||||
{"24h", "first_seen >= now() - interval '24 hours'"},
|
||||
{"7d", "first_seen >= now() - interval '7 days'"},
|
||||
{"30m", "first_seen >= now() - interval '30 minutes'"},
|
||||
{"2026-06-28", "first_seen >= '2026-06-28'"},
|
||||
}
|
||||
for _, c := range cases {
|
||||
got, err := newSinceCond(c.in)
|
||||
if err != nil {
|
||||
t.Fatalf("newSinceCond(%q) error: %v", c.in, err)
|
||||
}
|
||||
if got != c.want {
|
||||
t.Errorf("newSinceCond(%q) = %q, want %q", c.in, got, c.want)
|
||||
}
|
||||
}
|
||||
for _, bad := range []string{"yesterday", "1y", "'; DROP", ""} {
|
||||
if _, err := newSinceCond(bad); err == nil {
|
||||
t.Errorf("newSinceCond(%q) expected error, got nil", bad)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func TestValidateNS(t *testing.T) {
|
||||
for _, ok := range []string{"immich", "calico-system", "kube-system", "Global", "pg-cluster-rw"} {
|
||||
if err := validateNS(ok); err != nil {
|
||||
t.Errorf("validateNS(%q) unexpected error: %v", ok, err)
|
||||
}
|
||||
}
|
||||
for _, bad := range []string{"", "a b", "a'b", "a;b", "../x", "a$b"} {
|
||||
if err := validateNS(bad); err == nil {
|
||||
t.Errorf("validateNS(%q) expected error, got nil", bad)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
|
@ -20,7 +20,6 @@ func buildRegistry() []Command {
|
|||
reg = append(reg, deployCommands()...)
|
||||
reg = append(reg, netCommands()...)
|
||||
reg = append(reg, obsCommands()...)
|
||||
reg = append(reg, edgesCommands()...)
|
||||
reg = append(reg, usageCommands()...)
|
||||
reg = append(reg, haCommands()...)
|
||||
reg = append(reg, browserCommands()...)
|
||||
|
|
|
|||
|
|
@ -5,31 +5,8 @@ import (
|
|||
"os"
|
||||
"strings"
|
||||
"testing"
|
||||
"unicode/utf8"
|
||||
)
|
||||
|
||||
func TestTruncatePreviewKeepsValidUTF8(t *testing.T) {
|
||||
// Byte-slicing a long Cyrillic string at 240 splits a 2-byte rune and emits
|
||||
// invalid UTF-8 — the bug that crashed the recall hook. truncatePreview must
|
||||
// cut on a rune boundary and always stay valid UTF-8.
|
||||
long := strings.Repeat("я", 300) // 300 runes / 600 bytes
|
||||
got := truncatePreview(long, 240)
|
||||
if !utf8.ValidString(got) {
|
||||
t.Fatalf("truncatePreview produced invalid UTF-8: %q", got)
|
||||
}
|
||||
if r := []rune(got); len(r) != 241 || string(r[:240]) != strings.Repeat("я", 240) || r[240] != '…' {
|
||||
t.Fatalf("truncatePreview = %d runes, want 240 Cyrillic + ellipsis", len(r))
|
||||
}
|
||||
// Short multibyte strings pass through untouched (no ellipsis).
|
||||
if got := truncatePreview("кратко", 240); got != "кратко" {
|
||||
t.Fatalf("short string altered: %q", got)
|
||||
}
|
||||
// ASCII boundary still works.
|
||||
if got := truncatePreview(strings.Repeat("a", 500), 240); got != strings.Repeat("a", 240)+"…" {
|
||||
t.Fatalf("ascii truncation wrong: %q", got)
|
||||
}
|
||||
}
|
||||
|
||||
func TestResolveMemoryBase(t *testing.T) {
|
||||
old1, old2 := os.Getenv("CLAUDE_MEMORY_API_URL"), os.Getenv("MEMORY_API_URL")
|
||||
defer func() { os.Setenv("CLAUDE_MEMORY_API_URL", old1); os.Setenv("MEMORY_API_URL", old2) }()
|
||||
|
|
|
|||
|
|
@ -13,7 +13,7 @@ The trigger was a proposal to swap Forgejo out for GitHub entirely. The grilling
|
|||
Do **not** swap to GitHub. Reaffirm and *complete* the model already in `CONTEXT.md`:
|
||||
|
||||
- Every first-party repo has exactly **one** push target — its **Canonical repo** on Forgejo. GitHub is a one-way push-mirror (off-site backup + the source GitHub Actions builds from). **No repo is ever dual-pushed.**
|
||||
- A small, explicit set of **GitHub-first repos** are the exception (canonical lives on GitHub, outside the mirror policy): third-party clones/forks where GitHub is genuinely upstream (`jsoncrack.com`, `snmp_exporter`, `SparkyFitness`, `agent-rules-books`, `Plotting-Your-Dream-Book`) and the deliberately-public first-party `health`. `Plotting-Your-Dream-Book` (owned by Anca, dev in her org) keeps its GHA build in-place and pushes the image to **its own org's ghcr** (`ghcr.io/passionprojectsanca/book-plotter`, private) via the workflow's built-in `GITHUB_TOKEN` — no Forgejo mirror, no `viktorbarzin`-namespace push, no shared PAT in her repo (2026-06-27, migrated off DockerHub).
|
||||
- A small, explicit set of **GitHub-first repos** are the exception (canonical lives on GitHub, outside the mirror policy): third-party clones/forks where GitHub is genuinely upstream (`jsoncrack.com`, `snmp_exporter`, `SparkyFitness`, `agent-rules-books`, `Plotting-Your-Dream-Book`) and the deliberately-public first-party `health`.
|
||||
- `infra` is reconciled into the standard model: its GitHub-only `.github/workflows/build-*.yml` are brought onto Forgejo-canonical (inert on Forgejo, active on the mirror), then the mirror is enabled — ending the deliberate divergence while keeping Woodpecker on the Forgejo forge.
|
||||
- Enforcement is **structural**: reconciled clones keep only the Forgejo remote, so there is no GitHub remote to habitually push to; the execution rule is "push to the canonical forge only, never the mirror."
|
||||
|
||||
|
|
|
|||
|
|
@ -5,14 +5,6 @@ exists to answer the question that drove the whole CLI — *which verbs are wort
|
|||
adding next* — with data instead of one maintainer's habits (the earlier mining
|
||||
covered a single user's ~51k commands, so the surface is shaped to that user).
|
||||
|
||||
> **Update (2026-06-26) — the cross-user privacy *norm* below is superseded by
|
||||
> [ADR-0015](0015-os-is-the-authorization-boundary.md).** The prohibition this
|
||||
> ADR leaned on ("reading another user's `~/.claude` is off-limits even for an
|
||||
> owner in-session") no longer holds: the managed-settings policy now **defers
|
||||
> to OS/sudo authorization**. The `usage top` telemetry design itself is
|
||||
> unchanged and still current — only the "never read homes" framing in the
|
||||
> third decision below is overtaken.
|
||||
|
||||
## Decisions
|
||||
|
||||
- **Emit on dispatch, in `dispatch()`.** The longest-prefix match already knows
|
||||
|
|
|
|||
|
|
@ -27,9 +27,3 @@ As the Service count grows we want an audit-grade record of which Service talks
|
|||
- **Enforcement gains a better data source.** Goldmane's allow/deny + policy-trace flows build the Wave 1 empirical egress allowlist faster than the current iptables-`LOG`→journald→Loki path, and policies select on namespace/label with no SA dependency.
|
||||
- **New ubiquitous language** recorded in `CONTEXT.md`: **Service identity** and **Goldmane / Whisker**.
|
||||
- **Revisit triggers:** adopt dedicated per-Service SAs if identity-aware NetworkPolicy needs a principal finer than namespace/label, or if mTLS is ever required; reconsider Retina if DNS/drop-level flow detail becomes necessary.
|
||||
|
||||
## As-built (2026-06-25)
|
||||
|
||||
Implemented across infra issues #57–#63. **One material deviation from the decision above:** the durable trail is NOT a Goldmane→Loki emitter (no such emitter exists in OSS Calico 3.30) — it is the **`goldmane-edge-aggregator`** service, which streams Goldmane's gRPC `Flows.Stream` API over mTLS and upserts the unique namespace-pair **edge set** (`edge(src_ns,dst_ns,action,first_seen,last_seen,flow_count)`, self-edges + empty-namespace flows dropped) into **CNPG DB `goldmane_edges`**, plus a daily `goldmane-edges-digest` CronJob → `#alerts` (all Slack consolidated to `#alerts`; the `#security` channel was abandoned 2026-06-25 — the shared webhook's Slack app isn't a member of it — see runbook). The mTLS client cert **reuses the operator's Tigera-CA-signed `whisker-backend-key-pair`** rather than copying the CA private key into TF state (Goldmane verifies CA-chain only, not identity) — re-apply `stacks/goldmane-edge-aggregator` if the operator rotates it. `service-identity` labels are live on the multi-Service namespaces (`monitoring`, `dbaas`). Whisker UI is Authentik-gated at `whisker.viktorbarzin.me`. Health: Prometheus alerts `AggregatorDown` + `DigestFailing` and cluster-health check #48.
|
||||
|
||||
Full as-built, query recipes (incl. the Wave-1 egress-allowlist derivation), and troubleshooting: [`docs/runbooks/goldmane-flow-trail.md`](../runbooks/goldmane-flow-trail.md). Stacks: `stacks/calico` (Goldmane/Whisker + Whisker ingress), `stacks/goldmane-edge-aggregator` (the trail). Code: `~/code/goldmane-edge-aggregator`.
|
||||
|
|
|
|||
|
|
@ -1,57 +0,0 @@
|
|||
# OS is the authorization boundary: agents defer to Unix/sudo, not a stricter in-policy rule
|
||||
|
||||
Supersedes the cross-user privacy *norm* that the devvm managed-settings policy
|
||||
carried and that ADR-0011 leaned on ("never read another user's home /
|
||||
`~/.claude`, off-limits even for an owner in-session"). ADR-0011's actual
|
||||
subject — `usage top` telemetry and its emit design — is unchanged and still
|
||||
current; only the privacy prohibition it referenced is superseded here.
|
||||
|
||||
## Context
|
||||
|
||||
The devvm managed-settings policy (`/etc/claude-code/managed-settings.json`,
|
||||
`claudeMd`) carried two rules that were, in practice, *stricter than the OS*:
|
||||
"you are not the admin, do not escalate privileges" and "never read another
|
||||
user's home directory, credentials, tokens, or `~/.claude`." The OS told a
|
||||
different story: `wizard` holds `(ALL) NOPASSWD: ALL` — full passwordless root.
|
||||
The kernel had already granted total read access; the policy was layering an
|
||||
artificial refusal on top of an authorization the OS already permits, and the
|
||||
"not the admin" framing was factually wrong for a NOPASSWD-root user.
|
||||
|
||||
Two honest ways to resolve the inconsistency: tighten sudo to match the policy,
|
||||
or loosen the policy to match the OS. The owner chose the latter on 2026-06-26,
|
||||
for analytics/debugging across the shared box.
|
||||
|
||||
## Decision
|
||||
|
||||
- **Authorization follows the OS, not this policy.** Agents may access whatever
|
||||
their OS user can access — directly or via `sudo` where they hold sudo rights
|
||||
— and must not impose restrictions stricter than the OS. On this box that
|
||||
includes other users' home directories and `~/.claude` for users who hold
|
||||
broad sudo.
|
||||
- **No separate prompt or carve-out** for OS-authorized access. The Unix
|
||||
permission model + sudoers is the single source of truth for who may read
|
||||
what. Other homes are `0750`-owned, so a cross-home read necessarily transits
|
||||
`sudo` and is therefore captured in the sudo/auth audit log.
|
||||
- **Cluster/infra RBAC tiering is unchanged.** kubectl / Vault / infra access
|
||||
stays scoped to each user's RBAC tier; "defer to the OS" is about OS-level
|
||||
file access, not a licence to exceed cluster RBAC.
|
||||
- **Scope is symmetric and multi-user.** The rule lives in the *shared*
|
||||
managed-settings, so every user's agents defer to that user's own sudo grant.
|
||||
Any user with broad sudo gets the same cross-home read capability over other
|
||||
users' files. Accepted by the owner with that understanding; emo's and
|
||||
ancamilea's `~/.claude` is now agent-readable by sudo-holders.
|
||||
- **Takes effect in a fresh session.** managed-settings loads at session start;
|
||||
the session that made the change keeps running under the old policy.
|
||||
|
||||
## Consequences
|
||||
|
||||
- The privacy-preserving telemetry rationale in ADR-0011 (`usage top` as the
|
||||
"cross-user analytics without reading homes" answer) remains useful but is no
|
||||
longer the *only* sanctioned path; direct reads via `sudo` are now permitted.
|
||||
- Larger blast radius: if an agent session running as a sudo-holder is
|
||||
prompt-injected or otherwise compromised, it can now read every user's secrets
|
||||
with no in-agent friction (sudo here is passwordless). The sudo/auth audit log
|
||||
is the remaining accountability control.
|
||||
- Reversible: restore the prior `claudeMd` bullets (backup kept at
|
||||
`/etc/claude-code/managed-settings.json.bak-2026-06-26`) and start a fresh
|
||||
session.
|
||||
|
|
@ -86,56 +86,10 @@ Signin latency is dominated by screen count and round trips, not server time
|
|||
use the explicit-consent flow (it re-prompted every 4 weeks per app).
|
||||
- **Live tuning via `server.env`/`worker.env`** (the `authentik.*` Helm values
|
||||
are inert due to `existingSecret`): 3 gunicorn workers, 30m flow-plan cache,
|
||||
15m policy cache, gunicorn `max_requests=10000`/jitter=1000 (recycle
|
||||
hardening — decorrelates the 9 workers' recycles from PG blips). **No
|
||||
`CONN_MAX_AGE`** — persistent Django connections pin a PgBouncer server conn
|
||||
1:1 and saturate the session-mode pool (reverted 2026-06-10).
|
||||
15m policy cache, 60s persistent DB connections.
|
||||
- **Static assets cached immutable**: `/static` ingress carve-out adds
|
||||
`Cache-Control: public, max-age=31536000, immutable` (assets are
|
||||
version-fingerprinted; authentik itself sends no max-age).
|
||||
- **Rate-limit carve-out** (2026-06-28): `/` and `/static` use a dedicated
|
||||
`authentik-rate-limit` (100/1000) instead of the shared 10/50 default — the
|
||||
login SPA cold-loads ~70 flow-executor chunks from `/static`; the default
|
||||
burst 429'd the tail and a failed ES-module import left a blank login screen.
|
||||
- **Readiness tolerance** (2026-06-28): server `readinessProbe.failureThreshold:8`
|
||||
(~80s, was the chart-default ~30s). The probe (`/-/health/ready/`) queries the
|
||||
DB; too-tight tolerance let a sub-60s PG/pgbouncer transient return 503 on all
|
||||
3 server pods at once → Traefik had no healthy backend → 502/503/504 (episodic
|
||||
blank login + 30s hangs). 80s absorbs a full CNPG failover reconnect. Sessions
|
||||
+ cache are PostgreSQL-only since Redis was removed in 2026.2 (no external-cache
|
||||
option), so request-serving is coupled to PG — this survives a short transient,
|
||||
not a total CNPG outage.
|
||||
- **Rolling-update strategy** (2026-06-28): the chart key is `deploymentStrategy`
|
||||
(the repo's old `strategy:` key was silently inert → live ran the chart-default
|
||||
25%/25% and dropped a server pod out of rotation on every roll). Now
|
||||
`maxSurge:1/maxUnavailable:0` keeps all 3 ready throughout a roll.
|
||||
- **Old-browser login (SFE)** (2026-06-28): authentik's modern flow SPA is ES2022
|
||||
and renders a **blank login** on Safari/WebKit ≤16.3 (every iOS browser shares
|
||||
the system WebKit, so it's not browser-choice — e.g. iPadOS ≤15). The overlay
|
||||
image patches `flows/views/interface.py::compat_needs_sfe()` to also serve
|
||||
authentik's built-in no-JS **Simplified Flow Executor** (SFE, ES5) to old Safari
|
||||
**and any iOS browser** (Chrome/Firefox on iOS are WebKit skins) on iOS ≤16.3,
|
||||
so those clients get the *real* authentik login (password + MFA + reputation —
|
||||
no auth downgrade). The SFE can't render Identification-stage **sources**
|
||||
(authentik limitation), so the patch also injects static social-login `<a>`
|
||||
links into `flow-sfe.html` (→ `/source/oauth/login/<slug>/`, plain redirects) —
|
||||
required for password-less accounts (e.g. Google-only users). A Traefik
|
||||
basic-auth fallback was rejected: it would have put a single spoofable-UA
|
||||
password in front of `vbarzin→wizard` (passwordless root on the devvm). See
|
||||
`stacks/authentik/patch-compat-sfe.py`.
|
||||
- **SFE + forced-WebAuthn MFA gotcha** (2026-06-28): the `default-authentication-flow`
|
||||
MFA stage (`not_configured_action=configure`, `conf_stages=[webauthn]`) force-enrols
|
||||
a WebAuthn passkey for any **password**-path user with no MFA device — but the SFE
|
||||
**cannot render WebAuthn** (enrol *or* validate), so that user gets
|
||||
`unsupported state: ak-stage-authenticator-webauthn`. Two escape hatches, **no MFA
|
||||
downgrade**: (1) **social login** — sources run `default-source-authentication`
|
||||
(UserLoginStage only, **no MFA stage**), so the SFE's "Continue with <provider>"
|
||||
button always completes; (2) **enrol TOTP** — the SFE *can* validate TOTP codes, and
|
||||
≥1 confirmed device flips the stage from force-enrol to validate. User MFA devices are
|
||||
runtime data (not Terraform): enrol via `ak shell`
|
||||
(`TOTPDevice.objects.create(user=…, confirmed=True)`) and store the secret in the
|
||||
user's own Vaultwarden item. (Done for emo — the Google-only iPadOS-15 case: TOTP in
|
||||
his `authentik.viktorbarzin.me` Bitwarden item; e2e-verified the BW code is accepted.)
|
||||
- **Outpost**: 2 replicas, `log_level=info` (was 1 replica at `trace`).
|
||||
- **auth-proxy nginx**: upstream `keepalive 32` + HTTP/1.1 — no per-request
|
||||
TCP setup on the forward-auth subrequest path.
|
||||
|
|
|
|||
|
|
@ -205,43 +205,6 @@ healthy <0.3s, broken hangs). **Fix: cap `ulimit -n 65536` before x11vnc starts*
|
|||
wrapper in `main.tf` (so it applies deterministically even though the image is
|
||||
`:latest`/`IfNotPresent` and won't re-pull a rebuilt entrypoint). Same bug + fix
|
||||
as the android-emulator stack.
|
||||
|
||||
### noVNC black after a browser-container restart (x11vnc supervision)
|
||||
|
||||
A **distinct** failure from the fd-sweep gotcha above: the noVNC client *connects*
|
||||
but the view is **black**, and the novnc container logs spew
|
||||
`connecting to: localhost:5900` → `Failed to connect ... [Errno 111] Connection
|
||||
refused` (x11vnc is **down**, not slow). Cause: `x11vnc` and `websockify` both run
|
||||
in the **novnc** container, but x11vnc attaches to the **chrome-service** (browser)
|
||||
container's Xvfb over `localhost:6099` (shared pod network). When the browser
|
||||
container restarts — Chrome exits cleanly (exit 0, "Completed") or crashes — its
|
||||
Xvfb vanishes and x11vnc loses its X connection and exits.
|
||||
|
||||
`entrypoint.sh` **supervises** x11vnc: it launches x11vnc and websockify as
|
||||
background children and `wait -n`s on them, exiting non-zero if **either** dies, so
|
||||
the kubelet restarts the novnc container, which re-waits for Xvfb on `:6099` and
|
||||
relaunches x11vnc — the bridge **self-heals** across browser-container restarts.
|
||||
(Before 2026-06-27, x11vnc was an unsupervised background child of an `exec`ed
|
||||
websockify; a dead x11vnc was never relaunched, leaving `:5900` dead — a
|
||||
`<defunct>` zombie — and the view black until a manual pod restart. Same
|
||||
supervision pattern as the android-emulator stack's entrypoint.)
|
||||
|
||||
**Diagnose:** `kubectl exec -c novnc -- ps aux | grep x11vnc` (a `<defunct>`/Z
|
||||
entry = the bug); or the RFB-banner probe from a sibling container (`python3 -c
|
||||
"import socket;s=socket.socket();s.settimeout(2);s.connect(('127.0.0.1',5900));print(s.recv(12))"`
|
||||
— healthy returns `b'RFB 003.008\n'`, broken = `ConnectionRefused`). **Immediate
|
||||
recovery** (no image change): restart just the novnc container with `kubectl exec
|
||||
-n chrome-service deploy/chrome-service -c novnc -- kill 1` — re-runs its entrypoint
|
||||
and relaunches x11vnc **without** touching the browser session/in-flight CDP jobs.
|
||||
|
||||
> **Deploying a rebuilt novnc entrypoint:** Keel is **off** for this deployment
|
||||
> (`keel.sh/policy=never`, because the browser container's playwright image is
|
||||
> version-pinned to f1-stream) and the image is `:latest`/`IfNotPresent`, so a
|
||||
> rebuilt `:latest` will **not** redeploy on its own. After the
|
||||
> `build-chrome-service-novnc.yml` GHA build pushes `:latest` + `:<sha>`,
|
||||
> **SHA-pin** the novnc `image` in `main.tf` to the new `:<sha>` to force the pull
|
||||
> and rollout (the novnc image is TF-managed — not in the deployment's
|
||||
> `lifecycle.ignore_changes`).
|
||||
- **snapshot-server sidecar** (`mcr.microsoft.com/playwright/python:v1.48.0-noble`)
|
||||
serves `GET /api/snapshot` from `/profile/snapshots/storage-state.json`,
|
||||
bearer-gated by `PW_TOKEN`. Service `chrome-snapshot` maps :8088 → :8088
|
||||
|
|
@ -293,42 +256,6 @@ Key facts:
|
|||
byte-identical copy of `files/stealth.js`, guarded by a drift test — so the
|
||||
CLI's stealth never diverges from the in-cluster callers'.
|
||||
|
||||
## Multi-user access (sharing the browser)
|
||||
|
||||
There is ONE chrome-service browser with ONE persistent profile, warmed with
|
||||
**Viktor's** logged-in sessions. CDP has no per-context auth, so anyone who can
|
||||
drive the browser — over the noVNC view OR the CDP/`homelab browser` path — can
|
||||
reach the persistent profile (`browser.contexts[0]`) and therefore Viktor's
|
||||
sessions. Access is gated accordingly, per user.
|
||||
|
||||
**Decision (2026-06-28):** emo (`emil.barzin` / `emil.barzin@gmail.com`) SHARES
|
||||
Viktor's browser for form-filling + captcha solving, rather than getting an
|
||||
isolated instance. The session-exposure trade-off above was explicitly accepted.
|
||||
|
||||
Two independent grants make up "browser access" for a user:
|
||||
|
||||
1. **noVNC (interactive view, `chrome.viktorbarzin.me`)** — gated by the Authentik
|
||||
`admin-services-restriction` policy: the `CHROME_ALLOWED` set
|
||||
(`stacks/authentik/admin-services-restriction.tf`) matches the user's Authentik
|
||||
username OR email. Add the user there. No kubeconfig/RBAC needed.
|
||||
2. **CLI (`homelab browser`, CDP over port-forward)** — needs `pods/portforward`
|
||||
in `chrome-service` PLUS a non-interactive credential (a normal devvm user's
|
||||
kubeconfig is interactive-OIDC-only and can't authenticate a headless agent
|
||||
session). Provided by a per-user **ServiceAccount** with a long-lived token
|
||||
(`stacks/chrome-service/rbac.tf`, e.g. `emo-browser`): `pods/portforward` in
|
||||
this namespace + cluster read-only (`oidc-power-user-readonly`, so it can also
|
||||
resolve the Service and doesn't regress the user's normal read). The devvm
|
||||
provisioner (`scripts/t3-provision-users.sh` → `install_browser_kubeconfig`)
|
||||
reads that token and installs it as the user's DEFAULT kubeconfig context
|
||||
(`<user>-browser@homelab`), keeping their personal OIDC login as the
|
||||
`oidc@homelab` named context. The SA's existence is the source of truth for who
|
||||
gets the CLI — the provisioner no-ops for users without a `<user>-browser` SA.
|
||||
|
||||
**To grant another user:** add them to `CHROME_ALLOWED` (noVNC) and/or add a
|
||||
`<user>-browser` SA + bindings mirroring `emo-browser` in `rbac.tf` (CLI), then run
|
||||
the provisioner. To revoke: remove from `CHROME_ALLOWED` and delete the SA (rotate
|
||||
a token by deleting its `<user>-browser-token` Secret).
|
||||
|
||||
## Limits + risks
|
||||
|
||||
- **Anti-bot vs stealth arms race** — when an upstream beats us (DRM
|
||||
|
|
|
|||
|
|
@ -115,67 +115,9 @@ claude-agent-service, claude-memory-mcp, kms-website, Freedify,
|
|||
instagram-poster, payslip-ingest, broker-sync (image name `wealthfolio-sync`),
|
||||
fire-planner, recruiter-responder, x402-gateway — plus **tripit** (the original
|
||||
pilot, 2026-06-09). Earlier public-repo apps already on GHA (Website,
|
||||
k8s-portal, apple-health-data, audiblez-web, insta2spotify,
|
||||
k8s-portal, apple-health-data, audiblez-web, plotting-book, insta2spotify,
|
||||
audiobook-search) now also land on ghcr.
|
||||
|
||||
**plotting-book** is a special case (a GitHub-first repo owned by Anca,
|
||||
ADR-0003): the build runs in *her* GitHub repo
|
||||
(`PassionProjectsAnca/Plotting-Your-Dream-Book`) and pushes to **private
|
||||
`ghcr.io/passionprojectsanca/book-plotter`** — under her org's ghcr namespace,
|
||||
not `viktorbarzin`, using the workflow's built-in `GITHUB_TOKEN` (no shared
|
||||
PAT). The cluster pulls it via the Kyverno-synced `ghcr-credentials` secret (the
|
||||
`plotting-book` namespace is on the allowlist; the shared `ghcr_pull_token` has
|
||||
read access). Migrated off public DockerHub (`viktorbarzin/book-plotter`) on
|
||||
2026-06-27. The Woodpecker deploy hook (repo 43, registered to Anca's repo) is
|
||||
unchanged. Flow:
|
||||
|
||||
```text
|
||||
DEVELOP ───────────────────────────────────────────────────────────────────────
|
||||
Anca (Codex / t3 web agent)
|
||||
│ git push → main
|
||||
▼
|
||||
┌──────────────────────────────────────────────────────────────┐
|
||||
│ GitHub: PassionProjectsAnca/Plotting-Your-Dream-Book (private)│ ← canonical
|
||||
│ .github/workflows/build-and-deploy.yml on: push → main │
|
||||
└───────────────────────────┬──────────────────────────────────┘
|
||||
│ GitHub Actions runner (off-infra build · ADR-0002)
|
||||
┌────────────────────┴─────────────────────────────────┐
|
||||
▼ ▼
|
||||
┌─────────────────────────────────────────────┐ ╔═══════════════════════════════════════╗
|
||||
│ build job │ push ║ GHCR · PRIVATE package ║
|
||||
│ • svu next --always → tag vX.Y.Z (→ repo) │═════▶║ ghcr.io/passionprojectsanca/ ║
|
||||
│ • buildx linux/amd64, provenance:false │ tags ║ book-plotter :vX.Y.Z :latest ║
|
||||
│ • login ghcr (GITHUB_TOKEN, packages:write)│ ╚═══════════════════╤═══════════════════╝
|
||||
│ • delete-package-versions (keep newest 10) │ │
|
||||
└───────────────────────┬─────────────────────┘ │ pull (private,
|
||||
▼ deploy job [gate: repo var DEPLOY_ENABLED ≠ "false"] via secret)
|
||||
POST ci.viktorbarzin.me/api/repos/43/pipelines {IMAGE_TAG, IMAGE_NAME} │
|
||||
▼ │
|
||||
┌─────────────────────────────────────────────────────────────┐ │
|
||||
│ Woodpecker repo 43 · .woodpecker/deploy.yml (event: manual) │ │
|
||||
│ kubectl set image deployment/plotting-book = <ghcr>:vX.Y.Z │ │
|
||||
│ kubectl rollout status │ │
|
||||
└───────────────────────────┬─────────────────────────────────┘ │
|
||||
▼ │
|
||||
═══════════════ Kubernetes · ns: plotting-book ════════════════════════════ │
|
||||
┌─────────────────────────────────────────────────────────────┐ │
|
||||
│ Deployment plotting-book (Recreate · image = ignore_changes)│ │
|
||||
│ imagePullSecrets: ghcr-credentials ────────pull───────────┼─────────────────┘
|
||||
│ Pod → Express :3001 + SQLite on PVC (proxmox-lvm) │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
guards / supporting:
|
||||
• Kyverno require-trusted-registries [Enforce] → ghcr.io/* ALLOWED (admission)
|
||||
• Keel policy=patch @1h → watches GHCR via ghcr-credentials (backstop)
|
||||
• ghcr-credentials ⇐ Kyverno generate-clone ⇐ Vault secret/viktor/ghcr_pull_token
|
||||
|
||||
═══════════════ Serving path (unchanged) ══════════════════════════════════
|
||||
Browser ─▶ plotting-book.viktorbarzin.me (non-proxied DNS → Traefik .203)
|
||||
─▶ Authentik forward-auth (gate) ─▶ Service :80 ─▶ Pod :3001
|
||||
```
|
||||
|
||||
Governance: the Deployment + Kyverno allowlist are Terraform (`stacks/plotting-book`,
|
||||
`stacks/kyverno`); the live image *tag* is CI-owned (`ignore_changes`).
|
||||
|
||||
### Infra-owned images (issues #29 / #30)
|
||||
|
||||
Images owned by the infra repo build on GHA workflows **in the infra repo's own
|
||||
|
|
@ -221,9 +163,9 @@ Woodpecker is **deploy + cluster-touching steps only**:
|
|||
| Pipeline | File | Purpose |
|
||||
|----------|------|---------|
|
||||
| per-app deploy | `.woodpecker/deploy.yml` (each repo) | `kubectl set image` + Slack notify (event: **manual**) |
|
||||
| terragrunt apply | `.woodpecker/default.yml` | Changed-stacks apply on push to master (runs in `infra-ci`). **Skips Tier-0 `vault`** — it's human-applied via OIDC; the CI `ci` role lacks Vault-admin perms (`sys/mounts`, `sys/policies/acl`) so a CI apply 403s |
|
||||
| terragrunt apply | `.woodpecker/default.yml` | Changed-stacks apply on push to master (runs in `infra-ci`) |
|
||||
| certbot | `.woodpecker/renew-tls.yml` | TLS renewal cron |
|
||||
| drift-detection | `.woodpecker/drift-detection.yml` | Nightly Terraform drift (runs in `infra-ci`). **Skips Tier-0 `vault`** (its `plan` 403s under the `ci` role and would fail the whole run) |
|
||||
| drift-detection | `.woodpecker/drift-detection.yml` | Nightly Terraform drift (runs in `infra-ci`) |
|
||||
| provision-user | `.woodpecker/provision-user.yml` | Add namespace-owner user from Vault spec |
|
||||
| registry-config-sync | `.woodpecker/registry-config-sync.yml` | SCP `modules/docker-registry/*` → `10.0.20.10` on change |
|
||||
| pve-nfs-exports-sync | `.woodpecker/pve-nfs-exports-sync.yml` | Sync `scripts/pve-nfs-exports` → `/etc/exports` on PVE |
|
||||
|
|
@ -234,38 +176,6 @@ Woodpecker is **deploy + cluster-touching steps only**:
|
|||
|
||||
**No build/test pipeline exists on any repo.** Do not (re)introduce one.
|
||||
|
||||
### `default.yml` apply: dual-registration de-dup + reliability (2026-06-28)
|
||||
|
||||
infra is registered in Woodpecker on **both** the canonical Forgejo repo (id 82)
|
||||
and the legacy GitHub mirror (id 1), and **both fire `default.yml` on every
|
||||
push**. Left unguarded, two `terragrunt apply` runs race each other for the
|
||||
per-stack PG state lock — historically the #1 source of `Error acquiring the
|
||||
state lock` failures and push-supersede "killed" runs.
|
||||
|
||||
- **Forge guard** (first command in the `apply` step): the push-apply runs **only
|
||||
on the canonical Forgejo forge**; on the GitHub mirror it logs `[forge-guard]`
|
||||
and `exit 0`s. Detection: `CI_REPO_URL`/`CI_FORGE_URL` contains `github.com` →
|
||||
skip. Fail-open (unknown forge still applies). The mirror keeps running the
|
||||
**crons** (drift-detection, renew-tls, …), which live on repo 1 — only its
|
||||
duplicate push-apply no-ops. (Crons were NOT moved; deactivating repo 1 would
|
||||
have killed them.)
|
||||
- **Lock-skip matches both tiers**: a stack whose apply hits a lock is SKIPPED,
|
||||
not failed. The grep now matches the Tier-0 Vault message (`is locked by`) **and**
|
||||
the Tier-1 PG-backend message (`Error acquiring the state lock` / `already
|
||||
locked`) — the PG case was previously miscounted as a hard failure.
|
||||
- **Transient retry** (bounded, 3 attempts): only provider-registry download
|
||||
timeouts (`Failed to install provider` / `Client.Timeout`) and Vault 5xx are
|
||||
retried. Config errors (missing arg, invalid index) and helm `atomic` timeouts
|
||||
are NOT retried — they fail fast.
|
||||
|
||||
A pre-apply off-infra validate gate was evaluated and rejected: `terraform
|
||||
validate` runs without state but catches ~0 of the observed failures (they are
|
||||
provider-config-from-Vault-data, server-side-apply conflicts, helm installs, and
|
||||
lock contention — all invisible to static validate), and `plan` cannot run
|
||||
off-infra (no Vault/PG access). `terragrunt apply` already fails at its plan
|
||||
phase without mutating on config errors, so a separate in-pipeline plan-gate was
|
||||
also dropped as redundant.
|
||||
|
||||
### Woodpecker API
|
||||
|
||||
Uses **numeric repo IDs** (`/api/repos/<id>/pipelines`), NOT owner/name paths
|
||||
|
|
|
|||
|
|
@ -286,7 +286,7 @@ Uptime Kuma monitors: TCP SMTP (port 25) on `176.12.22.76` (external), IMAP (por
|
|||
|
||||
#### Security Alerts (Wave 1 — planned, beads `code-8ywc`)
|
||||
|
||||
Routed via **Loki ruler → Alertmanager → the `slack-security` receiver, which posts to `#alerts`** (it keeps its `[SECURITY/<sev>]` title styling so security-lane alerts stand out there). Same handling path as infra alerts; severity labels carried in the alert (critical/warning/info). The dedicated `#security` channel was abandoned 2026-06-25 — the shared `alertmanager_slack_api_url` webhook's Slack app isn't a member of it (a `#security` override 404s), so everything consolidated to `#alerts`. Detection sources: K8s API audit log (`job=kube-audit`), Vault audit log (`job=vault-audit`), PVE sshd journald (`job=sshd-pve`), Calico flow logs (`job=calico-flow`, W1.6 only).
|
||||
Routed via **Loki ruler → Alertmanager → `#security` Slack receiver**. Same handling path as infra alerts. Single channel with severity labels inside (critical/warning/info), not three separate channels. Detection sources: K8s API audit log (`job=kube-audit`), Vault audit log (`job=vault-audit`), PVE sshd journald (`job=sshd-pve`), Calico flow logs (`job=calico-flow`, W1.6 only).
|
||||
|
||||
| # | Source | Event | Severity |
|
||||
|---|---|---|---|
|
||||
|
|
@ -318,20 +318,9 @@ IOPS impact estimated ~1-2 GB/day additional disk writes after custom audit-poli
|
|||
Detects the inverse of the K-series alerts: a service that **must work WITHOUT Authentik SSO** getting accidentally walled off. Services on `ingress_factory auth = "required"` put Authentik forward-auth on `/`, which 302-bounces native-client / public / webhook / WebSocket / SPA-XHR paths. We carve those out with path-scoped `auth = "none"` ingresses; a TF revert, a bad deploy, or `ingress_factory`'s fail-closed `auth` default flipping back to `"required"` can silently clobber a carve-out.
|
||||
|
||||
- **Mechanism**: `blackbox-exporter` (monitoring ns) probes a representative GET-able URL per carve-out with `no_follow_redirects: true`. The `http_no_authentik_redirect` module FAILS the probe (`fail_if_header_matches` on the `Location` header, regex `authentik\.viktorbarzin\.me|/outpost\.goauthentik\.io|/application/o/authorize`) iff the response redirects to Authentik. `valid_status_codes` enumerates all expected non-Authentik responses **including 301/302** (so a legitimate redirect, e.g. a short-link 302, or a 404 carve-out like meshcentral `/agent.ashx`, stays green). Scrape job: `blackbox-authentik-walloff` (1m).
|
||||
- **Alert**: `probe_failed_due_to_regex{job="blackbox-authentik-walloff"} == 1` for 10m → `severity=warning`, `lane=security` → posts to **`#alerts`** via the `slack-security` receiver, which keeps its `[SECURITY]` styling (Slack-only, no paging; the dedicated `#security` channel was abandoned 2026-06-25 — the shared webhook's app isn't a member of it). `probe_failed_due_to_regex` (not bare `probe_success==0`) is the signal: it isolates the Authentik-redirect from unrelated 5xx/DNS/TLS failures already covered by reachability alerts. Inhibited by `TraefikDown` and `AuthentikDown` (symptom, not regression, during those outages).
|
||||
- **Alert**: `probe_failed_due_to_regex{job="blackbox-authentik-walloff"} == 1` for 10m → `severity=warning`, `lane=security` → **`#security` Slack** (Slack-only, no paging). `probe_failed_due_to_regex` (not bare `probe_success==0`) is the signal: it isolates the Authentik-redirect from unrelated 5xx/DNS/TLS failures already covered by reachability alerts. Inhibited by `TraefikDown` and `AuthentikDown` (symptom, not regression, during those outages).
|
||||
- **Target list + how to add one**: `local.authentik_walloff_targets` in `stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf` — a map of `service → URL`. To guard a NEW carve-out, add ONE line. Verify it does NOT already 302 to Authentik first: `curl -s -o /dev/null -w '%{http_code} %{redirect_url}\n' '<url>'`. The map key becomes the `service` label on the metric + alert. (Note: openclaw `task-webhook` is intentionally NOT probed — no public DNS record.)
|
||||
|
||||
#### East-west flow observability (Goldmane edge-aggregator) — `AggregatorDown` / `DigestFailing` (ADR-0014)
|
||||
|
||||
Health for the durable "who-talks-to-whom" trail (Calico Goldmane → `goldmane-edge-aggregator` → CNPG `goldmane_edges` → daily `#alerts` digest; full trail in security.md + [runbooks/goldmane-flow-trail.md](../runbooks/goldmane-flow-trail.md)). The aggregator pod exposes **no `/metrics`**, so health is inferred from kube-state-metrics. Alert group `Network Observability (Goldmane)` in `prometheus_chart_values.tpl`; both route the default `slack-warning` receiver → **`#alerts`**.
|
||||
|
||||
| Alert | Expr (abridged) | For | Severity |
|
||||
|---|---|---|---|
|
||||
| `AggregatorDown` | `kube_deployment_status_replicas_available{namespace="goldmane-edge-aggregator",deployment="goldmane-edge-aggregator"} < 1` (+ Prometheus-restart guard) | 15m | warning |
|
||||
| `DigestFailing` | `kube_job_status_failed{namespace="goldmane-edge-aggregator",job_name=~"goldmane-edges-digest.*"} > 0` within 24h | 30m | warning |
|
||||
|
||||
The two layers are **complementary**: `AggregatorDown` ⇒ no new edges land in the DB; `DigestFailing` ⇒ edges still land but nobody is told. (`< 1` requires the metric series to exist — a fully-deleted Deployment is instead caught by cluster-health check #48 below as "deployment missing".) A freshness probe (#61b) was deliberately skipped — `AggregatorDown` is the agreed floor. **Cluster-health check #48** (`check_goldmane_aggregator` in `scripts/cluster_healthcheck.sh`) reads the Deployment's `Available` condition independently (human / `--quiet` / `--json`; JSON key `goldmane_aggregator`).
|
||||
|
||||
#### Backup Alerts
|
||||
- **PostgreSQLBackupStale**: >36h since last backup
|
||||
- **MySQLBackupStale**: >36h since last backup
|
||||
|
|
|
|||
|
|
@ -541,7 +541,7 @@ Separate from the in-cluster namespace-owner model above, the **devvm** (`10.0.1
|
|||
|
||||
**RBAC tiers:** `admin` (Viktor — cluster-admin, unlocked tree, secrets) · `power-user` (cluster-wide read-only, NO Secrets, via a dedicated `oidc-power-user-readonly` ClusterRole) · `namespace-owner` (admin in own namespace only). Each session acts as the user's **own** OIDC identity (kubelogin), never the admin's.
|
||||
|
||||
**Config inheritance (live):** wizard authors the base (his chezmoi-versioned `~/.claude`). Two native layers carry it to every user — the enforced org `claudeMd` in `/etc/claude-code/managed-settings.json` (top precedence, all sessions) and per-user `~/.claude/{skills,rules,…}` **symlinks** to the base (seeded via `/etc/skel`; edits propagate live). Secrets stay per-user at mode 600, never symlinked. **(2026-06-26: the managed `claudeMd` now defers OS-level file access to the OS/sudo — a user holding broad `sudo` may read other users' files incl. `~/.claude`; the mode-600 / no-symlink posture is unchanged but is no longer reinforced by an agent "never read other homes" rule. See [ADR-0015](../adr/0015-os-is-the-authorization-boundary.md).)** **The managed config self-deploys from the repo** (2026-06-10): the hourly reconcile's `sync_managed_config` installs `scripts/workstation/managed-settings.json` to `/etc/claude-code/` whenever the repo copy changes — so editing the claudeMd = edit + commit, no manual install — and `refresh_codex_mirror` regenerates each user's `~/.codex/AGENTS.md` (a static mirror of the claudeMd; only files carrying the mirror header are touched, user-customized ones are left alone). Repo-level guidance (`.claude/CLAUDE.md`, `AGENTS.md`, `CONTEXT.md` in the infra repo) reaches non-admins through their auto-freshened clones — commit + push and every user has it within the hour.
|
||||
**Config inheritance (live):** wizard authors the base (his chezmoi-versioned `~/.claude`). Two native layers carry it to every user — the enforced org `claudeMd` in `/etc/claude-code/managed-settings.json` (top precedence, all sessions) and per-user `~/.claude/{skills,rules,…}` **symlinks** to the base (seeded via `/etc/skel`; edits propagate live). Secrets stay per-user at mode 600, never symlinked. **The managed config self-deploys from the repo** (2026-06-10): the hourly reconcile's `sync_managed_config` installs `scripts/workstation/managed-settings.json` to `/etc/claude-code/` whenever the repo copy changes — so editing the claudeMd = edit + commit, no manual install — and `refresh_codex_mirror` regenerates each user's `~/.codex/AGENTS.md` (a static mirror of the claudeMd; only files carrying the mirror header are touched, user-customized ones are left alone). Repo-level guidance (`.claude/CLAUDE.md`, `AGENTS.md`, `CONTEXT.md` in the infra repo) reaches non-admins through their auto-freshened clones — commit + push and every user has it within the hour.
|
||||
|
||||
**Memory — homelab CLI hooks (rolled out 2026-06-21, deploy-fixed 2026-06-22):** the per-user `claude_memory` MCP was retired for the **homelab-memory hooks** — the reconcile's `install_memory` (re)installs four scripts into `~/.claude/hooks/` each run (`homelab-memory-recall.py` UserPromptSubmit recall, `auto-learn.py` Stop-hook extraction, `pre-compact-backup.sh`/`post-compact-recovery.sh`), wires them into `settings.json` if-absent + additive, and removes the old `claude_memory` MCP. **The provisioner binary itself now self-deploys from the repo** (step 0: `bash -n`-gated `install` + re-exec when `scripts/t3-provision-users.sh` differs from `/usr/local/bin/t3-provision-users`, guarded against re-exec loops / DRY_RUN mutation) — added after this very rollout sat committed-but-undeployed for a day (only the manual `setup-devvm.sh` had ever deployed the binary), so the hourly reconcile kept running the pre-memory version and emo/anca silently lost memory (recall + auto-learn never wired). A latent `set -e` abort in `install_memory` (a bare `[[ -d plugin-dir ]] && …` returning non-zero) was also fixed; it had killed the reconcile after the first user the first time it actually ran. The hooks need a `MEMORY_API_KEY` (or `CLAUDE_MEMORY_API_KEY`) in the user's `settings.json` env — the `homelab` CLI defaults the API URL, so **the key is the only hard requirement**; `install_memory` reuses an existing key and only WARNs if absent (it does NOT mint one — that's an admin Vault step, see Remaining). wizard + emo carry a key from their original MCP setup; **ancamilea is keyless → her memory no-ops until a key is minted.** (`auto-learn.py`'s passive store calls the API directly, so it additionally needs `*_API_URL` in env to avoid its local-SQLite fallback; recall + manual `homelab memory store` go through the URL-defaulting CLI and need only the key.)
|
||||
|
||||
|
|
|
|||
|
|
@ -261,7 +261,7 @@ Traefik chain:
|
|||
|
||||
1. **Anti-AI bot-block** (`ai-bot-block` ForwardAuth, on by default via `ingress_factory`): blocks/tarpits known AI crawlers. **Fail-open** (currently a no-op `return 200` — poison-fountain scaled to 0; see `docs/architecture/security.md`).
|
||||
2. **Authentik Forward-Auth** (if `protected = true`): SSO authentication via OIDC. Non-authenticated users are redirected to login. Auth headers are stripped before forwarding to backend.
|
||||
3. **Rate Limiting**: Per-IP throttling. Returns **429 Too Many Requests** (not 503) when limit exceeded. Default is `rate-limit` (average 10 req/s, burst 50). Services whose clients legitimately burst harder get a dedicated middleware via `skip_default_rate_limit = true` + `extra_middlewares`: Immich (`immich-rate-limit`, 1000/20000, photo uploads), ActualBudget (`actualbudget-rate-limit`, 50/300 — the Actual web app boots with ~70 parallel asset/migration revalidations; the default burst 429'd the tail and stalled every page load), and authentik (`authentik-rate-limit`, 100/1000, on `/` and `/static` — the login SPA cold-loads ~70 flow-executor JS/CSS chunks from `/static`; the default burst 429'd the tail and a failed ES-module import left a blank login screen for cold/incognito/NAT-shared clients).
|
||||
3. **Rate Limiting**: Per-IP throttling. Returns **429 Too Many Requests** (not 503) when limit exceeded. Default is `rate-limit` (average 10 req/s, burst 50). Services whose clients legitimately burst harder get a dedicated middleware via `skip_default_rate_limit = true` + `extra_middlewares`: Immich (`immich-rate-limit`, 1000/20000, photo uploads) and ActualBudget (`actualbudget-rate-limit`, 50/300 — the Actual web app boots with ~70 parallel asset/migration revalidations; the default burst 429'd the tail and stalled every page load).
|
||||
4. **Retry**: 2 attempts with 100ms delay on transient failures (5xx errors, connection errors).
|
||||
|
||||
Additional middleware:
|
||||
|
|
@ -550,7 +550,7 @@ chain — a CrowdSec/LAPI outage cannot cause 503s; it only stops new bans.) Che
|
|||
|
||||
**Diagnosis**: Check Traefik middleware config for the affected IngressRoute.
|
||||
|
||||
**Fix**: Give the service a dedicated higher-limit middleware (don't loosen the shared default): define `<service>-rate-limit` in `stacks/traefik/modules/traefik/middleware.tf`, then set `skip_default_rate_limit = true` + `extra_middlewares = ["traefik-<service>-rate-limit@kubernetescrd"]` on its `ingress_factory` call. Shared default is average 10 req/s / burst 50; Immich uses 1000/20000, ActualBudget 50/300, authentik 100/1000 (login SPA `/static` chunk burst → blank screen).
|
||||
**Fix**: Give the service a dedicated higher-limit middleware (don't loosen the shared default): define `<service>-rate-limit` in `stacks/traefik/modules/traefik/middleware.tf`, then set `skip_default_rate_limit = true` + `extra_middlewares = ["traefik-<service>-rate-limit@kubernetescrd"]` on its `ingress_factory` call. Shared default is average 10 req/s / burst 50; Immich uses 1000/20000, ActualBudget 50/300.
|
||||
|
||||
### Large Downloads or Uploads Truncate / Fail Partway
|
||||
|
||||
|
|
|
|||
|
|
@ -132,13 +132,6 @@ for the supersession history — there is no longer an inline Traefik bouncer.)
|
|||
account hard-limits to **one** list), and CAPI is already covered in-kernel on
|
||||
direct hosts and by Cloudflare's own managed protections on proxied hosts.
|
||||
Registered bouncer key: **`kvsync`**.
|
||||
- **Rate-limit resilient (2026-06-27):** Cloudflare's Lists-API *write* endpoint
|
||||
is throttled (~per-60s; `429 retry-after`). The CronJob runs `backoff_limit=0`
|
||||
(one POST per cycle — the `*/2` schedule IS the retry cadence) and treats a CF
|
||||
`429` as a soft-skip (exit 0, retry next cycle), the same fail-safe pattern it
|
||||
uses for LAPI. An earlier `backoff_limit=2` fired 3 rapid POSTs/cycle and
|
||||
escalated the throttle into a stuck state that left the list empty — a
|
||||
self-inflicted DoS that this change prevents.
|
||||
- **Block-only**: the single-list limit precludes a separate
|
||||
captcha/managed-challenge list, so both ban and captcha decisions are enforced
|
||||
as a plain block at the edge.
|
||||
|
|
@ -279,7 +272,7 @@ Beads epic: `code-8ywc`. **Status: partially live as of 2026-05-18.**
|
|||
|
||||
The block below documents the locked design.
|
||||
|
||||
Response model: **(I) Slack-only, daily skim.** All security alerts post to **`#alerts`** via Alertmanager (the `slack-security` receiver keeps its distinct `[SECURITY/<sev>]` title styling so security-lane alerts still stand out). The dedicated `#security` channel was abandoned (2026-06-25) — the shared `alertmanager_slack_api_url` incoming webhook's Slack app isn't a member of it, so a channel override there returns HTTP `404 channel_not_found`; everything consolidated to `#alerts`. No paging. Mean detection time accepted as ~12-24h; the design weight sits on prevention (Kyverno enforce, NetworkPolicy default-deny egress) rather than runtime detection.
|
||||
Response model: **(I) Slack-only, daily skim.** All security alerts land in a new `#security` Slack channel via Alertmanager. No paging. Mean detection time accepted as ~12-24h; the design weight sits on prevention (Kyverno enforce, NetworkPolicy default-deny egress) rather than runtime detection.
|
||||
|
||||
#### Detection sources
|
||||
|
||||
|
|
@ -292,7 +285,7 @@ Response model: **(I) Slack-only, daily skim.** All security alerts post to **`#
|
|||
|
||||
#### Alert rules (16 total)
|
||||
|
||||
Routed via **Loki ruler → Alertmanager → the `slack-security` receiver, which posts to `#alerts`** (it keeps its `[SECURITY/<sev>]` title styling so security-lane alerts stand out there; the dedicated `#security` channel was abandoned 2026-06-25 — the shared webhook's Slack app isn't a member of it). Same handling path as existing infra alerts — silenceable in Alertmanager UI, history queryable, severity labels (critical/warning/info) carried in the alert.
|
||||
Routed via **Loki ruler → Alertmanager → `#security` Slack receiver**. Same handling path as existing infra alerts — silenceable in Alertmanager UI, history queryable, severity labels (critical/warning/info) inside the single `#security` channel.
|
||||
|
||||
**K8s API audit (K2-K9, 8 rules — K1 cluster-admin-grant intentionally skipped):**
|
||||
|
||||
|
|
@ -371,69 +364,6 @@ Beads: `code-8ywc` W1.6 + W1.7. **Status: planned.**
|
|||
- Rare-event misses: a Sunday-only CronJob's egress won't appear in 7 days of flow logs. Mitigation: extend observation to 2 weeks for namespaces with weekly CronJobs.
|
||||
- Mass-rollout cascade: the 26h March 2026 outage (memory id=390) was a mass-change cascade. Mitigation: phased per-namespace with health-check pauses, similar to the 2026-05-17 Keel phased rollout (memory id=1972).
|
||||
|
||||
#### Deriving the per-namespace egress allowlist from the edge trail (Wave 1 W1.7)
|
||||
|
||||
The durable **east-west flow trail** (below) is now the preferred data source for
|
||||
the *internal* (namespace-to-namespace) half of each Wave-1 egress allowlist —
|
||||
faster and identity-stamped vs the original iptables-`LOG`→journald→Loki path
|
||||
(ADR-0014: "Enforcement gains a better data source"). The unique observed
|
||||
namespace pairs live in CNPG DB `goldmane_edges`, table `edge`. To derive the
|
||||
namespaces a source is observed talking to (the `allow` set that seeds its
|
||||
NetworkPolicy):
|
||||
|
||||
```sql
|
||||
SELECT DISTINCT dst_ns FROM edge WHERE src_ns='<ns>' AND action='allow' ORDER BY dst_ns;
|
||||
```
|
||||
|
||||
The full SQL recipe (whole-cluster matrix, deny sanity-checks, the ≥7-day
|
||||
observation caveat) is in
|
||||
[runbooks/goldmane-flow-trail.md → Deriving the Wave-1 egress allowlist](../runbooks/goldmane-flow-trail.md#deriving-the-wave-1-egress-allowlist-from-the-edge-table-infra-62).
|
||||
**External / public-internet egress is NOT in this table** (empty-namespace flows
|
||||
are dropped) — for those destinations keep using the Calico flow-log observation
|
||||
(the W1.6 snapshot, `wave1-egress-observation-2026-05-22.md`). This feeds the
|
||||
existing observe-then-enforce effort (beads `code-8ywc`); **enforce-flips remain
|
||||
out of scope** of the trail — it is observe-and-derive only.
|
||||
|
||||
### East-west flow observability (Goldmane / Whisker + edge trail) (ADR-0014)
|
||||
|
||||
The "who-talks-to-whom" data plane that succeeds raw iptables-`LOG` lines (which
|
||||
carried no identity). **Service identity = the workload's namespace** (primary),
|
||||
refined by a `service-identity` label in the few multi-Service namespaces
|
||||
(`monitoring`, `kube-system`, `dbaas`). End-to-end trail, three layers:
|
||||
|
||||
1. **Calico Goldmane + Whisker** (`calico-system`) — Goldmane aggregates
|
||||
identity-stamped flows (ns/pod/workload/labels + allow-deny + policy-trace)
|
||||
streamed from Felix over gRPC into a **~60-min in-memory ring buffer** (no
|
||||
etcd/API writes — the etcd-cost constraint that drove the design). **Whisker**
|
||||
is its live web UI at `whisker.viktorbarzin.me` (Authentik-gated,
|
||||
`auth = "required"` — Whisker has no own login; an additive NetworkPolicy ORs
|
||||
Traefik past the operator's default-deny `whisker` NP). The ring buffer is
|
||||
**not** a trail (lost on Goldmane restart). Enabled via operator CRs in
|
||||
`stacks/calico/main.tf`; reversible toggle (Goldmane is OSS tech-preview).
|
||||
2. **`goldmane-edge-aggregator`** (`stacks/goldmane-edge-aggregator`) — streams
|
||||
Goldmane's gRPC `Flows.Stream` over **mTLS** and upserts the low-cardinality
|
||||
namespace-pair edge set (`edge(src_ns,dst_ns,action,first_seen,last_seen,
|
||||
flow_count)`) into CNPG DB `goldmane_edges`. Self-edges and empty-namespace
|
||||
(public-internet) flows are dropped — in-cluster relationships only. The mTLS
|
||||
client cert **reuses the operator's Tigera-CA-signed `whisker-backend-key-pair`**
|
||||
(Goldmane verifies CA-chain only, not identity) rather than copying the CA
|
||||
private key into TF state — **re-apply the stack if the operator rotates that
|
||||
Secret**.
|
||||
3. **`goldmane-edges-digest`** CronJob — posts first-seen edges daily to
|
||||
**`#alerts`** (reuses the alert-digest webhook). All Slack now consolidates to
|
||||
`#alerts`; the `#security` channel was abandoned 2026-06-25 because that
|
||||
webhook's Slack app isn't a member of it (a `#security` override 404s). See
|
||||
runbook.
|
||||
|
||||
The trail is **attribution-grade, not cryptographic** (reconstructs events in a
|
||||
trusted cluster; cannot prove identity against a spoofing pod — accepted trust-model
|
||||
limit; east-west stays plaintext, no mTLS between app pods). Health is covered by
|
||||
the **`AggregatorDown`** + **`DigestFailing`** alerts and cluster-health check #48
|
||||
(see monitoring.md). Full as-built, query recipes, and troubleshooting:
|
||||
[runbooks/goldmane-flow-trail.md](../runbooks/goldmane-flow-trail.md). Decision:
|
||||
[ADR-0014](../adr/0014-service-identity-and-east-west-observability.md); glossary
|
||||
`CONTEXT.md` → **Service identity**, **Goldmane / Whisker**.
|
||||
|
||||
### TLS & HTTP/3
|
||||
|
||||
**Traefik** handles TLS termination:
|
||||
|
|
|
|||
|
|
@ -1,117 +0,0 @@
|
|||
# k8s-upgrade compat-gate: classify "actionable" vs "held" blocks
|
||||
|
||||
**Date:** 2026-06-28
|
||||
**Status:** design → implementation
|
||||
**Stack:** `stacks/k8s-version-upgrade` (+ `stacks/monitoring` alert rules)
|
||||
|
||||
## Problem
|
||||
|
||||
The cluster is on k8s 1.35.6. The nightly `k8s-version-check` chain detects the
|
||||
next minor (1.36.2), runs the preflight compat-gate, and the gate **refuses**
|
||||
it — because no released kyverno/ESO supports k8s 1.36 yet, and gpu-operator is
|
||||
deliberately pinned (its 26.3 bump needs a newer NVIDIA driver image + Ubuntu
|
||||
release we're not ready for). The result, **every single night**:
|
||||
|
||||
- a **Failed** preflight Job (`block()` exits 1), and
|
||||
- `k8s_upgrade_blocked=1` → the **K8sUpgradeBlocked** alert.
|
||||
|
||||
But this block is **not actionable** — there's nothing we can upgrade to clear
|
||||
it; we can only wait for upstream (kyverno/ESO) and, separately, do the
|
||||
gpu-operator/Ubuntu work. The gate is crying wolf: a "blocked, needs attention"
|
||||
signal that's indistinguishable from a block we could actually fix.
|
||||
|
||||
## Goal
|
||||
|
||||
Make the gate **classify** each blocker and behave accordingly:
|
||||
|
||||
| Class | Definition | Behaviour |
|
||||
|-------|-----------|-----------|
|
||||
| **actionable** | the compat matrix has a newer version of the addon whose `max_k8s >= target`, and the running version is older — upgrading it would clear the block | **alert** (`k8s_upgrade_blocked=1` → K8sUpgradeBlocked), with the specific "upgrade X → Y" remediation in the nightly report |
|
||||
| **waiting-upstream** | **no** matrix version of the addon supports the target yet (kyverno/ESO for 1.36) | **quiet** (`k8s_upgrade_held=1`, no alert) — nightly report only |
|
||||
| **pinned** | a supporting version exists but the addon carries `"pinned": true` in the matrix (gpu-operator) | **quiet** (held) |
|
||||
|
||||
Removed-API and containerd blocks are always **actionable**. **Held wins:** if
|
||||
*any* blocker is waiting-or-pinned, the whole target is **HELD** (quiet) —
|
||||
acting on the actionable blockers wouldn't unblock it yet. The nightly report
|
||||
still lists everything so the full eventual scope is visible.
|
||||
|
||||
Also (scope decision: "tidy the block path"): deliberate gate decisions
|
||||
(actionable-block **and** held) now make the preflight Job **Complete cleanly**
|
||||
(exit 0) instead of Failing. Chain progression is gated on the verdict, not the
|
||||
exit code. Real failures (unhealthy nodes, kubeadm errors, crashes) still exit
|
||||
1 → `K8sUpgradeChainJobFailed`.
|
||||
|
||||
## Design
|
||||
|
||||
### `compat-gate.py`
|
||||
- New exit codes: `0` safe · `2` actionable-block · `3` gate-error (fail-safe) · **`4` held**.
|
||||
- Each stdout reason line is tagged `[ACTIONABLE]` / `[WAITING]` / `[PINNED]`.
|
||||
- `check_addons`: when an addon blocks, decide its class:
|
||||
- `pinned: true` in its matrix entry → `[PINNED]`.
|
||||
- else a higher matrix version with `max_k8s >= target` exists → `[ACTIONABLE]` (`upgrade X to >= V`).
|
||||
- else → `[WAITING]` (`no released X version supports k8s T yet`).
|
||||
- unreadable image / below-matrix → `[ACTIONABLE]` (fail-safe — a human must look).
|
||||
- `check_removed_apis`, `check_containerd`: tag `[ACTIONABLE]`.
|
||||
- `exit_code(reasons)`: `0` if none; `4` if any `held_reason` (WAITING/PINNED); else `2`.
|
||||
|
||||
### `upgrade-step.sh`
|
||||
- New global `HALT_CHAIN=0`; `spawn_next()` returns early (no next Job) when set.
|
||||
- Replace `block()` with `record_blocked()` / `record_held()` — push the gauge,
|
||||
set `HALT_CHAIN=1`, **do not exit**.
|
||||
- `phase_preflight` gate handling routes on the gate's exit code:
|
||||
- `0` → push `blocked=0`+`held=0`, proceed.
|
||||
- `2`/`3` → `record_blocked`, `return 0` (Job Completes, K8sUpgradeBlocked fires).
|
||||
- `4` → `record_held`, `return 0` (Job Completes, **no alert**).
|
||||
- Push the gauge **definitively once** per run (remove the pre-reset `blocked=0`
|
||||
at gate start) so a standing block doesn't flap 1→0→1 and re-notify.
|
||||
- postflight also clears `held=0` alongside the existing gauge resets.
|
||||
|
||||
### detector (`main.tf`, the `k8s-version-check` CronJob)
|
||||
- Consequence of the tidy change: refusals now **Complete** instead of Failing,
|
||||
so the old "re-spawn only a *Failed* preflight" idempotency would skip a
|
||||
refused-but-Complete preflight until its 7d TTL. Fix: re-spawn nightly when the
|
||||
preflight is **Complete but no `k8s-upgrade-master-<target>` Job exists** (the
|
||||
gate refused — chain never advanced) — **silently** (no Slack), so a standing
|
||||
hold re-evaluates each night without noise.
|
||||
- The per-night `slack "K8s upgrade available…"` becomes an `echo`; the spawn
|
||||
Slack fires only for a genuinely new spawn or a Failed-respawn (`ANNOUNCE`
|
||||
flag), not for silent re-evaluations — killing the last nightly-noise source.
|
||||
|
||||
### `addon-compat.json`
|
||||
- Add `"pinned": true` + `"pin_reason"` to the gpu-operator entry (its
|
||||
`26.3 → 1.36` row stays; `pinned` overrides classification to held). Document
|
||||
the `pinned` flag in `_comment`. Unpinning later = delete two keys.
|
||||
|
||||
### `stacks/monitoring` alert rules (`prometheus_chart_values.tpl`)
|
||||
- `K8sUpgradeBlocked` (`k8s_upgrade_blocked == 1`): unchanged trigger, now
|
||||
actionable-only; reword annotation (reasons are in the nightly report, not a
|
||||
per-run chain Slack).
|
||||
- `K8sUpgradeChainJobFailed`: **drop** the `unless on() (k8s_upgrade_blocked == 1)`
|
||||
clause — deliberate blocks no longer create Failed Jobs, so the alert again
|
||||
means a genuine wedge.
|
||||
- **No alert** for `k8s_upgrade_held` (intentional — nothing to action; the
|
||||
nightly report surfaces it). Add a comment recording this.
|
||||
|
||||
### `nightly-report.py`
|
||||
- Read `k8s_upgrade_held`. New `⏸️ HELD — <target> not yet upgradable` headline.
|
||||
- Group reasons by tag: *Action needed* / *Waiting on upstream* / *Pinned (held by us)*
|
||||
(fallback bullets for untagged lines, so older reason strings still render).
|
||||
- Fetch reasons when avail AND (blocked OR held).
|
||||
|
||||
## Net effect on 1.36 today
|
||||
**HELD, quiet** — waiting on kyverno + ESO (upstream) + gpu-operator (pinned);
|
||||
Calico listed as the lone actionable piece. No nightly Failed Job, no alert —
|
||||
just the nightly report's ⏸️ line. Flips to actionable (→ alert) only once
|
||||
kyverno/ESO ship support **and** gpu-operator is unpinned.
|
||||
|
||||
## Tests (TDD)
|
||||
- `compat-gate`: waiting / actionable / pinned-is-held / mixed-held-wins,
|
||||
removed-API & containerd are actionable, exit_code mapping, + existing
|
||||
patch/safe cases stay green.
|
||||
- `nightly-report`: held headline + grouped reasons; existing tests stay green.
|
||||
- `upgrade-step.sh`: shellcheck; manual review of the HALT_CHAIN + gauge flow
|
||||
(bash, not unit-tested).
|
||||
|
||||
## Out of scope (separate follow-up)
|
||||
Auto-refreshing the matrix when upstream ships 1.36 support (a periodic
|
||||
addon-readiness probe). This change only *consumes* the matrix.
|
||||
|
|
@ -1,128 +0,0 @@
|
|||
# Post-Mortem: MetalLB ServiceL2Status Stuck Immutable → PG LB VIP Flap → Woodpecker CI Tier 1 Applies Broken
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Date** | 2026-05-16 (mitigated) / 2026-05-26 (closed) |
|
||||
| **Duration** | ~5 days of degraded CI (2026-04-21 first observed → 2026-05-16 mitigated). Symptom-only; no human-visible service downtime. |
|
||||
| **Severity** | SEV3 — Woodpecker CI default.yml apply step failed on Tier 1 (PG-backend) stacks. Drift-detection ran silently broken. Manual `scripts/tg apply` continued to work. No data loss, no app downtime. |
|
||||
| **Affected Services** | Woodpecker CI pipelines applying any of the 28+ Tier 1 stacks (monitoring, crowdsec, authentik, headscale, etc.). PostgreSQL backend itself was healthy. |
|
||||
| **Issue** | Beads `code-aoxk` (closed 2026-05-26). |
|
||||
| **Status** | Closed |
|
||||
|
||||
## Summary
|
||||
|
||||
Woodpecker CI surfaced as `ERROR: Cannot read PG credentials from Vault. Run: vault login -method=oidc` from `scripts/tg` whenever a pipeline tried to apply a Tier 1 stack. The error was misleading on two counts:
|
||||
|
||||
1. **Vault was healthy.** A direct `vault read database/static-creds/pg-terraform-state` from inside a Woodpecker pipeline pod (using K8s SA JWT → `auth/kubernetes/login role=ci`) succeeded every time when run in isolation.
|
||||
2. **The "Cannot read PG credentials" message in `scripts/tg` was a catch-all** that fired for *any* Terraform/Terragrunt failure during PG state-lock acquire-release, including TCP RSTs against the PG LoadBalancer VIP.
|
||||
|
||||
Actual root cause: the MetalLB `ServiceL2Status` CR for the `postgresql-lb` service (`dbaas` namespace, VIP `10.0.20.200`) had a stuck `status.node` field that the controller treated as immutable. The L2 speaker kept failing to update it with `Invalid value: "k8s-nodeX": Value is immutable`, so the leader-elected announcer flapped between k8s-node3 and k8s-node4 every few seconds. Each flap dropped open TCP connections (RST). Terraform's state-lock acquire → operation → release sequence straddled flaps and failed mid-operation. `scripts/tg` surfaced this as the misleading "Cannot read PG credentials" message.
|
||||
|
||||
Manual `scripts/tg apply` from the DevVM kept working because the developer's session happened to land on whichever node currently held the VIP and complete fast enough to not straddle a flap. CI pipelines, being slower (full stack walk), reliably straddled at least one flap.
|
||||
|
||||
## Impact
|
||||
|
||||
- **CI degradation**: Tier 1 stack changes pushed to master were NOT auto-applied. Required manual `scripts/tg apply` from DevVM after every push touching one of 28+ stacks.
|
||||
- **Drift-detection broken**: The daily `drift-detection.yml` Woodpecker pipeline silently failed on every Tier 1 stack — meaning unannounced manual changes to those stacks could have persisted undetected for the duration.
|
||||
- **No user-facing outage**: PG cluster itself, all apps that use PG, and all in-cluster traffic to `10.0.20.200` worked normally. Only the very specific `acquire-state-lock → run operation → release-state-lock` round-trip pattern from CI was unreliable.
|
||||
|
||||
## Timeline (UTC)
|
||||
|
||||
| Time | Event |
|
||||
|------|-------|
|
||||
| 2026-04-21 | First broken CI pipelines (#411, #412, #413). Drift-detection failures noticed. `code-aoxk` filed. Initial hypothesis: Vault auth/role mismatch. |
|
||||
| 2026-04-22 — 2026-05-15 | Multiple investigation attempts. Verified Vault K8s `auth/kubernetes/role/ci` has correct policies (`terraform-state`, `ci`). Verified `database/static-creds/pg-terraform-state` exists, rotates on schedule, credentials valid. Could not reproduce the failure in isolated `vault read` from Woodpecker pods. |
|
||||
| 2026-05-16 (~12:14 UTC) | `pg-cluster-3` came up (third CNPG replica); endpoint set churn likely triggered MetalLB L2 announcer to attempt to update the existing `ServiceL2Status` CR (was `l2-rgt9d`). Update was rejected as immutable. Speaker kept retrying. VIP flapped. |
|
||||
| 2026-05-16 | RCA breakthrough: noticed `kubectl logs -n metallb-system -l component=speaker` was full of `Invalid value: "k8s-node…": Value is immutable` on the postgresql-lb ServiceL2Status. Correlated with `kubectl get servicel2status` returning multiple stale entries for the same service. |
|
||||
| 2026-05-16 | **Mitigation**: `kubectl delete servicel2status.metallb.io l2-rgt9d -n metallb-system`. Speaker recreated the CR cleanly (became `l2-zj9ss`). Flap stopped. PG connections stable. Manual CI re-runs of `monitoring` stack apply succeeded immediately. |
|
||||
| 2026-05-17 | Audit: acceptance criteria 1 + 2 met implicitly. #3 (post-mortem) remained pending. Beads task reverted from `in_progress` → `open`. |
|
||||
| 2026-05-25 | Node2 SCSI LUN remap → encrypted PVC emergency_ro → containerd boltdb corruption outage. Unrelated, but pulled Woodpecker server off node2. Subsequent server pod restart on k8s-node4. |
|
||||
| 2026-05-26 | Verification: from a live Woodpecker pipeline pod (`wp-01kshph6pa0w6ch0zf5x9bfqgr`), `vault write auth/kubernetes/login role=ci jwt=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)` succeeded. `vault read database/static-creds/pg-terraform-state` returned valid creds (`username=terraform_state`, last_vault_rotation 2026-05-21, TTL 58h). Live `default.yml` pipeline confirmed applying Tier 1 stacks: dbaas, authentik, crowdsec, monitoring, nvidia, cloudflared, kyverno, metallb — all `OK`. `postgresql-lb` ServiceL2Status currently single allocation (`l2-sv9vv` on k8s-node3, no flap). Beads task closed. |
|
||||
|
||||
## Root Cause
|
||||
|
||||
`metallb-speaker` reconciler in the deployed MetalLB version treats `ServiceL2Status.status.node` as immutable after first set. When the L2 announcer's leader-election picks a different node to announce a given VIP (which happens on speaker pod restart, node loss, endpoint set churn, or pod-anti-affinity reshuffles), the reconciler fails to patch the existing CR and gets stuck in a retry loop. Without manual deletion, the reconciler will not progress.
|
||||
|
||||
Why it manifested as Vault credential errors:
|
||||
|
||||
1. CI's `scripts/tg` pre-flight runs `vault read database/static-creds/pg-terraform-state` (line 83 in current code) to get PG credentials. That call succeeds.
|
||||
2. CI then runs `terragrunt apply` against the Tier 1 stack. Terragrunt connects to `10.0.20.200:5432` for state-lock acquire (via `pg_advisory_lock`). The TCP connection lands on whichever node MetalLB last announced the VIP from.
|
||||
3. Mid-operation, MetalLB tries to re-announce from a different node, sends gratuitous ARPs, and the upstream switch updates its MAC table. Open TCP sessions on the previous announcer's node are immediately RST.
|
||||
4. Terragrunt's state-lock release (or any subsequent PG operation) fails with broken pipe / connection refused.
|
||||
5. `scripts/tg` interpreted the wrapper-level failure as "PG creds bad" because that's the most common failure mode it handles. The actual error from terragrunt was buried in `2>/dev/null` suppression (since fixed — see Fix #1 below).
|
||||
|
||||
## Detection
|
||||
|
||||
We did not have any of:
|
||||
- A direct alert for "MetalLB ServiceL2Status reconciler errors".
|
||||
- An alert for "PG LB VIP node changed N times in M minutes".
|
||||
- An end-to-end probe for the CI state-lock pattern (terragrunt against `10.0.20.200`).
|
||||
|
||||
Detection mechanism was a human reading `kubectl logs -n metallb-system` for unrelated reasons. Took 25 days from first observed symptom to RCA.
|
||||
|
||||
## Fixes & Mitigations
|
||||
|
||||
### 1. Surface real error from `scripts/tg` (DONE)
|
||||
|
||||
The original `scripts/tg` swallowed the real `vault read` / terragrunt error behind `2>/dev/null` and printed a static "Cannot read PG credentials from Vault" message. Fixed in the script:
|
||||
|
||||
```sh
|
||||
# scripts/tg lines 79-89 (current)
|
||||
if ! command -v vault >/dev/null 2>&1; then
|
||||
echo "ERROR: vault CLI not found on PATH. Install it or use an image that includes it (ci/Dockerfile)." >&2
|
||||
exit 1
|
||||
fi
|
||||
VAULT_OUT=$(vault read -format=json database/static-creds/pg-terraform-state 2>&1) || {
|
||||
echo "ERROR: Cannot read PG credentials from Vault. Vault output follows:" >&2
|
||||
echo "$VAULT_OUT" >&2
|
||||
echo "" >&2
|
||||
echo "Hint: humans run 'vault login -method=oidc'; CI auths via K8s SA (role=ci)." >&2
|
||||
exit 1
|
||||
}
|
||||
```
|
||||
|
||||
Comment in the code explicitly references this incident.
|
||||
|
||||
### 2. Stuck-CR cleanup procedure (DOCUMENTED)
|
||||
|
||||
Reproduction check for future sessions (also in `code-aoxk` beads notes):
|
||||
|
||||
```sh
|
||||
kubectl logs -n metallb-system -l component=speaker --tail=200 | grep -iE 'Invalid value.*immutable'
|
||||
# If matches found → same root cause. Delete the stuck CR:
|
||||
kubectl get servicel2status -n metallb-system
|
||||
kubectl delete servicel2status.metallb.io <name> -n metallb-system
|
||||
```
|
||||
|
||||
Speaker recreates the CR cleanly within seconds.
|
||||
|
||||
### 3. Long-term MetalLB controller fix (DEFERRED)
|
||||
|
||||
The underlying bug — speaker not recreating the CR when the immutable field needs to change — is upstream MetalLB behaviour. Two paths possible:
|
||||
|
||||
- **Upgrade MetalLB** to a version where this is fixed (needs research — check changelogs).
|
||||
- **File upstream issue / patch** with reproducer.
|
||||
|
||||
Not done as part of this post-mortem; tracked separately. Risk acceptance: until then, the manual `delete servicel2status` workaround is the playbook, and is fast (<10s).
|
||||
|
||||
### 4. Alerting (DEFERRED)
|
||||
|
||||
Suggested but not implemented:
|
||||
- Prometheus alert on `metallb_speaker_reconcile_errors_total{kind="ServiceL2Status"}` rate.
|
||||
- Synthetic probe: a CronJob that does `pg_advisory_lock` + release against the PG VIP every 5min from CI namespace, alert if it ever fails.
|
||||
|
||||
Tracked as future hardening (no beads task yet — only worth filing if recurrence happens).
|
||||
|
||||
## Lessons
|
||||
|
||||
1. **`2>/dev/null` is a time-bomb.** It hid the real error for weeks. Fix #1 already lands the principle; audit other places in `scripts/` for the same anti-pattern next time we touch them.
|
||||
2. **CRD `status.*` immutability is non-obvious failure mode.** When debugging weird LB / VIP / endpoint behaviour, always grep speaker logs for `immutable`, `cannot update`, and reconciler errors. Add to cluster-health checks.
|
||||
3. **Misleading wrapper errors cost weeks.** `scripts/tg` claimed "Cannot read PG credentials" — that's what the operator believed. The actual `vault read` step worked. The real failure was three steps later in a completely different subsystem. When a wrapper script makes a definitive claim about which subsystem failed, distrust it; reproduce the subsystem in isolation before chasing the claim.
|
||||
4. **CNPG primary changes / endpoint churn can trigger L2 announcer flap.** The trigger (within the timeline) was likely the `pg-cluster-3` pod coming up. Worth flagging for any future CNPG topology changes.
|
||||
|
||||
## References
|
||||
|
||||
- Beads: `code-aoxk` — closed 2026-05-26.
|
||||
- `scripts/tg` lines 65-95 — current pre-flight with explicit error surfacing.
|
||||
- `kubectl get servicel2status -A` — current state, single allocation per service.
|
||||
- This file: `infra/docs/post-mortems/2026-05-16-metallb-l2-immutable-pg-vip-flap.md`.
|
||||
|
|
@ -1,97 +0,0 @@
|
|||
# Post-mortem: k8s 1.34→1.35 upgrade stalled — etcd IO starvation (2026-06-24)
|
||||
|
||||
> Filename kept for inbound links. The originally-suspected cause (kubeadm-config
|
||||
> OIDC drift) turned out **not** to be the crash — see "Correction" below. The OIDC
|
||||
> drift was a real *separate* latent bug fixed in the same change.
|
||||
|
||||
**Impact:** The autonomous k8s-version-upgrade chain (23:00 UTC nightly) reached
|
||||
the master control-plane phase for the first time — preflight passed, etcd
|
||||
snapshot taken, master cordoned + drained, etcd upgraded 3.6.5→3.6.6 — then the
|
||||
kube-apiserver upgrade to v1.35.6 **crash-looped**. kubeadm waited its 5-minute
|
||||
static-pod-hash window across all internal retries, then auto-rolled-back to
|
||||
v1.34.9. The cluster stayed healthy on 1.34.9 (apiserver, all 7 nodes Ready), but
|
||||
the run left **k8s-master cordoned** and the chain **wedged on `in_flight=1`**.
|
||||
No data loss; no user-facing outage (the master carries control-plane taints, so
|
||||
no workloads were displaced).
|
||||
|
||||
**Trigger:** the first *minor* upgrade the chain ever attempted (1.34→1.35) — the
|
||||
first time kubeadm upgrades etcd (3.6.5→3.6.6) and regenerates the control-plane
|
||||
static pods, i.e. the first time the upgrade pushes real write-IO at etcd.
|
||||
|
||||
## Root cause — etcd IO starvation on the shared HDD
|
||||
|
||||
The new kube-apiserver could not establish/keep a working connection to etcd
|
||||
during the upgrade because **etcd was IO-starved**. etcd's surviving container log
|
||||
from the crash window (`/var/log/pods/.../etcd/0.log`, 23:04–23:20 UTC) shows:
|
||||
|
||||
- **1,180** `apply request took too long` warnings in 16 minutes;
|
||||
- individual applies of **4.3s / 2.9s / 2.7s / 1.8s** (healthy is <100ms),
|
||||
clustered at **23:18:51 UTC** — exactly when kubeadm's final attempt was trying
|
||||
to bring the new apiserver up.
|
||||
|
||||
A reproduced 1.35.6 apiserver with no etcd dies with
|
||||
`F instance.go:233 Error creating leases: error creating storage factory: context
|
||||
deadline exceeded` — the same failure mode a multi-second etcd produces. etcd
|
||||
lives on the contended `sdc` HDD (**beads code-oflt**: "etcd/critical VM disks on
|
||||
shared sdc HDD — recurring IO-storm root cause"). The upgrade itself piled IO onto
|
||||
that spindle:
|
||||
|
||||
1. etcd's own upgrade-restart + WAL/db re-read (it restarted ~23:04, re-elected);
|
||||
2. kubeadm dumping a full **~400MB etcd DB backup** to
|
||||
`/etc/kubernetes/tmp/kubeadm-backup-etcd-<ts>/` (on the same HDD) before the
|
||||
etcd upgrade — and **145 of these had accumulated to 28GB** (kubeadm never
|
||||
cleans them up), pushing master root fs to **73%**, above the 70% kubelet
|
||||
image-GC threshold, so image GC churned during the drain too;
|
||||
3. master-drain pod evictions.
|
||||
|
||||
### Correction — it was NOT the OIDC flag swap
|
||||
|
||||
`kubeadm upgrade diff v1.35.6` showed the regenerated manifest also swaps
|
||||
`--authentication-config` (structured multi-issuer OIDC) back to legacy
|
||||
single-issuer `--oidc-*` flags (kubeadm-config drift, see secondary finding). That
|
||||
was the *first* hypothesis — but an isolated repro of the 1.35.6 apiserver with
|
||||
those exact `--oidc-*` flags **and authentik reachable** initialised OIDC cleanly
|
||||
(`oidc.go:313`, no error) and ran fine until it hit the (deliberately dead) test
|
||||
etcd. So the auth swap does **not** crash the apiserver; it was a red herring for
|
||||
the crash. Image pull (all v1.35.6 images pre-pulled), OOM (none), and disk-full
|
||||
were also ruled out.
|
||||
|
||||
## Secondary finding (real, fixed separately) — kubeadm-config OIDC drift
|
||||
|
||||
apiserver auth is configured in three places that must agree:
|
||||
(1) `/etc/kubernetes/pki/auth-config.yaml` (structured, two issuers: `kubernetes`
|
||||
+ `k8s-dashboard`, added 2026-06-19); (2) the live static-pod manifest
|
||||
(`--authentication-config`); (3) the kubeadm-config `ClusterConfiguration` CM —
|
||||
which still carried the legacy `--oidc-*` extraArgs. `kubeadm upgrade` regenerates
|
||||
the manifest from (3), so it would have reverted structured auth → **dashboard +
|
||||
kubectl SSO break after a successful upgrade** (recoverable: the chain's
|
||||
post-master `restore.sh` re-adds the flag). This is a real bug, just not the crash.
|
||||
|
||||
## Resolution
|
||||
|
||||
1. **Reclaimed the 28GB kubeadm scratch** on master (`/etc/kubernetes/tmp/kubeadm-backup-*`) — root fs 73% → 23%.
|
||||
2. **Reconciled kubeadm-config live** (zero cluster impact — CM only read at upgrade time): dropped `--oidc-*`, added `--authentication-config` via `kubeadm init phase upload-config kubeadm`. `kubeadm upgrade diff` then shows only the control-plane image bumps.
|
||||
3. **Recovered:** uncordoned k8s-master, cleared the stuck `in_flight` gauge + annotation, deleted last night's Complete/Failed `1-35-6` phase jobs (a Complete preflight would otherwise make the detector idempotent-skip the re-run).
|
||||
|
||||
## Prevention (landed in this change)
|
||||
|
||||
| Gap | Fix |
|
||||
|-----|-----|
|
||||
| kubeadm leaks ~400MB etcd-DB backups into `/etc/kubernetes/tmp` forever (→ disk fills, image-GC churn, write-IO on etcd's spindle) | **`upgrade-step.sh` preflight now prunes** `/etc/kubernetes/tmp/kubeadm-backup-*` + `kubeadm-upgraded-manifests*` older than 3 days on master, every run. Best-effort, never aborts. |
|
||||
| kubeadm-config drift would silently break SSO after an upgrade | `apiserver-oidc.tf`'s remote script now **also reconciles kubeadm-config** (`kubeadm init phase upload-config`), delivered via the `apiserver-oidc-restore` ConfigMap the chain re-runs (CI needs no ssh) or a local `-replace` apply. Preflight **alerts** (not blocks — SSO drift is recoverable) if `kubeadm upgrade diff` would still drop `--authentication-config`. |
|
||||
| etcd on the contended `sdc` HDD starves under upgrade IO | **Durable fix is beads code-oflt** (move etcd/critical VM disks off `sdc`). Not in this change. Mitigations above reduce the upgrade's own IO; reclaimed disk removes the image-GC variable. |
|
||||
|
||||
## Lessons
|
||||
|
||||
- **Capture the failing component's own logs before concluding.** The `kubeadm
|
||||
upgrade diff` made the OIDC swap look like the cause; only etcd's log (multi-second
|
||||
applies) + an isolated apiserver repro showed the truth (etcd IO). A clean diff is
|
||||
"what config changes," not "why it crashed."
|
||||
- **etcd on shared HDD is the cluster's recurring fragility** (immich IO storm
|
||||
2026-05-25, this stall). Upgrades concentrate IO (etcd restart + kubeadm's 400MB
|
||||
backup copy + drain) onto that spindle. code-oflt is the real fix.
|
||||
- **Tools that leave per-operation scratch must be reaped.** kubeadm's
|
||||
`/etc/kubernetes/tmp` etcd backups are throwaway (real backups → NFS) but never
|
||||
GC'd; 28GB had silently accumulated.
|
||||
- **Out-of-band control-plane edits must be written back to kubeadm-config** — else
|
||||
`kubeadm upgrade` silently reverts them (here: SSO; could be admission/audit/API flags).
|
||||
|
|
@ -11,11 +11,6 @@ inference every six hours and backs up only the `claudeAiOauth` object to:
|
|||
secret/workstation/claude-users/<os-user>
|
||||
```
|
||||
|
||||
The backup **merges** into that path (`vault kv patch -method=rw`, falling back to
|
||||
`kv put` only when the path does not exist yet), so keys that other tools
|
||||
co-locate there — notably `homelab vault`'s `vaultwarden_*` credentials — survive.
|
||||
A blind `kv put` here silently wiped them on every six-hourly run (fixed 2026-06-26).
|
||||
|
||||
The user's unrelated `mcpOAuth` credentials never leave their home directory.
|
||||
Each renewal service has a distinct 32-day periodic Vault token, mode `0600`, at
|
||||
`~/.config/claude-auth-sync/vault-token`. Its policy can access only that user's
|
||||
|
|
@ -80,64 +75,8 @@ sudo --preserve-env=VAULT_ADDR,VAULT_TOKEN /usr/local/bin/t3-provision-users
|
|||
```
|
||||
|
||||
Never copy another user's `.credentials.json` or scoped Vault token. Never restore
|
||||
a **shared** `CLAUDE_CODE_OAUTH_TOKEN` across users; environment credentials
|
||||
outrank per-user login and would silently collapse all users onto one identity.
|
||||
(A **per-user**, non-rotating setup-token tied to the user's OWN Enterprise
|
||||
identity is a different, sanctioned thing — see "Long-lived per-user token" below.)
|
||||
|
||||
## Long-lived per-user token (heavy concurrent-agent users)
|
||||
|
||||
The six-hourly renewal above assumes Claude owns refresh-token rotation in a
|
||||
single `~/.claude/.credentials.json`. A user who runs **many concurrent Claude
|
||||
sessions** (interactive tmux panes + their `t3-serve` instance + always-on
|
||||
`start-claude.sh` agents) breaks that assumption: when the shared access token
|
||||
expires, the processes refresh **simultaneously**, the OAuth server rotates the
|
||||
refresh token, and the losing writer persists an **empty** refresh token —
|
||||
logging the user out roughly every access-token lifetime (~8h). Re-issuing the
|
||||
credential does not help; the race recurs.
|
||||
|
||||
The fix is a **per-user, long-lived setup-token** (`sk-ant-oat01-…`, ~1y,
|
||||
**non-rotating**). With `CLAUDE_CODE_OAUTH_TOKEN` set, Claude uses it directly and
|
||||
never touches `.credentials.json` — so there is nothing to race on. This is the
|
||||
user's OWN Enterprise identity (scope `user:inference`; local MCP servers are
|
||||
client-side and unaffected), stored only in their OWN Vault path — **NOT** the
|
||||
forbidden shared token, and it never crosses OS users.
|
||||
|
||||
**Enable it (one-time, per user):**
|
||||
|
||||
1. The user mints their own token (interactive Enterprise SSO):
|
||||
|
||||
```bash
|
||||
claude setup-token # opens an SSO URL; paste the code back -> prints sk-ant-oat01-…
|
||||
```
|
||||
|
||||
2. An admin stores it in that user's Vault path (MERGE, never `kv put` — siblings
|
||||
like `claude_ai_oauth_json` / `vaultwarden_*` must survive):
|
||||
|
||||
```bash
|
||||
vault kv patch -method=rw secret/workstation/claude-users/<os-user> \
|
||||
setup_token=sk-ant-oat01-…
|
||||
```
|
||||
|
||||
3. Materialize + activate (or just wait ≤6h for the timer):
|
||||
|
||||
```bash
|
||||
systemctl start claude-auth-sync@<os-user>.service
|
||||
```
|
||||
|
||||
`claude-auth-sync` writes `~/.config/claude-auth-sync/claude-oauth.env`
|
||||
(`CLAUDE_CODE_OAUTH_TOKEN=…`, mode 0600) and, while a token is present, **skips**
|
||||
the rotating-credential validate/backup/restore (so no false
|
||||
`WorkstationClaudeAuthInvalid`). `start-claude.sh` and `t3-serve@.service` load
|
||||
that env file. **Sessions started before activation keep the old credential
|
||||
until relaunched** — the user must restart their agents / `t3-serve` to cut over.
|
||||
|
||||
**Disable it:** clear the field (`vault kv patch -method=rw
|
||||
secret/workstation/claude-users/<os-user> setup_token=""`) — the next sync removes
|
||||
the env file and the user reverts to the per-user SSO credential flow.
|
||||
|
||||
**Rotate before expiry:** setup-tokens expire 1y after mint. Re-mint (step 1) and
|
||||
re-store (step 2); the env file refreshes on the next sync.
|
||||
the old shared `CLAUDE_CODE_OAUTH_TOKEN`; environment credentials outrank per-user
|
||||
login and would silently collapse all users onto one identity.
|
||||
|
||||
## Verification
|
||||
|
||||
|
|
|
|||
|
|
@ -1,346 +0,0 @@
|
|||
# Goldmane Flow Trail — east-west "who-talks-to-whom" observability
|
||||
|
||||
> As-built runbook for the Calico Goldmane + Whisker flow plane and the
|
||||
> `goldmane-edge-aggregator` durable audit trail. Design + rationale:
|
||||
> [ADR-0014](../adr/0014-service-identity-and-east-west-observability.md).
|
||||
> Glossary: `CONTEXT.md` → **Service identity**, **Goldmane / Whisker**.
|
||||
> Implements infra issues #57 (Whisker ingress), #58 (aggregator), #61
|
||||
> (monitoring), #62 (egress allowlist queries), #63 (these docs).
|
||||
|
||||
## What the trail is
|
||||
|
||||
Three layers turn raw east-west traffic into a queryable, durable record of
|
||||
which Service talks to which. **Service identity = the workload's namespace**
|
||||
(primary), refined by a `service-identity` label in the few multi-Service
|
||||
namespaces (`monitoring`, `kube-system`, `dbaas`) — see ADR-0014.
|
||||
|
||||
| Layer | Component | Lifetime | Where it lives |
|
||||
|---|---|---|---|
|
||||
| **Live map** | Calico **Goldmane** + **Whisker** | ~60-min in-memory ring buffer (lost on Goldmane restart) | `calico-system`; Whisker UI at `whisker.viktorbarzin.me` |
|
||||
| **Durable trail** | `goldmane-edge-aggregator` (`aggregate` mode) | persistent | CNPG Postgres DB `goldmane_edges`, table `edge` |
|
||||
| **Notification** | `goldmane-edges-digest` CronJob (`digest` mode) | daily | Slack `#alerts` |
|
||||
|
||||
**Goldmane** aggregates identity-stamped flows (namespace / pod / workload /
|
||||
labels + allow-deny + policy-trace) streamed from Felix (the existing
|
||||
`calico-node` DaemonSet) over gRPC into a ~60-minute in-memory ring buffer —
|
||||
**nothing is written to etcd or the K8s API** (the etcd-cost constraint that
|
||||
drove the whole design). **Whisker** is its live web UI. Because the ring
|
||||
buffer is *not* a trail (a Goldmane restart loses the window), the
|
||||
`goldmane-edge-aggregator` consumes Goldmane's gRPC `Flows.Stream` API over
|
||||
mTLS and upserts the unique **namespace-pair edge set** into Postgres; a daily
|
||||
CronJob posts first-seen edges to Slack.
|
||||
|
||||
The edge set is deliberately **low-cardinality** — one row per
|
||||
`(src_ns, dst_ns, action)`, *not* per-pod or per-port — so the table stays
|
||||
small no matter how much traffic flows.
|
||||
|
||||
## Where the data lives
|
||||
|
||||
### Whisker UI — live, ~60 min
|
||||
- `https://whisker.viktorbarzin.me` (Authentik-gated — Whisker ships no own
|
||||
login; `auth = "required"`). Shows the live flow stream + a service graph for
|
||||
roughly the last hour. Use it for "what is talking right now"; it is **not**
|
||||
history.
|
||||
- In-cluster: `Service goldmane:7443` (gRPC/mTLS), `Service whisker:8081`
|
||||
(HTTP), both in `calico-system`.
|
||||
- **DNS fix + self-heal:** whisker's egress to the kube-dns ClusterIP is allowed
|
||||
by `whisker-allow-dns-clusterip` (`stacks/calico`) — without it the UI goes
|
||||
empty after any gRPC-stream break (see Troubleshooting → "Whisker UI empty").
|
||||
The `whisker-watchdog` CronJob (every 10 min) is a backstop that restarts
|
||||
whisker if its backend ever wedges for another reason.
|
||||
|
||||
### CNPG `goldmane_edges` — durable
|
||||
- Postgres DB `goldmane_edges` on the CNPG cluster
|
||||
(`pg-cluster-rw.dbaas.svc.cluster.local:5432`). One table:
|
||||
|
||||
```
|
||||
edge(src_ns text, dst_ns text, action text,
|
||||
first_seen timestamptz, last_seen timestamptz, flow_count bigint,
|
||||
PRIMARY KEY (src_ns, dst_ns, action))
|
||||
```
|
||||
|
||||
- `action` ∈ `allow` / `deny` / `pass` / `unspecified` (normalised Goldmane
|
||||
action).
|
||||
- **Self-edges (`src_ns == dst_ns`) and empty-namespace flows** (host-endpoint
|
||||
/ public-internet) are **dropped** — the trail is about in-cluster service
|
||||
relationships only. (Egress to the public internet is therefore NOT in this
|
||||
table; it lives in the Wave-1 Calico flow-log path — see security.md.)
|
||||
- A **"new edge"** = a row whose `first_seen` falls inside the digest window.
|
||||
- Role `goldmane_edges` (Vault-rotated, 7-day) owns the DB. The `edge` table
|
||||
is created idempotently by the aggregator at startup (canonical DDL also in
|
||||
the repo at `migrations/0001_edge.sql`).
|
||||
|
||||
### Slack `#alerts` — daily digest
|
||||
|
||||
> **Channel note (2026-06-25):** posts to **`#alerts`**. The dedicated `#security` channel was abandoned — the shared `alertmanager_slack_api_url` incoming webhook's Slack app is not a member of it, so a channel override there returns HTTP `404 channel_not_found`. Everything now posts to `#alerts` (this digest plus alertmanager's `slack-security` receiver, which keeps its `[SECURITY]` styling so security-lane alerts still stand out there).
|
||||
|
||||
- CronJob `goldmane-edges-digest` (08:00 Europe/London) posts edges first seen
|
||||
in the last 24h. Quiet when there are none. Reuses the existing alert-digest
|
||||
Slack incoming webhook (Vault `secret/viktor` → `alertmanager_slack_api_url`)
|
||||
— no new webhook was created.
|
||||
|
||||
## How to enable / disable
|
||||
|
||||
### Goldmane + Whisker (the flow plane)
|
||||
Operator CRs in **`stacks/calico/main.tf`** — NOT the Helm `goldmane`/`whisker`
|
||||
flags (those stay `false`; the operator's own `installation`/`apiServer` are
|
||||
operator-managed via the `goldmanes`/`whiskers.operator.tigera.io` CRDs):
|
||||
|
||||
- `kubectl_manifest.goldmane` (kind `Goldmane`) — creating it makes the operator
|
||||
re-render `calico-node` with the `FELIX_FLOWLOGSGOLDMANESERVER` env (the
|
||||
operator auto-wires Felix — **do NOT patch FelixConfiguration**), triggering a
|
||||
supervised `calico-node` DaemonSet roll. Yields `Deployment` + `Service
|
||||
goldmane:7443`.
|
||||
- `kubectl_manifest.whisker` (kind `Whisker`, `depends_on` goldmane;
|
||||
`notifications = Disabled`). Yields `Deployment` + `Service whisker:8081`.
|
||||
|
||||
**To disable:** delete those two CRs and re-apply `stacks/calico`. Reversible
|
||||
toggle (Goldmane is tech-preview in OSS Calico 3.30 — the main standing risk per
|
||||
ADR-0014).
|
||||
|
||||
### Whisker public ingress (infra #57)
|
||||
Also in `stacks/calico/main.tf`:
|
||||
- `module "ingress_whisker"` (`ingress_factory`, `auth = "required"`,
|
||||
`dns_type = "proxied"`) → `whisker.viktorbarzin.me`.
|
||||
- `kubernetes_network_policy_v1.whisker_allow_traefik` — **required alongside the
|
||||
ingress**: the operator's own `whisker` NetworkPolicy (owned by the Whisker CR)
|
||||
is `policyTypes: [Ingress]` with no rules = default-deny ingress to the pod.
|
||||
This additive NP ORs in an allow for `namespaceSelector
|
||||
kubernetes.io/metadata.name=traefik` on TCP 8081. Without it Traefik 502s.
|
||||
|
||||
### The aggregator + digest (the durable trail) — `stacks/goldmane-edge-aggregator`
|
||||
A Tier-1 stack (PG state) mirroring the claude-memory pattern. `scripts/tg
|
||||
apply` from `stacks/goldmane-edge-aggregator/`. It provisions: the namespace,
|
||||
the mTLS client material, the Postgres DB-init Job, the `DATABASE_URL`
|
||||
ExternalSecret (Vault static role `pg-goldmane-edges`), the Slack ExternalSecret,
|
||||
the `aggregate` Deployment, and the `digest` CronJob. **To disable the trail
|
||||
without touching the flow plane:** scale `deployment/goldmane-edge-aggregator` to
|
||||
0 (transient) or remove the stack (permanent) — Goldmane/Whisker keep running.
|
||||
|
||||
Image: `ghcr.io/viktorbarzin/goldmane-edge-aggregator` (PRIVATE) — the
|
||||
`goldmane-edge-aggregator` namespace must be in the `ghcr-credentials` Kyverno
|
||||
allowlist (`stacks/kyverno/modules/kyverno/ghcr-credentials.tf`,
|
||||
`local.ghcr_private_namespaces`) or pulls 401. Code repo:
|
||||
`~/code/goldmane-edge-aggregator` (see its `README.md` + `DEPLOY.md`).
|
||||
|
||||
## mTLS cert — the REUSE decision (cert-reuse gotcha)
|
||||
|
||||
The aggregator dials `goldmane:7443` over **mutual TLS**. Goldmane requires the
|
||||
client cert to chain to the **Tigera CA**, but it does **NOT authorize by client
|
||||
identity** — any Tigera-CA-signed cert is accepted.
|
||||
|
||||
Rather than copy the Tigera CA **private key** into Terraform state to mint our
|
||||
own cert (a needless CA-key exposure; the `hashicorp/tls` provider also clashes
|
||||
with this repo's global generate-providers/lockfile pattern), the stack
|
||||
**REUSES the operator-minted, Tigera-CA-signed `whisker-backend-key-pair`
|
||||
Secret** (`calico-system`), copying its `tls.crt`/`tls.key` into the
|
||||
`goldmane-client-tls` Secret in the aggregator namespace. The CA *bundle* that
|
||||
verifies Goldmane's serving cert (`tigera-ca-bundle` ConfigMap, key
|
||||
`tigera-ca-bundle.crt`) is likewise copied verbatim (a ConfigMap can't be
|
||||
cross-namespace-mounted).
|
||||
|
||||
> **GOTCHA — if the operator rotates `whisker-backend-key-pair`, re-apply
|
||||
> `stacks/goldmane-edge-aggregator`** to re-sync the copied cert. Symptom of a
|
||||
> stale copy: the `aggregate` pod logs TLS handshake / `Flows.Stream` failures
|
||||
> and no `last_seen` updates land in the `edge` table. Hardening follow-up
|
||||
> (noted in the stack): mint an own-identity cert in-namespace if Whisker is ever
|
||||
> removed (which would delete the reused source Secret).
|
||||
|
||||
The Deployment leaves `GOLDMANE_HOST=goldmane.calico-system.svc.cluster.local:7443`
|
||||
and the default cert/CA paths; the default ServerName (host sans port) is a SAN
|
||||
on Goldmane's live serving cert, so no `GOLDMANE_SERVER_NAME` /
|
||||
`GOLDMANE_TLS_INSECURE` override is needed.
|
||||
|
||||
## How to query who-talks-to-whom
|
||||
|
||||
**Quickest — the `homelab edges` CLI** (the investigation helper; read-only
|
||||
SELECT against the DB via the dbaas primary pod, no creds/SQL to remember):
|
||||
|
||||
```
|
||||
homelab edges --ns <ns> # edges touching <ns> (either direction)
|
||||
homelab edges --peers-of <ns> # <ns>'s distinct peer namespaces
|
||||
homelab edges --src <ns> # <ns>'s egress peers (--dst <ns> for ingress)
|
||||
homelab edges --new-since 24h # edges first seen in the last day (or a date)
|
||||
homelab edges --denied # blocked / lateral-movement attempts
|
||||
homelab edges --json [...] # machine-readable, for agents/pipelines
|
||||
homelab edges --help # full flag list
|
||||
```
|
||||
|
||||
For ad-hoc SQL, `psql` into the DB (creds: Vault static role
|
||||
`static-creds/pg-goldmane-edges`, or exec a CNPG pod). All queries are against
|
||||
the single `edge` table.
|
||||
|
||||
```sql
|
||||
-- Everything talking to a namespace (inbound), most-active first
|
||||
SELECT src_ns, action, flow_count, first_seen, last_seen
|
||||
FROM edge WHERE dst_ns = '<ns>' ORDER BY flow_count DESC;
|
||||
|
||||
-- Everything a namespace talks TO (outbound)
|
||||
SELECT dst_ns, action, flow_count, first_seen, last_seen
|
||||
FROM edge WHERE src_ns = '<ns>' ORDER BY last_seen DESC;
|
||||
|
||||
-- New edges in the last 24h (what the digest reports)
|
||||
SELECT src_ns, dst_ns, action, flow_count, first_seen
|
||||
FROM edge WHERE first_seen > now() - interval '24 hours'
|
||||
ORDER BY first_seen DESC;
|
||||
|
||||
-- Any DENIED edges (policy is dropping this pair)
|
||||
SELECT src_ns, dst_ns, flow_count, last_seen
|
||||
FROM edge WHERE action = 'deny' ORDER BY last_seen DESC;
|
||||
|
||||
-- Full edge set as a graph adjacency list
|
||||
SELECT src_ns, dst_ns, action, flow_count FROM edge ORDER BY src_ns, dst_ns;
|
||||
```
|
||||
|
||||
For the **live** (sub-hour) view including pod/port detail, use the Whisker UI —
|
||||
the `edge` table intentionally aggregates that away.
|
||||
|
||||
## Deriving the Wave-1 egress allowlist from the edge table (infra #62)
|
||||
|
||||
The durable edge set is a faster, identity-stamped data source for the existing
|
||||
**observe-then-enforce** egress effort (beads `code-8ywc`; snapshot
|
||||
`docs/architecture/wave1-egress-observation-2026-05-22.md`) than the original
|
||||
iptables-`LOG` → journald → Loki path (ADR-0014 consequence: "Enforcement gains
|
||||
a better data source"). It replaces the *internal* (namespace-to-namespace) leg
|
||||
of the allowlist; **external/public-internet egress is NOT in this table** (empty
|
||||
dst namespace, dropped) — for those destinations keep using the Calico flow-log
|
||||
path described in security.md.
|
||||
|
||||
**Per-namespace internal egress allowlist** — the set of in-cluster namespaces a
|
||||
given source is *observed* talking to with `action='allow'`:
|
||||
|
||||
```sql
|
||||
-- Internal egress allowlist for one namespace (feeds its NetworkPolicy)
|
||||
SELECT DISTINCT dst_ns
|
||||
FROM edge
|
||||
WHERE src_ns = '<ns>' AND action = 'allow'
|
||||
ORDER BY dst_ns;
|
||||
```
|
||||
|
||||
```sql
|
||||
-- Full internal egress matrix for all namespaces at once
|
||||
SELECT src_ns, array_agg(DISTINCT dst_ns ORDER BY dst_ns) AS allowed_dst_ns
|
||||
FROM edge
|
||||
WHERE action = 'allow'
|
||||
GROUP BY src_ns
|
||||
ORDER BY src_ns;
|
||||
```
|
||||
|
||||
```sql
|
||||
-- Sanity: namespaces with a DENY edge already (policy is biting; investigate
|
||||
-- before tightening further)
|
||||
SELECT DISTINCT src_ns, dst_ns FROM edge WHERE action = 'deny';
|
||||
```
|
||||
|
||||
**How this feeds enforcement (scope):** the derived `dst_ns` set is the
|
||||
*internal* half of a namespace's egress allowlist — it tells you which
|
||||
in-cluster namespaces to permit before flipping that namespace to default-deny.
|
||||
The universal baseline (kube-dns :53, often dbaas :3306/:5432, redis :6379) and
|
||||
the external destinations still come from the Wave-1 observation snapshot.
|
||||
**Enforce-flips remain OUT OF SCOPE** here — this is observe-and-derive only;
|
||||
the phased per-namespace default-deny rollout (starting `recruiter-responder`)
|
||||
is tracked under `code-8ywc`. Cross-links:
|
||||
[security.md → NetworkPolicy Default-Deny Egress](../architecture/security.md#networkpolicy-default-deny-egress-wave-1--observe-then-enforce-tier-34),
|
||||
[wave1-egress-observation-2026-05-22.md](../architecture/wave1-egress-observation-2026-05-22.md),
|
||||
[ADR-0014](../adr/0014-service-identity-and-east-west-observability.md).
|
||||
|
||||
> **Caveat (same as the Wave-1 snapshot):** an edge only exists if it was
|
||||
> *observed*. A weekly CronJob or a 7-day Vault rotation may not have fired yet —
|
||||
> collect ≥7 days of edges before treating a namespace's `allow` set as
|
||||
> complete. The `first_seen` column tells you how long an edge has been known;
|
||||
> the digest surfaces brand-new ones daily.
|
||||
|
||||
## Monitoring & health (infra #61)
|
||||
|
||||
The aggregator pod has **no `/metrics` endpoint** — health is inferred from
|
||||
kube-state-metrics. Three complementary signals (memory ids 6598, 6599;
|
||||
see also [monitoring.md → Security Alerts](../architecture/monitoring.md#security-alerts-wave-1--planned-beads-code-8ywc)):
|
||||
|
||||
| Signal | What | Where |
|
||||
|---|---|---|
|
||||
| **`AggregatorDown`** | `kube_deployment_status_replicas_available{namespace="goldmane-edge-aggregator",deployment="goldmane-edge-aggregator"} < 1` for 15m → warning | Prometheus alert group `Network Observability (Goldmane)` in `stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl`; routes `slack-warning` → `#alerts` |
|
||||
| **`DigestFailing`** | `kube_job_status_failed{...job_name=~"goldmane-edges-digest.*"} > 0` within 24h, for 30m → warning | same alert group → `#alerts` |
|
||||
| **cluster-health #48** | `check_goldmane_aggregator` reads the Deployment's `Available` condition (missing or not-Available → FAIL) | `scripts/cluster_healthcheck.sh` (human / `--quiet` / `--json` modes; emits `goldmane_aggregator`) |
|
||||
|
||||
The two alert layers are deliberately complementary: `AggregatorDown` →
|
||||
**no new edges land** in the DB; `DigestFailing` → **edges still land but nobody
|
||||
is told**. A freshness probe (#61b) was intentionally skipped — `AggregatorDown`
|
||||
is the agreed floor.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
**Whisker UI 502 / unreachable.** The additive
|
||||
`kubernetes_network_policy_v1.whisker_allow_traefik` is missing or the
|
||||
operator's default-deny `whisker` NP regenerated — re-apply `stacks/calico`. A
|
||||
brand-new ingress host is also invisible to LAN split-horizon until the hourly
|
||||
`technitium-ingress-dns-sync` runs (memory #5349); test meanwhile with
|
||||
`curl -sSI --resolve whisker.viktorbarzin.me:443:10.0.20.203 https://whisker.viktorbarzin.me`
|
||||
(expect a 302 to Authentik — the gate working).
|
||||
|
||||
**Whisker UI empty (but reachable — 302s to Authentik fine).** ROOT CAUSE (the
|
||||
2026-06-28 incident): the operator's own `whisker` NetworkPolicy is
|
||||
policyTypes:[Ingress,**Egress**], and its egress allows DNS only to the kube-dns
|
||||
*pods* (podSelector `k8s-app=kube-dns`). But whisker-backend resolves
|
||||
`goldmane.calico-system.svc` via the kube-dns **ClusterIP** (10.96.0.10), and
|
||||
**Calico drops UDP DNS to a ClusterIP under a podSelector-only egress rule**.
|
||||
Verified: from the whisker pod's netns, ClusterIP DNS = 100% timeout while direct
|
||||
kube-dns *pod-IP* DNS = OK, and a pod with no egress policy resolves fine.
|
||||
whisker-backend resolves goldmane ONCE in the brief startup window before the
|
||||
policy programs, holds its long-lived gRPC stream, and only re-resolves when that
|
||||
stream breaks (e.g. a node-reboot blip) — at which point the blocked ClusterIP
|
||||
DNS wedges its Go resolver (`failed to stream flows` / `code = Unavailable: dns
|
||||
... i/o timeout` forever) and the UI goes blank. The durable **aggregator is a
|
||||
SEPARATE pod in its own (unrestricted) namespace** and is unaffected.
|
||||
|
||||
FIX (applied 2026-06-28): `kubernetes_network_policy_v1.whisker_allow_dns_clusterip`
|
||||
(`stacks/calico`) — an additive egress NP allowing whisker → the kube-dns
|
||||
ClusterIP (`10.96.0.10/32`) on 53/UDP+TCP; k8s egress policies are additive so
|
||||
the operator NP is untouched. Backstop: the `whisker-watchdog` CronJob restarts
|
||||
the pod if it ever wedges for another reason. Immediate manual heal:
|
||||
`kubectl -n calico-system delete pod -l k8s-app=whisker`. Diagnose by comparing,
|
||||
from the whisker pod's netns, `nslookup goldmane.calico-system.svc.cluster.local
|
||||
10.96.0.10` (the ClusterIP — times out if the NP fix is missing) against the same
|
||||
query aimed at a kube-dns *pod IP* (always works).
|
||||
|
||||
**No new `last_seen` updates / `AggregatorDown` firing.** Check the `aggregate`
|
||||
pod logs (`kubectl logs -n goldmane-edge-aggregator deploy/goldmane-edge-aggregator`).
|
||||
Common causes, in order:
|
||||
1. **Stale mTLS cert** — the operator rotated `whisker-backend-key-pair`; re-apply
|
||||
`stacks/goldmane-edge-aggregator` (see cert-reuse gotcha above). Symptom: TLS
|
||||
handshake / `Flows.Stream` errors.
|
||||
2. **Stale DB password** — the 7-day Vault rotation bounced the credential but
|
||||
the pod kept the old one. The Deployment carries
|
||||
`secret.reloader.stakater.com/reload: goldmane-edges-db-creds`; if it's not
|
||||
restarting on rotation, verify the Reloader annotation and the ExternalSecret.
|
||||
3. **Goldmane restarted** — the in-memory window was lost (expected); the stream
|
||||
reconnects automatically and resumes upserting. No data loss in the DB
|
||||
(only the sub-hour live window in Whisker is gone).
|
||||
|
||||
**Digest never posts / `DigestFailing` firing.** Inspect the most recent
|
||||
`goldmane-edges-digest-*` Job (`kubectl get jobs -n goldmane-edge-aggregator`;
|
||||
`kubectl logs job/<name>`). The CronJob's `ttl_seconds_after_finished=86400` GCs
|
||||
pods after a day, so check soon after a failed run. With `SLACK_WEBHOOK_URL`
|
||||
empty the binary forces a dry-run (no post) — verify the `goldmane-edges-slack`
|
||||
ExternalSecret resolved. A dry run / smoke test: run the image with `args:
|
||||
["digest"]` + `DRY_RUN=1` to print the message instead of POSTing.
|
||||
> Resolved (2026-06-28): the digest posts cleanly to `#alerts`
|
||||
> (`lastSuccessfulTime` current, `DigestFailing` clear; e.g. the 2026-06-28 08:00
|
||||
> London run reported "8 new edges in last 24h"). The 2026-06-25 failures were
|
||||
> the `#security` channel override returning HTTP 404 — the shared
|
||||
> `alertmanager_slack_api_url` webhook's Slack app isn't a member of `#security`;
|
||||
> consolidating all Slack output to `#alerts` fixed it.
|
||||
|
||||
**No edges at all in the table.** Confirm Goldmane is enabled
|
||||
(`kubectl get goldmane,whisker -A`) and `calico-node` rolled with the
|
||||
`FELIX_FLOWLOGSGOLDMANESERVER` env; confirm the `goldmane-edges-db-init` Job
|
||||
completed; confirm the aggregator pod is `Running` and not `ImagePullBackOff`
|
||||
(ghcr allowlist).
|
||||
|
||||
## Related
|
||||
- [ADR-0014 — Service identity & east-west observability](../adr/0014-service-identity-and-east-west-observability.md)
|
||||
- [security.md — NetworkPolicy Default-Deny Egress + east-west flow observability](../architecture/security.md)
|
||||
- [monitoring.md — east-west flow observability + alerts](../architecture/monitoring.md)
|
||||
- [wave1-egress-observation-2026-05-22.md](../architecture/wave1-egress-observation-2026-05-22.md)
|
||||
- `CONTEXT.md` glossary — **Service identity**, **Goldmane / Whisker**
|
||||
- Code: `~/code/goldmane-edge-aggregator` (`README.md`, `DEPLOY.md`); stacks
|
||||
`stacks/goldmane-edge-aggregator`, `stacks/calico`
|
||||
|
|
@ -1,164 +0,0 @@
|
|||
# `homelab vault` onboarding (Vaultwarden access + `vault kv` infra secrets)
|
||||
|
||||
## Scope
|
||||
|
||||
`homelab vault` fronts **two unrelated secret stores** — the name collides, so
|
||||
the command keeps them clearly separated:
|
||||
|
||||
- **Vaultwarden** — your personal *password manager* (logins/passwords/TOTP).
|
||||
The verbs below give each devvm roster user no-HITL access to **their own**
|
||||
Vaultwarden vault (and any Organization Collection shared with their account).
|
||||
It shells out to the official `bw` CLI; the user's Vaultwarden credentials live
|
||||
only in their isolated Vault path `secret/workstation/claude-users/<os-user>`
|
||||
and are decrypted as that OS user — the admin never sees them.
|
||||
- **HashiCorp Vault / OpenBao** — the homelab *infra* secrets store (the
|
||||
`secret/…` KV mount at `vault.viktorbarzin.me`), under `homelab vault kv`.
|
||||
These use the caller's **own** Vault token (`vault login -method=oidc` →
|
||||
`~/.vault-token`), **not** the scoped Vaultwarden token (which only reads the
|
||||
`claude-users/<user>` path); access is whatever your Vault policy grants.
|
||||
|
||||
```text
|
||||
# Vaultwarden (password manager)
|
||||
homelab vault setup one-time: store VW email + master password + API key
|
||||
homelab vault status configured / unlocked / reachable (no secrets)
|
||||
homelab vault list [--search Q] item names (no secrets)
|
||||
homelab vault get <name> [--field password|username|uri|notes|totp] [--json]
|
||||
homelab vault get <name> --all all fields (incl. custom) as JSON; pipe it (| jq)
|
||||
homelab vault code <name> current TOTP code
|
||||
homelab vault lock lock / log out the local bw session
|
||||
|
||||
# HashiCorp Vault / OpenBao (infra secrets; uses your own OIDC token)
|
||||
homelab vault kv get <path> [--field K] read an infra KV secret
|
||||
homelab vault kv list <path> list sub-paths
|
||||
homelab vault kv put <path> <key> write one key (value via stdin; merges)
|
||||
```
|
||||
|
||||
## How auth works (why a non-admin can use it)
|
||||
|
||||
`homelab vault` runs `vault` as the calling user. It resolves a Vault token in
|
||||
this order (`ensureVaultToken`, `cli/cmd_vault.go`):
|
||||
|
||||
1. an explicit `$VAULT_TOKEN` (a deliberate override), then
|
||||
2. the per-user **scoped token** that `claude-auth-sync` maintains at
|
||||
`~/.config/claude-auth-sync/vault-token` (policy `workstation-claude-<user>`), then
|
||||
3. a native `~/.vault-token` (admins who carry one; non-admins usually don't).
|
||||
|
||||
**The scoped token deliberately beats `~/.vault-token`.** This tool only touches
|
||||
your own `secret/workstation/claude-users/<user>` path, and a power-user who ran
|
||||
`vault login -method=oidc` carries a read-only `~/.vault-token` (capability
|
||||
`deny` on that path); letting it win would shadow the scoped token and fail every
|
||||
op with `403 permission denied` (this is exactly what bit emo, 2026-06-28). The
|
||||
CLI also **self-defaults `VAULT_ADDR`** to `https://vault.viktorbarzin.me` when
|
||||
unset, so it works from non-login shells (tmux panes, AFK agent subprocesses)
|
||||
that never sourced `/etc/environment` — otherwise every `vault` child hits the
|
||||
`127.0.0.1:8200` default and fails `connection refused` (exit 2).
|
||||
|
||||
That scoped policy grants exactly `create`/`read`/`update` on the user's own
|
||||
`secret/workstation/claude-users/<user>` path — no `patch` capability — so the
|
||||
tool writes with `vault kv patch -method=rw` (read-modify-write), falling back to
|
||||
`kv put` only when the path does not exist yet. This preserves the
|
||||
`claude_ai_oauth_json` key that [claude-auth-sync](claude-auth-renew-workstation.md)
|
||||
co-locates there. (The admin-only bugs were fixed 2026-06-27; the
|
||||
`VAULT_ADDR`/token-precedence bugs above were fixed 2026-06-28.)
|
||||
|
||||
## Prerequisites (per user)
|
||||
|
||||
- The user is in `scripts/workstation/roster.yaml` and the **vault** stack has
|
||||
been applied → their `workstation-claude-<user>` policy exists.
|
||||
- The user's workstation was provisioned (`setup-devvm.sh`) → their scoped Vault
|
||||
token exists at `~/.config/claude-auth-sync/vault-token`.
|
||||
- `bw` is installed **system-wide** at `/usr/bin/bw` (see below).
|
||||
- The user has a Vaultwarden account at `https://vaultwarden.viktorbarzin.me`
|
||||
(self-service signup is open; admin panel is disabled).
|
||||
|
||||
## One-time admin steps (devvm)
|
||||
|
||||
`bw` must be system-wide so every user resolves it (it is a Node script, and
|
||||
`node` is already system-wide at `/usr/bin/node`). `setup-devvm.sh` installs it
|
||||
to the npm `/usr` prefix; the guard checks the **system** path, not
|
||||
`command -v bw` (an admin's own `~/.local/bin/bw` used to mask the system
|
||||
install, leaving non-admins with no backend). To install on a running box:
|
||||
|
||||
```bash
|
||||
sudo npm install -g --prefix /usr "@bitwarden/cli@^2024"
|
||||
bw --version # confirm /usr/bin/bw resolves
|
||||
```
|
||||
|
||||
After landing a `cli/` change, rebuild the binary so users pick it up:
|
||||
|
||||
```bash
|
||||
# version is stamped from cli/VERSION, exactly as setup-devvm.sh does it
|
||||
sudo bash -c 'cd /home/wizard/code/infra/cli && \
|
||||
go build -ldflags "-X main.version=$(cat VERSION 2>/dev/null || echo dev)" \
|
||||
-o /usr/local/bin/homelab .'
|
||||
```
|
||||
|
||||
(or just re-run `scripts/workstation/setup-devvm.sh` as root, which rebuilds it.)
|
||||
|
||||
## User onboarding
|
||||
|
||||
The user runs these as themselves. The master password / API key are entered
|
||||
interactively (never on the command line) and stored only in the user's Vault
|
||||
path.
|
||||
|
||||
1. In the Vaultwarden web vault → **Settings → Security → Keys → View API key**,
|
||||
copy the `client_id` (`user.xxxx`) and `client_secret`.
|
||||
2. Configure:
|
||||
|
||||
```bash
|
||||
homelab vault setup # prompts: VW email, API client_id/secret, master password
|
||||
homelab vault status # → "vault: configured, unlocked, reachable ✓"
|
||||
homelab vault list # item names (own vault + any shared Collections)
|
||||
```
|
||||
|
||||
## Shared-Collection access (sharing passwords with a user)
|
||||
|
||||
`homelab vault` surfaces Organization Collection items automatically once the
|
||||
user's Vaultwarden account is a confirmed member. These steps are done by the
|
||||
vault owner in the **Vaultwarden web UI** (they need the owner's master
|
||||
password — not an infra/Terraform operation):
|
||||
|
||||
1. Create or reuse an **Organization** and a **Collection** of shared logins.
|
||||
2. **Invite** the user's Vaultwarden account to the Organization, granting
|
||||
**"Can view"** on that Collection (least privilege).
|
||||
3. The user accepts the email invite and confirms membership.
|
||||
4. The user runs `homelab vault list` — the shared items now appear alongside
|
||||
their own (a `homelab vault status` sync picks them up).
|
||||
|
||||
## Security model (the no-HITL trade)
|
||||
|
||||
Identity is the kernel UID. Anything running as the user can decrypt the user's
|
||||
vault — this is the accepted trade for no-human-in-the-loop fetches. Secrets
|
||||
never appear in `argv` (passed via env or stdin), core dumps are disabled, TOTP
|
||||
fetches are logged to syslog/Loki, and on a TTY values go to the clipboard
|
||||
(auto-clearing) rather than scrollback. The admin's Vault token is never used by
|
||||
a non-admin: each user authenticates with their own scoped token.
|
||||
|
||||
## Verification
|
||||
|
||||
```bash
|
||||
# the scoped token carries the right policy
|
||||
VAULT_TOKEN="$(sudo cat /home/<user>/.config/claude-auth-sync/vault-token)" \
|
||||
vault token lookup -format=json | jq '.data.display_name, .data.policies'
|
||||
# → "token-devvm-claude-auth-<user>", [..., "workstation-claude-<user>"]
|
||||
|
||||
sudo -u <user> -i bw --version # /usr/bin/bw resolves for the user
|
||||
sudo -u <user> -i homelab vault status
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
**`homelab vault setup` (or any verb) fails with `exit status 2`** — older
|
||||
binaries swallowed the underlying `vault` error; the message now includes it.
|
||||
Two historical causes (both fixed in-CLI 2026-06-28, kept here for diagnosis):
|
||||
|
||||
- `... connection refused` to `127.0.0.1:8200` → `VAULT_ADDR` wasn't set in the
|
||||
caller's shell. The CLI now self-defaults it, but if you see this on an old
|
||||
binary: `export VAULT_ADDR=https://vault.viktorbarzin.me`.
|
||||
- `403 permission denied` on `PUT .../secret/data/workstation/claude-users/<user>`
|
||||
→ a stale read-only `~/.vault-token` (e.g. from `vault login -method=oidc`,
|
||||
policy `default`, capability `deny` on that path) was shadowing the scoped
|
||||
token. The CLI now prefers the scoped token; on an old binary, `rm
|
||||
~/.vault-token` (or `unset VAULT_TOKEN`) and retry. Confirm with
|
||||
`VAULT_TOKEN="$(sudo cat /home/<user>/.config/claude-auth-sync/vault-token)" vault token capabilities secret/data/workstation/claude-users/<user>`
|
||||
→ must be `create, read, update`.
|
||||
|
|
@ -36,13 +36,11 @@ envsubst on /template/job-template.yaml | kubectl apply -f -
|
|||
▼
|
||||
|
||||
Job 0 — preflight (pinned: k8s-node1)
|
||||
├── compat-gate: addon/API/containerd support for target (else BLOCK-actionable+alert / HOLD-quiet)
|
||||
├── compat-gate: addon/API/containerd support for target (else BLOCK+alert)
|
||||
├── All nodes Ready + no Mem/Disk pressure
|
||||
├── halt-on-alert (kured-style ignore-list)
|
||||
├── 24h-quiet baseline (no Ready transitions <24h ago)
|
||||
├── kubeadm upgrade plan matches target (skipped when master already at target — partial-resume)
|
||||
├── apiserver-OIDC drift check: kubeadm upgrade diff drops --authentication-config? → Slack WARN (recoverable; not a block)
|
||||
├── reclaim kubeadm scratch: prune /etc/kubernetes/tmp/kubeadm-backup-* >3d on master (kubeadm leaks ~400MB etcd-db backups)
|
||||
├── Push k8s_upgrade_in_flight=1, k8s_upgrade_started_timestamp=$(date +%s)
|
||||
├── Trigger backup-etcd Job, wait, verify snapshot byte count
|
||||
├── SSH master: containerd skew fix (if master < workers)
|
||||
|
|
@ -114,36 +112,18 @@ inert for a patch (no API removal or containerd floor occurs inside a minor).
|
|||
|
||||
This is the **"auto-upgrade when we can, halt + alert when we can't"** contract.
|
||||
|
||||
**The gate classifies each refusal** (2026-06-28) so it only cries wolf when
|
||||
there's something to do — `compat-gate.py` exit code + a `[TAG]` on every reason:
|
||||
**On a block**, the gate:
|
||||
- pushes `k8s_upgrade_blocked=1` to Pushgateway (→ the `K8sUpgradeBlocked`
|
||||
Prometheus alert),
|
||||
- Slacks the **specific reasons** (which addon/API/node, current vs required), and
|
||||
- **halts the chain** — it exits **non-fatal** (the upgrade simply isn't safe yet,
|
||||
this is not a failure). Because the block happens **before any mutation, no
|
||||
rollback is involved**; nothing was changed.
|
||||
|
||||
- **`[ACTIONABLE]`** (exit 2) — a newer version of the lagging addon **exists in
|
||||
the compat matrix** and upgrading it would clear the block (or an in-use
|
||||
deprecated API must be migrated / a node's containerd bumped).
|
||||
- **`[WAITING]`** (exit 4 = held) — **no released addon version supports the
|
||||
target yet** (e.g. kyverno/ESO behind a brand-new k8s minor). Only an upstream
|
||||
release can clear it.
|
||||
- **`[PINNED]`** (exit 4 = held) — a supporting version exists but the addon is
|
||||
**deliberately pinned** in the matrix (`"pinned": true`, e.g. gpu-operator,
|
||||
whose bump is coupled to a newer NVIDIA driver image + Ubuntu/kernel).
|
||||
- **Held wins on a mix**: if any blocker is waiting/pinned the whole target is
|
||||
held — acting on the actionable ones wouldn't unblock it yet.
|
||||
|
||||
**On any refusal** the preflight pushes the verdict gauge (`k8s_upgrade_blocked=1`
|
||||
for actionable, `k8s_upgrade_held=1` for held), sets `HALT_CHAIN` so the chain
|
||||
doesn't advance, and **exits 0 — the Job Completes cleanly** (a refusal is a
|
||||
decision, not a failure: no Failed Job, no `K8sUpgradeChainJobFailed`). It's
|
||||
before any mutation, so no rollback. Reasons (grouped by class) appear in the
|
||||
**morning nightly report**, not a per-run Slack.
|
||||
|
||||
- **Actionable** → `K8sUpgradeBlocked` fires (once, via alert-on-change). Clear
|
||||
it by doing the named upgrade/migration; the next nightly run proceeds.
|
||||
- **Held** → **deliberately NO alert** — only the nightly report's `⏸️ HELD`
|
||||
line, because it can't be actioned now (a nightly alert would cry wolf). It
|
||||
clears itself once upstream ships support (refresh `addon-compat.json`) or the
|
||||
pin is lifted (delete `pinned`+`pin_reason`). The detector re-evaluates every
|
||||
night, silently re-spawning the refused-but-Complete preflight (so a cleared
|
||||
block is picked up next run, not after the 7d Job TTL).
|
||||
**To clear a block**: upgrade the named addon (or migrate the API caller off the
|
||||
deprecated group/version, or bump containerd on the named node) so the offending
|
||||
condition no longer holds. The **next nightly run then proceeds automatically** —
|
||||
no manual chain restart needed.
|
||||
|
||||
The **compat matrix** lives in
|
||||
`stacks/k8s-version-upgrade/scripts/addon-compat.json` — a map of `addon → highest
|
||||
|
|
@ -183,8 +163,6 @@ Pushed by upgrade-step.sh during phase execution; observed by the
|
|||
| `k8s_upgrade_in_flight` (1/0) | preflight Job (set to 1) | postflight Job (set to 0) |
|
||||
| `k8s_upgrade_started_timestamp` (epoch s) | preflight Job | postflight Job (set to 0) |
|
||||
| `k8s_upgrade_snapshot_taken` (1/0) | preflight Job (set to 1 after Job=`pre-upgrade-etcd-*` completes with `Backup done:` log of ≥1 KiB) | postflight Job (0) |
|
||||
| `k8s_upgrade_blocked` (1/0) | preflight Job — set 1 on an **actionable** compat refusal (→ `K8sUpgradeBlocked`) | preflight (definitive each run; 0 when safe) / postflight (0) |
|
||||
| `k8s_upgrade_held` (1/0) | preflight Job — set 1 on a **held** (waiting-upstream/pinned) refusal; **no alert** | preflight (definitive each run; 0 when safe) / postflight (0) |
|
||||
| `k8s_upgrade_available{kind,running,target}` | detection CronJob | next detection run (overwrite) |
|
||||
| `k8s_version_check_last_run_timestamp` | detection CronJob | (cumulative) |
|
||||
|
||||
|
|
@ -193,8 +171,8 @@ Pushed by upgrade-step.sh during phase execution; observed by the
|
|||
- **`K8sVersionSkew`** — distinct kubelet/apiserver `gitVersion` count > 1 for 30m. Catches a half-done rollout.
|
||||
- **`EtcdPreUpgradeSnapshotMissing`** — `k8s_upgrade_in_flight==1 && k8s_upgrade_snapshot_taken==0` for 10m. Catches preflight Stage 2 failing silently.
|
||||
- **`K8sUpgradeStalled`** — `k8s_upgrade_in_flight==1 && time()-k8s_upgrade_started_timestamp > 5400` for 5m. Catches a Job in the chain dying without spawning its successor.
|
||||
- **`K8sUpgradeChainJobFailed`** — `kube_job_status_failed{namespace="k8s-upgrade",job_name=~"k8s-upgrade-(preflight|master|worker|postflight)-.*",reason=~"BackoffLimitExceeded|DeadlineExceeded"} > 0` for 15m (warning). Catches a phase Job that **terminally failed before `k8s_upgrade_in_flight` was set** — the preflight gates exit pre-metric, so the two `in_flight`-based alerts above are blind to a failed preflight (this is what hid the 5-day 1.34.9 wedge on 2026-06-12). Reason-scoped to terminal job conditions so a retry-success doesn't false-positive (a bare failed-pod-count would otherwise also block kured for the Job's 7d TTL). The old `unless on() (k8s_upgrade_blocked == 1)` clause was **dropped 2026-06-28**: compat-gate refusals now Complete cleanly (exit 0) instead of Failing, so a terminally-Failed chain Job again means a genuine wedge with nothing to exclude.
|
||||
- **`K8sUpgradeBlocked`** — `k8s_upgrade_blocked == 1` (warning). An **ACTIONABLE** compat-gate refusal — a newer version of the lagging addon exists and upgrading it would clear the block (or an in-use deprecated API must be migrated / a node's containerd bumped). Reasons (grouped by class) are in the **morning nightly report**; clear it by doing the named upgrade/migration, after which the next nightly run proceeds (see "Auto-upgrade compat gate"). No upgrade was attempted, so this is not a half-done-rollout alert. **There is deliberately NO companion alert for the held verdict** (`k8s_upgrade_held=1` — waiting-on-upstream / pinned): nothing can be actioned now, so it is surfaced only by the nightly report's `⏸️ HELD` line.
|
||||
- **`K8sUpgradeChainJobFailed`** — `(kube_job_status_failed{namespace="k8s-upgrade",job_name=~"k8s-upgrade-(preflight|master|worker|postflight)-.*",reason=~"BackoffLimitExceeded|DeadlineExceeded"} > 0) unless on() (k8s_upgrade_blocked == 1)` for 15m (warning). Catches a phase Job that **terminally failed before `k8s_upgrade_in_flight` was set** — the preflight gates exit pre-metric, so the two `in_flight`-based alerts above are blind to a failed preflight (this is what hid the 5-day 1.34.9 wedge on 2026-06-12). Reason-scoped to terminal job conditions so a retry-success doesn't false-positive (a bare failed-pod-count would otherwise also block kured for the Job's 7d TTL). The `unless k8s_upgrade_blocked == 1` clause (added 2026-06-21) excludes a preflight that failed because the **compat gate deliberately refused** the target — that's owned by `K8sUpgradeBlocked` and was double-firing here; a genuine wedge exits without setting the blocked gauge, so it still fires.
|
||||
- **`K8sUpgradeBlocked`** — `k8s_upgrade_blocked == 1` (warning). A k8s **auto-upgrade was refused** by the compat gate because a critical addon, an in-use deprecated API, or a node's containerd is too old for the detected target. The **specific reasons are in Slack**; clear it by upgrading the named addon / migrating the API caller / bumping containerd, after which the next nightly run proceeds (see "Auto-upgrade compat gate"). No upgrade was attempted, so this is not a half-done-rollout alert.
|
||||
- The first four alerts ALSO block kured (same `--prometheus-url` halt-on-alert mechanism) so the OS-reboot pipeline can't run on top of a half-done version upgrade.
|
||||
|
||||
### Nightly upgrade report (Slack)
|
||||
|
|
@ -203,8 +181,8 @@ CronJob `k8s-upgrade-nightly-report` (k8s-upgrade ns, `var.report_schedule`,
|
|||
default `7 6 * * *` = 06:07 UTC — after the 23:00 chain, before the 08:00 London
|
||||
alert-digest) posts ONE Slack summary each morning of the previous night's run:
|
||||
running version, detector freshness, detected target + kind, the outcome
|
||||
(⚪ no upgrade needed / 🔴 blocked-actionable + reasons / ⏸️ held = waiting-upstream/pinned /
|
||||
🟢 upgraded / 🟡 in progress / ⚠️ detector stale), and recent chain jobs. Read-only — it reads
|
||||
(⚪ no upgrade needed / 🔴 blocked + live blocker reasons / 🟢 upgraded /
|
||||
🟡 in progress / ⚠️ detector stale), and recent chain jobs. Read-only — it reads
|
||||
the Pushgateway gauges + live nodes/jobs and re-runs `compat-gate.py` for fresh
|
||||
blocker reasons; reuses the chain's SA + `slack_webhook` + scripts ConfigMap.
|
||||
Logic + unit tests: `scripts/nightly-report.py`, `scripts/test_nightly_report.py`.
|
||||
|
|
@ -244,34 +222,22 @@ Exposed in K8s via ExternalSecret `k8s-upgrade-creds` in the `k8s-upgrade` names
|
|||
|
||||
## Common Operations
|
||||
|
||||
### apiserver OIDC + kubeadm upgrades (kubeadm-config reconciliation since 2026-06-24)
|
||||
### Post-upgrade: apiserver OIDC restore (AUTOMATED by the chain since 2026-06-19)
|
||||
|
||||
`kubeadm upgrade apply` **regenerates `/etc/kubernetes/manifests/kube-apiserver.yaml`
|
||||
from kubeadm-config**. apiserver auth uses a structured multi-issuer
|
||||
`--authentication-config` (kubectl + dashboard SSO), but kubeadm-config used to
|
||||
still carry the legacy single-issuer `--oidc-*` extraArgs — so every upgrade
|
||||
reverted the flag, **silently breaking SSO after the upgrade** (the apiserver does
|
||||
NOT crash on this — verified by isolated repro; it's recoverable via the restore
|
||||
script below). NB: the **1.34→1.35 stall on 2026-06-24 was a *separate* issue —
|
||||
etcd IO starvation**, not this drift; post-mortem:
|
||||
`docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md`.
|
||||
and drops the `--authentication-config` flag**, silently disabling apiserver
|
||||
OIDC (kubectl/kubelogin CLI **and** the web dashboard SSO break — tokens get
|
||||
401). This used to require a manual re-apply after **every** control-plane bump.
|
||||
|
||||
**Primary fix (2026-06-24):** `stacks/rbac/modules/rbac/apiserver-oidc.tf` now
|
||||
**reconciles kubeadm-config** (`kubeadm init phase upload-config kubeadm`, rewriting
|
||||
`apiServer.extraArgs`: drop `--oidc-*`, add `--authentication-config`) as part of
|
||||
its remote script. So kubeadm regenerates a **correct** manifest and the apiserver
|
||||
upgrades with a pure image bump — `kubeadm upgrade diff <target>` shows only the
|
||||
image change. Zero live impact (the CM is read only during an upgrade).
|
||||
|
||||
**Backstops:**
|
||||
- **Preflight check 4b** runs `kubeadm upgrade diff` and **alerts** (Slack WARN, does
|
||||
NOT block — the drift only breaks SSO, which is recoverable) if
|
||||
`--authentication-config` would still be dropped.
|
||||
- The `rbac` stack still publishes its restore script to the
|
||||
`kube-system/apiserver-oidc-restore` ConfigMap, and `phase_master` re-runs it on
|
||||
master right after `kubeadm upgrade apply` (idempotent, `/livez`-gated with
|
||||
auto-rollback, non-fatal) — now redundant belt-and-suspenders that *also*
|
||||
re-reconciles kubeadm-config. Self-skips when master is already at target.
|
||||
**Now automated:** the `rbac` stack publishes its OIDC restore script to the
|
||||
`kube-system/apiserver-oidc-restore` ConfigMap, and the version-upgrade chain's
|
||||
`phase_master` re-runs it on master immediately after `kubeadm upgrade apply`
|
||||
(while tigera-operator is still quiesced, so the flag-add apiserver restart can't
|
||||
crashloop the operator). It's idempotent, health-gates `/livez` with
|
||||
auto-rollback, and is **non-fatal** — a failure only lags SSO until the next rbac
|
||||
apply (the version upgrade itself already succeeded). So a chain-driven
|
||||
control-plane bump no longer breaks SSO. The master phase self-skips when master
|
||||
is already at target, so this only runs when master was actually upgraded.
|
||||
|
||||
**Manual fallback** — only for an out-of-band/manual `kubeadm` upgrade, or if the
|
||||
chain logged `WARN: --authentication-config absent after re-apply`:
|
||||
|
|
|
|||
|
|
@ -1,72 +0,0 @@
|
|||
# Runbook: pfSense WAN / egress outage
|
||||
|
||||
**Scope:** the cluster (and home) loses **internet egress** while pfSense is
|
||||
otherwise alive — internal VLAN routing and DNS keep working. This is the
|
||||
**2026-06-27 incident class**: pfSense (Proxmox **VMID 101**) stopped passing
|
||||
IPv4 egress for ~20 min (00:02→00:23 UTC) while LAN/OPT1 routing + Unbound
|
||||
stayed up; recovery required a manual reboot, and **nothing alerted** (no egress
|
||||
probe existed; the cloudflared replica metric stayed green). The alerts +
|
||||
probes below close that gap. Incident detail: memory ids #6715–#6723.
|
||||
|
||||
pfSense is a **single point of failure** (no HA): it is the k8s default gateway
|
||||
(`10.0.20.1`), Kea DHCP, Unbound DNS, NAT, and the WireGuard hub. WAN is
|
||||
**static** `192.168.1.2/24`, upstream gateway `WANGW = 192.168.1.1` (the TP-Link
|
||||
Archer AX6000). The sole IPv4 default gateway, no gateway-group/failover.
|
||||
|
||||
## Alerts (all in `stacks/monitoring/modules/monitoring/`)
|
||||
|
||||
| Alert | Signal | Means |
|
||||
|-------|--------|-------|
|
||||
| `WANGatewayUnreachable` (critical) | in-cluster ICMP to `192.168.1.1` fails >3m | pfSense's upstream gateway is unreachable from the cluster |
|
||||
| `InternetEgressDown` (critical) | in-cluster ICMP to **both** `9.9.9.9` and `1.1.1.1` fails >2m | internet egress through pfSense NAT is black-holed |
|
||||
| `ExternalDNSResolutionDown` (warning) | UDP/53 to both public resolvers fails >3m | egress or external-DNS path broken |
|
||||
| `EgressOnlyDivergence` (critical) | t3-probe `cloudflare` leg down **while** `internal` leg up >3m | egress-specific failure, internal healthy (the exact 2026-06-27 signature) |
|
||||
| `PfSenseVMDown` (critical) | `pve_up{id="qemu/101"}==0` while host up >2m | the pfSense VM stopped/crashed (host fine) |
|
||||
| `CloudflaredTunnelConnLoss` (warning, Loki) | >20 cloudflared edge-conn failures/5m | tunnel/egress trouble (canary that fires first; replica metric is blind) |
|
||||
|
||||
Probes run **from inside the cluster** (blackbox-exporter, pod → node → pfSense
|
||||
NAT), so they exercise the exact egress path that fails. `WANGatewayUnreachable`
|
||||
/ `InternetEgressDown` **inhibit** the downstream egress symptoms so one root
|
||||
alert pages, not a storm.
|
||||
|
||||
`PfSenseVMDown` **does not** catch a *guest-internal* reboot — `pve_up` tracks
|
||||
the qemu process, which survives an in-guest reboot (this is why 2026-06-27 was
|
||||
metric-invisible). `CloudflaredTunnelConnLoss` + the probe alerts cover that case.
|
||||
|
||||
## Diagnose (read-only first)
|
||||
|
||||
1. **Confirm scope** — is it egress-only or total?
|
||||
- `kubectl -n monitoring` Grafana → `probe_success{job=~"wan-gateway-icmp|internet-egress-icmp"}` and `t3probe_connected` by `leg`.
|
||||
- Internal still up? `pve_up{id="qemu/101"}` should be `1`; internal k8s DNS (`10.0.20.1`) still resolving = pfSense alive, egress-only.
|
||||
2. **Capture pfSense on-box logs BEFORE rebooting** (they persist on disk — no RAM-disk — and are the only source that proves the mechanism; they are NOT shipped to Loki):
|
||||
```
|
||||
ssh -i ~/.ssh/id_ed25519 admin@10.0.20.1 # devvm wizard key (id #6784)
|
||||
clog /var/log/gateways.log | grep -iE 'WANGW|down|up|delay|loss' # dpinger gateway alarms
|
||||
clog /var/log/routing.log | grep -iE 'default|route' # default-route add/delete
|
||||
clog /var/log/system.log | tail -200
|
||||
netstat -rn | head # is the default route present?
|
||||
ls -la /var/crash/ # panic/textdump?
|
||||
```
|
||||
(If SSH is rejected post-reboot, the reboot regenerated `authorized_keys` from
|
||||
config.xml — re-add the key via console or WebGUI; see id #6718.)
|
||||
3. **Upstream check** — is the TP-Link / ISP up? It held the same public IP with
|
||||
clean DHCP renewals through the 2026-06-27 event, so a *sustained* upstream
|
||||
fault is unlikely; a reboot fixing it points at **pfSense-side state**.
|
||||
|
||||
## Recover
|
||||
|
||||
- **Fast path (known fix):** reboot pfSense — re-adds the default route, re-arms
|
||||
dpinger, flushes pf state. **Capture the logs above FIRST** (a reboot wipes
|
||||
the volatile evidence needed to find the real mechanism).
|
||||
- Targeted (if logs show a dpinger gateway-down): System → Routing → Gateways →
|
||||
WANGW; check the monitor IP + dpinger state; re-enable the gateway / let it
|
||||
re-eval. Confirm `netstat -rn` shows the default route restored.
|
||||
|
||||
## Prevent / harden (deferred, needs a live-pfSense change)
|
||||
|
||||
Not done in this monitoring change — tracked for a follow-up with hands-on
|
||||
pfSense access: point dpinger's monitor at the local gateway (`192.168.1.1`)
|
||||
instead of an external IP + widen thresholds; disable `gw_down_kill_states` for
|
||||
the single WAN; add a failover gateway group; a 60s auto-recovery watchdog;
|
||||
ship pfSense system/gateway/routing syslog to the cluster so these logs become
|
||||
centrally queryable.
|
||||
|
|
@ -27,7 +27,7 @@ KUBECONFIG_PATH="${KUBECONFIG:-${HOME}/.kube/config}"
|
|||
[[ -f "$KUBECONFIG_PATH" ]] || KUBECONFIG_PATH="$(pwd)/config"
|
||||
KUBECTL=""
|
||||
JSON_RESULTS=()
|
||||
TOTAL_CHECKS=48
|
||||
TOTAL_CHECKS=47
|
||||
|
||||
# Parallel execution settings. Each check function is self-contained — it
|
||||
# only reads cluster state and mutates the in-memory counters / JSON_RESULTS
|
||||
|
|
@ -3156,44 +3156,6 @@ PYEOF
|
|||
esac
|
||||
}
|
||||
|
||||
# --- 48. Goldmane edge-aggregator availability ---
|
||||
#
|
||||
# The goldmane-edge-aggregator Deployment (ADR-0014 / infra #58) streams Calico
|
||||
# Goldmane flows into the goldmane_edges CNPG DB — the durable who-talks-to-whom
|
||||
# trail. The pod has NO /metrics endpoint, so its liveness can't be scraped;
|
||||
# this check reads the Deployment's Available condition directly so the trail
|
||||
# silently dying surfaces in the health board (mirrors the AggregatorDown
|
||||
# Prometheus alert). Missing Deployment / not-Available -> FAIL.
|
||||
check_goldmane_aggregator() {
|
||||
section 48 "Goldmane Edge-Aggregator"
|
||||
local ns="goldmane-edge-aggregator" dep="goldmane-edge-aggregator"
|
||||
local avail desired ready
|
||||
|
||||
# One get; absent Deployment is a hard fail (the trail isn't deployed).
|
||||
if ! $KUBECTL get deploy "$dep" -n "$ns" >/dev/null 2>&1; then
|
||||
[[ "$QUIET" == true ]] && section_always 48 "Goldmane Edge-Aggregator"
|
||||
fail "Deployment $ns/$dep not found — who-talks-to-whom edge trail is not running"
|
||||
json_add "goldmane_aggregator" "FAIL" "deployment missing"
|
||||
return 0
|
||||
fi
|
||||
|
||||
avail=$($KUBECTL get deploy "$dep" -n "$ns" \
|
||||
-o jsonpath='{.status.conditions[?(@.type=="Available")].status}' 2>/dev/null)
|
||||
ready=$($KUBECTL get deploy "$dep" -n "$ns" -o jsonpath='{.status.readyReplicas}' 2>/dev/null)
|
||||
desired=$($KUBECTL get deploy "$dep" -n "$ns" -o jsonpath='{.spec.replicas}' 2>/dev/null)
|
||||
ready=${ready:-0}
|
||||
desired=${desired:-0}
|
||||
|
||||
if [[ "$avail" == "True" ]]; then
|
||||
pass "Edge-aggregator Available ($ready/$desired ready)"
|
||||
json_add "goldmane_aggregator" "PASS" "${ready}/${desired} ready"
|
||||
else
|
||||
[[ "$QUIET" == true ]] && section_always 48 "Goldmane Edge-Aggregator"
|
||||
fail "Edge-aggregator NOT Available ($ready/$desired ready) — edge trail has stopped recording"
|
||||
json_add "goldmane_aggregator" "FAIL" "${ready}/${desired} ready; Available=${avail:-unknown}"
|
||||
fi
|
||||
}
|
||||
|
||||
# --- Summary ---
|
||||
print_summary() {
|
||||
if [[ "$JSON" == true ]]; then
|
||||
|
|
@ -3262,7 +3224,7 @@ main() {
|
|||
check_monitoring_prom_am check_monitoring_vault check_monitoring_css
|
||||
check_external_replicas check_external_divergence check_pve_thermals
|
||||
check_pve_load check_external_traefik_5xx check_ha_status_dashboard
|
||||
check_immich_search check_csi_ghost_drift check_goldmane_aggregator
|
||||
check_immich_search check_csi_ghost_drift
|
||||
)
|
||||
|
||||
# Auto-fix mutates cluster state inside individual checks — keep that
|
||||
|
|
|
|||
|
|
@ -240,79 +240,6 @@ EOF
|
|||
log "wrote OIDC kubeconfig -> $user:~/.kube/config"
|
||||
}
|
||||
|
||||
# Hands-off chrome-service browser credential. For a user who has a
|
||||
# `<os_user>-browser` ServiceAccount in the chrome-service namespace (created in
|
||||
# stacks/chrome-service/rbac.tf), install a DUAL-CONTEXT kubeconfig whose DEFAULT
|
||||
# context authenticates with that SA's long-lived token — so `homelab browser`
|
||||
# (which shells out to `kubectl port-forward -n chrome-service`) works
|
||||
# non-interactively, even from a headless agent session (the user's interactive
|
||||
# OIDC login can't authenticate a headless kubectl). The user's personal OIDC
|
||||
# identity is retained as the `oidc@homelab` named context
|
||||
# (`kubectl --context oidc@homelab`). TF (the SA's existence) is the source of
|
||||
# truth for WHO gets this — there is no roster flag. Idempotent (cmp-guarded; SA
|
||||
# tokens are stable) + best-effort (cluster/secret unreachable -> WARN, never aborts).
|
||||
install_browser_kubeconfig() {
|
||||
local user="$1" home kc sa secret token server ca tmp
|
||||
home="$(getent passwd "$user" | cut -d: -f6)"
|
||||
[[ -z "$home" ]] && return 0
|
||||
sa="${user}-browser"
|
||||
secret="${sa}-token"
|
||||
[[ -r "$ADMIN_KUBECONFIG" ]] || return 0
|
||||
# Gate: only users with a chrome-service browser SA (TF-driven). Best-effort read.
|
||||
KUBECONFIG="$ADMIN_KUBECONFIG" kubectl --request-timeout=10s -n chrome-service get serviceaccount "$sa" >/dev/null 2>&1 || return 0
|
||||
token="$(KUBECONFIG="$ADMIN_KUBECONFIG" kubectl --request-timeout=10s -n chrome-service get secret "$secret" -o jsonpath='{.data.token}' 2>/dev/null | base64 -d 2>/dev/null || true)"
|
||||
[[ -n "$token" ]] || { log "WARN: browser SA token not ready for $user (secret chrome-service/$secret) — skipped"; return 0; }
|
||||
server="$(KUBECONFIG="$ADMIN_KUBECONFIG" kubectl config view --raw --minify -o jsonpath='{.clusters[0].cluster.server}')"
|
||||
ca="$(KUBECONFIG="$ADMIN_KUBECONFIG" kubectl config view --raw --minify -o jsonpath='{.clusters[0].cluster.certificate-authority-data}')"
|
||||
[[ -n "$server" && -n "$ca" ]] || { log "WARN: could not read cluster server/CA -> skip browser kubeconfig for $user"; return 0; }
|
||||
kc="$home/.kube/config"
|
||||
tmp="$(mktemp)"
|
||||
cat > "$tmp" <<EOF
|
||||
apiVersion: v1
|
||||
kind: Config
|
||||
clusters:
|
||||
- name: homelab
|
||||
cluster:
|
||||
server: $server
|
||||
certificate-authority-data: $ca
|
||||
contexts:
|
||||
- name: ${sa}@homelab
|
||||
context:
|
||||
cluster: homelab
|
||||
user: $sa
|
||||
- name: oidc@homelab
|
||||
context:
|
||||
cluster: homelab
|
||||
user: oidc
|
||||
current-context: ${sa}@homelab
|
||||
users:
|
||||
- name: $sa
|
||||
user:
|
||||
token: $token
|
||||
- name: oidc
|
||||
user:
|
||||
exec:
|
||||
apiVersion: client.authentication.k8s.io/v1beta1
|
||||
command: kubectl
|
||||
args:
|
||||
- oidc-login
|
||||
- get-token
|
||||
- --oidc-issuer-url=$OIDC_ISSUER
|
||||
- --oidc-client-id=kubernetes
|
||||
- --oidc-extra-scope=email
|
||||
- --oidc-extra-scope=profile
|
||||
- --oidc-extra-scope=groups
|
||||
interactiveMode: IfAvailable
|
||||
EOF
|
||||
if cmp -s "$tmp" "$kc" 2>/dev/null; then rm -f "$tmp"; return 0; fi # already current -> no churn
|
||||
if [[ "$DRY_RUN" == 1 ]]; then echo "[dry-run] dual-context (SA default + OIDC) browser kubeconfig -> $user:$kc"; rm -f "$tmp"; return 0; fi
|
||||
install -d -o "$user" -g "$user" -m 0700 "$home/.kube"
|
||||
install -o "$user" -g "$user" -m 0600 "$tmp" "$kc" || { log "WARN: failed to write browser kubeconfig for $user"; rm -f "$tmp"; return 0; }
|
||||
rm -f "$tmp"
|
||||
log "wrote dual-context browser kubeconfig (SA default + OIDC) -> $user:~/.kube/config"
|
||||
return 0
|
||||
}
|
||||
|
||||
# Idempotently set KEY=VALUE in a t3-serve env file, PRESERVING other lines — so writing
|
||||
# T3_PORT never clobbers an injected CLAUDE_CODE_OAUTH_TOKEN, and vice-versa. Mode 0600.
|
||||
env_set() {
|
||||
|
|
@ -667,7 +594,6 @@ while IFS=$'\t' read -r os_user tier shell groups_csv code_layout repos_csv; do
|
|||
refresh_user_clone "$os_user" code
|
||||
fi
|
||||
install_user_kubeconfig "$os_user"
|
||||
install_browser_kubeconfig "$os_user" # hands-off chrome-service CLI cred (no-op unless the user has a browser SA)
|
||||
deploy_user_launcher "$os_user" # keep ~/start-claude.sh current (skel only seeds new accounts)
|
||||
fi
|
||||
refresh_codex_mirror "$os_user" # all tiers — mirror of the managed claudeMd
|
||||
|
|
|
|||
|
|
@ -11,12 +11,6 @@ Environment=HOME=/home/%i
|
|||
Environment=PATH=/usr/local/bin:/usr/bin:/bin:/home/%i/.local/bin
|
||||
Environment=NODE_ENV=production
|
||||
EnvironmentFile=/etc/t3-serve/%i.env
|
||||
# Optional per-user long-lived CLAUDE_CODE_OAUTH_TOKEN, materialized by
|
||||
# claude-auth-sync from the user's own Vault path. Non-rotating, so t3's
|
||||
# concurrent agent sessions can't race on OAuth refresh-token rotation and wipe
|
||||
# the shared ~/.claude/.credentials.json. Leading '-' = optional (absent for
|
||||
# users on the normal per-user Enterprise-SSO credential flow).
|
||||
EnvironmentFile=-/home/%i/.config/claude-auth-sync/claude-oauth.env
|
||||
WorkingDirectory=/home/%i
|
||||
ExecStart=/usr/bin/t3 serve --host 0.0.0.0 --port ${T3_PORT} --base-dir /home/%i/.t3
|
||||
Restart=on-failure
|
||||
|
|
|
|||
|
|
@ -28,61 +28,5 @@ ok "accept own scoped Vault token" cas_vault_identity_ok token-devvm-claude-auth
|
|||
no "reject another user's token" cas_vault_identity_ok token-devvm-claude-auth-anca default,workstation-claude-anca
|
||||
no "reject wrong policy" cas_vault_identity_ok token-devvm-claude-auth-emo default,workstation-claude-anca
|
||||
|
||||
# --- Regression: cas_backup must MERGE into the shared Vault path, preserving
|
||||
# sibling keys that other tools co-locate there (e.g. `homelab vault`'s
|
||||
# vaultwarden_* creds) — NOT overwrite the whole KV document. A blind `kv put`
|
||||
# wiped them every 6h (claude-auth-sync clobber, 2026-06-26).
|
||||
fakebin="$tmp/bin"; mkdir -p "$fakebin"
|
||||
store="$tmp/vault-store.json"
|
||||
cat > "$fakebin/vault" <<'FAKE'
|
||||
#!/usr/bin/env bash
|
||||
# Minimal KV-v2 fake backed by $VAULT_FAKE_STORE (a flat JSON object).
|
||||
[[ "$1" == kv ]] || { echo '{}'; exit 0; } # token lookup etc. -> ignore
|
||||
op="$2"; shift 2
|
||||
store="$VAULT_FAKE_STORE"
|
||||
case "$op" in
|
||||
get)
|
||||
for a in "$@"; do [[ "$a" == -field=* ]] && field="${a#-field=}"; done
|
||||
if [[ "$*" == *-format=json* ]]; then
|
||||
[[ -f "$store" ]] || { echo "No value found"; exit 2; }
|
||||
jq -n --argjson d "$(cat "$store")" '{data:{data:$d}}'; exit 0
|
||||
fi
|
||||
[[ -f "$store" ]] || exit 2 # bare get == existence check
|
||||
if [[ -n "${field:-}" ]]; then
|
||||
v="$(jq -r --arg k "$field" '.[$k] // empty' "$store")"; [[ -n "$v" ]] || exit 1
|
||||
printf '%s' "$v"; exit 0
|
||||
fi
|
||||
exit 0 ;;
|
||||
put) echo '{}' > "$store" ;; # full replace
|
||||
patch) [[ -f "$store" ]] || { echo "No value found"; exit 2; } ;; # merge (rw)
|
||||
*) exit 1 ;;
|
||||
esac
|
||||
for a in "$@"; do
|
||||
case "$a" in
|
||||
-*|secret/*) continue ;; # flags + the path arg
|
||||
*=*) k="${a%%=*}"; v="${a#*=}"
|
||||
t="$(mktemp)"; jq --arg k "$k" --arg v "$v" '.[$k]=$v' "$store" > "$t" && mv "$t" "$store" ;;
|
||||
esac
|
||||
done
|
||||
exit 0
|
||||
FAKE
|
||||
chmod +x "$fakebin/vault"
|
||||
|
||||
CAS_VAULT_PATH="secret/workstation/claude-users/test"
|
||||
CAS_CREDENTIALS="$tmp/credentials.json"
|
||||
CAS_STATE_DIR="$tmp/state"
|
||||
_oldpath="$PATH"; PATH="$fakebin:$PATH"; export VAULT_FAKE_STORE="$store"
|
||||
|
||||
printf '{"vaultwarden_master_password":"keep-me"}\n' > "$store" # pretend `homelab vault setup` ran
|
||||
ok "backup succeeds (existing doc)" cas_backup
|
||||
eq "merge preserves sibling key" keep-me "$(jq -r '.vaultwarden_master_password' "$store")"
|
||||
eq "merge writes claude oauth" access "$(jq -r '.claude_ai_oauth_json|fromjson|.accessToken' "$store")"
|
||||
|
||||
rm -f "$store" # fresh user: no doc yet
|
||||
ok "backup succeeds (creates doc)" cas_backup
|
||||
eq "create writes claude oauth" access "$(jq -r '.claude_ai_oauth_json|fromjson|.accessToken' "$store")"
|
||||
|
||||
PATH="$_oldpath"; unset VAULT_FAKE_STORE
|
||||
|
||||
printf '\n%d passed, %d failed\n' "$pass" "$fail"
|
||||
(( fail == 0 ))
|
||||
|
|
|
|||
|
|
@ -13,10 +13,6 @@ CAS_VAULT_TOKEN_FILE="${CLAUDE_AUTH_VAULT_TOKEN_FILE:-$CAS_CONFIG_DIR/vault-toke
|
|||
CAS_VAULT_PATH="${CLAUDE_AUTH_VAULT_PATH:-secret/workstation/claude-users/$CAS_USER}"
|
||||
CAS_STATE_DIR="${CLAUDE_AUTH_STATE_DIR:-$CAS_HOME/.local/state/claude-auth-sync}"
|
||||
CAS_LOG="$CAS_STATE_DIR/sync.log"
|
||||
# Where a long-lived per-user setup-token is materialized as an env file
|
||||
# (KEY=VALUE) for start-claude.sh + t3-serve@.service to load. Lives under the
|
||||
# already-ReadWritePaths config dir so the sandboxed service may write it.
|
||||
CAS_TOKEN_ENV_FILE="${CLAUDE_AUTH_TOKEN_ENV_FILE:-$CAS_CONFIG_DIR/claude-oauth.env}"
|
||||
|
||||
cas_log() {
|
||||
mkdir -p "$CAS_STATE_DIR"
|
||||
|
|
@ -86,17 +82,7 @@ cas_backup() {
|
|||
return 1
|
||||
}
|
||||
expires="$(jq -r '.expiresAt' <<<"$oauth")"
|
||||
# MERGE into the shared path so sibling keys other tools co-locate there
|
||||
# (e.g. `homelab vault`'s vaultwarden_* creds) survive. `kv patch -method=rw`
|
||||
# is read+update (needs no `patch` capability) but requires the secret to
|
||||
# already exist, so create it with `kv put` on the very first backup only.
|
||||
local -a write_cmd
|
||||
if vault kv get "$CAS_VAULT_PATH" >/dev/null 2>&1; then
|
||||
write_cmd=(vault kv patch -method=rw "$CAS_VAULT_PATH")
|
||||
else
|
||||
write_cmd=(vault kv put "$CAS_VAULT_PATH")
|
||||
fi
|
||||
"${write_cmd[@]}" \
|
||||
vault kv put "$CAS_VAULT_PATH" \
|
||||
claude_ai_oauth_json="$oauth" \
|
||||
credential_expires_at_ms="$expires" \
|
||||
backed_up_at="$(date -Is)" >/dev/null || {
|
||||
|
|
@ -137,41 +123,6 @@ cas_restore() {
|
|||
cas_log "RECOVERED restored Claude OAuth state from Vault"
|
||||
}
|
||||
|
||||
# A user-scoped, long-lived setup-token (`sk-ant-oat01-…`, ~1y, NON-rotating) may
|
||||
# be stored in this user's OWN Vault path (field `setup_token`). When present it
|
||||
# is the authoritative credential: it bypasses the shared
|
||||
# ~/.claude/.credentials.json OAuth refresh-token rotation entirely — the fix for
|
||||
# users running many concurrent Claude sessions (interactive + t3-serve + always-on
|
||||
# agents) that otherwise race on refresh and wipe each other's refresh token.
|
||||
# We materialize it to a user-owned env file that start-claude.sh and
|
||||
# t3-serve@.service load as CLAUDE_CODE_OAUTH_TOKEN. This is the user's OWN
|
||||
# Enterprise identity, NOT the forbidden legacy SHARED token — it never crosses
|
||||
# OS users. Returns 0 when a token is active, so the caller skips the
|
||||
# rotating-credential validate/backup/restore (probing the now-vestigial
|
||||
# credential would otherwise emit false WorkstationClaudeAuthInvalid alerts).
|
||||
cas_sync_setup_token() {
|
||||
local token desired tmp
|
||||
token="$(vault kv get -field=setup_token "$CAS_VAULT_PATH" 2>/dev/null)" || token=""
|
||||
if [[ "$token" != sk-ant-oat01-* ]]; then
|
||||
if [[ -e "$CAS_TOKEN_ENV_FILE" ]]; then
|
||||
rm -f "$CAS_TOKEN_ENV_FILE"
|
||||
cas_log "removed stale CLAUDE_CODE_OAUTH_TOKEN env (no setup-token in Vault)"
|
||||
fi
|
||||
return 1
|
||||
fi
|
||||
desired="CLAUDE_CODE_OAUTH_TOKEN=$token"
|
||||
if [[ -r "$CAS_TOKEN_ENV_FILE" && "$(<"$CAS_TOKEN_ENV_FILE")" == "$desired" ]]; then
|
||||
cas_log "OK long-lived setup-token active (CLAUDE_CODE_OAUTH_TOKEN current); credential checks skipped"
|
||||
return 0
|
||||
fi
|
||||
tmp="$(mktemp "${CAS_TOKEN_ENV_FILE}.XXXXXX")" || { cas_log "FAIL could not stage token env file"; return 1; }
|
||||
printf '%s\n' "$desired" > "$tmp"
|
||||
chmod 0600 "$tmp"
|
||||
mv "$tmp" "$CAS_TOKEN_ENV_FILE"
|
||||
cas_log "OK long-lived setup-token active; CLAUDE_CODE_OAUTH_TOKEN materialized; credential checks skipped"
|
||||
return 0
|
||||
}
|
||||
|
||||
cas_main() {
|
||||
umask 077
|
||||
for bin in jq vault claude timeout flock; do
|
||||
|
|
@ -182,11 +133,6 @@ cas_main() {
|
|||
flock -n 9 || { cas_log "SKIP another sync is already running"; return 0; }
|
||||
|
||||
cas_prepare_vault || return 1
|
||||
# A long-lived per-user setup-token, if provisioned, is authoritative and
|
||||
# non-rotating — materialize it and skip the rotating-credential dance.
|
||||
if cas_sync_setup_token; then
|
||||
return 0
|
||||
fi
|
||||
if cas_live_auth_ok; then
|
||||
cas_backup
|
||||
return
|
||||
|
|
|
|||
|
|
@ -45,15 +45,9 @@ def main() -> None:
|
|||
try:
|
||||
res = subprocess.run(
|
||||
[homelab, "memory", "recall", prompt, "--limit", "5"],
|
||||
capture_output=True, text=True, errors="replace", timeout=4,
|
||||
env=os.environ,
|
||||
capture_output=True, text=True, timeout=4, env=os.environ,
|
||||
)
|
||||
except Exception:
|
||||
# Best-effort: ANY failure — timeout, OSError, or a UnicodeDecodeError on
|
||||
# truncated multibyte (Cyrillic) output — must silently skip recall this
|
||||
# turn, exactly like the MCP being unavailable. errors="replace" above
|
||||
# also keeps a mid-rune-truncated payload from raising here at all. Never
|
||||
# let this hook surface a "UserPromptSubmit hook error".
|
||||
except (subprocess.TimeoutExpired, OSError):
|
||||
return
|
||||
|
||||
out = (res.stdout or "").strip()
|
||||
|
|
|
|||
|
|
@ -19,29 +19,13 @@ unpinned-CLI dependencies out of the hourly **root** reconcile.
|
|||
|
||||
- `mattpocock/skills` (https://github.com/mattpocock/skills) — all except `find-skills`
|
||||
- `vercel-labs/skills` (https://github.com/vercel-labs/skills) — `find-skills`
|
||||
- **homelab-local, emo-PERSONALIZED** — `cluster-health` here is an
|
||||
**emo-specific variant**, not a copy of the canonical skill. It started as a
|
||||
copy of this repo's `.claude/skills/cluster-health/` but was rewritten on
|
||||
2026-06-26 to focus on ha-sofia + emo's Sofia devices (emo is the only entry
|
||||
in `SKILL_USERS`, a read-only power-user). The canonical admin skill
|
||||
(`.claude/skills/cluster-health/`) is the full 47-check version and is left
|
||||
untouched. **Do NOT `cp -a` the canonical copy over this one** — that would
|
||||
clobber the personalization. Maintain the two independently.
|
||||
|
||||
## Refreshing
|
||||
|
||||
Re-snapshot the upstream skills from a current install and commit the diff:
|
||||
Re-snapshot from a current install and commit the diff:
|
||||
|
||||
```sh
|
||||
cp -a ~/.agents/skills/. scripts/workstation/claude-skills/
|
||||
```
|
||||
|
||||
`cluster-health` is hand-maintained (emo variant) — it is **not** covered by the
|
||||
`cp -a` above and must **not** be overwritten from `.claude/skills/`. Edit it in
|
||||
place here when emo's needs change, then refresh his live copy (the provisioner's
|
||||
`install_skills()` is if-absent, so it won't update an existing `~/.agents/skills`
|
||||
copy — `cp` the new `SKILL.md` to `/home/emo/.agents/skills/cluster-health/` and
|
||||
`chown emo:emo`, or remove emo's copy and re-run the reconcile).
|
||||
|
||||
Snapshot taken 2026-06-23 (upstream); `cluster-health` vendored 2026-06-26,
|
||||
personalized for emo 2026-06-26.
|
||||
Snapshot taken 2026-06-23.
|
||||
|
|
|
|||
|
|
@ -1,146 +0,0 @@
|
|||
---
|
||||
name: cluster-health
|
||||
description: |
|
||||
Personalized for emo. Check whether the homelab Kubernetes cluster is
|
||||
affecting ha-sofia or the Sofia smart-home devices it runs (Tuya devices,
|
||||
the MPPT ATS, lights, climate, security, irrigation). Use when:
|
||||
(1) "is ha-sofia ok", "are my devices / the ATS / the lights down",
|
||||
(2) "is the cluster affecting Sofia / my devices",
|
||||
(3) "check the cluster", "cluster health", "is everything running",
|
||||
(4) a device on the Барзини → Статус dashboard looks offline.
|
||||
Runs the cluster-wide healthcheck read-only and triages it by what
|
||||
ha-sofia actually depends on; the rest of the cluster is the admin's area.
|
||||
author: Claude Code
|
||||
version: 3.0.0-emo
|
||||
date: 2026-06-26
|
||||
---
|
||||
|
||||
# Cluster Health — personalized for emo (ha-sofia focus)
|
||||
|
||||
## What you actually care about
|
||||
|
||||
You care about **ha-sofia** and the **Sofia smart-home devices** it runs —
|
||||
the Tuya devices, the **MPPT ATS**, and the lights / climate / security /
|
||||
irrigation on your **Барзини → Статус** dashboard. The wider Kubernetes
|
||||
cluster matters to you **only when it's breaking something ha-sofia or your
|
||||
devices depend on.** Anything else is the admin's (wizard's) area — note it in
|
||||
one line and move on; don't chase it.
|
||||
|
||||
You have **read-only** cluster access. You can SEE everything but change
|
||||
nothing — so when something on your chain is broken, the job is to confirm it
|
||||
and hand it off, not to repair it.
|
||||
|
||||
## How ha-sofia depends on the cluster
|
||||
|
||||
ha-sofia itself runs at the house (HAOS at https://ha-sofia.viktorbarzin.me) —
|
||||
**not** in the cluster. The cluster reaches it through exactly two things:
|
||||
|
||||
1. **tuya-bridge** (namespace `tuya-bridge`) — the REST API ha-sofia calls for
|
||||
every Tuya device **and the MPPT ATS**. If it's unhealthy, your Tuya devices
|
||||
+ ATS stop responding. **This is the #1 thing to check.**
|
||||
2. **The path that carries ha-sofia ⇄ tuya-bridge and keeps ha-sofia
|
||||
reachable**: cloudflared (tunnel) → Traefik (LB) → the ingress + TLS cert
|
||||
for `tuya-bridge.viktorbarzin.me` and `ha-sofia.viktorbarzin.me`, plus
|
||||
Technitium DNS. If any of these break, ha-sofia can't reach tuya-bridge and
|
||||
you can't reach ha-sofia remotely.
|
||||
|
||||
Everything else in the cluster is unrelated to you unless it's hosting one of
|
||||
those pods.
|
||||
|
||||
## Step 1 — run the healthcheck (read-only, with your HA token)
|
||||
|
||||
Your account can't read Vault, so load your own ha-sofia token first (it was
|
||||
minted for you and lives at `~/.config/cluster-health/haos_token`). Then run
|
||||
the script from YOUR clone, read-only:
|
||||
|
||||
```bash
|
||||
cd /home/emo/code
|
||||
export HOME_ASSISTANT_SOFIA_TOKEN="$(cat ~/.config/cluster-health/haos_token)"
|
||||
bash scripts/cluster_healthcheck.sh --no-fix --quiet
|
||||
# machine-readable instead:
|
||||
# bash scripts/cluster_healthcheck.sh --no-fix --quiet --json | tee /tmp/cluster-health.json
|
||||
```
|
||||
|
||||
- **Never pass `--fix`** — it deletes pods (a write); you're read-only and it
|
||||
will fail.
|
||||
- Exit codes: `0` healthy, `1` warnings, `2` failures.
|
||||
|
||||
With the token exported, the **ha-sofia checks run for you**:
|
||||
26 Entity Availability · 27 Integration Health · 28 Automation Status ·
|
||||
29 System Resources · **45 Status Dashboard** — your Барзини → Статус view,
|
||||
classifying every device tile as OK / ⚠️ / Offline across Сигурност, Мрежа &
|
||||
IT, Енергия, Климат, Уреди, Мултимедия, Осветление, Поливна. Check 30 also
|
||||
covers the **tuya** exporter.
|
||||
|
||||
## Step 2 — triage the output by relevance to YOU
|
||||
|
||||
Read the PASS/WARN/FAIL summary, then split the WARN/FAIL items in two:
|
||||
|
||||
- **On your chain → this is what matters.** Anything touching: `tuya-bridge`,
|
||||
`cloudflared`, `traefik`, DNS (check 21), the TLS cert / ingress for your two
|
||||
hosts (checks 12, 22, 31, 32), or a **node** hosting those pods — plus all the
|
||||
**ha-sofia** checks (26–29, 45) and the **tuya** exporter (30).
|
||||
- **Not on your chain → one line, then drop it.** Summarise as "N unrelated
|
||||
cluster issues (admin's area)" and don't investigate.
|
||||
|
||||
## Step 3 — read-only checks for your chain
|
||||
|
||||
All of these work with your read-only access:
|
||||
|
||||
```bash
|
||||
# tuya-bridge — your devices + the ATS
|
||||
kubectl get pods -n tuya-bridge
|
||||
kubectl rollout status deploy/tuya-bridge -n tuya-bridge
|
||||
kubectl logs -n tuya-bridge deploy/tuya-bridge --tail=50
|
||||
|
||||
# the reachability path ha-sofia uses
|
||||
kubectl get pods -n cloudflared
|
||||
kubectl get pods -n traefik
|
||||
kubectl get ingress -A | grep -Ei 'tuya-bridge|ha-sofia'
|
||||
|
||||
# whole external path in one shot (DNS + tunnel + Traefik + cert):
|
||||
curl -sI --max-time 10 https://tuya-bridge.viktorbarzin.me | head -1
|
||||
# reachable -> HTTP/2 200 / 401 / 403 (any HTTP response = path is up)
|
||||
# broken -> curl: timeout / could not resolve host
|
||||
```
|
||||
|
||||
The fastest **device-level** signal is your own dashboard: open
|
||||
**https://ha-sofia.viktorbarzin.me → Барзини → Статус**. If devices show
|
||||
Offline / Разкачен / ⚠️ **but tuya-bridge is healthy**, the problem is at the
|
||||
house (device power / Wi-Fi / the Sofia TP-Link network) — **not** the cluster.
|
||||
|
||||
## Step 4 — if something on your chain is broken
|
||||
|
||||
You can't fix the cluster (read-only), so **capture + hand off**:
|
||||
|
||||
```bash
|
||||
kubectl describe pod -n tuya-bridge <pod>
|
||||
kubectl logs -n tuya-bridge <pod> --previous --tail=200
|
||||
```
|
||||
|
||||
Then file it for the admin with the **`/file-issue`** skill — e.g. *"ha-sofia
|
||||
Tuya devices + ATS unresponsive; tuya-bridge pod CrashLooping"* with the output
|
||||
above. cloudflared / Traefik / DNS outages are cluster-wide — the admin's
|
||||
alerting is already firing, but file it so it's tracked from your side too.
|
||||
|
||||
## What will skip for you (expected — not failures)
|
||||
|
||||
A few checks need access your account doesn't have. They warn/skip — that's
|
||||
normal, and **none of them are on your ha-sofia chain**:
|
||||
|
||||
- **Uptime Kuma (14)** — needs an admin password from Vault.
|
||||
- **PVE host checks** — 36 (LVM snapshots), 43 (host thermals), 44 (host load),
|
||||
and the Proxmox CSI ghost-disk check — all need root SSH to the Proxmox host.
|
||||
- **`--fix`** — pod deletion (a write); not available to you.
|
||||
|
||||
(The ha-sofia checks are **not** in this list — your token makes them work.)
|
||||
|
||||
## Your ha-sofia token
|
||||
|
||||
- Stored at `~/.config/cluster-health/haos_token` (yours, mode 600).
|
||||
- It's a **dedicated** long-lived token, named `emo-cluster-health` under
|
||||
ha-sofia → your profile → **Long-Lived Access Tokens**. Revoking it there
|
||||
affects only you.
|
||||
- It currently carries admin-level HA scope (Home Assistant only lets a token
|
||||
be minted for the account that created it, and it was minted via the admin
|
||||
account). If it ever stops working, tell wizard and a fresh one can be minted.
|
||||
|
|
@ -1,4 +1,4 @@
|
|||
{
|
||||
"claudeMd": "# Viktor Barzin homelab — shared multi-user Claude Code Workstation (devvm)\n\nYou are running as a specific OS user on a SHARED devvm Workstation, not as the admin. These org-wide rules apply to EVERY user and sit at the top of settings precedence (they cannot be overridden by a user's own config):\n\n- Respect your permission tier for cluster/infra. kubectl, Vault, and infra access are scoped to your RBAC tier (admin / power-user / namespace-owner) — don't try to exceed it. At the OS level, authorization follows the OS: where your user holds `sudo`, using it is permitted and audited.\n- OS access follows OS permissions, not a separate rule here. You may read whatever your OS user can read — directly or via `sudo` where you hold it — including other users' home directories, credentials, tokens, and `~/.claude`. Do not impose restrictions stricter than the OS; no extra prompt is required for OS-authorized access. Protect your own secrets at mode 600.\n- Infrastructure changes go through Terraform/Terragrunt — never direct kubectl apply/edit/patch. Committed stack changes are auto-applied by CI on push to master; verify the live result with your read-only kubectl.\n- The AGENT does ALL git mechanics silently — the user may not know git, so never ask them to commit, push, pull, or open anything, and never surface git jargon. Lifecycle (worktrees, landing, cleanup): ~/.claude/rules/execution.md. Org red-lines on top:\n - THE COMMIT MESSAGE IS THE AUDIT TRAIL — subject says WHAT changed; body says WHY in plain words (paraphrase the user's actual request).\n - Never use [ci skip] as a non-admin (it hides the change from the audit feed).\n - Push rejected by branch protection (user not whitelisted) → fall back to a <os-user>/<topic> branch + PR via the Forgejo API (token = password field in ~/.git-credentials).\n - Keep every clone on a clean master when done; tell the user in plain words what happened.\n - Full recipe: AGENTS.md → \"Non-admin workstation users\" in your infra clone.\n- Follow the engineering rules in ~/.claude/rules/ (execution, planning) and every CLAUDE.md in the repo tree.\n- Code lives under ~/code in one of two per-user layouts: either ~/code IS the git-crypt-LOCKED infra clone (single layout), or ~/code is a workspace directory of per-project clones — the locked infra clone at ~/code/infra plus other project repos alongside it. [ -d ~/code/.git ] means single. In locked infra clones secret files read as ciphertext — that is expected, not an error.\n",
|
||||
"claudeMd": "# Viktor Barzin homelab — shared multi-user Claude Code Workstation (devvm)\n\nYou are running as a specific OS user on a SHARED devvm Workstation, not as the admin. These org-wide rules apply to EVERY user and sit at the top of settings precedence (they cannot be overridden by a user's own config):\n\n- Respect your permission tier. kubectl, Vault, and infra access are scoped to your RBAC tier (admin / power-user / namespace-owner). Do not attempt to escalate privileges or reach another user's resources.\n- Secrets are per-user. Never read another user's home directory, credentials, tokens, or ~/.claude secrets. Your own secrets live in your home at mode 600.\n- Infrastructure changes go through Terraform/Terragrunt — never direct kubectl apply/edit/patch. Committed stack changes are auto-applied by CI on push to master; verify the live result with your read-only kubectl.\n- The AGENT does ALL git mechanics silently — the user may not know git, so never ask them to commit, push, pull, or open anything, and never surface git jargon. Lifecycle (worktrees, landing, cleanup): ~/.claude/rules/execution.md. Org red-lines on top:\n - THE COMMIT MESSAGE IS THE AUDIT TRAIL — subject says WHAT changed; body says WHY in plain words (paraphrase the user's actual request).\n - Never use [ci skip] as a non-admin (it hides the change from the audit feed).\n - Push rejected by branch protection (user not whitelisted) → fall back to a <os-user>/<topic> branch + PR via the Forgejo API (token = password field in ~/.git-credentials).\n - Keep every clone on a clean master when done; tell the user in plain words what happened.\n - Full recipe: AGENTS.md → \"Non-admin workstation users\" in your infra clone.\n- Follow the engineering rules in ~/.claude/rules/ (execution, planning) and every CLAUDE.md in the repo tree.\n- Code lives under ~/code in one of two per-user layouts: either ~/code IS the git-crypt-LOCKED infra clone (single layout), or ~/code is a workspace directory of per-project clones — the locked infra clone at ~/code/infra plus other project repos alongside it. [ -d ~/code/.git ] means single. In locked infra clones secret files read as ciphertext — that is expected, not an error.\n",
|
||||
"model": "claude-opus-4-8"
|
||||
}
|
||||
|
|
|
|||
|
|
@ -72,14 +72,11 @@ if [[ -n "$want_t3" && "$(t3 --version 2>/dev/null | awk '{print $NF}' | sed 's/
|
|||
fi
|
||||
|
||||
# 2c) Bitwarden CLI — backs `homelab vault` (per-user no-HITL Vaultwarden access).
|
||||
# Install SYSTEM-WIDE (npm prefix /usr → /usr/bin/bw) so EVERY user's PATH
|
||||
# resolves it. The guard tests the SYSTEM path, NOT `command -v bw`: the
|
||||
# latter is satisfied by an admin's own ~/.local/bin/bw and would skip the
|
||||
# system install, leaving non-admins (emo, anca, …) with no backend. Pinned
|
||||
# major; best-effort (a failure only disables `homelab vault`).
|
||||
if [ ! -x /usr/bin/bw ] && [ ! -x /usr/local/bin/bw ]; then
|
||||
log "npm: installing @bitwarden/cli system-wide (homelab vault backend)"
|
||||
npm install -g --prefix /usr "@bitwarden/cli@^2024" >/dev/null 2>&1 || log "WARN: @bitwarden/cli install failed; homelab vault unavailable"
|
||||
# npm-global so every user's PATH resolves it. Pinned major; best-effort (a
|
||||
# failure only disables `homelab vault`, nothing else on the box).
|
||||
if ! command -v bw >/dev/null; then
|
||||
log "npm: installing @bitwarden/cli (homelab vault backend)"
|
||||
npm install -g "@bitwarden/cli@^2024" >/dev/null 2>&1 || log "WARN: @bitwarden/cli install failed; homelab vault unavailable"
|
||||
fi
|
||||
|
||||
# 3) kubelogin (kubectl oidc-login) system-wide — NOT the apt 'kubelogin' (= Azure tool).
|
||||
|
|
|
|||
|
|
@ -93,15 +93,6 @@ ensure_onboarding() {
|
|||
}
|
||||
ensure_onboarding
|
||||
|
||||
# Load a per-user long-lived CLAUDE_CODE_OAUTH_TOKEN if claude-auth-sync has
|
||||
# materialized one from this user's own Vault path. A non-rotating setup-token
|
||||
# sidesteps the shared ~/.claude/.credentials.json OAuth refresh-token race that
|
||||
# logs out users running many concurrent agents (interactive + t3 + always-on).
|
||||
# Absent file -> no-op (normal per-user Enterprise-SSO flow). The user's OWN
|
||||
# token; never shared between OS users.
|
||||
_oauth_env="$HOME/.config/claude-auth-sync/claude-oauth.env"
|
||||
if [ -r "$_oauth_env" ]; then set -a; . "$_oauth_env"; set +a; fi
|
||||
|
||||
# Deliberately not `exec` so we can branch on the exit code: clean quit ends the
|
||||
# pane (ttyd closes the terminal); a crash drops to a shell so the tmux session
|
||||
# isn't destroyed-and-recreated in a ttyd auto-reconnect loop.
|
||||
|
|
|
|||
|
|
@ -5,9 +5,6 @@ variable "tls_secret_name" {
|
|||
variable "nfs_server" { type = string }
|
||||
|
||||
resource "kubernetes_manifest" "external_secret" {
|
||||
field_manager {
|
||||
force_conflicts = true
|
||||
}
|
||||
manifest = {
|
||||
apiVersion = "external-secrets.io/v1"
|
||||
kind = "ExternalSecret"
|
||||
|
|
|
|||
|
|
@ -5,9 +5,6 @@ variable "tls_secret_name" {
|
|||
variable "nfs_server" { type = string }
|
||||
|
||||
resource "kubernetes_manifest" "external_secret" {
|
||||
field_manager {
|
||||
force_conflicts = true
|
||||
}
|
||||
manifest = {
|
||||
apiVersion = "external-secrets.io/v1"
|
||||
kind = "ExternalSecret"
|
||||
|
|
@ -45,9 +42,6 @@ data "kubernetes_secret" "eso_secrets" {
|
|||
# DB credentials from Vault database engine (rotated automatically)
|
||||
# Provides DATABASE_URL that auto-updates when password rotates
|
||||
resource "kubernetes_manifest" "db_external_secret" {
|
||||
field_manager {
|
||||
force_conflicts = true
|
||||
}
|
||||
manifest = {
|
||||
apiVersion = "external-secrets.io/v1"
|
||||
kind = "ExternalSecret"
|
||||
|
|
|
|||
|
|
@ -1,46 +0,0 @@
|
|||
# SLOW-1a overlay over the official authentik server image.
|
||||
#
|
||||
# The login flow's identification stage renders each enabled source's UI login
|
||||
# button. Upstream authentik/stages/identification/stage.py does:
|
||||
# current_stage.sources.filter(enabled=True).order_by("name").select_subclasses()
|
||||
# The bare no-arg select_subclasses() (django-model-utils InheritanceManager)
|
||||
# LEFT-JOINs EVERY Source subtype table; on the cold-login hot path that is ~1.5s
|
||||
# (verified live on 2026.2.4: 1527ms vs 14ms). Passing only the subtypes that
|
||||
# actually render a UI login button — every concrete Source type that overrides
|
||||
# ui_login_button: oauth/saml/plex/telegram/kerberos, NOT the sync-only ldap/scim —
|
||||
# is ~100x faster and BYTE-IDENTICAL output (verified: concrete types + rendered
|
||||
# buttons match). django-model-utils accepts the lowercase subclass *accessor
|
||||
# names* as strings, so no new import is needed (no circular-import risk) — the
|
||||
# patch is a single, reviewable line edit.
|
||||
#
|
||||
# RE-VERIFY ON EVERY AUTHENTIK BUMP: bump the FROM tag below AND the image tag in
|
||||
# modules/authentik/values.yaml together. The grep guards fail the build LOUDLY if
|
||||
# the upstream target line moved. If a future authentik version adds a NEW
|
||||
# login-capable source type, add its lowercase accessor to the list below.
|
||||
# Upstream: the bare select_subclasses() is still present in main (no fix/PR as of
|
||||
# 2026-06-28) — drop this overlay once upstream narrows the query.
|
||||
FROM ghcr.io/goauthentik/server:2026.2.4
|
||||
|
||||
USER root
|
||||
RUN set -eux; \
|
||||
F=/authentik/stages/identification/stage.py; \
|
||||
grep -q 'order_by("name").select_subclasses()' "$F"; \
|
||||
sed -i 's/order_by("name")\.select_subclasses()/order_by("name").select_subclasses("oauthsource", "samlsource", "plexsource", "telegramsource", "kerberossource")/' "$F"; \
|
||||
grep -q 'select_subclasses("oauthsource", "samlsource", "plexsource", "telegramsource", "kerberossource")' "$F"; \
|
||||
PY="$(command -v python || command -v python3)"; "$PY" -c "import ast,sys; ast.parse(open('$F').read())"; \
|
||||
rm -f /authentik/stages/identification/__pycache__/stage.*.pyc
|
||||
|
||||
# PATCH #2 — old-browser BLANK LOGIN. authentik's modern flow SPA is ES2022 and
|
||||
# hard-fails (blank login) on Safari<=16.3 (e.g. iPadOS<=16.3). authentik already
|
||||
# ships a no-JS Simplified Flow Executor (SFE, ES5) but only serves it to
|
||||
# IE/old-Edge/PKeyAuth. patch-compat-sfe.py (a) extends compat_needs_sfe() to
|
||||
# serve the SFE to old Safari AND any iOS browser (Chrome/CriOS, Firefox/FxiOS —
|
||||
# all share the system WebKit) on iOS<=16.3, and (b) injects static social-login
|
||||
# <a> links into the SFE shell (the SFE can't render Identification-stage sources;
|
||||
# needed for password-less Google-only accounts). Clients get the REAL authentik
|
||||
# login (password + MFA + reputation, NO auth downgrade) instead of a blank page.
|
||||
# The script is guarded (asserts both upstream anchors + ast-parses) so the build
|
||||
# fails loudly if upstream moves — re-verify on every authentik bump.
|
||||
COPY patch-compat-sfe.py /tmp/patch-compat-sfe.py
|
||||
RUN python3 /tmp/patch-compat-sfe.py && rm -f /tmp/patch-compat-sfe.py
|
||||
USER authentik
|
||||
|
|
@ -49,15 +49,14 @@ resource "authentik_policy_expression" "admin_services_restriction" {
|
|||
|
||||
host = request.context.get("host", "")
|
||||
|
||||
# chrome-service noVNC (chrome.viktorbarzin.me) exposes LIVE logged-in browser
|
||||
# sessions from the SHARED persistent profile. Originally Viktor-only.
|
||||
# 2026-06-28 (Viktor's explicit decision): emo SHARES Viktor's browser, so emo
|
||||
# (emil.barzin / emil.barzin@gmail.com) is allowed in for noVNC form-filling +
|
||||
# captcha solving. Trade-off accepted: emo can therefore reach Viktor's warmed
|
||||
# sessions (the CLI half is the emo-browser ServiceAccount in
|
||||
# stacks/chrome-service/rbac.tf). akadmin kept as break-glass. Match username OR
|
||||
# email so neither attribute alone can lock anyone out.
|
||||
CHROME_ALLOWED = {"akadmin", "akadmin@viktorbarzin.me", "vbarzin@gmail.com", "emil.barzin", "emil.barzin@gmail.com"}
|
||||
# chrome-service noVNC (chrome.viktorbarzin.me) exposes Viktor's LIVE
|
||||
# logged-in browser sessions, so lock it to Viktor's own accounts ONLY.
|
||||
# "Home Server Admins" is NOT sufficient — emo (emil.barzin@gmail.com) is a
|
||||
# member. akadmin kept as break-glass. The homelab-browser CDP path is
|
||||
# already RBAC-gated (emo = oidc-power-user-readonly, no pods/portforward),
|
||||
# so this closes the only remaining, human, noVNC path. Match username OR
|
||||
# email so neither attribute alone can lock Viktor out.
|
||||
CHROME_ALLOWED = {"akadmin", "akadmin@viktorbarzin.me", "vbarzin@gmail.com"}
|
||||
if host == "chrome.viktorbarzin.me":
|
||||
return request.user.username in CHROME_ALLOWED or request.user.email in CHROME_ALLOWED
|
||||
|
||||
|
|
|
|||
|
|
@ -6,9 +6,6 @@
|
|||
# are non-secret and live in values.yaml. The reloader annotation rolls the
|
||||
# authentik pods if the password ever changes.
|
||||
resource "kubernetes_manifest" "authentik_email_secret" {
|
||||
field_manager {
|
||||
force_conflicts = true
|
||||
}
|
||||
manifest = {
|
||||
apiVersion = "external-secrets.io/v1"
|
||||
kind = "ExternalSecret"
|
||||
|
|
|
|||
|
|
@ -29,12 +29,7 @@ resource "kubernetes_namespace" "authentik" {
|
|||
labels = {
|
||||
tier = var.tier
|
||||
"resource-governance/custom-quota" = "true"
|
||||
# Keel intentionally NOT enrolled: server+worker run our custom overlay image
|
||||
# (ghcr.io/viktorbarzin/authentik-server — see values.yaml global.image +
|
||||
# stacks/authentik/Dockerfile). The tag is pinned explicitly and bumped
|
||||
# manually (rebuild the overlay FROM the new authentik version + repoint), so
|
||||
# a Keel auto-bump would only risk re-introducing the upstream tag / the
|
||||
# 2026-06-10 downgrade-boot-storm class. Re-enroll only if the overlay is dropped.
|
||||
"keel.sh/enrolled" = "true"
|
||||
}
|
||||
}
|
||||
lifecycle {
|
||||
|
|
@ -87,11 +82,6 @@ module "ingress" {
|
|||
service_name = "goauthentik-server"
|
||||
tls_secret_name = var.tls_secret_name
|
||||
anti_ai_scraping = false
|
||||
# Swap the shared 10/50 default limiter for a dedicated 100/1000 carve-out:
|
||||
# the login SPA + flow-executor API burst on a cold load otherwise 429s into
|
||||
# a blank screen (see traefik middleware "authentik-rate-limit").
|
||||
skip_default_rate_limit = true
|
||||
extra_middlewares = ["traefik-authentik-rate-limit@kubernetescrd"]
|
||||
extra_annotations = {
|
||||
"gethomepage.dev/enabled" = "true"
|
||||
"gethomepage.dev/name" = "Authentik"
|
||||
|
|
@ -159,12 +149,5 @@ module "ingress-static" {
|
|||
tls_secret_name = var.tls_secret_name
|
||||
anti_ai_scraping = false
|
||||
homepage_enabled = false
|
||||
# /static serves ALL the SPA JS/CSS chunks; the default 10/50 limiter 429s the
|
||||
# cold-load fan-out → blank screen. Dedicated 100/1000 carve-out (note the two
|
||||
# namespaces: cache-headers is in ns authentik, rate-limit is in ns traefik).
|
||||
skip_default_rate_limit = true
|
||||
extra_middlewares = [
|
||||
"authentik-static-cache-headers@kubernetescrd",
|
||||
"traefik-authentik-rate-limit@kubernetescrd",
|
||||
]
|
||||
extra_middlewares = ["authentik-static-cache-headers@kubernetescrd"]
|
||||
}
|
||||
|
|
|
|||
|
|
@ -39,16 +39,6 @@ server:
|
|||
value: "3"
|
||||
- name: AUTHENTIK_WEB__THREADS
|
||||
value: "4"
|
||||
# Gunicorn worker recycle hardening (defaults max_requests=1000/jitter=50).
|
||||
# A worker recycle that coincides with a transient PG/pgbouncer blip stalls
|
||||
# in-flight requests (sessions+cache are on PostgreSQL since Redis was removed
|
||||
# in 2026.2), and with 9 workers recycling on a tight 50-jitter window the
|
||||
# recycles cluster — feeding the episodic all-pods-NotReady 502/504 cascade.
|
||||
# 10x rarer recycles + 20x wider jitter (1000) decorrelate them from DB blips.
|
||||
- name: AUTHENTIK_WEB__MAX_REQUESTS
|
||||
value: "10000"
|
||||
- name: AUTHENTIK_WEB__MAX_REQUESTS_JITTER
|
||||
value: "1000"
|
||||
# Cache flow plans for 30m and policy evaluations for 15m (defaults 300s).
|
||||
# Authentik 2026.2 stores cache in Postgres, so a TTL hit is still a
|
||||
# SELECT — but a single indexed lookup beats re-planning the flow
|
||||
|
|
@ -97,28 +87,11 @@ server:
|
|||
livenessProbe:
|
||||
failureThreshold: 6
|
||||
timeoutSeconds: 5
|
||||
# Readiness widened from the chart default (3x10s/3s ~= 30s) to ~80s. The
|
||||
# readiness probe (/-/health/ready/) queries the DB, so a sub-~60s PG/pgbouncer
|
||||
# transient otherwise returns 503 and drops ALL 3 server pods from the Service
|
||||
# at once -> Traefik has no healthy backend -> 502/504 (the episodic blank
|
||||
# screen + 30s hang). 80s absorbs a full CNPG failover reconnect; liveness
|
||||
# still reaps a truly hung pod. Partial override — the chart deep-merges the
|
||||
# httpGet path /-/health/ready/ (same as the livenessProbe override above).
|
||||
readinessProbe:
|
||||
failureThreshold: 8
|
||||
periodSeconds: 10
|
||||
timeoutSeconds: 5
|
||||
# RollingUpdate strategy. The chart key is `deploymentStrategy`, NOT `strategy`
|
||||
# (authentik.server reads .Values.server.deploymentStrategy) — the old
|
||||
# `strategy:` key was silently ignored, so live ran the chart default 25%/25%
|
||||
# and every rolling event dropped a server pod out of rotation, amplifying the
|
||||
# NotReady cascade. maxSurge:1 + maxUnavailable:0 keeps all 3 ready throughout
|
||||
# a roll (PDB minAvailable:2 + ResourceQuota headroom allow the transient pod).
|
||||
deploymentStrategy:
|
||||
strategy:
|
||||
type: RollingUpdate
|
||||
rollingUpdate:
|
||||
maxSurge: 1
|
||||
maxUnavailable: 0
|
||||
maxSurge: 0
|
||||
maxUnavailable: 1
|
||||
resources:
|
||||
requests:
|
||||
cpu: 100m
|
||||
|
|
@ -145,23 +118,15 @@ server:
|
|||
global:
|
||||
addPrometheusAnnotations: true
|
||||
image:
|
||||
# CUSTOM OVERLAY: two thin patches over the official authentik server image
|
||||
# (see stacks/authentik/Dockerfile): (1) SLOW-1a — narrows the login-flow
|
||||
# select_subclasses() query, ~1.4s -> ~14ms; (2) serve authentik's no-JS SFE
|
||||
# login to old Safari/WebKit AND any iOS browser (Chrome/Firefox = WebKit) on
|
||||
# iOS<=16.3 so old devices (e.g. iPadOS<=15) get a working login instead of a
|
||||
# blank page, and injects social-login links into the SFE (it can't render
|
||||
# sources; needed for password-less Google-only accounts). Built by
|
||||
# .github/workflows/build-authentik.yml to ghcr.io/viktorbarzin/authentik-server
|
||||
# (public package, anonymous pull — no imagePullSecret needed, like the
|
||||
# upstream goauthentik image). Keel is NO LONGER enrolled for this namespace
|
||||
# (see main.tf) so it can't bump/downgrade the tag; helm also defaults the tag
|
||||
# to the chart appVersion (2026.2.2) — so BOTH repository AND tag are pinned
|
||||
# explicitly here to prevent the 2026-06-10 downgrade-boot-storm class.
|
||||
# UPGRADE = bump the Dockerfile FROM tag + this tag together (e.g. ->
|
||||
# 2026.3.0-patch1), let GHA rebuild, then apply.
|
||||
repository: ghcr.io/viktorbarzin/authentik-server
|
||||
tag: "2026.2.4-patch3"
|
||||
# Pin to the Keel-managed live tag. Keel (diun-annotated, keel.sh/enrolled
|
||||
# namespace) bumps the IMAGE between chart releases, while helm defaults
|
||||
# the tag to the chart appVersion — so any helm upgrade silently
|
||||
# DOWNGRADES the running pods to the chart pin (2026-06-10: a values-only
|
||||
# apply rolled live 2026.2.4 back to 2026.2.2 against a 2026.2.4-migrated
|
||||
# DB → boot storm, see docs/post-mortems/2026-06-10-authentik-downgrade-
|
||||
# boot-storm.md). Keep this tag in sync with what Keel has deployed when
|
||||
# touching this chart; clear it only when bumping the chart version itself.
|
||||
tag: "2026.2.4"
|
||||
|
||||
worker:
|
||||
# 2 replicas: workers handle background tasks (LDAP sync, email,
|
||||
|
|
@ -201,10 +166,7 @@ worker:
|
|||
secretKeyRef:
|
||||
name: authentik-email
|
||||
key: AUTHENTIK_EMAIL__PASSWORD
|
||||
# Chart key is `deploymentStrategy`, not `strategy` (see server above). Workers
|
||||
# serve no user traffic, so maxSurge:0/maxUnavailable:1 is fine — this is just
|
||||
# the dead-key cleanup so the declared intent actually takes effect.
|
||||
deploymentStrategy:
|
||||
strategy:
|
||||
type: RollingUpdate
|
||||
rollingUpdate:
|
||||
maxSurge: 0
|
||||
|
|
|
|||
|
|
@ -1,96 +0,0 @@
|
|||
#!/usr/bin/env python3
|
||||
"""Overlay patch — make authentik usable on OLD browsers (no modern-JS SPA).
|
||||
|
||||
authentik's modern flow SPA is ES2022 (static{} init blocks) that hard-fail on
|
||||
Safari/WebKit <= 16.3 (e.g. iPadOS <= 16.3) and render a COMPLETELY BLANK login.
|
||||
authentik ships a no-JS Simplified Flow Executor (SFE, ES5) but only serves it to
|
||||
IE / old-Edge / PKeyAuth, and the SFE itself canNOT render Identification-stage
|
||||
sources (social-login buttons) — authentik docs list "Sources" as unsupported.
|
||||
|
||||
This patch does TWO things, both guarded (assert the upstream anchor + verify the
|
||||
result) so the image build fails LOUDLY if upstream moves. RE-VERIFY on every
|
||||
authentik upgrade.
|
||||
|
||||
1. flows/views/interface.py::compat_needs_sfe() -> also return True for old
|
||||
Safari/WebKit: (a) Safari/Mobile Safari Version <= 16.3 (covers desktop-mode
|
||||
iPadOS which reports as Mac Safari), and (b) ANY iOS browser (Chrome/CriOS,
|
||||
Firefox/FxiOS, Edge — all share the system WebKit) on iOS <= 16.3. So old
|
||||
iPads get the SFE on EVERY browser, not just Safari.
|
||||
|
||||
2. flows/templates/if/flow-sfe.html -> inject static social-login <a> links
|
||||
(plain redirects to /source/oauth/login/<slug>/, work on ANY browser) so SFE
|
||||
users (who otherwise see only username/password) can use social login —
|
||||
required for accounts with no password (e.g. Google-only users like emo).
|
||||
"""
|
||||
import ast
|
||||
import glob
|
||||
import os
|
||||
|
||||
# --- Patch 1: compat_needs_sfe() UA gate -------------------------------------
|
||||
INTERFACE = "/authentik/flows/views/interface.py"
|
||||
ANCHOR = (
|
||||
' if "PKeyAuth" in ua["string"]:\n'
|
||||
" return True\n"
|
||||
" return False"
|
||||
)
|
||||
REPLACEMENT = (
|
||||
' if "PKeyAuth" in ua["string"]:\n'
|
||||
" return True\n"
|
||||
" # OVERLAY: old WebKit can't parse the modern ES2022 flow SPA (blank\n"
|
||||
" # login) -> serve the SFE (real authentik login). (a) desktop-mode\n"
|
||||
" # Safari/iPadOS reports as Mac Safari with Version<=16.3:\n"
|
||||
' if ua["user_agent"]["family"] in ("Safari", "Mobile Safari"):\n'
|
||||
" try:\n"
|
||||
' _maj = int(ua["user_agent"]["major"] or 0)\n'
|
||||
' _min = int(ua["user_agent"]["minor"] or 0)\n'
|
||||
" except (TypeError, ValueError):\n"
|
||||
" _maj = _min = 0\n"
|
||||
" if _maj and (_maj < 16 or (_maj == 16 and _min <= 3)):\n"
|
||||
" return True\n"
|
||||
" # (b) ANY iOS browser (Chrome/CriOS, Firefox/FxiOS, Edge) shares the\n"
|
||||
" # system WebKit, so iOS<=16.3 fails regardless of the browser family:\n"
|
||||
' if ua["os"]["family"] == "iOS":\n'
|
||||
" try:\n"
|
||||
' _omaj = int(ua["os"]["major"] or 0)\n'
|
||||
' _omin = int(ua["os"]["minor"] or 0)\n'
|
||||
" except (TypeError, ValueError):\n"
|
||||
" _omaj = _omin = 0\n"
|
||||
" if _omaj and (_omaj < 16 or (_omaj == 16 and _omin <= 3)):\n"
|
||||
" return True\n"
|
||||
" return False"
|
||||
)
|
||||
src = open(INTERFACE).read()
|
||||
assert "def compat_needs_sfe" in src, "compat_needs_sfe() not found — upstream changed"
|
||||
assert src.count(ANCHOR) == 1, f"anchor not found exactly once in {INTERFACE}"
|
||||
src = src.replace(ANCHOR, REPLACEMENT)
|
||||
open(INTERFACE, "w").write(src)
|
||||
ast.parse(src)
|
||||
assert 'ua["os"]["family"] == "iOS"' in open(INTERFACE).read()
|
||||
for pyc in glob.glob("/authentik/flows/views/__pycache__/interface.*.pyc"):
|
||||
os.remove(pyc)
|
||||
|
||||
# --- Patch 2: social-login links on the SFE shell ----------------------------
|
||||
SFE_HTML = "/authentik/flows/templates/if/flow-sfe.html"
|
||||
HTML_ANCHOR = (
|
||||
" </main>\n"
|
||||
" <span class=\"mt-3 mb-0 text-muted text-center\">{% trans 'Powered by authentik' %}</span>"
|
||||
)
|
||||
HTML_REPLACEMENT = (
|
||||
" </main>\n"
|
||||
" <!-- OVERLAY: the SFE can't render Identification-stage sources, so add\n"
|
||||
" static social-login links (plain redirects, work on any browser).\n"
|
||||
" Re-verify slugs on source changes; shown on all SFE flows. -->\n"
|
||||
' <div class="form-signin w-100 m-auto pt-2 mt-2 border-top">\n'
|
||||
' <a class="btn btn-outline-secondary w-100 mb-2" href="/source/oauth/login/google/">Continue with Google</a>\n'
|
||||
' <a class="btn btn-outline-secondary w-100 mb-2" href="/source/oauth/login/github/">Continue with GitHub</a>\n'
|
||||
' <a class="btn btn-outline-secondary w-100 mb-2" href="/source/oauth/login/facebook/">Continue with Facebook</a>\n'
|
||||
" </div>\n"
|
||||
" <span class=\"mt-3 mb-0 text-muted text-center\">{% trans 'Powered by authentik' %}</span>"
|
||||
)
|
||||
html = open(SFE_HTML).read()
|
||||
assert html.count(HTML_ANCHOR) == 1, f"SFE html anchor not found exactly once in {SFE_HTML}"
|
||||
html = html.replace(HTML_ANCHOR, HTML_REPLACEMENT)
|
||||
open(SFE_HTML, "w").write(html)
|
||||
assert "Continue with Google" in open(SFE_HTML).read()
|
||||
|
||||
print("patch-compat-sfe: SFE for old Safari + all iOS<=16.3; social-login links added to SFE")
|
||||
|
|
@ -601,9 +601,6 @@ resource "kubernetes_config_map" "beadboard_config" {
|
|||
# Pulls the claude-agent-service bearer token from Vault so BeadBoard can
|
||||
# dispatch agent jobs via the in-cluster HTTP API.
|
||||
resource "kubernetes_manifest" "beadboard_agent_service_secret" {
|
||||
field_manager {
|
||||
force_conflicts = true
|
||||
}
|
||||
manifest = {
|
||||
apiVersion = "external-secrets.io/v1"
|
||||
kind = "ExternalSecret"
|
||||
|
|
|
|||
|
|
@ -28,9 +28,6 @@ resource "kubernetes_namespace" "broker_sync" {
|
|||
# trading212_api_keys — JSON array of {account_id, account_type, api_key, name, currency}
|
||||
# imap_host, imap_user, imap_password, imap_directory — for InvestEngine + Schwab email ingest
|
||||
resource "kubernetes_manifest" "external_secret" {
|
||||
field_manager {
|
||||
force_conflicts = true
|
||||
}
|
||||
manifest = {
|
||||
apiVersion = "external-secrets.io/v1"
|
||||
kind = "ExternalSecret"
|
||||
|
|
|
|||
|
|
@ -22,7 +22,7 @@ resource "kubernetes_namespace" "calico_system" {
|
|||
name = "calico-system"
|
||||
labels = {
|
||||
name = "calico-system"
|
||||
# calico-system namespace is managed by tigera-operator — auto-update is
|
||||
# calico-system namespace is managed by tigera-operator — auto-update is
|
||||
# incompatible (operator reverts DaemonSet image from its Installation CR).
|
||||
# "keel.sh/enrolled" = "true"
|
||||
}
|
||||
|
|
@ -212,229 +212,3 @@ resource "kubectl_manifest" "whisker" {
|
|||
spec = { notifications = "Disabled" }
|
||||
})
|
||||
}
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Gated public ingress for the Whisker UI (infra #57 / ADR-0014).
|
||||
#
|
||||
# whisker.viktorbarzin.me -> whisker:8081, Authentik-gated (auth="required":
|
||||
# Whisker ships NO own login — it's an admin observability UI, so Authentik
|
||||
# forward-auth is the only gate between strangers and the flow view). The
|
||||
# operator replicated `tls-secret` into calico-system already.
|
||||
#
|
||||
# TWO coupled pieces are required because the operator's own `whisker`
|
||||
# NetworkPolicy (owned by the Whisker CR above) sets policyTypes:[Ingress]
|
||||
# with NO ingress rules => default-deny on ingress to the whisker pod. The
|
||||
# additive NP below ORs in a Traefik allow (k8s NetworkPolicies are additive
|
||||
# across policies selecting the same pod), so we never edit the operator NP.
|
||||
module "ingress_whisker" {
|
||||
source = "../../modules/kubernetes/ingress_factory"
|
||||
dns_type = "proxied"
|
||||
namespace = "calico-system"
|
||||
name = "whisker"
|
||||
service_name = "whisker"
|
||||
port = 8081
|
||||
auth = "required"
|
||||
tls_secret_name = "tls-secret"
|
||||
extra_annotations = {
|
||||
"gethomepage.dev/enabled" = "true"
|
||||
"gethomepage.dev/name" = "Whisker"
|
||||
"gethomepage.dev/description" = "Calico flow observability (who-talks-to-whom)"
|
||||
"gethomepage.dev/icon" = "calico.png"
|
||||
"gethomepage.dev/group" = "Infrastructure"
|
||||
}
|
||||
}
|
||||
|
||||
# Additive NetworkPolicy: permit Traefik -> whisker:8081. ORs with the
|
||||
# operator's default-deny `whisker` NP (selecting the same pod) so Traefik
|
||||
# can reach the UI without touching the operator-owned policy.
|
||||
resource "kubernetes_network_policy_v1" "whisker_allow_traefik" {
|
||||
metadata {
|
||||
name = "whisker-allow-traefik"
|
||||
namespace = "calico-system"
|
||||
}
|
||||
spec {
|
||||
pod_selector {
|
||||
match_labels = {
|
||||
"app.kubernetes.io/name" = "whisker"
|
||||
}
|
||||
}
|
||||
policy_types = ["Ingress"]
|
||||
ingress {
|
||||
from {
|
||||
namespace_selector {
|
||||
match_labels = {
|
||||
"kubernetes.io/metadata.name" = "traefik"
|
||||
}
|
||||
}
|
||||
}
|
||||
ports {
|
||||
port = "8081"
|
||||
protocol = "TCP"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
# Additive egress NetworkPolicy: permit whisker -> the kube-dns ClusterIP for DNS.
|
||||
#
|
||||
# ROOT CAUSE of the 2026-06-28 "Whisker UI empty" incident: the operator's own
|
||||
# `whisker` NetworkPolicy is policyTypes:[Ingress,Egress] and its egress allows
|
||||
# DNS only to the kube-dns *pods* (podSelector k8s-app=kube-dns). But
|
||||
# whisker-backend resolves `goldmane...svc` via the kube-dns *ClusterIP*
|
||||
# (10.96.0.10), and Calico drops UDP DNS to a ClusterIP under a podSelector-only
|
||||
# egress rule (verified: from whisker's netns, ClusterIP DNS = 100% timeout
|
||||
# while direct kube-dns pod-IP DNS = OK; a pod with no egress policy resolves
|
||||
# fine). whisker-backend resolves once in the brief startup window before the
|
||||
# policy programs, establishes its long-lived gRPC stream, and only re-resolves
|
||||
# when that stream breaks — at which point the blocked ClusterIP DNS wedges its
|
||||
# Go resolver and the UI goes empty (the durable aggregator, in its own
|
||||
# unrestricted namespace, is unaffected). k8s egress policies are additive, so
|
||||
# this ORs in an allow for the ClusterIP; the operator NP is left untouched.
|
||||
# (Empirically: adding this ipBlock rule flips ClusterIP DNS from 100% fail to
|
||||
# 100% ok.) See docs/runbooks/goldmane-flow-trail.md.
|
||||
resource "kubernetes_network_policy_v1" "whisker_allow_dns_clusterip" {
|
||||
metadata {
|
||||
name = "whisker-allow-dns-clusterip"
|
||||
namespace = "calico-system"
|
||||
}
|
||||
spec {
|
||||
pod_selector {
|
||||
match_labels = {
|
||||
"app.kubernetes.io/name" = "whisker"
|
||||
}
|
||||
}
|
||||
policy_types = ["Egress"]
|
||||
egress {
|
||||
# 10.96.0.10 is the kube-dns ClusterIP (cluster invariant — service CIDR
|
||||
# 10.96.0.0/12, DNS always .10; the same IP CoreDNS/Technitium configs pin).
|
||||
to {
|
||||
ip_block {
|
||||
cidr = "10.96.0.10/32"
|
||||
}
|
||||
}
|
||||
ports {
|
||||
port = "53"
|
||||
protocol = "UDP"
|
||||
}
|
||||
ports {
|
||||
port = "53"
|
||||
protocol = "TCP"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Whisker self-heal watchdog (ADR-0014; added 2026-06-28 after a live incident).
|
||||
#
|
||||
# BACKSTOP. The REAL fix is kubernetes_network_policy_v1.whisker_allow_dns_clusterip
|
||||
# above (it unblocks the root-cause ClusterIP DNS). This watchdog stays as
|
||||
# defense-in-depth: whisker-backend has NO operator liveness probe, so if its
|
||||
# long-lived goldmane gRPC stream ever wedges for any OTHER reason (the Go
|
||||
# resolver spams `failed to stream flows` / `code = Unavailable` and never
|
||||
# reconnects -> empty UI, while the durable aggregator in its own namespace is
|
||||
# unaffected), nothing else would restart it. Whisker is operator-managed
|
||||
# (Whisker CR) so we can't inject a probe; this is the supported-pattern
|
||||
# alternative. With the DNS fix in place it should rarely, if ever, fire.
|
||||
#
|
||||
# It restarts the pod ONLY when the wedged signature is present AND Goldmane is
|
||||
# Ready (so a real Goldmane outage doesn't cause restart-thrash). A fresh pod
|
||||
# reconnects cleanly. See docs/runbooks/goldmane-flow-trail.md.
|
||||
resource "kubernetes_service_account" "whisker_watchdog" {
|
||||
metadata {
|
||||
name = "whisker-watchdog"
|
||||
namespace = kubernetes_namespace.calico_system.metadata[0].name
|
||||
}
|
||||
}
|
||||
|
||||
# Namespaced Role (least privilege — only calico-system): read pod logs to
|
||||
# detect the wedge, delete the whisker pod to heal it.
|
||||
resource "kubernetes_role" "whisker_watchdog" {
|
||||
metadata {
|
||||
name = "whisker-watchdog"
|
||||
namespace = kubernetes_namespace.calico_system.metadata[0].name
|
||||
}
|
||||
rule {
|
||||
api_groups = [""]
|
||||
resources = ["pods"]
|
||||
verbs = ["get", "list", "delete"]
|
||||
}
|
||||
rule {
|
||||
api_groups = [""]
|
||||
resources = ["pods/log"]
|
||||
verbs = ["get"]
|
||||
}
|
||||
}
|
||||
|
||||
resource "kubernetes_role_binding" "whisker_watchdog" {
|
||||
metadata {
|
||||
name = "whisker-watchdog"
|
||||
namespace = kubernetes_namespace.calico_system.metadata[0].name
|
||||
}
|
||||
role_ref {
|
||||
api_group = "rbac.authorization.k8s.io"
|
||||
kind = "Role"
|
||||
name = kubernetes_role.whisker_watchdog.metadata[0].name
|
||||
}
|
||||
subject {
|
||||
kind = "ServiceAccount"
|
||||
name = kubernetes_service_account.whisker_watchdog.metadata[0].name
|
||||
namespace = kubernetes_namespace.calico_system.metadata[0].name
|
||||
}
|
||||
}
|
||||
|
||||
resource "kubernetes_cron_job_v1" "whisker_watchdog" {
|
||||
metadata {
|
||||
name = "whisker-watchdog"
|
||||
namespace = kubernetes_namespace.calico_system.metadata[0].name
|
||||
}
|
||||
spec {
|
||||
schedule = "*/10 * * * *"
|
||||
successful_jobs_history_limit = 1
|
||||
failed_jobs_history_limit = 1
|
||||
concurrency_policy = "Forbid"
|
||||
job_template {
|
||||
metadata {
|
||||
name = "whisker-watchdog"
|
||||
}
|
||||
spec {
|
||||
template {
|
||||
metadata {
|
||||
name = "whisker-watchdog"
|
||||
}
|
||||
spec {
|
||||
service_account_name = kubernetes_service_account.whisker_watchdog.metadata[0].name
|
||||
container {
|
||||
name = "watchdog"
|
||||
image = "bitnami/kubectl:latest"
|
||||
command = ["/bin/sh", "-c", <<-EOT
|
||||
set -eu
|
||||
NS=calico-system
|
||||
# Don't thrash if Goldmane itself is down — that's not a whisker bug.
|
||||
if ! kubectl -n "$NS" get pod -l k8s-app=goldmane \
|
||||
-o jsonpath='{.items[*].status.conditions[?(@.type=="Ready")].status}' 2>/dev/null | grep -q True; then
|
||||
echo "goldmane not Ready — skipping (not a whisker problem)"; exit 0
|
||||
fi
|
||||
ERRS=$(kubectl -n "$NS" logs -l k8s-app=whisker -c whisker-backend --since=11m --tail=500 2>/dev/null \
|
||||
| grep -cE 'failed to stream flows|failed to list filter hints|code = Unavailable|i/o timeout' || true)
|
||||
ERRS=$${ERRS:-0}
|
||||
if [ "$ERRS" -ge 10 ]; then
|
||||
echo "whisker-backend WEDGED: $ERRS goldmane-connection errors in 11m — restarting whisker pod"
|
||||
kubectl -n "$NS" delete pod -l k8s-app=whisker --ignore-not-found
|
||||
else
|
||||
echo "whisker-backend healthy: $ERRS goldmane-connection errors in 11m"
|
||||
fi
|
||||
EOT
|
||||
]
|
||||
}
|
||||
restart_policy = "Never"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
lifecycle {
|
||||
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
|
||||
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
|
||||
}
|
||||
}
|
||||
|
|
|
|||
|
|
@ -19,9 +19,6 @@ resource "kubernetes_namespace" "changedetection" {
|
|||
}
|
||||
|
||||
resource "kubernetes_manifest" "external_secret" {
|
||||
field_manager {
|
||||
force_conflicts = true
|
||||
}
|
||||
manifest = {
|
||||
apiVersion = "external-secrets.io/v1"
|
||||
kind = "ExternalSecret"
|
||||
|
|
|
|||
|
|
@ -19,14 +19,14 @@ for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15; do
|
|||
sleep 2
|
||||
done
|
||||
|
||||
# Both x11vnc and websockify run as supervised children of this entrypoint (PID
|
||||
# 1) so their logs land on container stdout and the `wait -n` at the end can catch
|
||||
# either one dying. `-noshm` skips MIT-SHM probes that fail across container
|
||||
# boundaries (each container has its own /dev/shm); `-noxdamage` skips XDAMAGE
|
||||
# which Xvfb doesn't expose; `-quiet` keeps the polling chatter out of pod logs.
|
||||
# websockify runs as PID 1; x11vnc is a child so its logs land on container stdout
|
||||
# `-noshm` skips MIT-SHM probes that fail across container boundaries (each
|
||||
# container has its own /dev/shm); `-noxdamage` skips XDAMAGE which Xvfb
|
||||
# doesn't expose; `-quiet` keeps the polling chatter out of pod logs.
|
||||
echo "starting x11vnc -> :5900"
|
||||
x11vnc -display localhost:99 -nopw -listen 0.0.0.0 -rfbport 5900 \
|
||||
-forever -shared -noshm -noxdamage -quiet 2>&1 &
|
||||
X11VNC_PID=$!
|
||||
|
||||
for i in 1 2 3 4 5 6 7 8 9 10; do
|
||||
if echo > /dev/tcp/127.0.0.1/5900 2>/dev/null; then
|
||||
|
|
@ -43,18 +43,4 @@ if ! echo > /dev/tcp/127.0.0.1/5900 2>/dev/null; then
|
|||
fi
|
||||
|
||||
echo "starting websockify -> :6080"
|
||||
# Run websockify in the background (it was `exec`ed before) so BOTH it and x11vnc
|
||||
# are supervised. x11vnc attaches to the chrome-service container's Xvfb over
|
||||
# localhost:6099 (shared pod network); when that container restarts, x11vnc loses
|
||||
# its X connection and exits. Previously websockify was PID 1 and x11vnc was an
|
||||
# unsupervised child, so a dead x11vnc was never relaunched: :5900 stayed dead and
|
||||
# the noVNC view went black until a manual pod restart. Now if EITHER process
|
||||
# exits, `wait -n` returns and we exit non-zero so the kubelet restarts this
|
||||
# container, which re-waits for Xvfb and relaunches x11vnc — the bridge self-heals
|
||||
# across browser-container restarts. (Same supervision pattern as the
|
||||
# android-emulator stack's entrypoint.)
|
||||
websockify --web=/usr/share/novnc 6080 localhost:5900 &
|
||||
|
||||
wait -n || true
|
||||
echo "novnc: a supervised process (x11vnc or websockify) exited; exiting so the kubelet restarts this container." >&2
|
||||
exit 1
|
||||
exec websockify --web=/usr/share/novnc 6080 localhost:5900
|
||||
|
|
|
|||
|
|
@ -41,9 +41,6 @@ resource "kubernetes_namespace" "chrome_service" {
|
|||
# --- Secrets (single-key extract: api_bearer_token) ---
|
||||
|
||||
resource "kubernetes_manifest" "external_secret" {
|
||||
field_manager {
|
||||
force_conflicts = true
|
||||
}
|
||||
manifest = {
|
||||
apiVersion = "external-secrets.io/v1"
|
||||
kind = "ExternalSecret"
|
||||
|
|
@ -333,23 +330,15 @@ resource "kubernetes_deployment" "chrome_service" {
|
|||
container {
|
||||
name = "novnc"
|
||||
# Phase 3 cutover 2026-05-07 — Forgejo registry consolidation.
|
||||
# SHA-pinned (not :latest): Keel is OFF for this deployment
|
||||
# (keel.sh/policy=never, below) and :latest/IfNotPresent won't re-pull a
|
||||
# rebuilt image, so a new noVNC entrypoint only deploys when this digest
|
||||
# is bumped here. Bump after build-chrome-service-novnc.yml pushes a new
|
||||
# SHA tag — then WAIT for that apply pipeline to finish before pushing
|
||||
# anything else: Woodpecker cancel-previous SIGKILLs an in-flight apply
|
||||
# mid-run (memory id=1957), which is exactly how the 2026-06-27 apply got
|
||||
# killed. 2026-06-27: bumped to land the x11vnc-supervision self-heal fix
|
||||
# (noVNC went black after a browser-container restart; see
|
||||
# docs/architecture/chrome-service.md "x11vnc supervision").
|
||||
image = "ghcr.io/viktorbarzin/chrome-service-novnc:19d0f0933a8ec75be6cfa077db88e0f8c3760f40"
|
||||
image = "ghcr.io/viktorbarzin/chrome-service-novnc:latest"
|
||||
image_pull_policy = "IfNotPresent"
|
||||
# Cap RLIMIT_NOFILE before the entrypoint runs. Containerd grants pods
|
||||
# nofile=2^31; x11vnc sweeps the whole fd table on each client connect,
|
||||
# so every VNC connection hangs on "Connecting" until it times out
|
||||
# (fd-sweep bug, same as android-emulator). entrypoint.sh also sets this;
|
||||
# the wrapper keeps the cap deterministic even off a cached image.
|
||||
# (fd-sweep bug, same as android-emulator). entrypoint.sh now also sets
|
||||
# this, but the image is :latest/IfNotPresent so a rebuilt entrypoint
|
||||
# isn't guaranteed to be pulled — this wrapper applies the cap
|
||||
# deterministically on every rollout off the cached image.
|
||||
command = ["bash", "-c", "ulimit -n 65536; exec /entrypoint.sh"]
|
||||
port {
|
||||
name = "http"
|
||||
|
|
@ -359,13 +348,9 @@ resource "kubernetes_deployment" "chrome_service" {
|
|||
# x11vnc connects to the chrome-service container's Xvfb over
|
||||
# localhost TCP (shared pod network). Same uid 1000 as chrome
|
||||
# container so we can read MIT-MAGIC-COOKIE if Xvfb adds one.
|
||||
# 256Mi (was 96Mi): the 96Mi cap OOMKilled (exit 137) the sidecar under
|
||||
# ACTIVE VNC use — x11vnc + websockify framebuffer/encode buffers spike
|
||||
# well past idle (~37Mi) when a client streams the 1280x720 screen, so the
|
||||
# noVNC view froze/hung on connect. Bumped 2026-06-28.
|
||||
resources {
|
||||
requests = { cpu = "10m", memory = "64Mi" }
|
||||
limits = { memory = "256Mi" }
|
||||
requests = { cpu = "10m", memory = "32Mi" }
|
||||
limits = { memory = "96Mi" }
|
||||
}
|
||||
}
|
||||
|
||||
|
|
|
|||
|
|
@ -1,95 +0,0 @@
|
|||
# emo's hands-off "homelab browser" credential + chrome-service port-forward RBAC.
|
||||
#
|
||||
# Access decision (2026-06-28, Viktor's explicit call): emo SHARES Viktor's single
|
||||
# chrome-service browser rather than getting an isolated instance. The noVNC half of
|
||||
# that grant is the Authentik allowlist in
|
||||
# stacks/authentik/admin-services-restriction.tf (CHROME_ALLOWED); THIS file is the
|
||||
# CLI half — it lets emo's `homelab browser` reach the headed Chrome over CDP.
|
||||
#
|
||||
# `homelab browser` shells out to `kubectl port-forward -n chrome-service svc/chrome-service`
|
||||
# (cli/browser.go). emo's normal kubeconfig is interactive-OIDC-only (kubelogin) and
|
||||
# can't authenticate a headless agent session, and his power-user tier has no
|
||||
# pods/portforward. So we mint a dedicated ServiceAccount with a long-lived token
|
||||
# (the dashboard-sa.tf pattern) that the devvm provisioner installs as emo's DEFAULT
|
||||
# kubeconfig context (scripts/t3-provision-users.sh install_browser_kubeconfig); his
|
||||
# personal OIDC login stays available as the `oidc@homelab` named context.
|
||||
#
|
||||
# TRADE-OFF (accepted): CDP access == full control of the shared browser, including
|
||||
# the persistent profile (browser.contexts[0]) where Viktor's warmed logins live.
|
||||
# CDP has no per-context auth, so this SA can reach Viktor's sessions. That is inherent
|
||||
# to sharing one browser (the isolated per-user instance was declined).
|
||||
# See docs/architecture/chrome-service.md "Multi-user access".
|
||||
|
||||
resource "kubernetes_service_account" "emo_browser" {
|
||||
metadata {
|
||||
name = "emo-browser"
|
||||
namespace = kubernetes_namespace.chrome_service.metadata[0].name
|
||||
}
|
||||
}
|
||||
|
||||
# Long-lived (non-expiring) token for the SA — the devvm provisioner reads this and
|
||||
# writes it into emo's kubeconfig. Same pattern as stacks/rbac/.../dashboard-sa.tf.
|
||||
resource "kubernetes_secret" "emo_browser_token" {
|
||||
metadata {
|
||||
name = "emo-browser-token"
|
||||
namespace = kubernetes_namespace.chrome_service.metadata[0].name
|
||||
annotations = {
|
||||
"kubernetes.io/service-account.name" = kubernetes_service_account.emo_browser.metadata[0].name
|
||||
}
|
||||
}
|
||||
type = "kubernetes.io/service-account-token"
|
||||
wait_for_service_account_token = true
|
||||
}
|
||||
|
||||
# The ONLY verb emo's SA lacks for `kubectl port-forward svc/chrome-service`: the
|
||||
# port-forward subresource. (get/list of pods + services + endpoints comes from the
|
||||
# cluster-read binding below.) Namespace-scoped to chrome-service.
|
||||
resource "kubernetes_role" "browser_portforward" {
|
||||
metadata {
|
||||
name = "chrome-service-portforward"
|
||||
namespace = kubernetes_namespace.chrome_service.metadata[0].name
|
||||
}
|
||||
rule {
|
||||
api_groups = [""]
|
||||
resources = ["pods/portforward"]
|
||||
verbs = ["create"]
|
||||
}
|
||||
}
|
||||
|
||||
resource "kubernetes_role_binding" "emo_browser_portforward" {
|
||||
metadata {
|
||||
name = "emo-browser-portforward"
|
||||
namespace = kubernetes_namespace.chrome_service.metadata[0].name
|
||||
}
|
||||
role_ref {
|
||||
api_group = "rbac.authorization.k8s.io"
|
||||
kind = "Role"
|
||||
name = kubernetes_role.browser_portforward.metadata[0].name
|
||||
}
|
||||
subject {
|
||||
kind = "ServiceAccount"
|
||||
name = kubernetes_service_account.emo_browser.metadata[0].name
|
||||
namespace = kubernetes_namespace.chrome_service.metadata[0].name
|
||||
}
|
||||
}
|
||||
|
||||
# Cluster-wide read-only (NO secrets), mirroring emo's power-user OIDC access, bound
|
||||
# to the SA. Needed because the SA becomes emo's DEFAULT kubectl context, so without
|
||||
# this his everyday `kubectl get ...` would regress — AND port-forward itself needs
|
||||
# get/list on services + pods + endpoints (all covered by oidc-power-user-readonly).
|
||||
# That ClusterRole is defined in stacks/rbac (modules/rbac/main.tf); referenced by name.
|
||||
resource "kubernetes_cluster_role_binding" "emo_browser_readonly" {
|
||||
metadata {
|
||||
name = "emo-browser-readonly"
|
||||
}
|
||||
role_ref {
|
||||
api_group = "rbac.authorization.k8s.io"
|
||||
kind = "ClusterRole"
|
||||
name = "oidc-power-user-readonly"
|
||||
}
|
||||
subject {
|
||||
kind = "ServiceAccount"
|
||||
name = kubernetes_service_account.emo_browser.metadata[0].name
|
||||
namespace = kubernetes_namespace.chrome_service.metadata[0].name
|
||||
}
|
||||
}
|
||||
|
|
@ -49,9 +49,6 @@ resource "kubernetes_namespace" "ci_pipeline_health" {
|
|||
# billing on PRIVATE mirrors, which a future scoped read:packages rotation of
|
||||
# the alias could not do. Blast radius = this single-CronJob namespace.
|
||||
resource "kubernetes_manifest" "external_secret" {
|
||||
field_manager {
|
||||
force_conflicts = true
|
||||
}
|
||||
manifest = {
|
||||
apiVersion = "external-secrets.io/v1"
|
||||
kind = "ExternalSecret"
|
||||
|
|
|
|||
|
|
@ -38,9 +38,6 @@ resource "kubernetes_namespace" "claude_agent" {
|
|||
# --- Secrets ---
|
||||
|
||||
resource "kubernetes_manifest" "external_secret" {
|
||||
field_manager {
|
||||
force_conflicts = true
|
||||
}
|
||||
manifest = {
|
||||
apiVersion = "external-secrets.io/v1"
|
||||
kind = "ExternalSecret"
|
||||
|
|
|
|||
|
|
@ -57,9 +57,6 @@ resource "kubernetes_service_account" "breakglass" {
|
|||
# DENIED this path (see stacks/vault/main.tf) so the shared, prompt-injectable
|
||||
# pod can never read it.
|
||||
resource "kubernetes_manifest" "external_secret_ssh" {
|
||||
field_manager {
|
||||
force_conflicts = true
|
||||
}
|
||||
manifest = {
|
||||
apiVersion = "external-secrets.io/v1"
|
||||
kind = "ExternalSecret"
|
||||
|
|
@ -85,9 +82,6 @@ resource "kubernetes_manifest" "external_secret_ssh" {
|
|||
# Env secrets: the Anthropic OAuth token (shared with claude-agent-service —
|
||||
# same account) and the app bearer token (in-cluster/CLI fallback caller auth).
|
||||
resource "kubernetes_manifest" "external_secret_env" {
|
||||
field_manager {
|
||||
force_conflicts = true
|
||||
}
|
||||
manifest = {
|
||||
apiVersion = "external-secrets.io/v1"
|
||||
kind = "ExternalSecret"
|
||||
|
|
|
|||
|
|
@ -29,9 +29,6 @@ resource "kubernetes_namespace" "claude-memory" {
|
|||
}
|
||||
|
||||
resource "kubernetes_manifest" "external_secret" {
|
||||
field_manager {
|
||||
force_conflicts = true
|
||||
}
|
||||
manifest = {
|
||||
apiVersion = "external-secrets.io/v1"
|
||||
kind = "ExternalSecret"
|
||||
|
|
@ -60,9 +57,6 @@ resource "kubernetes_manifest" "external_secret" {
|
|||
|
||||
# DB credentials from Vault database engine (rotated every 24h)
|
||||
resource "kubernetes_manifest" "db_external_secret" {
|
||||
field_manager {
|
||||
force_conflicts = true
|
||||
}
|
||||
manifest = {
|
||||
apiVersion = "external-secrets.io/v1"
|
||||
kind = "ExternalSecret"
|
||||
|
|
|
|||
|
|
@ -5,9 +5,6 @@ variable "tls_secret_name" {
|
|||
variable "public_ip" { type = string }
|
||||
|
||||
resource "kubernetes_manifest" "external_secret" {
|
||||
field_manager {
|
||||
force_conflicts = true
|
||||
}
|
||||
manifest = {
|
||||
apiVersion = "external-secrets.io/v1"
|
||||
kind = "ExternalSecret"
|
||||
|
|
|
|||
|
|
@ -23,9 +23,6 @@ resource "kubernetes_namespace" "dawarich" {
|
|||
}
|
||||
|
||||
resource "kubernetes_manifest" "external_secret" {
|
||||
field_manager {
|
||||
force_conflicts = true
|
||||
}
|
||||
manifest = {
|
||||
apiVersion = "external-secrets.io/v1"
|
||||
kind = "ExternalSecret"
|
||||
|
|
|
|||
|
|
@ -745,10 +745,7 @@ resource "kubernetes_deployment" "phpmyadmin" {
|
|||
labels = {
|
||||
"app" = "phpmyadmin"
|
||||
tier = var.tier
|
||||
# ADR-0014 service identity: dbaas is a multi-Service namespace, so the
|
||||
# namespace alone can't attribute Goldmane flows. Value = the fronting
|
||||
# Service name (kubernetes_service.phpmyadmin is named "pma").
|
||||
"service-identity" = "pma"
|
||||
|
||||
}
|
||||
annotations = {
|
||||
"reloader.stakater.com/search" = "true"
|
||||
|
|
@ -765,10 +762,6 @@ resource "kubernetes_deployment" "phpmyadmin" {
|
|||
metadata {
|
||||
labels = {
|
||||
"app" = "phpmyadmin"
|
||||
# ADR-0014: Goldmane/Felix stamps POD labels onto flows, so the
|
||||
# disambiguating identity must live on the pod template (not just
|
||||
# the Deployment metadata above). Not in selector → no replace.
|
||||
"service-identity" = "pma"
|
||||
}
|
||||
}
|
||||
spec {
|
||||
|
|
@ -819,19 +812,8 @@ resource "kubernetes_deployment" "phpmyadmin" {
|
|||
}
|
||||
}
|
||||
lifecycle {
|
||||
ignore_changes = [
|
||||
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
|
||||
# This Deployment is Keel-enrolled (keel.sh/policy=patch). Ignore the
|
||||
# attributes Keel/Kyverno mutate at runtime so `terragrunt apply` (incl.
|
||||
# the daily drift plan) doesn't fight them or revert the live image —
|
||||
# canonical KEEL/KYVERNO lifecycle guard, matches linkwarden/chrome-service.
|
||||
metadata[0].annotations["keel.sh/policy"],
|
||||
metadata[0].annotations["keel.sh/trigger"],
|
||||
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
|
||||
metadata[0].annotations["keel.sh/match-tag"],
|
||||
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE — Keel manages tag updates
|
||||
spec[0].template[0].metadata[0].annotations["keel.sh/update-time"], # KEEL_LIFECYCLE_V1
|
||||
]
|
||||
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
|
||||
ignore_changes = [spec[0].template[0].spec[0].dns_config]
|
||||
}
|
||||
}
|
||||
|
||||
|
|
@ -1517,10 +1499,6 @@ resource "kubernetes_deployment" "pgadmin" {
|
|||
}
|
||||
labels = {
|
||||
tier = var.tier
|
||||
# ADR-0014 service identity: dbaas is a multi-Service namespace, so the
|
||||
# namespace alone can't attribute Goldmane flows. Value = the fronting
|
||||
# Service name (kubernetes_service.pgadmin is named "pgadmin").
|
||||
"service-identity" = "pgadmin"
|
||||
}
|
||||
}
|
||||
spec {
|
||||
|
|
@ -1536,10 +1514,6 @@ resource "kubernetes_deployment" "pgadmin" {
|
|||
metadata {
|
||||
labels = {
|
||||
app = "pgadmin"
|
||||
# ADR-0014: Goldmane/Felix stamps POD labels onto flows, so the
|
||||
# disambiguating identity must live on the pod template (not just
|
||||
# the Deployment metadata above). Not in selector → no replace.
|
||||
"service-identity" = "pgadmin"
|
||||
}
|
||||
}
|
||||
spec {
|
||||
|
|
@ -1594,20 +1568,8 @@ resource "kubernetes_deployment" "pgadmin" {
|
|||
}
|
||||
}
|
||||
lifecycle {
|
||||
ignore_changes = [
|
||||
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
|
||||
# This Deployment is Keel-enrolled (keel.sh/policy=patch) and Keel has
|
||||
# bumped the live image (dpage/pgadmin4:9.16). Ignore the Keel/Kyverno
|
||||
# runtime-mutated attributes so `terragrunt apply` (incl. the daily drift
|
||||
# plan) doesn't revert the image to bare `dpage/pgadmin4` or strip Keel's
|
||||
# annotations — canonical guard, matches linkwarden/chrome-service.
|
||||
metadata[0].annotations["keel.sh/policy"],
|
||||
metadata[0].annotations["keel.sh/trigger"],
|
||||
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
|
||||
metadata[0].annotations["keel.sh/match-tag"],
|
||||
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE — Keel manages tag updates
|
||||
spec[0].template[0].metadata[0].annotations["keel.sh/update-time"], # KEEL_LIFECYCLE_V1
|
||||
]
|
||||
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
|
||||
ignore_changes = [spec[0].template[0].spec[0].dns_config]
|
||||
}
|
||||
}
|
||||
resource "kubernetes_service" "pgadmin" {
|
||||
|
|
|
|||
|
|
@ -20,9 +20,6 @@ resource "kubernetes_namespace" "diun" {
|
|||
}
|
||||
|
||||
resource "kubernetes_manifest" "external_secret" {
|
||||
field_manager {
|
||||
force_conflicts = true
|
||||
}
|
||||
manifest = {
|
||||
apiVersion = "external-secrets.io/v1"
|
||||
kind = "ExternalSecret"
|
||||
|
|
|
|||
|
|
@ -20,9 +20,6 @@ resource "kubernetes_namespace" "ebooks" {
|
|||
|
||||
# ExternalSecrets for all three sources
|
||||
resource "kubernetes_manifest" "calibre_external_secret" {
|
||||
field_manager {
|
||||
force_conflicts = true
|
||||
}
|
||||
manifest = {
|
||||
apiVersion = "external-secrets.io/v1"
|
||||
kind = "ExternalSecret"
|
||||
|
|
@ -50,9 +47,6 @@ resource "kubernetes_manifest" "calibre_external_secret" {
|
|||
}
|
||||
|
||||
resource "kubernetes_manifest" "audiobookshelf_external_secret" {
|
||||
field_manager {
|
||||
force_conflicts = true
|
||||
}
|
||||
manifest = {
|
||||
apiVersion = "external-secrets.io/v1"
|
||||
kind = "ExternalSecret"
|
||||
|
|
@ -80,9 +74,6 @@ resource "kubernetes_manifest" "audiobookshelf_external_secret" {
|
|||
}
|
||||
|
||||
resource "kubernetes_manifest" "servarr_external_secret" {
|
||||
field_manager {
|
||||
force_conflicts = true
|
||||
}
|
||||
manifest = {
|
||||
apiVersion = "external-secrets.io/v1"
|
||||
kind = "ExternalSecret"
|
||||
|
|
|
|||
|
|
@ -33,9 +33,6 @@ resource "kubernetes_namespace" "f1-stream" {
|
|||
}
|
||||
|
||||
resource "kubernetes_manifest" "external_secret" {
|
||||
field_manager {
|
||||
force_conflicts = true
|
||||
}
|
||||
manifest = {
|
||||
apiVersion = "external-secrets.io/v1"
|
||||
kind = "ExternalSecret"
|
||||
|
|
@ -65,9 +62,6 @@ resource "kubernetes_manifest" "external_secret" {
|
|||
# Pull the chrome-service bearer token into this namespace as a separate
|
||||
# Secret so the verifier can reach the in-cluster Playwright pool.
|
||||
resource "kubernetes_manifest" "chrome_service_client_secret" {
|
||||
field_manager {
|
||||
force_conflicts = true
|
||||
}
|
||||
manifest = {
|
||||
apiVersion = "external-secrets.io/v1"
|
||||
kind = "ExternalSecret"
|
||||
|
|
|
|||
|
|
@ -53,9 +53,6 @@ resource "kubernetes_namespace" "fire_planner" {
|
|||
# Seed before applying:
|
||||
# secret/fire-planner -> property `recompute_bearer_token`
|
||||
resource "kubernetes_manifest" "external_secret" {
|
||||
field_manager {
|
||||
force_conflicts = true
|
||||
}
|
||||
manifest = {
|
||||
apiVersion = "external-secrets.io/v1"
|
||||
kind = "ExternalSecret"
|
||||
|
|
@ -118,9 +115,6 @@ resource "kubernetes_manifest" "external_secret" {
|
|||
# Template builds the asyncpg DSN consumed by the FastAPI app + CronJob
|
||||
# as DB_CONNECTION_STRING.
|
||||
resource "kubernetes_manifest" "db_external_secret" {
|
||||
field_manager {
|
||||
force_conflicts = true
|
||||
}
|
||||
manifest = {
|
||||
apiVersion = "external-secrets.io/v1"
|
||||
kind = "ExternalSecret"
|
||||
|
|
@ -165,9 +159,6 @@ resource "kubernetes_manifest" "db_external_secret" {
|
|||
# pg-sync sidecar populates `daily_account_valuation` etc. hourly; the
|
||||
# fire-planner ingest reads those tables via this role.
|
||||
resource "kubernetes_manifest" "wealthfolio_sync_db_external_secret" {
|
||||
field_manager {
|
||||
force_conflicts = true
|
||||
}
|
||||
manifest = {
|
||||
apiVersion = "external-secrets.io/v1"
|
||||
kind = "ExternalSecret"
|
||||
|
|
@ -459,90 +450,6 @@ resource "kubernetes_cron_job_v1" "fire_planner_recompute" {
|
|||
]
|
||||
}
|
||||
|
||||
# Monthly FIRE-countdown target solve on the 2nd at 10:00 UTC (an hour after
|
||||
# recompute-all, so account_snapshot is fresh). Binary-searches each Case's FIRE
|
||||
# number per country at the 99% Guyton-Klinger bar and upserts fire_target, which
|
||||
# the wealth Grafana dashboard's "FIRE Countdown" section reads.
|
||||
resource "kubernetes_cron_job_v1" "fire_planner_fire_targets" {
|
||||
metadata {
|
||||
name = "fire-planner-fire-targets"
|
||||
namespace = kubernetes_namespace.fire_planner.metadata[0].name
|
||||
}
|
||||
spec {
|
||||
schedule = "0 10 2 * *"
|
||||
concurrency_policy = "Forbid"
|
||||
successful_jobs_history_limit = 3
|
||||
failed_jobs_history_limit = 5
|
||||
starting_deadline_seconds = 600
|
||||
|
||||
job_template {
|
||||
metadata {
|
||||
labels = local.labels
|
||||
}
|
||||
spec {
|
||||
backoff_limit = 1
|
||||
ttl_seconds_after_finished = 86400
|
||||
# The full country sweep is CPU-bound (binary search × ~22 cities ×
|
||||
# 3 cases). Give it room rather than letting it run forever.
|
||||
active_deadline_seconds = 3600
|
||||
template {
|
||||
metadata {
|
||||
labels = local.labels
|
||||
}
|
||||
spec {
|
||||
restart_policy = "OnFailure"
|
||||
image_pull_secrets {
|
||||
name = "registry-credentials"
|
||||
}
|
||||
image_pull_secrets {
|
||||
name = "ghcr-credentials"
|
||||
}
|
||||
container {
|
||||
name = "fire-targets"
|
||||
image = local.image
|
||||
# --horizon 72: Viktor retires ~age 28 and plans to live to 100, so
|
||||
# the portfolio must last 72 years (was the 60y default ≈ to age 88).
|
||||
command = ["python", "-m", "fire_planner", "recompute-fire-targets",
|
||||
"--countries", "all", "--horizon", "72"]
|
||||
|
||||
env_from {
|
||||
secret_ref {
|
||||
name = "fire-planner-secrets"
|
||||
}
|
||||
}
|
||||
env_from {
|
||||
secret_ref {
|
||||
name = "fire-planner-db-creds"
|
||||
}
|
||||
}
|
||||
|
||||
resources {
|
||||
requests = {
|
||||
cpu = "500m"
|
||||
memory = "1Gi"
|
||||
}
|
||||
limits = {
|
||||
memory = "2Gi"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
lifecycle {
|
||||
# KYVERNO_LIFECYCLE_V1
|
||||
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
|
||||
}
|
||||
|
||||
depends_on = [
|
||||
kubernetes_manifest.external_secret,
|
||||
kubernetes_manifest.db_external_secret,
|
||||
]
|
||||
}
|
||||
|
||||
# Weekly refresh of the COL cache: walks col_snapshot for rows
|
||||
# expiring within 7 days, re-scrapes Numbeo + Expatistan, upserts. With
|
||||
# the user-chosen 1-year TTL, a healthy cache has 0 stale rows on most
|
||||
|
|
@ -662,53 +569,16 @@ module "ingress_api" {
|
|||
auth = "none"
|
||||
}
|
||||
|
||||
# ExternalSecret in the monitoring namespace mirroring the rotating
|
||||
# fire_planner DB password. Grafana mounts this via envFromSecrets in
|
||||
# monitoring/grafana_chart_values.yaml; the datasource ConfigMap below
|
||||
# references it as $__env{FIRE_PLANNER_PG_PASSWORD}. Reloader restarts
|
||||
# Grafana whenever ESO updates this secret (on the 7d static-role
|
||||
# rotation), so the provisioned datasource never goes stale — replaces
|
||||
# the old plan-time `data.kubernetes_secret` bake that broke weekly.
|
||||
# Mirrors the wealth-pg / payslips-pg pattern.
|
||||
resource "kubernetes_manifest" "grafana_fire_planner_pg_creds" {
|
||||
field_manager {
|
||||
force_conflicts = true
|
||||
}
|
||||
manifest = {
|
||||
apiVersion = "external-secrets.io/v1"
|
||||
kind = "ExternalSecret"
|
||||
metadata = {
|
||||
name = "grafana-fire-planner-pg-creds"
|
||||
namespace = "monitoring"
|
||||
}
|
||||
spec = {
|
||||
refreshInterval = "15m"
|
||||
secretStoreRef = {
|
||||
name = "vault-database"
|
||||
kind = "ClusterSecretStore"
|
||||
}
|
||||
target = {
|
||||
name = "grafana-fire-planner-pg-creds"
|
||||
template = {
|
||||
metadata = {
|
||||
annotations = {
|
||||
"reloader.stakater.com/match" = "true"
|
||||
}
|
||||
}
|
||||
data = {
|
||||
FIRE_PLANNER_PG_PASSWORD = "{{ .password }}"
|
||||
}
|
||||
}
|
||||
}
|
||||
data = [{
|
||||
secretKey = "password"
|
||||
remoteRef = {
|
||||
key = "static-creds/pg-fire-planner"
|
||||
property = "password"
|
||||
}
|
||||
}]
|
||||
}
|
||||
# Plan-time read of the ESO-created K8s Secret for Grafana datasource
|
||||
# password. First-apply gotcha: must
|
||||
# `terragrunt apply -target=kubernetes_manifest.db_external_secret` so
|
||||
# the Secret exists before this data source plans.
|
||||
data "kubernetes_secret" "fire_planner_db_creds" {
|
||||
metadata {
|
||||
name = "fire-planner-db-creds"
|
||||
namespace = kubernetes_namespace.fire_planner.metadata[0].name
|
||||
}
|
||||
depends_on = [kubernetes_manifest.db_external_secret]
|
||||
}
|
||||
|
||||
# Grafana datasource for fire_planner PostgreSQL DB.
|
||||
|
|
@ -745,15 +615,12 @@ resource "kubernetes_config_map" "grafana_fire_planner_datasource" {
|
|||
timescaledb = false
|
||||
}
|
||||
secureJsonData = {
|
||||
# Live env from grafana-fire-planner-pg-creds (above), injected into
|
||||
# Grafana via envFromSecrets; reloader refreshes it on rotation.
|
||||
password = "$__env{FIRE_PLANNER_PG_PASSWORD}"
|
||||
password = data.kubernetes_secret.fire_planner_db_creds.data["DB_PASSWORD"]
|
||||
}
|
||||
editable = true
|
||||
}]
|
||||
})
|
||||
}
|
||||
depends_on = [kubernetes_manifest.grafana_fire_planner_pg_creds]
|
||||
}
|
||||
|
||||
# CI retrigger 2026-05-16T13:42:57+00:00 — bulk enrollment apply (pipeline #689 killed)
|
||||
|
|
@ -794,9 +661,6 @@ variable "run_examples_bulk_ingest" {
|
|||
|
||||
# Reddit OAuth creds pulled from Vault secret/viktor.
|
||||
resource "kubernetes_manifest" "external_secret_examples_reddit" {
|
||||
field_manager {
|
||||
force_conflicts = true
|
||||
}
|
||||
manifest = {
|
||||
apiVersion = "external-secrets.io/v1"
|
||||
kind = "ExternalSecret"
|
||||
|
|
@ -837,9 +701,6 @@ resource "kubernetes_manifest" "external_secret_examples_reddit" {
|
|||
# claude-agent-service bearer pulled separately so its rotation cadence
|
||||
# is decoupled from the Reddit creds.
|
||||
resource "kubernetes_manifest" "external_secret_examples_claude" {
|
||||
field_manager {
|
||||
force_conflicts = true
|
||||
}
|
||||
manifest = {
|
||||
apiVersion = "external-secrets.io/v1"
|
||||
kind = "ExternalSecret"
|
||||
|
|
|
|||
|
|
@ -6,9 +6,6 @@
|
|||
# (stacks/authentik/email-secret.tf) — one credential, one rotation point. The
|
||||
# reloader annotation rolls the Forgejo pod if the password is ever rotated.
|
||||
resource "kubernetes_manifest" "forgejo_email_secret" {
|
||||
field_manager {
|
||||
force_conflicts = true
|
||||
}
|
||||
manifest = {
|
||||
apiVersion = "external-secrets.io/v1"
|
||||
kind = "ExternalSecret"
|
||||
|
|
|
|||
|
|
@ -3,9 +3,6 @@ variable "tls_secret_name" {
|
|||
sensitive = true
|
||||
}
|
||||
resource "kubernetes_manifest" "external_secret" {
|
||||
field_manager {
|
||||
force_conflicts = true
|
||||
}
|
||||
manifest = {
|
||||
apiVersion = "external-secrets.io/v1"
|
||||
kind = "ExternalSecret"
|
||||
|
|
|
|||
|
|
@ -18,9 +18,6 @@ resource "kubernetes_namespace" "immich" {
|
|||
}
|
||||
|
||||
resource "kubernetes_manifest" "external_secret" {
|
||||
field_manager {
|
||||
force_conflicts = true
|
||||
}
|
||||
manifest = {
|
||||
apiVersion = "external-secrets.io/v1"
|
||||
kind = "ExternalSecret"
|
||||
|
|
|
|||
|
|
@ -57,19 +57,16 @@ resource "kubernetes_namespace" "goldmane_edge_aggregator" {
|
|||
# -----------------------------------------------------------------------------
|
||||
# The aggregator dials goldmane:7443 over mutual TLS. We mint a client cert
|
||||
# signed by the Tigera CA (the same CA that issues Goldmane's serving cert), so
|
||||
# Goldmane requires mutual TLS on :7443 and verifies the client cert chains to
|
||||
# the Tigera CA — it does NOT authorize by client identity, so ANY Tigera-CA-
|
||||
# signed cert is accepted. Rather than copy the Tigera CA PRIVATE KEY into TF
|
||||
# state to mint our own (a needless CA-key exposure; the hashicorp/tls provider
|
||||
# is also incompatible with this repo's global generate-providers/lockfile
|
||||
# pattern), we REUSE the operator-minted, Tigera-CA-signed client cert
|
||||
# `whisker-backend-key-pair` (calico-system). We never touch the CA key.
|
||||
# Trade-off: if the operator rotates that cert, re-apply to re-sync (hardening
|
||||
# follow-up: mint an own-identity cert in-namespace if Whisker is ever removed).
|
||||
data "kubernetes_secret" "whisker_backend" {
|
||||
# Goldmane trusts the client and the client trusts Goldmane's server cert via
|
||||
# the published CA bundle.
|
||||
#
|
||||
# The Tigera CA private key lives in the `tigera-ca-private` Secret in
|
||||
# tigera-operator (Opaque; verified keys: tls.crt + tls.key). The stack's apply
|
||||
# identity needs RBAC get on that secret — see the Role/RoleBinding below.
|
||||
data "kubernetes_secret" "tigera_ca" {
|
||||
metadata {
|
||||
name = "whisker-backend-key-pair"
|
||||
namespace = "calico-system"
|
||||
name = "tigera-ca-private"
|
||||
namespace = "tigera-operator"
|
||||
}
|
||||
}
|
||||
|
||||
|
|
@ -96,11 +93,46 @@ resource "kubernetes_config_map" "tigera_ca_bundle" {
|
|||
data = data.kubernetes_config_map.tigera_ca_bundle.data
|
||||
}
|
||||
|
||||
# Client cert + key for mTLS to goldmane:7443, mounted at TLS_CERT_PATH /
|
||||
# TLS_KEY_PATH defaults (/etc/goldmane-client-tls/tls.crt and .../tls.key).
|
||||
# Sourced verbatim from the operator's whisker-backend client key-pair (read
|
||||
# above) — already Tigera-CA-signed, which is all Goldmane verifies. No CA key
|
||||
# is touched and no cross-namespace CA RBAC is needed.
|
||||
# Client private key.
|
||||
resource "tls_private_key" "goldmane_client" {
|
||||
algorithm = "RSA"
|
||||
rsa_bits = 2048
|
||||
}
|
||||
|
||||
# CSR for the client cert. CN identifies the client; the service-DNS SAN mirrors
|
||||
# how Felix/whisker-backend present a client identity to Goldmane.
|
||||
resource "tls_cert_request" "goldmane_client" {
|
||||
private_key_pem = tls_private_key.goldmane_client.private_key_pem
|
||||
subject {
|
||||
common_name = "goldmane-edge-aggregator"
|
||||
organization = "goldmane-edge-aggregator"
|
||||
}
|
||||
dns_names = [
|
||||
"goldmane-edge-aggregator",
|
||||
"goldmane-edge-aggregator.goldmane-edge-aggregator.svc.cluster.local",
|
||||
]
|
||||
}
|
||||
|
||||
# Sign the CSR with the Tigera CA. 10-year validity (87600h): re-apply rotates
|
||||
# it well before expiry; a long horizon avoids surprise mTLS outages from an
|
||||
# unattended stack. The Tigera CA itself outlives this (operator-managed).
|
||||
resource "tls_locally_signed_cert" "goldmane_client" {
|
||||
cert_request_pem = tls_cert_request.goldmane_client.cert_request_pem
|
||||
ca_private_key_pem = data.kubernetes_secret.tigera_ca.data["tls.key"]
|
||||
ca_cert_pem = data.kubernetes_secret.tigera_ca.data["tls.crt"]
|
||||
|
||||
validity_period_hours = 87600 # 10y
|
||||
early_renewal_hours = 720 # re-sign on apply when <30d remain
|
||||
|
||||
allowed_uses = [
|
||||
"client_auth",
|
||||
"digital_signature",
|
||||
"key_encipherment",
|
||||
]
|
||||
}
|
||||
|
||||
# The minted client cert + key, mounted at TLS_CERT_PATH / TLS_KEY_PATH defaults
|
||||
# (/etc/goldmane-client-tls/tls.crt and .../tls.key).
|
||||
resource "kubernetes_secret" "goldmane_client_tls" {
|
||||
metadata {
|
||||
name = "goldmane-client-tls"
|
||||
|
|
@ -108,8 +140,47 @@ resource "kubernetes_secret" "goldmane_client_tls" {
|
|||
}
|
||||
type = "Opaque"
|
||||
data = {
|
||||
"tls.crt" = data.kubernetes_secret.whisker_backend.data["tls.crt"]
|
||||
"tls.key" = data.kubernetes_secret.whisker_backend.data["tls.key"]
|
||||
"tls.crt" = tls_locally_signed_cert.goldmane_client.cert_pem
|
||||
"tls.key" = tls_private_key.goldmane_client.private_key_pem
|
||||
}
|
||||
}
|
||||
|
||||
# Narrow RBAC so this stack's apply identity (and ESO/Reloader are unaffected)
|
||||
# can `get` the Tigera CA private key in tigera-operator. The data source above
|
||||
# reads it at apply time; this Role/RoleBinding documents + grants that access
|
||||
# rather than relying on cluster-admin. The subject is the same SA the other
|
||||
# Tier-1 stacks apply as (claude-agent/terraform-state for headless, the human
|
||||
# OIDC identity interactively) — both are cluster-admin today, so this is
|
||||
# belt-and-braces / least-privilege intent for when apply identities tighten.
|
||||
resource "kubernetes_role" "read_tigera_ca" {
|
||||
metadata {
|
||||
name = "goldmane-edge-aggregator-read-tigera-ca"
|
||||
namespace = "tigera-operator"
|
||||
}
|
||||
rule {
|
||||
api_groups = [""]
|
||||
resources = ["secrets"]
|
||||
resource_names = ["tigera-ca-private"]
|
||||
verbs = ["get"]
|
||||
}
|
||||
}
|
||||
|
||||
resource "kubernetes_role_binding" "read_tigera_ca" {
|
||||
metadata {
|
||||
name = "goldmane-edge-aggregator-read-tigera-ca"
|
||||
namespace = "tigera-operator"
|
||||
}
|
||||
role_ref {
|
||||
api_group = "rbac.authorization.k8s.io"
|
||||
kind = "Role"
|
||||
name = kubernetes_role.read_tigera_ca.metadata[0].name
|
||||
}
|
||||
# The headless apply identity (claude-agent-service runs Tier-1 applies as the
|
||||
# `terraform-state` Vault K8s role in the claude-agent namespace).
|
||||
subject {
|
||||
kind = "ServiceAccount"
|
||||
name = "default"
|
||||
namespace = "claude-agent"
|
||||
}
|
||||
}
|
||||
|
||||
|
|
@ -156,11 +227,6 @@ resource "kubernetes_job" "db_init" {
|
|||
timeouts {
|
||||
create = "2m"
|
||||
}
|
||||
lifecycle {
|
||||
# KYVERNO_LIFECYCLE_V1: Kyverno injects dns_config (ndots=2); ignore it so
|
||||
# this idempotent Job isn't replaced (Jobs are immutable) on every apply.
|
||||
ignore_changes = [spec[0].template[0].spec[0].dns_config]
|
||||
}
|
||||
}
|
||||
|
||||
# ExternalSecret projecting the Vault-rotated (7-day) credential into a K8s
|
||||
|
|
@ -168,9 +234,6 @@ resource "kubernetes_job" "db_init" {
|
|||
# place in the CNPG connection allowlist are added in stacks/vault/main.tf
|
||||
# (see this stack's terragrunt.hcl note). remoteRef key: static-creds/pg-goldmane-edges.
|
||||
resource "kubernetes_manifest" "db_external_secret" {
|
||||
field_manager {
|
||||
force_conflicts = true
|
||||
}
|
||||
manifest = {
|
||||
apiVersion = "external-secrets.io/v1"
|
||||
kind = "ExternalSecret"
|
||||
|
|
@ -213,9 +276,6 @@ resource "kubernetes_manifest" "db_external_secret" {
|
|||
# into this namespace as SLACK_WEBHOOK_URL via an ExternalSecret (no new
|
||||
# webhook). The digest CronJob defaults to #security.
|
||||
resource "kubernetes_manifest" "slack_external_secret" {
|
||||
field_manager {
|
||||
force_conflicts = true
|
||||
}
|
||||
manifest = {
|
||||
apiVersion = "external-secrets.io/v1"
|
||||
kind = "ExternalSecret"
|
||||
|
|
@ -235,7 +295,7 @@ resource "kubernetes_manifest" "slack_external_secret" {
|
|||
data = [{
|
||||
secretKey = "SLACK_WEBHOOK_URL"
|
||||
remoteRef = {
|
||||
key = "viktor"
|
||||
key = "monitoring"
|
||||
property = "alertmanager_slack_api_url"
|
||||
}
|
||||
}]
|
||||
|
|
@ -456,12 +516,7 @@ resource "kubernetes_cron_job_v1" "digest" {
|
|||
}
|
||||
env {
|
||||
name = "SLACK_CHANNEL"
|
||||
# Posts to #alerts. The dedicated #security channel was abandoned
|
||||
# 2026-06-25 — the shared alertmanager_slack_api_url webhook's
|
||||
# Slack app isn't a member of it (channel override 404s), so all
|
||||
# Slack (incl. alertmanager's security-lane alerts) consolidated
|
||||
# to #alerts. See docs/runbooks/goldmane-flow-trail.md.
|
||||
value = "#alerts"
|
||||
value = "#security"
|
||||
}
|
||||
|
||||
resources {
|
||||
|
|
|
|||
|
|
@ -5,9 +5,6 @@ variable "tls_secret_name" {
|
|||
variable "nfs_server" { type = string }
|
||||
|
||||
resource "kubernetes_manifest" "external_secret" {
|
||||
field_manager {
|
||||
force_conflicts = true
|
||||
}
|
||||
manifest = {
|
||||
apiVersion = "external-secrets.io/v1"
|
||||
kind = "ExternalSecret"
|
||||
|
|
|
|||
|
|
@ -208,9 +208,6 @@ module "ingress" {
|
|||
}
|
||||
|
||||
resource "kubernetes_manifest" "external_secret" {
|
||||
field_manager {
|
||||
force_conflicts = true
|
||||
}
|
||||
manifest = {
|
||||
apiVersion = "external-secrets.io/v1"
|
||||
kind = "ExternalSecret"
|
||||
|
|
|
|||
|
|
@ -250,9 +250,6 @@ module "ingress_test" {
|
|||
}
|
||||
|
||||
resource "kubernetes_manifest" "external_secret_db" {
|
||||
field_manager {
|
||||
force_conflicts = true
|
||||
}
|
||||
manifest = {
|
||||
apiVersion = "external-secrets.io/v1"
|
||||
kind = "ExternalSecret"
|
||||
|
|
@ -287,9 +284,6 @@ resource "kubernetes_manifest" "external_secret_db" {
|
|||
}
|
||||
|
||||
resource "kubernetes_manifest" "external_secret_kv" {
|
||||
field_manager {
|
||||
force_conflicts = true
|
||||
}
|
||||
manifest = {
|
||||
apiVersion = "external-secrets.io/v1"
|
||||
kind = "ExternalSecret"
|
||||
|
|
|
|||
|
|
@ -37,9 +37,6 @@ module "tls_secret" {
|
|||
# --- Secrets (ESO from Vault) ---
|
||||
|
||||
resource "kubernetes_manifest" "external_secret" {
|
||||
field_manager {
|
||||
force_conflicts = true
|
||||
}
|
||||
manifest = {
|
||||
apiVersion = "external-secrets.io/v1"
|
||||
kind = "ExternalSecret"
|
||||
|
|
|
|||
|
|
@ -1,155 +0,0 @@
|
|||
# Immich photo-frame for Emo (emil.barzin@gmail.com) — a second instance cloned
|
||||
# from the London frame in frame.tf, scoped to Emo's Immich account + Sofia
|
||||
# weather. Served at highlights-immich-emo.viktorbarzin.me and shown on Emo's
|
||||
# Portal Mini (Sofia) via the portal-immich-frame app.
|
||||
# API key: Vault secret/immich -> frame_api_key_emo (minted on Emo's account).
|
||||
|
||||
resource "kubernetes_config_map" "frame_config_emo" {
|
||||
metadata {
|
||||
name = "config-emo"
|
||||
namespace = "immich"
|
||||
|
||||
labels = {
|
||||
app = "frame-config-emo"
|
||||
}
|
||||
annotations = {
|
||||
"reloader.stakater.com/match" = "true"
|
||||
}
|
||||
}
|
||||
|
||||
data = {
|
||||
"Settings.yml" = <<-EOF
|
||||
General:
|
||||
Layout: single
|
||||
Interval: 45
|
||||
ImageZoom: true
|
||||
ShowAlbumName: false
|
||||
ShowProgressBar: false
|
||||
ClockFormat: "HH:mm"
|
||||
PhotoDateFormat: "dd/MM/yyyy"
|
||||
WeatherApiKey: ${data.vault_kv_secret_v2.secrets.data["frame_weather_api_key"]}
|
||||
UnitSystem: metric
|
||||
WeatherLatLong: "42.6977,23.3219"
|
||||
Language: en
|
||||
Accounts:
|
||||
- ImmichServerUrl: http://immich.viktorbarzin.me
|
||||
ApiKey: ${data.vault_kv_secret_v2.secrets.data["frame_api_key_emo"]}
|
||||
ImagesFromDays: 730
|
||||
EOF
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
resource "kubernetes_deployment" "immich-frame-emo" {
|
||||
metadata {
|
||||
name = "immich-frame-emo"
|
||||
namespace = "immich"
|
||||
annotations = {
|
||||
"reloader.stakater.com/search" = "true"
|
||||
}
|
||||
labels = {
|
||||
tier = local.tiers.gpu
|
||||
}
|
||||
}
|
||||
|
||||
spec {
|
||||
replicas = 1
|
||||
selector {
|
||||
match_labels = {
|
||||
app = "immich-frame-emo"
|
||||
}
|
||||
}
|
||||
strategy {
|
||||
type = "RollingUpdate"
|
||||
}
|
||||
template {
|
||||
metadata {
|
||||
labels = {
|
||||
app = "immich-frame-emo"
|
||||
}
|
||||
annotations = {
|
||||
"dependency.kyverno.io/wait-for" = "immich-server.immich:2283"
|
||||
}
|
||||
}
|
||||
spec {
|
||||
container {
|
||||
image = "ghcr.io/immichframe/immichframe:v1.0.32.0"
|
||||
name = "immich-frame-emo"
|
||||
resources {
|
||||
requests = {
|
||||
cpu = "10m"
|
||||
memory = "64Mi"
|
||||
}
|
||||
limits = {
|
||||
memory = "128Mi"
|
||||
}
|
||||
}
|
||||
port {
|
||||
container_port = 8080
|
||||
protocol = "TCP"
|
||||
name = "http"
|
||||
}
|
||||
volume_mount {
|
||||
name = "config"
|
||||
mount_path = "/app/Config"
|
||||
read_only = true
|
||||
}
|
||||
}
|
||||
volume {
|
||||
name = "config"
|
||||
config_map {
|
||||
name = "config-emo"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
lifecycle {
|
||||
ignore_changes = [
|
||||
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
|
||||
metadata[0].annotations["keel.sh/policy"],
|
||||
metadata[0].annotations["keel.sh/trigger"],
|
||||
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
|
||||
metadata[0].annotations["keel.sh/match-tag"],
|
||||
metadata[0].annotations["kubernetes.io/change-cause"],
|
||||
metadata[0].annotations["deployment.kubernetes.io/revision"],
|
||||
spec[0].template[0].metadata[0].annotations["keel.sh/update-time"], # KEEL_LIFECYCLE_V1
|
||||
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE
|
||||
]
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
resource "kubernetes_service" "immich-frame-emo" {
|
||||
metadata {
|
||||
name = "immich-frame-emo"
|
||||
namespace = "immich"
|
||||
labels = {
|
||||
"app" = "immich-frame-emo"
|
||||
}
|
||||
}
|
||||
|
||||
spec {
|
||||
selector = {
|
||||
app = "immich-frame-emo"
|
||||
}
|
||||
port {
|
||||
port = 80
|
||||
target_port = 8080
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
module "ingress_emo" {
|
||||
source = "../../modules/kubernetes/ingress_factory"
|
||||
# Photo-frame kiosk display on Emo's Portal — headless browser pulling images
|
||||
# via an Immich API key (no user login). Forward-auth would 302 the device to
|
||||
# Authentik with no way to complete login.
|
||||
# auth = "none": photo-frame kiosk; headless browser with API key; no user login.
|
||||
auth = "none"
|
||||
dns_type = "proxied"
|
||||
namespace = "immich"
|
||||
name = "highlights-immich-emo"
|
||||
tls_secret_name = var.tls_secret_name
|
||||
service_name = "immich-frame-emo"
|
||||
}
|
||||
|
|
@ -162,9 +162,6 @@ resource "kubernetes_resource_quota" "immich" {
|
|||
}
|
||||
|
||||
resource "kubernetes_manifest" "external_secret" {
|
||||
field_manager {
|
||||
force_conflicts = true
|
||||
}
|
||||
manifest = {
|
||||
apiVersion = "external-secrets.io/v1"
|
||||
kind = "ExternalSecret"
|
||||
|
|
|
|||
|
|
@ -20,9 +20,6 @@ resource "kubernetes_namespace" "insta2spotify" {
|
|||
}
|
||||
|
||||
resource "kubernetes_manifest" "external_secret" {
|
||||
field_manager {
|
||||
force_conflicts = true
|
||||
}
|
||||
manifest = {
|
||||
apiVersion = "external-secrets.io/v1"
|
||||
kind = "ExternalSecret"
|
||||
|
|
|
|||
|
|
@ -35,14 +35,6 @@ resource "kubernetes_namespace" "instagram_poster" {
|
|||
# - immich_tag_instagram (optional — auto-resolved if missing)
|
||||
# - immich_tag_posted (optional — auto-resolved if missing)
|
||||
resource "kubernetes_manifest" "external_secret" {
|
||||
# The external-secrets controller takes server-side-apply ownership of
|
||||
# .spec.refreshInterval, so a plain TF apply conflicts. force_conflicts lets
|
||||
# TF win (values match, so it's stable) — same pattern as grafana/woodpecker/
|
||||
# traefik/k8s-version-upgrade. Surfaced 2026-06-24 by the first IG apply since
|
||||
# the ESO v1 migration (the scale-to-0 push).
|
||||
field_manager {
|
||||
force_conflicts = true
|
||||
}
|
||||
manifest = {
|
||||
apiVersion = "external-secrets.io/v1"
|
||||
kind = "ExternalSecret"
|
||||
|
|
@ -147,11 +139,6 @@ resource "kubernetes_manifest" "external_secret" {
|
|||
# ESO refreshes the K8s Secret every 15m. `reloader.stakater.com/match`
|
||||
# bounces the pod when the password changes.
|
||||
resource "kubernetes_manifest" "benchmark_db_external_secret" {
|
||||
# See external_secret above — ESO owns .spec.refreshInterval; force_conflicts
|
||||
# lets the TF apply win instead of erroring on the field-manager conflict.
|
||||
field_manager {
|
||||
force_conflicts = true
|
||||
}
|
||||
manifest = {
|
||||
apiVersion = "external-secrets.io/v1"
|
||||
kind = "ExternalSecret"
|
||||
|
|
@ -240,11 +227,7 @@ resource "kubernetes_deployment" "instagram_poster" {
|
|||
}
|
||||
|
||||
spec {
|
||||
# Scaled to 0 (2026-06-24): Instagram Graph integration is unused and its
|
||||
# ExternalSecret is dead (missing ig_graph_long_lived_token /
|
||||
# ig_business_account_id in Vault secret/instagram-poster). Set back to 1
|
||||
# after minting a Meta long-lived token and populating those keys.
|
||||
replicas = 0
|
||||
replicas = 1
|
||||
# RWO PVC — cannot rolling-update.
|
||||
strategy {
|
||||
type = "Recreate"
|
||||
|
|
|
|||
|
|
@ -41,9 +41,6 @@ resource "kubernetes_namespace" "job_hunter" {
|
|||
# digest_to_address — where the weekly digest goes
|
||||
# digest_from_address — From: header for the digest
|
||||
resource "kubernetes_manifest" "external_secret" {
|
||||
field_manager {
|
||||
force_conflicts = true
|
||||
}
|
||||
manifest = {
|
||||
apiVersion = "external-secrets.io/v1"
|
||||
kind = "ExternalSecret"
|
||||
|
|
@ -108,9 +105,6 @@ resource "kubernetes_manifest" "external_secret" {
|
|||
# DB credentials from Vault database engine (7-day rotation).
|
||||
# Template builds the asyncpg DSN consumed by the FastAPI app as DB_CONNECTION_STRING.
|
||||
resource "kubernetes_manifest" "db_external_secret" {
|
||||
field_manager {
|
||||
force_conflicts = true
|
||||
}
|
||||
manifest = {
|
||||
apiVersion = "external-secrets.io/v1"
|
||||
kind = "ExternalSecret"
|
||||
|
|
@ -331,9 +325,6 @@ resource "kubernetes_service" "job_hunter" {
|
|||
# references it as $__env{JOB_HUNTER_PG_PASSWORD}. Reloader restarts
|
||||
# Grafana whenever ESO updates this secret (every 7d on rotation).
|
||||
resource "kubernetes_manifest" "grafana_job_hunter_db_external_secret" {
|
||||
field_manager {
|
||||
force_conflicts = true
|
||||
}
|
||||
manifest = {
|
||||
apiVersion = "external-secrets.io/v1"
|
||||
kind = "ExternalSecret"
|
||||
|
|
|
|||
|
|
@ -5,9 +5,6 @@
|
|||
# -----------------------------------------------------------------------------
|
||||
|
||||
resource "kubernetes_manifest" "oauth2_proxy_externalsecret" {
|
||||
field_manager {
|
||||
force_conflicts = true
|
||||
}
|
||||
manifest = {
|
||||
apiVersion = "external-secrets.io/v1"
|
||||
kind = "ExternalSecret"
|
||||
|
|
|
|||
|
|
@ -5,11 +5,9 @@
|
|||
<main>
|
||||
<h1>Kubernetes Access Portal</h1>
|
||||
|
||||
<div class="callout info">
|
||||
<strong>Fastest way in:</strong> open the <a href="https://t3.viktorbarzin.me">web terminal</a> or the
|
||||
<a href="https://k8s.viktorbarzin.me">dashboard</a> and sign in — no install, no VPN needed. Prefer your
|
||||
own machine? The <a href="/onboarding#path-laptop">local-setup guide</a> covers VPN + kubectl, and the
|
||||
<a href="/onboarding">Getting Started page</a> compares all three access paths.
|
||||
<div class="callout warning">
|
||||
<strong>VPN Required</strong> — The cluster is on a private network. You need Headscale VPN access before kubectl will work.
|
||||
<a href="/onboarding">See the Getting Started guide</a> for VPN setup instructions.
|
||||
</div>
|
||||
|
||||
<section>
|
||||
|
|
@ -28,7 +26,6 @@
|
|||
<p><strong>Assigned namespaces:</strong> {data.namespaces.join(', ')}</p>
|
||||
|
||||
<h3>Quick Commands</h3>
|
||||
<p>Run these as-is in the <a href="https://t3.viktorbarzin.me">web terminal</a> — it's already signed in as you.</p>
|
||||
<pre>
|
||||
# Check your pods
|
||||
kubectl get pods -n {data.namespaces[0]}
|
||||
|
|
@ -50,23 +47,16 @@ vault write kubernetes/creds/{data.namespaces[0]}-deployer \
|
|||
|
||||
<section>
|
||||
<h2>Get Started</h2>
|
||||
<h3>No setup — start now</h3>
|
||||
<ol>
|
||||
<li><a href="https://t3.viktorbarzin.me">Open the web terminal</a> — a ready shell with kubectl, Vault and your repos already set up</li>
|
||||
<li><a href="https://k8s.viktorbarzin.me">Open the dashboard</a> — point-and-click view of your workloads</li>
|
||||
</ol>
|
||||
<h3>On your own machine</h3>
|
||||
<ol>
|
||||
{#if data.role === 'namespace-owner'}
|
||||
<li><a href="/onboarding?role=namespace-owner#path-laptop">Follow the namespace-owner setup</a> (VPN, kubectl, Vault, encrypted state)</li>
|
||||
<li><a href="/onboarding?role=namespace-owner">Complete the namespace-owner onboarding guide</a></li>
|
||||
{:else}
|
||||
<li><a href="/onboarding#path-laptop">Follow the local setup</a> (VPN, kubectl, git)</li>
|
||||
<li><a href="/onboarding">Complete the onboarding guide</a> (VPN, kubectl, git)</li>
|
||||
{/if}
|
||||
<li><a href="/setup">Install kubectl and kubelogin</a></li>
|
||||
<li><a href="/download">Download your kubeconfig</a></li>
|
||||
<li>Run <code>kubectl get namespaces</code> to verify access</li>
|
||||
</ol>
|
||||
<p><a href="/onboarding">Compare all three access paths →</a></p>
|
||||
</section>
|
||||
|
||||
<section>
|
||||
|
|
@ -101,12 +91,12 @@ vault write kubernetes/creds/{data.namespaces[0]}-deployer \
|
|||
border-radius: 6px;
|
||||
margin: 1rem 0;
|
||||
}
|
||||
.callout.info {
|
||||
background: #e8f4fd;
|
||||
border-left: 4px solid #2196f3;
|
||||
.callout.warning {
|
||||
background: #fff3cd;
|
||||
border-left: 4px solid #ffc107;
|
||||
}
|
||||
.callout a {
|
||||
color: #0d47a1;
|
||||
color: #856404;
|
||||
font-weight: 600;
|
||||
}
|
||||
</style>
|
||||
|
|
|
|||
|
|
@ -5,123 +5,22 @@
|
|||
|
||||
<main class="content">
|
||||
<h1>Getting Started</h1>
|
||||
<p>
|
||||
Welcome! There are three ways to reach the home Kubernetes cluster. Pick the one that fits —
|
||||
the first two need <strong>zero setup</strong> and open right in your browser.
|
||||
</p>
|
||||
|
||||
<section>
|
||||
<h2>Three ways in</h2>
|
||||
<table>
|
||||
<thead><tr><th>Path</th><th>Best for</th><th>Setup</th></tr></thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td><a href="#path-terminal"><strong>A — Web terminal</strong></a></td>
|
||||
<td>Just want to start working now</td>
|
||||
<td>None — opens in your browser</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><a href="#path-dashboard"><strong>B — Web dashboard</strong></a></td>
|
||||
<td>Click around, watch your app, read logs</td>
|
||||
<td>None — opens in your browser</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><a href="#path-laptop"><strong>C — Your own machine</strong></a></td>
|
||||
<td>kubectl / Terraform locally, full control</td>
|
||||
<td>VPN + one-line installer</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
<div class="callout info">
|
||||
<strong>Not sure?</strong> Start with the <a href="#path-terminal">web terminal (Path A)</a>.
|
||||
Everything is already installed and your repos are already cloned — you can run your first
|
||||
<code>kubectl</code> command within a minute, from any device.
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<section id="path-terminal" class="path">
|
||||
<h2>Path A — Web terminal <span class="badge rec">Recommended</span> <span class="badge none">No setup</span></h2>
|
||||
<p>
|
||||
A full terminal that runs in your browser — nothing to install, works from any device
|
||||
(even a tablet). It drops you into your own account on the shared workstation, with every
|
||||
tool already set up.
|
||||
</p>
|
||||
<ol>
|
||||
<li>Open <a href="https://t3.viktorbarzin.me" target="_blank">t3.viktorbarzin.me</a></li>
|
||||
<li>Sign in with your Authentik account (the same SSO login as this portal)</li>
|
||||
<li>You land in a ready-to-use shell. Try it:
|
||||
<pre>kubectl get pods -n YOUR_NAMESPACE</pre>
|
||||
</li>
|
||||
</ol>
|
||||
<div class="callout info">
|
||||
<strong>Already done for you</strong> on the workstation:
|
||||
<ul>
|
||||
<li><code>kubectl</code> + your kubeconfig, scoped to your namespaces (no login dance)</li>
|
||||
<li><code>vault</code>, <code>terragrunt</code>, <code>terraform</code>, <code>sops</code>, <code>kubeseal</code></li>
|
||||
<li>Your repos cloned under <code>~/code</code> — the <code>infra</code> repo plus your own project repos</li>
|
||||
<li>Claude Code, ready to pair with you on changes</li>
|
||||
</ul>
|
||||
</div>
|
||||
<div class="callout warning">
|
||||
<strong>No access yet?</strong> The workstation is provisioned per person. If
|
||||
<code>t3.viktorbarzin.me</code> says you're not authorized, ask Viktor to add you
|
||||
(<a href="mailto:vbarzin@gmail.com">vbarzin@gmail.com</a> or Slack).
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<section id="path-dashboard" class="path">
|
||||
<h2>Path B — Web dashboard <span class="badge none">No setup</span></h2>
|
||||
<p>
|
||||
A point-and-click view of the cluster — browse your pods, read logs, restart a deployment,
|
||||
check events. Nothing to install.
|
||||
</p>
|
||||
<ol>
|
||||
<li>Open <a href="https://k8s.viktorbarzin.me" target="_blank">k8s.viktorbarzin.me</a></li>
|
||||
<li>Sign in with your Authentik account</li>
|
||||
<li>
|
||||
You're dropped straight into the Kubernetes Dashboard, already authenticated as you —
|
||||
<strong>no token to paste</strong>. The portal injects your personal access token for you.
|
||||
</li>
|
||||
</ol>
|
||||
<div class="callout info">
|
||||
Scoped to your namespace(s): you can see and manage your own workloads, but not other
|
||||
tenants'. This path uses a per-user token that does <em>not</em> depend on CLI login, so it
|
||||
keeps working even if <code>kubectl</code> OIDC login is having a bad day — making it the
|
||||
reliable fallback for Path C.
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<section id="path-laptop" class="path c">
|
||||
<h2>Path C — From your own machine</h2>
|
||||
<p>
|
||||
For running <code>kubectl</code>, <code>vault</code> and Terraform locally. This is the most
|
||||
powerful path and the one to use for infrastructure changes — it just needs a bit more setup
|
||||
because the cluster API lives on a private network.
|
||||
</p>
|
||||
<p>Welcome! Follow these steps to get access to the home Kubernetes cluster.</p>
|
||||
|
||||
<div class="role-tabs">
|
||||
<a href="/onboarding?role=general#path-laptop" class:active={!showNamespaceOwner}>General User</a>
|
||||
<a href="/onboarding?role=namespace-owner#path-laptop" class:active={showNamespaceOwner}>Namespace Owner</a>
|
||||
<a href="/onboarding" class:active={!showNamespaceOwner}>General User</a>
|
||||
<a href="/onboarding?role=namespace-owner" class:active={showNamespaceOwner}>Namespace Owner</a>
|
||||
</div>
|
||||
<p class="prereq">
|
||||
{#if showNamespaceOwner}
|
||||
Namespace owner — you'll also set up Vault and encrypted Terraform state so you can deploy
|
||||
your own app stacks.
|
||||
{:else}
|
||||
General user — VPN, kubectl and git access. (Managing your own app stack? Switch to the
|
||||
<strong>Namespace Owner</strong> tab above.)
|
||||
{/if}
|
||||
</p>
|
||||
|
||||
<section>
|
||||
<h3>Step 1 — Join the VPN</h3>
|
||||
<p>The cluster API is on a private network (<code>10.0.20.0/24</code>), so you need VPN access first.</p>
|
||||
<h2>Step 0 — Join the VPN</h2>
|
||||
<p>The cluster is on a private network (<code>10.0.20.0/24</code>). You need VPN access first.</p>
|
||||
<ol>
|
||||
<li>Install <a href="https://tailscale.com/download" target="_blank">Tailscale</a> for your OS</li>
|
||||
<li>Run this in your terminal:
|
||||
<pre>tailscale login --login-server https://headscale.viktorbarzin.me</pre>
|
||||
</li>
|
||||
<li>A browser window opens with a registration URL</li>
|
||||
<li>A browser window will open with a registration URL</li>
|
||||
<li>Send that URL to Viktor via email (<a href="mailto:vbarzin@gmail.com">vbarzin@gmail.com</a>) or Slack</li>
|
||||
<li>Wait for approval (usually within a few hours)</li>
|
||||
<li>Once approved, test: <pre>ping 10.0.20.100</pre></li>
|
||||
|
|
@ -129,49 +28,62 @@
|
|||
</section>
|
||||
|
||||
<section>
|
||||
<h3>Step 2 — Install the tools</h3>
|
||||
<p>Run one of these to install everything automatically (kubectl, kubelogin, vault, terragrunt, terraform, kubeseal) and write your kubeconfig to <code>~/.kube/config-home</code>:</p>
|
||||
<h4>macOS</h4>
|
||||
<p class="prereq">Requires <a href="https://brew.sh" target="_blank">Homebrew</a>. Install it first if you don't have it.</p>
|
||||
<pre>bash <(curl -fsSL https://k8s-portal.viktorbarzin.me/setup/script?os=mac)</pre>
|
||||
<h4>Linux</h4>
|
||||
<pre>bash <(curl -fsSL https://k8s-portal.viktorbarzin.me/setup/script?os=linux)</pre>
|
||||
<h4>Windows</h4>
|
||||
<p>Use <a href="https://learn.microsoft.com/en-us/windows/wsl/install" target="_blank">WSL2</a> and follow the Linux instructions.</p>
|
||||
<h2>Step 1 — Log in to the portal</h2>
|
||||
<p>Visit <a href="https://k8s-portal.viktorbarzin.me">k8s-portal.viktorbarzin.me</a> and sign in with your Authentik account.</p>
|
||||
<p>If you don't have an account yet, ask Viktor to create one.</p>
|
||||
</section>
|
||||
|
||||
<section>
|
||||
<h3>Step 3 — Verify access</h3>
|
||||
<p>Run this. The first time, it opens your browser for SSO login:</p>
|
||||
<pre>kubectl get {showNamespaceOwner ? 'pods -n YOUR_NAMESPACE' : 'namespaces'}</pre>
|
||||
<p>You should see your resources (or an empty list if you haven't deployed anything yet).</p>
|
||||
<div class="callout warning">
|
||||
<strong>Browser login loops, or kubectl says "Unauthorized"?</strong> Command-line SSO
|
||||
(OIDC) can occasionally be unavailable. When that happens, use the
|
||||
<a href="#path-dashboard">web dashboard (Path B)</a> or the
|
||||
<a href="#path-terminal">web terminal (Path A)</a> — both authenticate a different way and
|
||||
keep working — and let Viktor know.
|
||||
</div>
|
||||
<p class="prereq">Connection error instead? Make sure the VPN is up: <code>tailscale status</code>.</p>
|
||||
<h2>Step 2 — Set up kubectl</h2>
|
||||
<p>Run one of these commands in your terminal to install everything automatically:</p>
|
||||
<h3>macOS</h3>
|
||||
<p class="prereq">Requires <a href="https://brew.sh" target="_blank">Homebrew</a>. Install it first if you don't have it.</p>
|
||||
<pre>bash <(curl -fsSL https://k8s-portal.viktorbarzin.me/setup/script?os=mac)</pre>
|
||||
<h3>Linux</h3>
|
||||
<pre>bash <(curl -fsSL https://k8s-portal.viktorbarzin.me/setup/script?os=linux)</pre>
|
||||
<h3>Windows</h3>
|
||||
<p>Use <a href="https://learn.microsoft.com/en-us/windows/wsl/install" target="_blank">WSL2</a> and follow the Linux instructions.</p>
|
||||
</section>
|
||||
|
||||
{#if showNamespaceOwner}
|
||||
<section>
|
||||
<h3>Step 4 — Log into Vault</h3>
|
||||
<h2>Step 3 — Log into Vault</h2>
|
||||
<p>Vault manages your secrets and issues dynamic Kubernetes credentials.</p>
|
||||
<pre>vault login -method=oidc</pre>
|
||||
<p>This opens your browser for Authentik SSO. After login, your token is saved to <code>~/.vault-token</code>.</p>
|
||||
</section>
|
||||
|
||||
<section>
|
||||
<h3>Step 5 — Clone the infra repo</h3>
|
||||
<h2>Step 4 — Verify kubectl access</h2>
|
||||
<p>Run this command. It will open your browser for OIDC login the first time:</p>
|
||||
<pre>kubectl get pods -n YOUR_NAMESPACE</pre>
|
||||
<p>You should see an empty list (no resources) or your running pods.</p>
|
||||
</section>
|
||||
|
||||
<section>
|
||||
<h2>Step 5 — Clone the infra repo</h2>
|
||||
<pre>git clone https://github.com/ViktorBarzin/infra.git
|
||||
cd infra</pre>
|
||||
<p>This is where all the infrastructure configuration lives. Terraform state is committed as encrypted files.</p>
|
||||
</section>
|
||||
|
||||
<section>
|
||||
<h3>Step 6 — Decrypt your state</h3>
|
||||
<h2>Step 6 — Install tools</h2>
|
||||
<p>You need <code>sops</code> and <code>terragrunt</code> to work with infrastructure state:</p>
|
||||
<h3>macOS</h3>
|
||||
<pre>brew install sops terragrunt</pre>
|
||||
<h3>Linux</h3>
|
||||
<pre># sops
|
||||
curl -LO https://github.com/getsops/sops/releases/latest/download/sops-v3.9.4.linux.amd64
|
||||
sudo mv sops-*.linux.amd64 /usr/local/bin/sops && sudo chmod +x /usr/local/bin/sops
|
||||
|
||||
# terragrunt
|
||||
curl -LO https://github.com/gruntwork-io/terragrunt/releases/latest/download/terragrunt_linux_amd64
|
||||
sudo mv terragrunt_linux_amd64 /usr/local/bin/terragrunt && sudo chmod +x /usr/local/bin/terragrunt</pre>
|
||||
</section>
|
||||
|
||||
<section>
|
||||
<h2>Step 7 — Decrypt your state</h2>
|
||||
<p>Terraform state is encrypted with SOPS. Your Vault login gives you access to <strong>only your stacks</strong>.</p>
|
||||
<pre># Make sure you're logged into Vault
|
||||
vault login -method=oidc
|
||||
|
|
@ -220,7 +132,7 @@ cd stacks/YOUR_NAMESPACE
|
|||
</section>
|
||||
|
||||
<section>
|
||||
<h3>Step 7 — Create your first app stack</h3>
|
||||
<h2>Step 8 — Create your first app stack</h2>
|
||||
<ol>
|
||||
<li>Copy the template: <pre>cp -r stacks/_template stacks/myapp
|
||||
mv stacks/myapp/main.tf.example stacks/myapp/main.tf</pre></li>
|
||||
|
|
@ -241,7 +153,7 @@ git push</pre>
|
|||
</section>
|
||||
|
||||
<section>
|
||||
<h3>Architecture Overview</h3>
|
||||
<h2>Architecture Overview</h2>
|
||||
<p>Here's how your changes flow through the system:</p>
|
||||
|
||||
<div class="diagram">
|
||||
|
|
@ -292,18 +204,31 @@ git push</pre>
|
|||
</section>
|
||||
{:else}
|
||||
<section>
|
||||
<h3>Step 4 — Clone the repo</h3>
|
||||
<h2>Step 3 — Verify access</h2>
|
||||
<p>Run this command. It will open your browser for login the first time:</p>
|
||||
<pre>kubectl get namespaces</pre>
|
||||
<p>You should see output like:</p>
|
||||
<pre class="output">NAME STATUS AGE
|
||||
default Active 200d
|
||||
kube-system Active 200d
|
||||
monitoring Active 200d
|
||||
...</pre>
|
||||
<p>If you get a connection error, make sure your VPN is connected (<code>tailscale status</code>).</p>
|
||||
</section>
|
||||
|
||||
<section>
|
||||
<h2>Step 4 — Clone the repo</h2>
|
||||
<pre>git clone https://github.com/ViktorBarzin/infra.git
|
||||
cd infra</pre>
|
||||
<p>This is where all the infrastructure configuration lives.</p>
|
||||
</section>
|
||||
|
||||
<section>
|
||||
<h3>Step 5 — Your first change</h3>
|
||||
<h2>Step 5 — Your first change</h2>
|
||||
<ol>
|
||||
<li>Create a branch: <pre>git checkout -b my-first-change</pre></li>
|
||||
<li>Edit a service file (e.g., change an image tag in <code>stacks/echo/main.tf</code>)</li>
|
||||
<li>Commit and push: <pre>git add . && git commit -m "my first change" && git push -u origin my-first-change</pre></li>
|
||||
<li>Commit and push: <pre>git add . && git commit -m "my first change" && git push -u origin my-first-change</pre></li>
|
||||
<li>Open a Pull Request on GitHub</li>
|
||||
<li>Viktor reviews and merges</li>
|
||||
<li>Woodpecker CI automatically applies the change to the cluster</li>
|
||||
|
|
@ -311,29 +236,19 @@ cd infra</pre>
|
|||
</ol>
|
||||
</section>
|
||||
{/if}
|
||||
</section>
|
||||
</main>
|
||||
|
||||
<style>
|
||||
.content { max-width: 768px; margin: 2rem auto; padding: 0 1rem; font-family: system-ui, -apple-system, sans-serif; line-height: 1.6; }
|
||||
.content h1 { border-bottom: 1px solid #e0e0e0; padding-bottom: 0.5rem; }
|
||||
.content h2 { margin-top: 2rem; color: #333; }
|
||||
.content h3 { color: #444; margin: 1.25rem 0 0.25rem; }
|
||||
.content h4 { color: #666; margin: 0.75rem 0 0.25rem; }
|
||||
.content h3 { color: #666; margin: 1rem 0 0.25rem; }
|
||||
.content pre { background: #1e1e1e; color: #d4d4d4; padding: 1rem; border-radius: 6px; overflow-x: auto; }
|
||||
.content pre.output { background: #f5f5f5; color: #333; }
|
||||
.content code { background: #f0f0f0; padding: 2px 6px; border-radius: 3px; }
|
||||
.content .prereq { font-size: 0.9rem; color: #666; font-style: italic; }
|
||||
section { margin: 2rem 0; }
|
||||
section section { margin: 1.25rem 0; }
|
||||
|
||||
.path { border-left: 4px solid #4fc3f7; padding-left: 1.25rem; scroll-margin-top: 4rem; }
|
||||
.path.c { border-left-color: #bbb; }
|
||||
|
||||
.badge { display: inline-block; font-size: 0.65rem; font-weight: 700; text-transform: uppercase; letter-spacing: 0.5px; padding: 0.15rem 0.5rem; border-radius: 4px; vertical-align: middle; margin-left: 0.4rem; }
|
||||
.badge.rec { background: #d4f8d4; color: #1b5e20; }
|
||||
.badge.none { background: #e3f2fd; color: #0d47a1; }
|
||||
|
||||
.role-tabs { display: flex; gap: 0; margin: 1.5rem 0 0.5rem; border-bottom: 2px solid #e0e0e0; }
|
||||
.role-tabs { display: flex; gap: 0; margin: 1.5rem 0; border-bottom: 2px solid #e0e0e0; }
|
||||
.role-tabs a { padding: 0.5rem 1.5rem; text-decoration: none; color: #666; border-bottom: 2px solid transparent; margin-bottom: -2px; }
|
||||
.role-tabs a.active { color: #333; border-bottom-color: #333; font-weight: 600; }
|
||||
table { border-collapse: collapse; width: 100%; margin: 0.5rem 0; }
|
||||
|
|
@ -343,7 +258,6 @@ cd infra</pre>
|
|||
.callout { padding: 1rem; border-radius: 6px; margin: 1rem 0; }
|
||||
.callout.info { background: #e8f4fd; border-left: 4px solid #2196f3; }
|
||||
.callout.warning { background: #fff3cd; border-left: 4px solid #ffc107; }
|
||||
.callout ul { margin: 0.5rem 0 0; padding-left: 1.25rem; }
|
||||
|
||||
.diagram { background: #fafafa; border: 1px solid #e0e0e0; border-radius: 8px; padding: 1.5rem; margin: 1.5rem 0; }
|
||||
.diagram h3 { margin: 0 0 1rem 0; color: #333; font-size: 0.95rem; text-transform: uppercase; letter-spacing: 0.5px; }
|
||||
|
|
|
|||
|
|
@ -2,19 +2,6 @@
|
|||
<h1>Service Catalog</h1>
|
||||
<p>70+ services running on the cluster. Here are the most commonly used:</p>
|
||||
|
||||
<section>
|
||||
<h2>Cluster Access</h2>
|
||||
<table>
|
||||
<thead><tr><th>Service</th><th>URL</th><th>Description</th></tr></thead>
|
||||
<tbody>
|
||||
<tr><td>Web Terminal</td><td><a href="https://t3.viktorbarzin.me">t3.viktorbarzin.me</a></td><td>Browser shell on the shared workstation — kubectl, Vault & your repos preinstalled (zero setup)</td></tr>
|
||||
<tr><td>Kubernetes Dashboard</td><td><a href="https://k8s.viktorbarzin.me">k8s.viktorbarzin.me</a></td><td>Point-and-click view of your workloads, auto-authenticated (zero setup)</td></tr>
|
||||
<tr><td>Access Portal</td><td><a href="https://k8s-portal.viktorbarzin.me">k8s-portal.viktorbarzin.me</a></td><td>This portal — onboarding, kubeconfig download, setup script</td></tr>
|
||||
<tr><td>Vault</td><td><a href="https://vault.viktorbarzin.me">vault.viktorbarzin.me</a></td><td>Secrets & dynamic credentials — <code>vault login -method=oidc</code></td></tr>
|
||||
</tbody>
|
||||
</table>
|
||||
</section>
|
||||
|
||||
<section>
|
||||
<h2>Core Services</h2>
|
||||
<table>
|
||||
|
|
@ -35,7 +22,7 @@
|
|||
<tbody>
|
||||
<tr><td>Nextcloud</td><td><a href="https://nextcloud.viktorbarzin.me">nextcloud.viktorbarzin.me</a></td><td>File storage, calendar, contacts</td></tr>
|
||||
<tr><td>Immich</td><td><a href="https://immich.viktorbarzin.me">immich.viktorbarzin.me</a></td><td>Photo library (Google Photos alternative)</td></tr>
|
||||
<tr><td>Vaultwarden</td><td><a href="https://vaultwarden.viktorbarzin.me">vaultwarden.viktorbarzin.me</a></td><td>Password manager</td></tr>
|
||||
<tr><td>Vaultwarden</td><td><a href="https://vault.viktorbarzin.me">vault.viktorbarzin.me</a></td><td>Password manager</td></tr>
|
||||
<tr><td>Paperless-ngx</td><td><a href="https://pdf.viktorbarzin.me">pdf.viktorbarzin.me</a></td><td>Document management</td></tr>
|
||||
<tr><td>Navidrome</td><td><a href="https://music.viktorbarzin.me">music.viktorbarzin.me</a></td><td>Music streaming</td></tr>
|
||||
<tr><td>Tandoor</td><td><a href="https://recipes.viktorbarzin.me">recipes.viktorbarzin.me</a></td><td>Recipe manager</td></tr>
|
||||
|
|
|
|||
|
|
@ -11,26 +11,6 @@
|
|||
</ol>
|
||||
</section>
|
||||
|
||||
<section>
|
||||
<h2>Browser login loops, or kubectl says "Unauthorized"</h2>
|
||||
<p>Command-line SSO (OIDC) login can occasionally be unavailable. You don't have to wait for it — these authenticate a different way and keep working:</p>
|
||||
<ul>
|
||||
<li><a href="https://k8s.viktorbarzin.me">Web dashboard</a> — auto-authenticated, no token to paste</li>
|
||||
<li><a href="https://t3.viktorbarzin.me">Web terminal</a> — its kubectl is already wired up</li>
|
||||
</ul>
|
||||
<p>Let Viktor know so the CLI login path gets fixed.</p>
|
||||
</section>
|
||||
|
||||
<section>
|
||||
<h2>Don't want to set up a local machine at all?</h2>
|
||||
<p>Skip the VPN and CLI install entirely:</p>
|
||||
<ul>
|
||||
<li><a href="https://t3.viktorbarzin.me">t3.viktorbarzin.me</a> — a browser shell with everything preinstalled</li>
|
||||
<li><a href="https://k8s.viktorbarzin.me">k8s.viktorbarzin.me</a> — a point-and-click dashboard</li>
|
||||
</ul>
|
||||
<p>Both just need your Authentik login. See the <a href="/onboarding">Getting Started</a> guide.</p>
|
||||
</section>
|
||||
|
||||
<section>
|
||||
<h2>"Forbidden" or "Permission denied"</h2>
|
||||
<p>You may not have access to that namespace. Your access is scoped to specific namespaces.</p>
|
||||
|
|
|
|||
|
|
@ -483,49 +483,31 @@ resource "kubernetes_cron_job_v1" "k8s_version_check" {
|
|||
exit 0
|
||||
fi
|
||||
|
||||
echo "K8s upgrade available: v$RUNNING -> v$TARGET ($KIND)"
|
||||
slack "K8s upgrade available: v$RUNNING → v$TARGET ($KIND)"
|
||||
|
||||
if [ "$DRY_RUN" = "true" ]; then
|
||||
slack "DRY_RUN — target v$TARGET detected, not spawning preflight Job"
|
||||
slack "DRY_RUN — not spawning preflight Job"
|
||||
exit 0
|
||||
fi
|
||||
|
||||
# 7. Spawn Job 0 (preflight) via envsubst on the job-template
|
||||
# Idempotency: deterministic name reconciles via `apply`.
|
||||
JOB_NAME="k8s-upgrade-preflight-$${TARGET//./-}"
|
||||
MASTER_JOB="k8s-upgrade-master-$${TARGET//./-}"
|
||||
ANNOUNCE=yes # Slack the spawn? Suppressed for silent nightly re-evaluations of a standing gate refusal.
|
||||
|
||||
# Idempotency + nightly re-evaluation:
|
||||
# - FAILED preflight (transient gate abort, e.g. a spurious
|
||||
# critical alert / unhealthy node) -> delete + re-spawn, announced.
|
||||
# - COMPLETE preflight but NO master Job spawned -> the compat
|
||||
# gate REFUSED the target (blocked/held now Complete cleanly
|
||||
# rather than Failing). Re-spawn SILENTLY so the gate re-checks
|
||||
# nightly (the refusal may have cleared: addon upgraded / matrix
|
||||
# updated / upstream shipped) WITHOUT nightly Slack noise for a
|
||||
# standing refusal — the morning report (+ K8sUpgradeBlocked for
|
||||
# actionable) is the signal.
|
||||
# - Otherwise (Active, or Complete with the chain advanced) -> skip.
|
||||
# The old "Failed-only re-spawn" left a refused-but-Complete preflight
|
||||
# skipped until its 7d TTL — too slow now that refusals Complete
|
||||
# instead of Failing (2026-06-28). Deterministic names; `apply`
|
||||
# reconciles. (Stuck-pipeline history: a transient critical alert
|
||||
# wedged 1.34.9 for 5 days, 2026-06-17 — hence Failed always re-spawns.)
|
||||
# Retry-on-failure idempotency: skip only if an existing preflight
|
||||
# Job is Active/Complete. A *Failed* preflight (aborted on a
|
||||
# transient gate, e.g. a spurious critical alert) is deleted and
|
||||
# re-spawned — otherwise its deterministic name + 7d TTL wedges
|
||||
# the entire pipeline until it ages out. (Stuck-pipeline fix
|
||||
# 2026-06-17: a transient critical alert wedged 1.34.9 for 5 days.)
|
||||
if /usr/local/bin/kubectl -n k8s-upgrade get job "$JOB_NAME" >/dev/null 2>&1; then
|
||||
JOB_FAILED=$(/usr/local/bin/kubectl -n k8s-upgrade get job "$JOB_NAME" \
|
||||
-o jsonpath='{.status.conditions[?(@.type=="Failed")].status}' 2>/dev/null || true)
|
||||
JOB_COMPLETE=$(/usr/local/bin/kubectl -n k8s-upgrade get job "$JOB_NAME" \
|
||||
-o jsonpath='{.status.conditions[?(@.type=="Complete")].status}' 2>/dev/null || true)
|
||||
if [ "$JOB_FAILED" = "True" ]; then
|
||||
slack "Preflight Job $JOB_NAME exists but FAILED — deleting and re-spawning"
|
||||
/usr/local/bin/kubectl -n k8s-upgrade delete job "$JOB_NAME" --wait=true >/dev/null 2>&1 || true
|
||||
elif [ "$JOB_COMPLETE" = "True" ] && ! /usr/local/bin/kubectl -n k8s-upgrade get job "$MASTER_JOB" >/dev/null 2>&1; then
|
||||
echo "Preflight $JOB_NAME Complete + no master Job (gate refused) — silent nightly re-evaluate"
|
||||
/usr/local/bin/kubectl -n k8s-upgrade delete job "$JOB_NAME" --wait=true >/dev/null 2>&1 || true
|
||||
ANNOUNCE=no
|
||||
else
|
||||
echo "Preflight Job $JOB_NAME already exists (active / chain advanced) — skipping"
|
||||
slack "Preflight Job $JOB_NAME already exists (active/complete) — skipping"
|
||||
exit 0
|
||||
fi
|
||||
fi
|
||||
|
|
@ -539,9 +521,7 @@ resource "kubernetes_cron_job_v1" "k8s_version_check" {
|
|||
< /template/job-template.yaml \
|
||||
| /usr/local/bin/kubectl apply -f -
|
||||
|
||||
if [ "$ANNOUNCE" = "yes" ]; then
|
||||
slack "Spawned $JOB_NAME (target=v$TARGET kind=$KIND)"
|
||||
fi
|
||||
EOT
|
||||
]
|
||||
env {
|
||||
|
|
|
|||
|
|
@ -1,5 +1,5 @@
|
|||
{
|
||||
"_comment": "Addon -> highest k8s minor each addon version supports. The preflight compat-gate (compat-gate.py) reads the RUNNING version of each addon and blocks a k8s upgrade whose target minor exceeds what that running version supports — so the chain auto-halts + alerts instead of breaking on an unsupported addon. Keep current; sources are the addons' own k8s compat matrices (last refreshed 2026-06-19 for the 1.34->1.36 catch-up). max_k8s keys are addon-version floors (major.minor); value is the highest k8s minor that floor supports. An addon entry may also set \"pinned\": true (+ \"pin_reason\") to mark it deliberately held: the gate classifies its block as PINNED/held (quiet — no alert, nightly report only) even if a supporting version exists, for upgrades coupled to other work we're not ready for (e.g. gpu-operator's NVIDIA-driver/Ubuntu coupling). A block with NO supporting version in the matrix is WAITING (also quiet); a block a newer matrix version would clear is ACTIONABLE (alerts).",
|
||||
"_comment": "Addon -> highest k8s minor each addon version supports. The preflight compat-gate (compat-gate.py) reads the RUNNING version of each addon and blocks a k8s upgrade whose target minor exceeds what that running version supports — so the chain auto-halts + alerts instead of breaking on an unsupported addon. Keep current; sources are the addons' own k8s compat matrices (last refreshed 2026-06-19 for the 1.34->1.36 catch-up). max_k8s keys are addon-version floors (major.minor); value is the highest k8s minor that floor supports.",
|
||||
"addons": [
|
||||
{
|
||||
"name": "calico",
|
||||
|
|
@ -48,9 +48,7 @@
|
|||
"max_k8s": {
|
||||
"25.10": "1.35",
|
||||
"26.3": "1.36"
|
||||
},
|
||||
"pinned": true,
|
||||
"pin_reason": "26.3 needs a newer NVIDIA driver image + Ubuntu/kernel; held until the driver/OS path is ready. Unpin = delete pinned + pin_reason."
|
||||
}
|
||||
}
|
||||
],
|
||||
"containerd_min": {
|
||||
|
|
|
|||
|
|
@ -14,20 +14,9 @@ classes of blocker:
|
|||
3. containerd — every node's containerd >= the target's floor, if the matrix
|
||||
declares one (e.g. the 1.7.x -> k8s 1.37 cliff)
|
||||
|
||||
Each reason line is tagged with its class so the caller can act differently:
|
||||
[ACTIONABLE] a newer addon version (present in the matrix) supports the
|
||||
target — upgrading it clears the block. Also covers removed-API
|
||||
/ containerd blocks and the unreadable-version fail-safe.
|
||||
[WAITING] no released addon version supports the target yet — only an
|
||||
upstream release can clear it (e.g. kyverno/ESO behind a new k8s).
|
||||
[PINNED] a supporting version exists but the addon is deliberately held
|
||||
(matrix `pinned: true`, e.g. gpu-operator's driver/OS coupling).
|
||||
|
||||
Exit 0 = safe, proceed.
|
||||
Exit 2 = BLOCKED, actionable — >=1 blocker, none held. Caller pushes
|
||||
k8s_upgrade_blocked=1 (-> K8sUpgradeBlocked alert) and halts.
|
||||
Exit 4 = HELD — >=1 waiting-upstream/pinned blocker (held wins over actionable).
|
||||
Caller pushes k8s_upgrade_held=1 (no alert; nightly report only) and halts.
|
||||
Exit 2 = BLOCKED — prints one human reason per line (caller pushes
|
||||
k8s_upgrade_blocked=1, Slacks the reasons, and halts the chain).
|
||||
Exit 3 = the gate itself errored — caller treats as a block (fail safe).
|
||||
|
||||
Read-only: kubectl get + one Prometheus query. No mutations. PROM is overridable
|
||||
|
|
@ -73,20 +62,6 @@ def running_minor():
|
|||
return min(minors) if minors else None
|
||||
|
||||
|
||||
def _addon_resolution(a, tgt, running_ver):
|
||||
"""For a BLOCKING addon, decide whether a newer matrix version would clear
|
||||
the block. Returns ("actionable", hint) when some version key has
|
||||
max_k8s >= target AND is newer than the running version (upgrading it clears
|
||||
the block); otherwise ("waiting", hint) — nothing released supports the
|
||||
target yet, so only an upstream release can clear it."""
|
||||
sufficient = [floor for floor, mk in a["max_k8s"].items()
|
||||
if minor(mk) and minor(mk) >= tgt and minor(floor) > minor(running_ver)]
|
||||
if sufficient:
|
||||
best = min(sufficient, key=minor) # smallest sufficient upgrade
|
||||
return "actionable", f"upgrade {a['name']} to >= {best}"
|
||||
return "waiting", f"no released {a['name']} version supports k8s {tgt[0]}.{tgt[1]} yet"
|
||||
|
||||
|
||||
def check_addons(matrix, tgt, running):
|
||||
# A target at or below the RUNNING minor (a patch, or a same/lower minor)
|
||||
# crosses into no new k8s minor, so every installed addon is already
|
||||
|
|
@ -102,36 +77,25 @@ def check_addons(matrix, tgt, running):
|
|||
"-o", "jsonpath={.spec.template.spec.containers[*].image}"])
|
||||
m = re.search(a["image_re"], img or "")
|
||||
if not m:
|
||||
# Fail safe: can't read the running version → block; a human must
|
||||
# look (ACTIONABLE), never upgrade blind.
|
||||
reasons.append(f"[ACTIONABLE] addon {a['name']}: could not read running "
|
||||
f"version (img='{img or 'not found'}') — refusing to upgrade blind")
|
||||
# Fail safe: if we can't read the running version, don't upgrade blind.
|
||||
reasons.append(f"addon {a['name']}: could not read running version "
|
||||
f"(img='{img or 'not found'}') — refusing to upgrade blind")
|
||||
continue
|
||||
running_ver = m.group(1) # e.g. "3.26"
|
||||
running = m.group(1) # e.g. "3.26"
|
||||
# max_k8s maps an addon-version floor -> highest supported k8s minor.
|
||||
# Pick the highest floor that is <= the running version.
|
||||
max_k8s = None
|
||||
for floor, mk in sorted(a["max_k8s"].items(), key=lambda kv: minor(kv[0]), reverse=True):
|
||||
if minor(running_ver) >= minor(floor):
|
||||
if minor(running) >= minor(floor):
|
||||
max_k8s = mk
|
||||
break
|
||||
if max_k8s is None:
|
||||
reasons.append(f"[ACTIONABLE] addon {a['name']} v{running_ver}: below the lowest "
|
||||
f"version in the compat matrix — unknown k8s support")
|
||||
reasons.append(f"addon {a['name']} v{running}: below the lowest version "
|
||||
f"in the compat matrix — unknown k8s support")
|
||||
continue
|
||||
if tgt > minor(max_k8s):
|
||||
base = (f"addon {a['name']} v{running_ver} supports k8s <= {max_k8s}; "
|
||||
f"target {tgt[0]}.{tgt[1]} exceeds it")
|
||||
# A deliberately-pinned addon is HELD even if a newer version exists
|
||||
# (e.g. gpu-operator 26.3 supports 1.36 but its driver/OS coupling
|
||||
# means we don't take it yet) — the pin overrides actionable.
|
||||
if a.get("pinned"):
|
||||
why = a.get("pin_reason", "deliberately pinned")
|
||||
reasons.append(f"[PINNED] {base} — pinned ({why}); holding")
|
||||
else:
|
||||
kind, hint = _addon_resolution(a, tgt, running_ver)
|
||||
tag = "ACTIONABLE" if kind == "actionable" else "WAITING"
|
||||
reasons.append(f"[{tag}] {base} — {hint}")
|
||||
reasons.append(f"addon {a['name']} v{running} supports k8s <= {max_k8s}; "
|
||||
f"target {tgt[0]}.{tgt[1]} exceeds it — upgrade {a['name']} first")
|
||||
return reasons
|
||||
|
||||
|
||||
|
|
@ -145,11 +109,11 @@ def check_removed_apis(tgt):
|
|||
rr = lbl.get("removed_release", "")
|
||||
if rr and minor(rr) and tgt >= minor(rr):
|
||||
g = lbl.get("group") or "core"
|
||||
reasons.append(f"[ACTIONABLE] deprecated API {g}/{lbl.get('version')} "
|
||||
reasons.append(f"deprecated API {g}/{lbl.get('version')} "
|
||||
f"{lbl.get('resource')} is in use and is removed in "
|
||||
f"k8s {rr} (target {tgt[0]}.{tgt[1]}) — migrate callers first")
|
||||
except Exception as e:
|
||||
reasons.append(f"[ACTIONABLE] removed-API check could not query Prometheus ({e}) — "
|
||||
reasons.append(f"removed-API check could not query Prometheus ({e}) — "
|
||||
f"refusing to upgrade blind")
|
||||
return reasons
|
||||
|
||||
|
|
@ -168,28 +132,11 @@ def check_containerd(matrix, tgt):
|
|||
name, _, ver = line.partition(" ")
|
||||
cv = ver.replace("containerd://", "")
|
||||
if minor(cv) and minor(cv) < minor(floor):
|
||||
reasons.append(f"[ACTIONABLE] node {name} containerd {cv} < required {floor} "
|
||||
reasons.append(f"node {name} containerd {cv} < required {floor} "
|
||||
f"for k8s {tgt[0]}.{tgt[1]} — bump containerd first")
|
||||
return reasons
|
||||
|
||||
|
||||
def held_reason(r):
|
||||
"""True for a blocker the cluster cannot act on now: no released version
|
||||
supports the target (WAITING) or the addon is deliberately pinned (PINNED).
|
||||
These are quiet (no alert) — only an upstream release / a manual unpin clears
|
||||
them, so a nightly 'needs attention' alert would be crying wolf."""
|
||||
return r.startswith("[WAITING]") or r.startswith("[PINNED]")
|
||||
|
||||
|
||||
def exit_code(reasons):
|
||||
"""Map reasons to the gate verdict: 0 safe · 2 actionable block · 4 held.
|
||||
Held WINS over actionable on a mix — if anything is waiting/pinned the target
|
||||
can't proceed yet, so acting on the actionable blockers would be premature."""
|
||||
if not reasons:
|
||||
return 0
|
||||
return 4 if any(held_reason(r) for r in reasons) else 2
|
||||
|
||||
|
||||
def main():
|
||||
if len(sys.argv) < 2:
|
||||
print("usage: compat-gate.py <target-k8s-version> (matrix JSON on stdin)")
|
||||
|
|
@ -211,9 +158,9 @@ def main():
|
|||
if reasons:
|
||||
for r in reasons:
|
||||
print(r)
|
||||
else:
|
||||
sys.exit(2)
|
||||
print(f"compat-gate OK: cluster is safe to upgrade to {sys.argv[1]}")
|
||||
sys.exit(exit_code(reasons))
|
||||
sys.exit(0)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
|
|
|
|||
|
|
@ -69,29 +69,6 @@ def fmt_age(seconds):
|
|||
return f"{seconds / 86400:.1f}d ago"
|
||||
|
||||
|
||||
def _render_reasons(blocker_reasons):
|
||||
"""Group compat-gate reason lines by their [ACTIONABLE]/[WAITING]/[PINNED]
|
||||
tag into labelled sections, stripping the tag from each bullet. Untagged
|
||||
lines (older reason format) fall back to a generic 'Blockers' list. PURE.
|
||||
Returns a list of message lines."""
|
||||
lines = [r.strip() for r in (blocker_reasons or "").splitlines() if r.strip()]
|
||||
out, shown = [], set()
|
||||
for title, tag in (("Action needed", "[ACTIONABLE]"),
|
||||
("Waiting on upstream", "[WAITING]"),
|
||||
("Pinned (held by us)", "[PINNED]")):
|
||||
sub = [l for l in lines if l.startswith(tag)]
|
||||
if sub:
|
||||
out.append(f"{title}:")
|
||||
for l in sub:
|
||||
shown.add(l)
|
||||
out.append(f" • {l[len(tag):].strip()}")
|
||||
rest = [l for l in lines if l not in shown]
|
||||
if rest:
|
||||
out.append("Blockers:")
|
||||
out.extend(f" • {l}" for l in rest)
|
||||
return out
|
||||
|
||||
|
||||
def compose_report(now_ts, nodes, metrics, blocker_reasons, jobs):
|
||||
"""Build the Slack message text from gathered facts. PURE.
|
||||
|
||||
|
|
@ -121,7 +98,6 @@ def compose_report(now_ts, nodes, metrics, blocker_reasons, jobs):
|
|||
|
||||
avail = [(lbl, val) for lbl, val in select(metrics, "k8s_upgrade_available") if val == 1]
|
||||
blocked = any(val == 1 for _, val in select(metrics, "k8s_upgrade_blocked"))
|
||||
held = any(val == 1 for _, val in select(metrics, "k8s_upgrade_held"))
|
||||
|
||||
if avail:
|
||||
lbl = avail[0][0]
|
||||
|
|
@ -129,12 +105,7 @@ def compose_report(now_ts, nodes, metrics, blocker_reasons, jobs):
|
|||
kind = lbl.get("kind", "?")
|
||||
tgt_line = f"Detected target: *{target}* ({kind})"
|
||||
if blocked:
|
||||
# actionable block — an addon upgrade would clear it (K8sUpgradeBlocked fired)
|
||||
headline = f"🔴 BLOCKED (action needed) — {target}"
|
||||
elif held:
|
||||
# waiting on upstream and/or a pinned addon — nothing to do but wait;
|
||||
# intentionally NO alert, this nightly line is the only signal
|
||||
headline = f"⏸️ HELD — {target} not yet upgradable"
|
||||
headline = f"🔴 BLOCKED — compat gate refused {target}"
|
||||
elif len(versions) == 1 and target == versions[0]:
|
||||
headline = f"🟢 UPGRADED — all nodes now on {target}"
|
||||
else:
|
||||
|
|
@ -149,8 +120,12 @@ def compose_report(now_ts, nodes, metrics, blocker_reasons, jobs):
|
|||
|
||||
msg = [f"*[k8s-upgrade nightly]* {headline}", node_line, run_line, tgt_line]
|
||||
|
||||
if (blocked or held) and blocker_reasons:
|
||||
msg.extend(_render_reasons(blocker_reasons))
|
||||
if blocked and blocker_reasons:
|
||||
msg.append("Blockers (live):")
|
||||
for r in blocker_reasons.splitlines():
|
||||
r = r.strip()
|
||||
if r:
|
||||
msg.append(f" • {r}")
|
||||
|
||||
if jobs:
|
||||
msg.append("Chain jobs (recent):")
|
||||
|
|
@ -238,8 +213,7 @@ def main():
|
|||
|
||||
avail = [(lbl, val) for lbl, val in select(metrics, "k8s_upgrade_available") if val == 1]
|
||||
blocked = any(val == 1 for _, val in select(metrics, "k8s_upgrade_blocked"))
|
||||
held = any(val == 1 for _, val in select(metrics, "k8s_upgrade_held"))
|
||||
reasons = get_blocker_reasons(avail[0][0].get("target", "")) if (avail and (blocked or held)) else None
|
||||
reasons = get_blocker_reasons(avail[0][0].get("target", "")) if (avail and blocked) else None
|
||||
|
||||
msg = compose_report(now_ts, nodes, metrics, reasons, jobs)
|
||||
post_slack(msg)
|
||||
|
|
|
|||
|
|
@ -95,121 +95,3 @@ def test_running_minor_from_kubectl(monkeypatch):
|
|||
# oldest kubelet wins (mirrors the detector): node2 on 1.33 is the floor.
|
||||
monkeypatch.setattr(cg, "kget", lambda args: "v1.34.9\nv1.33.5\nv1.34.9")
|
||||
assert cg.running_minor() == (1, 33)
|
||||
|
||||
|
||||
# --- block classification: actionable / waiting-upstream / pinned ----------
|
||||
# A block is ACTIONABLE if a newer addon version in the matrix supports the
|
||||
# target (we can upgrade to clear it), WAITING if no released version supports
|
||||
# the target yet (only upstream can clear it), or PINNED if a version exists but
|
||||
# we deliberately hold the addon. Held (waiting|pinned) is quiet; actionable
|
||||
# alerts.
|
||||
KYVERNO_MATRIX = {
|
||||
"addons": [{
|
||||
"name": "kyverno",
|
||||
"namespace": "kyverno",
|
||||
"kind": "deployment",
|
||||
"resource": "kyverno-admission-controller",
|
||||
"image_re": r"kyverno:v(\d+\.\d+)",
|
||||
"max_k8s": {"1.16": "1.34", "1.18": "1.35"},
|
||||
}]
|
||||
}
|
||||
GPU_MATRIX = {
|
||||
"addons": [{
|
||||
"name": "gpu-operator",
|
||||
"namespace": "nvidia",
|
||||
"kind": "deployment",
|
||||
"resource": "gpu-operator",
|
||||
"image_re": r"gpu-operator:v(\d+\.\d+)",
|
||||
"max_k8s": {"25.10": "1.35", "26.3": "1.36"},
|
||||
"pinned": True,
|
||||
"pin_reason": "needs newer NVIDIA driver + Ubuntu release",
|
||||
}]
|
||||
}
|
||||
|
||||
|
||||
def test_actionable_when_higher_version_supports_target(monkeypatch):
|
||||
# calico 3.30 (ceiling 1.35), target 1.36, matrix has 3.32 -> 1.36:
|
||||
# upgrading calico WOULD clear it -> ACTIONABLE, with a remediation hint.
|
||||
_img(monkeypatch, "quay.io/calico/node:v3.30.7")
|
||||
reasons = cg.check_addons(CALICO_MATRIX, (1, 36), (1, 35))
|
||||
assert len(reasons) == 1, reasons
|
||||
assert reasons[0].startswith("[ACTIONABLE]"), reasons
|
||||
assert "3.32" in reasons[0] and "calico" in reasons[0]
|
||||
|
||||
|
||||
def test_waiting_when_no_version_supports_target(monkeypatch):
|
||||
# kyverno 1.18 is the matrix ceiling (k8s 1.35); target 1.36 has NO
|
||||
# supporting version -> WAITING on upstream (nothing to upgrade to).
|
||||
_img(monkeypatch, "kyverno/kyverno:v1.18.1")
|
||||
reasons = cg.check_addons(KYVERNO_MATRIX, (1, 36), (1, 35))
|
||||
assert len(reasons) == 1, reasons
|
||||
assert reasons[0].startswith("[WAITING]"), reasons
|
||||
assert "kyverno" in reasons[0]
|
||||
|
||||
|
||||
def test_pinned_addon_is_held_not_actionable(monkeypatch):
|
||||
# gpu-operator 25.10, target 1.36; 26.3 supports 1.36 BUT the entry is
|
||||
# pinned -> classified PINNED (held), never ACTIONABLE.
|
||||
_img(monkeypatch, "nvcr.io/nvidia/gpu-operator:v25.10.0")
|
||||
reasons = cg.check_addons(GPU_MATRIX, (1, 36), (1, 35))
|
||||
assert len(reasons) == 1, reasons
|
||||
assert reasons[0].startswith("[PINNED]"), reasons
|
||||
assert "gpu-operator" in reasons[0]
|
||||
|
||||
|
||||
def test_unreadable_addon_tagged_actionable(monkeypatch):
|
||||
# fail-safe block on an unreadable image is ACTIONABLE (a human must look).
|
||||
_img(monkeypatch, "")
|
||||
reasons = cg.check_addons(ESO_MATRIX, (1, 35), (1, 34))
|
||||
assert reasons and reasons[0].startswith("[ACTIONABLE]"), reasons
|
||||
|
||||
|
||||
def test_existing_reasons_are_tagged(monkeypatch):
|
||||
# the legacy "ceiling below target, newer version exists" case is ACTIONABLE.
|
||||
_img(monkeypatch, "external-secrets/external-secrets:v0.12.1")
|
||||
reasons = cg.check_addons(ESO_MATRIX, (1, 35), (1, 34))
|
||||
assert reasons[0].startswith("[ACTIONABLE]"), reasons
|
||||
|
||||
|
||||
def test_held_reason_classifier():
|
||||
assert cg.held_reason("[WAITING] x")
|
||||
assert cg.held_reason("[PINNED] x")
|
||||
assert not cg.held_reason("[ACTIONABLE] x")
|
||||
assert not cg.held_reason("untagged")
|
||||
|
||||
|
||||
def test_exit_code_mapping():
|
||||
assert cg.exit_code([]) == 0
|
||||
assert cg.exit_code(["[ACTIONABLE] x"]) == 2
|
||||
assert cg.exit_code(["[WAITING] x"]) == 4
|
||||
assert cg.exit_code(["[PINNED] x"]) == 4
|
||||
# held wins on a mix: an upstream/pinned wait can't be cleared by acting now
|
||||
assert cg.exit_code(["[ACTIONABLE] x", "[WAITING] y"]) == 4
|
||||
|
||||
|
||||
def test_real_matrix_136_is_held(monkeypatch):
|
||||
"""Regression guard on the SHIPPED addon-compat.json: at today's running
|
||||
versions a 1.36 jump must be HELD (exit 4) — calico ACTIONABLE (3.32 in the
|
||||
matrix), ESO+kyverno WAITING (no 1.36 release), gpu-operator PINNED. Catches
|
||||
a matrix edit that silently turns the quiet held state into a nightly alert."""
|
||||
import json as _json
|
||||
matrix = _json.loads((HERE / "addon-compat.json").read_text())
|
||||
running_imgs = {
|
||||
"calico-system": "quay.io/calico/node:v3.30.7",
|
||||
"external-secrets": "ghcr.io/external-secrets/external-secrets:v2.6.0",
|
||||
"kyverno": "ghcr.io/kyverno/kyverno:v1.18.1",
|
||||
"nvidia": "nvcr.io/nvidia/gpu-operator:v25.10.0",
|
||||
}
|
||||
|
||||
def fake_kget(args):
|
||||
ns = args[args.index("-n") + 1] if "-n" in args else ""
|
||||
return running_imgs.get(ns, "")
|
||||
|
||||
monkeypatch.setattr(cg, "kget", fake_kget)
|
||||
reasons = cg.check_addons(matrix, (1, 36), (1, 35))
|
||||
pick = lambda name: next(r for r in reasons if name in r)
|
||||
assert pick("calico").startswith("[ACTIONABLE]"), reasons
|
||||
assert pick("external-secrets").startswith("[WAITING]"), reasons
|
||||
assert pick("kyverno").startswith("[WAITING]"), reasons
|
||||
assert pick("gpu-operator").startswith("[PINNED]"), reasons
|
||||
assert cg.exit_code(reasons) == 4 # held wins
|
||||
|
|
|
|||
|
|
@ -79,41 +79,3 @@ def test_compose_includes_recent_jobs():
|
|||
jobs = [{"name": "k8s-upgrade-preflight-1-35-6", "status": "Failed", "age_s": 3600}]
|
||||
out = nr.compose_report(LAST_RUN + 30000, NODES_UNIFORM, m, "x", jobs)
|
||||
assert "k8s-upgrade-preflight-1-35-6: Failed" in out
|
||||
|
||||
|
||||
# --- held (waiting-upstream / pinned) vs actionable-blocked rendering -------
|
||||
METRICS_HELD = f"""# TYPE k8s_upgrade_available gauge
|
||||
k8s_upgrade_available{{instance="",job="k8s-version-check",kind="minor",running="1.35.6",target="1.36.2"}} 1
|
||||
k8s_upgrade_held{{instance="",job="k8s-version-upgrade"}} 1
|
||||
k8s_upgrade_blocked{{instance="",job="k8s-version-upgrade"}} 0
|
||||
k8s_version_check_last_run_timestamp{{instance="",job="k8s-version-check"}} {LAST_RUN}
|
||||
"""
|
||||
NODES_135 = [(f"k8s-node{i}", "v1.35.6") for i in range(7)]
|
||||
|
||||
|
||||
def test_compose_held_headline_and_grouped_reasons():
|
||||
m = nr.parse_metrics(METRICS_HELD)
|
||||
reasons = (
|
||||
"[WAITING] addon kyverno v1.18 supports k8s <= 1.35; target 1.36 exceeds it — no released kyverno version supports k8s 1.36 yet\n"
|
||||
"[PINNED] addon gpu-operator v25.10 supports k8s <= 1.35; target 1.36 exceeds it — pinned (driver/OS); holding\n"
|
||||
"[ACTIONABLE] addon calico v3.30 supports k8s <= 1.35; target 1.36 exceeds it — upgrade calico to >= 3.32"
|
||||
)
|
||||
out = nr.compose_report(LAST_RUN + 30000, NODES_135, m, reasons, [])
|
||||
# held headline, NOT a red actionable block
|
||||
assert "⏸️ HELD" in out and "1.36.2" in out
|
||||
assert "🔴 BLOCKED" not in out
|
||||
# grouped by class
|
||||
assert "Waiting on upstream" in out and "kyverno" in out
|
||||
assert "Pinned" in out and "gpu-operator" in out
|
||||
# the lone actionable piece is still listed so eventual scope is visible
|
||||
assert "calico" in out
|
||||
# tags are stripped from the rendered bullets (no raw "[WAITING]")
|
||||
assert "[WAITING]" not in out
|
||||
|
||||
|
||||
def test_compose_blocked_groups_actionable():
|
||||
m = nr.parse_metrics(METRICS_BLOCKED) # blocked=1
|
||||
reasons = "[ACTIONABLE] addon calico v3.30 supports k8s <= 1.35; target 1.36 exceeds it — upgrade calico to >= 3.32"
|
||||
out = nr.compose_report(LAST_RUN + 30000, NODES_UNIFORM, m, reasons, [])
|
||||
assert "🔴 BLOCKED" in out
|
||||
assert "Action needed" in out and "calico" in out
|
||||
|
|
|
|||
|
|
@ -37,12 +37,6 @@ KUBECTL=kubectl
|
|||
JOB_TEMPLATE=/template/job-template.yaml
|
||||
UPDATE_K8S_SH=/scripts/update_k8s.sh
|
||||
|
||||
# Set to 1 by record_blocked/record_held when the compat-gate refuses the
|
||||
# target. spawn_next() then declines to advance the chain — but the Job still
|
||||
# exits 0, because a gate refusal is a DECISION, not a failure (no Failed Job,
|
||||
# no K8sUpgradeChainJobFailed). Signalling is via the gauges those recorders push.
|
||||
HALT_CHAIN=0
|
||||
|
||||
# SSH targets are node InternalIPs, resolved live from `kubectl get nodes` (see
|
||||
# ssh_target() below) — the pipeline has NO dependency on node DNS records
|
||||
# (`k8s-node<N>.viktorbarzin.lan`). This is what lets a freshly-joined node be
|
||||
|
|
@ -94,31 +88,17 @@ push() {
|
|||
| curl -sS --data-binary @- "$PG" || echo "warn: pushgateway push failed"
|
||||
}
|
||||
|
||||
# Compat-gate verdict recorders. A gate refusal is a DECISION, not a crash: the
|
||||
# Job Completes cleanly and the chain simply doesn't advance (spawn_next checks
|
||||
# HALT_CHAIN). The two outcomes differ only in how they're signalled:
|
||||
# - record_blocked: ACTIONABLE — a newer addon version would clear it.
|
||||
# k8s_upgrade_blocked=1 -> K8sUpgradeBlocked alert (fires once via
|
||||
# alert-on-change). "upgrade when we can, alert when we can't."
|
||||
# - record_held: WAITING-ON-UPSTREAM or PINNED — nothing to do but wait.
|
||||
# k8s_upgrade_held=1 -> NO alert; the nightly report's ⏸️ line is the
|
||||
# only signal. This is what stops the nightly cry-wolf for unactionable
|
||||
# blocks (kyverno/ESO behind upstream, gpu-operator pinned).
|
||||
# Neither Slacks per-run: the reasons are in the nightly report (it re-runs
|
||||
# compat-gate), and per-run Slack was itself a nightly-noise source.
|
||||
record_blocked() {
|
||||
# Auto-upgrade safety: a preflight compat-gate refusal is a BLOCK, not a crash —
|
||||
# the cluster simply isn't ready for this target yet (an addon / in-use API /
|
||||
# containerd is too old). Record it (k8s_upgrade_blocked=1 -> K8sUpgradeBlocked
|
||||
# alert), Slack the reasons, and halt so a human clears the blocker (or a later
|
||||
# run proceeds once it's cleared). This is the "upgrade when we can, alert when
|
||||
# we can't" contract.
|
||||
block() {
|
||||
push k8s_upgrade_blocked 1
|
||||
push k8s_upgrade_held 0
|
||||
HALT_CHAIN=1
|
||||
echo "BLOCKED (action needed) preflight v$TARGET_VERSION:" >&2
|
||||
printf '%s\n' "$1" >&2
|
||||
}
|
||||
record_held() {
|
||||
push k8s_upgrade_held 1
|
||||
push k8s_upgrade_blocked 0
|
||||
HALT_CHAIN=1
|
||||
echo "HELD (not yet upgradable — waiting upstream / pinned) preflight v$TARGET_VERSION:" >&2
|
||||
printf '%s\n' "$1" >&2
|
||||
slack "BLOCKED preflight (target v$TARGET_VERSION) — auto-upgrade halted, needs attention:\n$1"
|
||||
echo "BLOCKED: $1" >&2
|
||||
exit 1
|
||||
}
|
||||
|
||||
halt_on_alert_query() {
|
||||
|
|
@ -276,10 +256,6 @@ case "$PHASE" in
|
|||
esac
|
||||
|
||||
spawn_next() {
|
||||
if [ "${HALT_CHAIN:-0}" = "1" ]; then
|
||||
echo "Chain halted by compat-gate (blocked/held) — not spawning next phase."
|
||||
return 0
|
||||
fi
|
||||
[ -z "$NEXT_PHASE" ] && { echo "End of chain."; return 0; }
|
||||
|
||||
local job_name="k8s-upgrade-${NEXT_PHASE}-${TARGET_VERSION//./-}"
|
||||
|
|
@ -339,37 +315,15 @@ phase_preflight() {
|
|||
# 0. Auto-upgrade compat gate (compat-gate.py): refuse the upgrade if a critical
|
||||
# addon, an in-use deprecated API, or a node's containerd is too old for the
|
||||
# target. Runs FIRST — before any mutation (etcd snapshot, drains) — so a
|
||||
# refusal is cheap. The gate CLASSIFIES the refusal (exit code):
|
||||
# 0 safe -> proceed
|
||||
# 2 actionable -> record_blocked (a newer addon version would clear it)
|
||||
# 4 held -> record_held (waiting on upstream / a pinned addon)
|
||||
# 3/other err -> fail-safe: treat as actionable block
|
||||
# blocked/held push the gauge DEFINITIVELY (one value per run — no pre-reset
|
||||
# flap that would re-notify the alert nightly) and set HALT_CHAIN so the Job
|
||||
# Completes cleanly without advancing the chain. This is what makes
|
||||
# unattended minor upgrades safe AND quiet: proceed when supported, alert
|
||||
# only when there's something to do, hold silently when there isn't.
|
||||
# block is cheap. Reset the blocked gauge for this run; block() sets it to 1
|
||||
# only on a refusal. This is what makes unattended minor upgrades safe: the
|
||||
# chain proceeds when the cluster supports the target and halts+alerts when
|
||||
# it doesn't (e.g. Calico/ESO/kyverno behind, or a removed API still in use).
|
||||
push k8s_upgrade_blocked 0
|
||||
local gate_out gate_rc=0
|
||||
gate_out=$(python3 /scripts/compat-gate.py "$TARGET_VERSION" < /scripts/addon-compat.json 2>&1) || gate_rc=$?
|
||||
case "$gate_rc" in
|
||||
0)
|
||||
push k8s_upgrade_blocked 0
|
||||
push k8s_upgrade_held 0
|
||||
if [ "$gate_rc" -ne 0 ]; then block "$gate_out"; fi
|
||||
echo "compat-gate passed for v$TARGET_VERSION"
|
||||
;;
|
||||
4)
|
||||
record_held "$gate_out"
|
||||
return 0
|
||||
;;
|
||||
2)
|
||||
record_blocked "$gate_out"
|
||||
return 0
|
||||
;;
|
||||
*)
|
||||
record_blocked "gate ERROR (rc=$gate_rc) — failing safe as an actionable block:"$'\n'"$gate_out"
|
||||
return 0
|
||||
;;
|
||||
esac
|
||||
|
||||
# 1. All nodes Ready + no pressure
|
||||
local bad_nodes
|
||||
|
|
@ -462,39 +416,6 @@ phase_preflight() {
|
|||
fi
|
||||
fi
|
||||
|
||||
# 4b. apiserver-OIDC drift check (backstop for the rbac stack's kubeadm-config
|
||||
# reconciliation). A `kubeadm upgrade` REGENERATES the apiserver manifest from
|
||||
# kubeadm-config; if kubeadm-config still carries the legacy single-issuer
|
||||
# --oidc-* args instead of --authentication-config, the regenerated apiserver
|
||||
# loses structured multi-issuer auth → kubectl + dashboard SSO break AFTER the
|
||||
# upgrade. This is RECOVERABLE (the apiserver does NOT crash — verified by an
|
||||
# isolated repro 2026-06-24; the chain's post-master restore.sh re-adds the flag,
|
||||
# and the rbac stack reconciles kubeadm-config so it won't recur) — so this is an
|
||||
# ALERT, not a block. (NB the 2026-06-24 stall was NOT this — it was etcd IO
|
||||
# starvation; see docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md.)
|
||||
# Skip on an at-target master (resume — no apiserver regen).
|
||||
if [ "$master_kubelet_v" != "$TARGET_VERSION" ]; then
|
||||
local apiserver_diff
|
||||
apiserver_diff=$(ssh "${SSH_OPTS[@]}" "$(ssh_target k8s-master)" "sudo kubeadm upgrade diff v$TARGET_VERSION 2>/dev/null" || true)
|
||||
if echo "$apiserver_diff" | grep -qE '^-[[:space:]].*--authentication-config'; then
|
||||
slack "WARN preflight — kubeadm upgrade will DROP --authentication-config (kubeadm-config OIDC drift). SSO breaks post-upgrade until restore.sh re-adds it; re-apply the rbac stack to reconcile kubeadm-config. Proceeding (recoverable, not a crash)."
|
||||
fi
|
||||
fi
|
||||
|
||||
# 4c. Reclaim kubeadm scratch on master. `kubeadm upgrade apply` dumps a full
|
||||
# ~400MB etcd DB backup into /etc/kubernetes/tmp/kubeadm-backup-etcd-<ts>/ before
|
||||
# every etcd upgrade and NEVER cleans it up — 145 dirs / 28GB had accumulated by
|
||||
# 2026-06-24, pushing master root fs to 73% (image-GC churn + extra write IO on
|
||||
# the shared HDD where etcd lives — a contributor to the etcd IO starvation that
|
||||
# stalled that run, see post-mortem). Real etcd backups go to NFS, so these are
|
||||
# throwaway. Prune ones >3 days old (keeps a short rollback window). Best-effort;
|
||||
# never aborts the chain.
|
||||
if [ "$master_kubelet_v" != "$TARGET_VERSION" ]; then
|
||||
ssh "${SSH_OPTS[@]}" "$(ssh_target k8s-master)" \
|
||||
"sudo find /etc/kubernetes/tmp -maxdepth 1 -type d \( -name 'kubeadm-backup-*' -o -name 'kubeadm-upgraded-manifests*' \) -mtime +3 -exec rm -rf {} + 2>/dev/null; echo -n 'master root after prune: '; df -h / | awk 'NR==2{print \$5\" used, \"\$4\" free\"}'" \
|
||||
|| echo "kubeadm-scratch prune skipped (ssh/df failed) — non-fatal"
|
||||
fi
|
||||
|
||||
# 5. Push in-flight + started_timestamp metrics + ns annotations
|
||||
$KUBECTL annotate ns "$NS" \
|
||||
"viktorbarzin.me/k8s-upgrade-in-flight=$(date -u +%FT%TZ)" \
|
||||
|
|
@ -823,8 +744,6 @@ phase_postflight() {
|
|||
push k8s_upgrade_in_flight 0
|
||||
push k8s_upgrade_snapshot_taken 0
|
||||
push k8s_upgrade_started_timestamp 0
|
||||
push k8s_upgrade_blocked 0
|
||||
push k8s_upgrade_held 0
|
||||
|
||||
slack ":white_check_mark: K8s upgrade complete: cluster on v$TARGET_VERSION (pod-ready ratio $ratio)"
|
||||
}
|
||||
|
|
|
|||
Some files were not shown because too many files have changed in this diff Show more
Loading…
Add table
Add a link
Reference in a new issue