Compare commits
30 commits
wizard/gol
...
master
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
041aedc486 | ||
|
|
7988a690ed | ||
|
|
6415f77fed | ||
|
|
b371ae6eee | ||
|
|
51dc5d031c | ||
|
|
82a7b2585b | ||
|
|
006f97ef58 | ||
|
|
7b4a8ba867 | ||
|
|
19d0f0933a | ||
|
|
abb15cd49d | ||
|
|
fc83595f5e | ||
|
|
fd33d1a447 | ||
|
|
196d0db4bd | ||
|
|
5d33327c30 | ||
|
|
1bca799bb4 | ||
|
|
d105713ae7 | ||
|
|
6f1951af93 | ||
|
|
8121d8a4ac | ||
|
|
ebc8b6588f | ||
|
|
6c5288998f | ||
|
|
306cdd4cb3 | ||
|
|
9c68d147e0 | ||
|
|
60a1cb9a25 | ||
|
|
c6bba1da6e | ||
|
|
b858561bd0 | ||
|
|
a7704f46a6 | ||
|
|
aa510e3600 | ||
|
|
53834deb24 | ||
|
|
8dd9a3978d | ||
|
|
65b2df1222 |
111 changed files with 6091 additions and 4140 deletions
|
|
@ -16,6 +16,7 @@
|
||||||
**ALL infrastructure changes MUST go through Terraform/Terragrunt.** Never use `kubectl apply/edit/patch/set`, `helm install/upgrade`, or any manual cluster mutation as the final state.
|
**ALL infrastructure changes MUST go through Terraform/Terragrunt.** Never use `kubectl apply/edit/patch/set`, `helm install/upgrade`, or any manual cluster mutation as the final state.
|
||||||
|
|
||||||
- **No exceptions for "quick fixes"** — even one-line changes must be in `.tf` files and applied via `scripts/tg apply`
|
- **No exceptions for "quick fixes"** — even one-line changes must be in `.tf` files and applied via `scripts/tg apply`
|
||||||
|
- **Apply locally OR let CI do it — but ALWAYS commit.** You don't have to wait for CI: with apply access you MAY run the apply yourself (`scripts/tg apply <stack>` / `homelab tf apply <stack>`), but **from the main checkout, never a worktree** (git-crypt'd `*.tfvars` come through as ciphertext under the worktree filter-bypass, so a worktree apply reads garbage). **Every applied change MUST be committed and pushed to `master` the same session** — the repo is the source of truth, so applied-but-uncommitted HCL is drift that the next CI apply / daily drift-detection will try to revert. Order either way: apply locally then commit + push (CI's changed-stack apply then no-ops), or commit + push and let CI apply. Never apply an uncommitted edit; never leave a committed change unapplied.
|
||||||
- **kubectl is for read-only operations and temporary debugging only** (get, describe, logs, exec, port-forward)
|
- **kubectl is for read-only operations and temporary debugging only** (get, describe, logs, exec, port-forward)
|
||||||
- **If a resource isn't in Terraform yet**, evaluate whether it can be added before making manual changes. If manual change is unavoidable (e.g., emergency), document it immediately and create the Terraform resource in the same session
|
- **If a resource isn't in Terraform yet**, evaluate whether it can be added before making manual changes. If manual change is unavoidable (e.g., emergency), document it immediately and create the Terraform resource in the same session
|
||||||
- **kubectl scale/patch during migrations is acceptable** as a transient step, but the final state must be in Terraform and applied via `scripts/tg apply`
|
- **kubectl scale/patch during migrations is acceptable** as a transient step, but the final state must be in Terraform and applied via `scripts/tg apply`
|
||||||
|
|
@ -233,7 +234,7 @@ the workflow's built-in `GITHUB_TOKEN` (`packages: write`).
|
||||||
- **External monitoring**: `[External] <service>` monitors in Uptime Kuma test full external path (DNS → Cloudflare → Tunnel → Traefik). Divergence metric `external_internal_divergence_count` → alert `ExternalAccessDivergence` (15min). Config: `stacks/uptime-kuma/`, targets from `cloudflare_proxied_names` in `config.tfvars` (17 remaining centrally-managed hostnames; most DNS records now auto-created by `ingress_factory` `dns_type` param).
|
- **External monitoring**: `[External] <service>` monitors in Uptime Kuma test full external path (DNS → Cloudflare → Tunnel → Traefik). Divergence metric `external_internal_divergence_count` → alert `ExternalAccessDivergence` (15min). Config: `stacks/uptime-kuma/`, targets from `cloudflare_proxied_names` in `config.tfvars` (17 remaining centrally-managed hostnames; most DNS records now auto-created by `ingress_factory` `dns_type` param).
|
||||||
- Key alerts: OOMKill, pod replica mismatch, 4xx/5xx error rates, UPS battery, CPU temp, SSD writes, NFS responsiveness, ClusterMemoryRequestsHigh (>85%), ContainerNearOOM (>85% limit), PodUnschedulable, ExternalAccessDivergence, ImmichSmartSearchSlow (context-search latency / clip_index cache eviction).
|
- Key alerts: OOMKill, pod replica mismatch, 4xx/5xx error rates, UPS battery, CPU temp, SSD writes, NFS responsiveness, ClusterMemoryRequestsHigh (>85%), ContainerNearOOM (>85% limit), PodUnschedulable, ExternalAccessDivergence, ImmichSmartSearchSlow (context-search latency / clip_index cache eviction).
|
||||||
- **E2E email monitoring**: CronJob `email-roundtrip-monitor` (every 20 min) sends test email via Brevo HTTP API to `smoke-test@viktorbarzin.me` (catch-all → `spam@`), verifies IMAP delivery, deletes test email, pushes metrics to Pushgateway + Uptime Kuma. Alerts: `EmailRoundtripFailing` (60m), `EmailRoundtripStale` (60m), `EmailRoundtripNeverRun` (60m). Outbound relay: Brevo EU (`smtp-relay.brevo.com:587`, 300/day free — migrated from Mailgun). Inbound external traffic enters via pfSense HAProxy on `10.0.20.1:{25,465,587,993}`, which forwards to k8s `mailserver-proxy` NodePort (30125-30128) with `send-proxy-v2`. Mailserver pod runs alt PROXY-speaking listeners (2525/4465/5587/10993) alongside stock PROXY-free ones (25/465/587/993) for intra-cluster clients. Real client IPs recovered from PROXY v2 header despite kube-proxy SNAT (replaces pre-2026-04-19 MetalLB `10.0.20.202` ETP:Local scheme; see bd code-yiu + `docs/runbooks/mailserver-pfsense-haproxy.md`). Vault: `brevo_api_key` in `secret/viktor` (probe + relay).
|
- **E2E email monitoring**: CronJob `email-roundtrip-monitor` (every 20 min) sends test email via Brevo HTTP API to `smoke-test@viktorbarzin.me` (catch-all → `spam@`), verifies IMAP delivery, deletes test email, pushes metrics to Pushgateway + Uptime Kuma. Alerts: `EmailRoundtripFailing` (60m), `EmailRoundtripStale` (60m), `EmailRoundtripNeverRun` (60m). Outbound relay: Brevo EU (`smtp-relay.brevo.com:587`, 300/day free — migrated from Mailgun). Inbound external traffic enters via pfSense HAProxy on `10.0.20.1:{25,465,587,993}`, which forwards to k8s `mailserver-proxy` NodePort (30125-30128) with `send-proxy-v2`. Mailserver pod runs alt PROXY-speaking listeners (2525/4465/5587/10993) alongside stock PROXY-free ones (25/465/587/993) for intra-cluster clients. Real client IPs recovered from PROXY v2 header despite kube-proxy SNAT (replaces pre-2026-04-19 MetalLB `10.0.20.202` ETP:Local scheme; see bd code-yiu + `docs/runbooks/mailserver-pfsense-haproxy.md`). Vault: `brevo_api_key` in `secret/viktor` (probe + relay).
|
||||||
- **Authentik walling-off guard**: `blackbox-exporter` (monitoring ns, `stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf`) probes each must-stay-public `auth = "none"` carve-out URL with `no_follow_redirects` and FAILS (`fail_if_header_matches` on `Location`) iff it 302s to Authentik. Catches a carve-out regressing (TF revert / deploy / `ingress_factory` `auth` default flipping back to `"required"`). Scrape job `blackbox-authentik-walloff` (1m) → alert `AuthentikWallingOffPublicPath` (`probe_failed_due_to_regex == 1`, for 10m, `lane=security` → `#security` Slack). **To guard a new carve-out: add one line to `local.authentik_walloff_targets`** (a `service → URL` map; `valid_status_codes` includes 301/302 so legit redirects/404s stay green — only the Authentik `Location` fails the probe). `curl -sI '<url>'` must NOT show a Location to `authentik.viktorbarzin.me` before adding.
|
- **Authentik walling-off guard**: `blackbox-exporter` (monitoring ns, `stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf`) probes each must-stay-public `auth = "none"` carve-out URL with `no_follow_redirects` and FAILS (`fail_if_header_matches` on `Location`) iff it 302s to Authentik. Catches a carve-out regressing (TF revert / deploy / `ingress_factory` `auth` default flipping back to `"required"`). Scrape job `blackbox-authentik-walloff` (1m) → alert `AuthentikWallingOffPublicPath` (`probe_failed_due_to_regex == 1`, for 10m, `lane=security` → posts to `#alerts` via the `slack-security` receiver, which keeps its `[SECURITY]` styling; the dedicated `#security` channel was abandoned 2026-06-25 — the shared webhook's app isn't a member of it). **To guard a new carve-out: add one line to `local.authentik_walloff_targets`** (a `service → URL` map; `valid_status_codes` includes 301/302 so legit redirects/404s stay green — only the Authentik `Location` fails the probe). `curl -sI '<url>'` must NOT show a Location to `authentik.viktorbarzin.me` before adding.
|
||||||
|
|
||||||
## Security Posture (Wave 1 — locked 2026-05-18)
|
## Security Posture (Wave 1 — locked 2026-05-18)
|
||||||
|
|
||||||
|
|
@ -241,9 +242,10 @@ Plan in `docs/architecture/security.md` + response playbook in `docs/runbooks/se
|
||||||
|
|
||||||
- **Identity allowlist for security rules**: ONLY `me@viktorbarzin.me`. NOT `viktor@viktorbarzin.me`, NOT `emo@viktorbarzin.me` (those don't exist). emo's identity scheme is unknown — ask before assuming.
|
- **Identity allowlist for security rules**: ONLY `me@viktorbarzin.me`. NOT `viktor@viktorbarzin.me`, NOT `emo@viktorbarzin.me` (those don't exist). emo's identity scheme is unknown — ask before assuming.
|
||||||
- **Source-IP allowlist (K2, K9, V7, S1)**: `10.0.20.0/22`, `192.168.1.0/24` (Proxmox + Sofia LAN), K8s pod CIDR, K8s service CIDR, Headscale tailnet. **Policy: no public-IP access** — Vault, kube-apiserver, PVE sshd must transit LAN or Headscale. **One documented exception (2026-06-11): break-glass SSH** — PVE sshd on a WAN-exposed `:52222`, key-only, dedicated break-glass key only (`Match LocalPort`), rate-limited + fail2ban; intentionally cluster-independent so it survives an outage. As-built `docs/runbooks/breakglass-ssh.md`. (Replaced the 2026-05-30 port-knock design — circular Vault dep caused a lockout.)
|
- **Source-IP allowlist (K2, K9, V7, S1)**: `10.0.20.0/22`, `192.168.1.0/24` (Proxmox + Sofia LAN), K8s pod CIDR, K8s service CIDR, Headscale tailnet. **Policy: no public-IP access** — Vault, kube-apiserver, PVE sshd must transit LAN or Headscale. **One documented exception (2026-06-11): break-glass SSH** — PVE sshd on a WAN-exposed `:52222`, key-only, dedicated break-glass key only (`Match LocalPort`), rate-limited + fail2ban; intentionally cluster-independent so it survives an outage. As-built `docs/runbooks/breakglass-ssh.md`. (Replaced the 2026-05-30 port-knock design — circular Vault dep caused a lockout.)
|
||||||
- **Response model**: (I) Slack-only daily skim. All security alerts via Loki ruler → Alertmanager → `#security` Slack receiver. Single channel with severity labels inside (critical/warning/info). No paging.
|
- **Response model**: (I) Slack-only daily skim. All security alerts via Loki ruler → Alertmanager → the `slack-security` receiver, which posts to `#alerts` (it keeps its `[SECURITY/<sev>]` title styling so security-lane alerts stand out). Severity labels carried in the alert (critical/warning/info). No paging. The dedicated `#security` channel was abandoned 2026-06-25 — the shared `alertmanager_slack_api_url` webhook's Slack app isn't a member of it (a `#security` override 404s), so everything consolidated to `#alerts`.
|
||||||
- **Kyverno policies (wave 1)**: `deny-privileged-containers`, `deny-host-namespaces`, `restrict-sys-admin`, `require-trusted-registries` flip Audit→Enforce with the 31-namespace exclude list (memory id=1970). `failurePolicy: Ignore` preserved. Cosign `verify-images` deferred.
|
- **Kyverno policies (wave 1)**: `deny-privileged-containers`, `deny-host-namespaces`, `restrict-sys-admin`, `require-trusted-registries` flip Audit→Enforce with the 31-namespace exclude list (memory id=1970). `failurePolicy: Ignore` preserved. Cosign `verify-images` deferred.
|
||||||
- **NetworkPolicy default-deny egress (wave 1)**: observe-then-enforce (γ approach) — Calico flow logs cluster-wide + GlobalNetworkPolicy log-only on tier 3+4, build empirical allowlist after 1 week, phased per-namespace enforce starting `recruiter-responder`. Tier 0/1/2 deferred.
|
- **NetworkPolicy default-deny egress (wave 1)**: observe-then-enforce (γ approach) — Calico flow logs cluster-wide + GlobalNetworkPolicy log-only on tier 3+4, build empirical allowlist after 1 week, phased per-namespace enforce starting `recruiter-responder`. Tier 0/1/2 deferred. **The internal (ns-to-ns) half of each allowlist now derives faster from the east-west flow trail** (below): `SELECT DISTINCT dst_ns FROM edge WHERE src_ns='<ns>' AND action='allow'`. External egress is NOT in that table (empty-ns flows dropped) — those still come from the Calico flow-log W1.6 snapshot. Enforce-flips remain out of scope of the trail (observe-and-derive only; beads `code-8ywc`).
|
||||||
|
- **East-west flow trail (who-talks-to-whom, ADR-0014)**: Calico **Goldmane** (`goldmane.calico-system:7443`, gRPC/mTLS, ~60-min in-memory ring buffer — no etcd writes) + **Whisker** live UI (`whisker.viktorbarzin.me`, Authentik-gated) → **`goldmane-edge-aggregator`** streams Goldmane's `Flows.Stream` over mTLS and upserts the namespace-pair **edge set** (`edge(src_ns,dst_ns,action,first_seen,last_seen,flow_count)`, self-edges + public-internet flows dropped) into **CNPG DB `goldmane_edges`** → daily **`goldmane-edges-digest`** CronJob posts first-seen edges to `#alerts` (consolidated to `#alerts`; the `#security` channel was abandoned 2026-06-25 — the shared webhook's Slack app isn't a member of it, so a `#security` override 404s; see runbook). **CERT-REUSE GOTCHA**: the aggregator's mTLS client cert reuses the operator's Tigera-CA-signed `whisker-backend-key-pair` Secret (Goldmane verifies CA-chain only) — **re-apply `stacks/goldmane-edge-aggregator` if the operator rotates it** (symptom: no `last_seen` updates, `AggregatorDown`). Service identity = namespace, + `service-identity` label only in `monitoring`/`kube-system`/`dbaas`. Health: `AggregatorDown` + `DigestFailing` alerts + cluster-health #48. Runbook: `docs/runbooks/goldmane-flow-trail.md`. (Goldmane is OSS tech-preview — reversible operator-CR toggle in `stacks/calico/main.tf`.)
|
||||||
- **What's NOT in scope**: canary tokens (rejected — self-trigger risk with Viktor's normal `vault kv list secret/viktor` and `kubectl get secret -A` workflows), Falco/Tetragon (too noisy for Slack-only daily check), Cloudflare/GitHub audit polling (deferred to wave 2).
|
- **What's NOT in scope**: canary tokens (rejected — self-trigger risk with Viktor's normal `vault kv list secret/viktor` and `kubectl get secret -A` workflows), Falco/Tetragon (too noisy for Slack-only daily check), Cloudflare/GitHub audit polling (deferred to wave 2).
|
||||||
|
|
||||||
## Storage & Backup Architecture
|
## Storage & Backup Architecture
|
||||||
|
|
|
||||||
|
|
@ -13,6 +13,8 @@
|
||||||
| authentik | Identity provider (SSO) | authentik |
|
| authentik | Identity provider (SSO) | authentik |
|
||||||
| cloudflared | Cloudflare tunnel | cloudflared |
|
| cloudflared | Cloudflare tunnel | cloudflared |
|
||||||
| authelia | Auth middleware (may be merged into ebooks or removed) | platform |
|
| authelia | Auth middleware (may be merged into ebooks or removed) | platform |
|
||||||
|
| goldmane | Calico 3.30 OSS flow aggregator (`goldmane.calico-system.svc:7443`, gRPC/mTLS). Stamps identity (ns/pod/workload/labels + allow-deny) on every flow from Felix into a ~60-min in-memory ring buffer — no etcd/API writes. East-west "who-talks-to-whom" source (ADR-0014). Enabled via operator CR (`kubectl_manifest.goldmane`). | calico |
|
||||||
|
| whisker | Calico 3.30 OSS live flow-observability UI (`whisker.calico-system.svc:8081`) at `whisker.viktorbarzin.me` (Authentik-gated, `auth=required` — no own login; additive NP ORs Traefik past the operator default-deny). ~60-min live view of Goldmane flows, NOT history. Enabled via operator CR (`kubectl_manifest.whisker`). | calico |
|
||||||
| monitoring | Prometheus/Grafana/Loki stack | monitoring |
|
| monitoring | Prometheus/Grafana/Loki stack | monitoring |
|
||||||
|
|
||||||
## Storage & Security (Tier: cluster)
|
## Storage & Security (Tier: cluster)
|
||||||
|
|
@ -37,6 +39,7 @@
|
||||||
## Active Use
|
## Active Use
|
||||||
| Service | Description | Stack |
|
| Service | Description | Stack |
|
||||||
|---------|-------------|-------|
|
|---------|-------------|-------|
|
||||||
|
| goldmane-edge-aggregator | Durable who-talks-to-whom audit trail (ADR-0014 / #58). Go service: `aggregate` Deployment streams Goldmane's gRPC `Flows.Stream` (mTLS) and upserts the low-cardinality namespace-pair edge set (`edge(src_ns,dst_ns,action,first_seen,last_seen,flow_count)`) into CNPG DB `goldmane_edges`; `goldmane-edges-digest` CronJob posts first-seen edges daily to `#alerts` (the `#security` channel was abandoned 2026-06-25 — shared webhook's app isn't a member of it). mTLS client cert REUSES the operator's `whisker-backend-key-pair` (re-apply if rotated). Tier-4-aux. Image `ghcr.io/viktorbarzin/goldmane-edge-aggregator` (private). Runbook: [goldmane-flow-trail.md](../../docs/runbooks/goldmane-flow-trail.md). | goldmane-edge-aggregator |
|
||||||
| mailserver | Email (docker-mailserver) | mailserver |
|
| mailserver | Email (docker-mailserver) | mailserver |
|
||||||
| shadowsocks | Proxy | shadowsocks |
|
| shadowsocks | Proxy | shadowsocks |
|
||||||
| webhook_handler | Webhook processing | webhook_handler |
|
| webhook_handler | Webhook processing | webhook_handler |
|
||||||
|
|
@ -161,3 +164,4 @@ procedures) are documented in `infra/docs/runbooks/`:
|
||||||
| pfSense + Unbound DNS | [pfsense-unbound.md](../../docs/runbooks/pfsense-unbound.md) |
|
| pfSense + Unbound DNS | [pfsense-unbound.md](../../docs/runbooks/pfsense-unbound.md) |
|
||||||
| Mailserver PROXY-protocol / HAProxy | [mailserver-pfsense-haproxy.md](../../docs/runbooks/mailserver-pfsense-haproxy.md) |
|
| Mailserver PROXY-protocol / HAProxy | [mailserver-pfsense-haproxy.md](../../docs/runbooks/mailserver-pfsense-haproxy.md) |
|
||||||
| Technitium apply flow | [technitium-apply.md](../../docs/runbooks/technitium-apply.md) |
|
| Technitium apply flow | [technitium-apply.md](../../docs/runbooks/technitium-apply.md) |
|
||||||
|
| Goldmane flow trail (east-west who-talks-to-whom) | [goldmane-flow-trail.md](../../docs/runbooks/goldmane-flow-trail.md) |
|
||||||
|
|
|
||||||
|
|
@ -11,8 +11,8 @@ description: |
|
||||||
There are TWO Home Assistant deployments: ha-london (default) and ha-sofia.
|
There are TWO Home Assistant deployments: ha-london (default) and ha-sofia.
|
||||||
Always use Home Assistant for smart home control.
|
Always use Home Assistant for smart home control.
|
||||||
author: Claude Code
|
author: Claude Code
|
||||||
version: 2.0.0
|
version: 2.1.0
|
||||||
date: 2026-02-07
|
date: 2026-06-24
|
||||||
---
|
---
|
||||||
|
|
||||||
# Home Assistant Control
|
# Home Assistant Control
|
||||||
|
|
@ -395,14 +395,27 @@ Advanced SSH, File Editor, Studio Code Server, InfluxDB, Mosquitto, Node-RED, Fr
|
||||||
## ha-london Knowledge Map
|
## ha-london Knowledge Map
|
||||||
|
|
||||||
### Overview
|
### Overview
|
||||||
- **HA Version**: 2025.9.1 (Docker container on Raspberry Pi)
|
- **HA Version**: 2026.5.2 on **Home Assistant OS** (HAOS — managed appliance, NOT a `docker run` container). Latest is 2026.6.4 (update available, deliberately not applied).
|
||||||
- **Location**: London, UK
|
- **Location**: London, UK
|
||||||
- **Platform**: Raspberry Pi 4, HA OS (not Docker standalone)
|
- **Platform**: Raspberry Pi 4, HA OS
|
||||||
- **SSH**: `ssh hassio@192.168.8.103` (requires `sudo` for file access)
|
- **Access from the Sofia devvm**: london is **remote** — `homelab ha ssh --instance london` generally WON'T connect (ADR-0012). Drive it via the API: `homelab ha token --instance london` + `https://ha-london.viktorbarzin.me/api/...`, and the WebSocket API `wss://ha-london.viktorbarzin.me/api/websocket` for dashboards / config-entries / HACS installs.
|
||||||
- **Config path**: `/config/` (requires `sudo` for file access)
|
- **SSH (only from the London LAN)**: `ssh hassio@192.168.8.103` (requires `sudo` for file access)
|
||||||
|
- **Config path**: `/config/`
|
||||||
- **3 tracked people**: Viktor Barzin, Anca Milea, Gheorghe Milea
|
- **3 tracked people**: Viktor Barzin, Anca Milea, Gheorghe Milea
|
||||||
- **Zone**: London (home)
|
- **Zone**: London (home)
|
||||||
|
|
||||||
|
### Dashboards (redesigned 2026-06-24)
|
||||||
|
**Glossary** (HA terms — keep distinct):
|
||||||
|
- **Dashboard** = a sidebar entry (Overview, Air Quality, Map). Sidebar *order* is a per-USER frontend preference, not in any dashboard config.
|
||||||
|
- **View** = a tab inside a dashboard. View order is global (stored in the dashboard config).
|
||||||
|
- **Card** = a widget inside a view.
|
||||||
|
|
||||||
|
- **Overview** (`lovelace`, the default): responsive **sections** views, styled with Mushroom + mini-graph-card.
|
||||||
|
- **Home** tab: *Who's home* · *Comfort & Air* (CO₂/temp/humidity/PM2.5/VOC chips + CO₂ and temp/humidity trend graphs + link to Air Quality) · *Cowboy* (battery/range/last-ride) · *Energy* (5 Kasa plugs + power trend) · *Quick actions* (Netflix/Stremio/Night).
|
||||||
|
- **More** tab: *Network* (GL-MT6000 router) · *System* (HA version/update, last backup, RPi power) · *Phones*.
|
||||||
|
- **Air Quality** (`air-quality`): deep-dive (views: Home, Detailed). (`detialed`→`detailed` path typo fixed 2026-06-24.)
|
||||||
|
- Built via the WS `lovelace/config/save` API (london is remote — no SSH path).
|
||||||
|
|
||||||
### Key Systems
|
### Key Systems
|
||||||
|
|
||||||
#### 1. Smart Plugs (TP-Link Kasa) — Energy Monitoring
|
#### 1. Smart Plugs (TP-Link Kasa) — Energy Monitoring
|
||||||
|
|
@ -424,10 +437,15 @@ Named plugs with power/energy tracking:
|
||||||
- PM1.0/2.5/4.0/10 particulate sensors
|
- PM1.0/2.5/4.0/10 particulate sensors
|
||||||
- VOC, NOx, ammonia, CO, ethanol, hydrogen, methane, NO2 gas sensors
|
- VOC, NOx, ammonia, CO, ethanol, hydrogen, methane, NO2 gas sensors
|
||||||
|
|
||||||
#### 3. Cowboy E-Bike
|
#### 3. Cowboy E-Bike (`elsbrock/cowboy-ha`)
|
||||||
- `sensor.bike_state_of_charge`: Battery %
|
Bike named **"Classic Performance"** → entities are `sensor.classic_performance_*` (26 total). The old `sensor.bike_*` names are GONE (they were the dead `jdejaegh` integration).
|
||||||
- `sensor.bike_total_distance`: Total km
|
- `sensor.classic_performance_remaining_battery`: Battery % (was `sensor.bike_state_of_charge`)
|
||||||
- `sensor.bike_total_co2_saved`: CO2 saved (grams)
|
- `sensor.classic_performance_remaining_range`: Range km
|
||||||
|
- `sensor.classic_performance_mileage`: Total km (was `sensor.bike_total_distance`)
|
||||||
|
- `sensor.classic_performance_saved_co2`: Lifetime CO2 saved (was `sensor.bike_total_co2_saved`)
|
||||||
|
- Plus `_distance_today`, `_last_trip_*`, `_battery_health`, `device_tracker.classic_performance`, etc.
|
||||||
|
- **GOTCHA**: live battery/range/mileage read `unknown` while the bike is parked/asleep — Cowboy only reports live SoC when awake (ridden/charging); trip-history + `distance_today` stay live regardless.
|
||||||
|
- Auth: account **email+password** (no AWS Cognito — that was the dead `jdejaegh`/`cowboybike` lineage). Setup via UI config flow / REST `config_entries/flow`. Creds in Vaultwarden item **"cowboy bike"** (`homelab vault get "cowboy bike"`).
|
||||||
|
|
||||||
#### 4. Uptime Monitoring (UptimeRobot)
|
#### 4. Uptime Monitoring (UptimeRobot)
|
||||||
- `sensor.blog`: blog uptime
|
- `sensor.blog`: blog uptime
|
||||||
|
|
@ -446,12 +464,17 @@ Named plugs with power/energy tracking:
|
||||||
- Scripts: `script.start_netflix`, `script.start_stremio`
|
- Scripts: `script.start_netflix`, `script.start_stremio`
|
||||||
- Scene: `scene.night` (turns off Livia + Michelle plugs)
|
- Scene: `scene.night` (turns off Livia + Michelle plugs)
|
||||||
|
|
||||||
### Custom Components
|
### Custom Components (HACS integrations)
|
||||||
- **cowboy**: Cowboy e-bike integration (HACS)
|
- **cowboy** (`elsbrock/cowboy-ha` v1.2.0): Cowboy e-bike — revived 2026-06-24. The old `jdejaegh/home-assistant-cowboy` repo is **dead (404)**; don't chase it.
|
||||||
- **hildebrandglow_dcc**: UK smart meter DCC energy data (HACS)
|
- **hildebrandglow_dcc**: UK smart meter DCC energy — **DISABLED by user** (config entry `disabled_by: user`), not broken.
|
||||||
|
|
||||||
|
### HACS frontend cards (plugins)
|
||||||
|
- **Mushroom** (`piitaya/lovelace-mushroom`), **mini-graph-card** (`kalkih/mini-graph-card`), **plotly-graph-card** (`dbuezas/lovelace-plotly-graph-card`) — used by the redesigned Overview. Install over WS `hacs/repository/download`; resources auto-register in storage mode.
|
||||||
|
|
||||||
### Integrations
|
### Integrations
|
||||||
ESPHome, TP-Link Kasa, Tapo, UptimeRobot, Cowboy, Hildebrand Glow DCC, Oral-B BLE, Ookla Speedtest, HACS, OpenRouter (multiple free LLMs), Piper (local TTS), Whisper (local STT), Android TV/ADB
|
ESPHome, TP-Link Kasa, Tapo, UptimeRobot, **Cowboy** (elsbrock), Oral-B BLE, Ookla Speedtest (exposes only an `update` entity, no live speed sensors), HACS, OpenRouter (free LLMs), Piper (TTS), Whisper (STT), Android TV/ADB.
|
||||||
|
- **Disabled by user (NOT broken)**: `met` + `metoffice` (weather — so `weather.*` entities are ABSENT), `roomba` (Rumi vacuum), `hildebrandglow_dcc` (energy).
|
||||||
|
- **Failing**: `tplink` **Tapo P100** projector plug — `setup_retry`, 403 KLAP handshake from 192.168.8.108 (plug off / firmware). Left as-is.
|
||||||
|
|
||||||
### AI / Voice Assistants
|
### AI / Voice Assistants
|
||||||
- 5 free LLM conversation agents: Google Gemma 3 27B, Meta Llama 3.2 3B, Mistral Devstral 2, OpenAI GPT-OSS-20B, Z.AI GLM 4.5 Air
|
- 5 free LLM conversation agents: Google Gemma 3 27B, Meta Llama 3.2 3B, Mistral Devstral 2, OpenAI GPT-OSS-20B, Z.AI GLM 4.5 Air
|
||||||
|
|
@ -466,15 +489,8 @@ ESPHome, TP-Link Kasa, Tapo, UptimeRobot, Cowboy, Hildebrand Glow DCC, Oral-B BL
|
||||||
- Anca arrival/departure notifications
|
- Anca arrival/departure notifications
|
||||||
- Night scene: turns off Livia + Michelle
|
- Night scene: turns off Livia + Michelle
|
||||||
|
|
||||||
### Docker Setup
|
### Platform (HAOS — ignore any legacy `docker run` snippet)
|
||||||
```bash
|
ha-london runs **Home Assistant OS** (managed appliance), NOT a hand-run Docker container. There is no `docker run homeassistant/home-assistant` to manage. Install HACS components over the WebSocket API (`hacs/repository/download` with the repo's HACS id), then restart via `POST /api/services/homeassistant/restart` — a HAOS restart drops automations for ~1–2 min and resets `sensor.uptime` (use that as the "back up" marker).
|
||||||
docker run -d --name homeassistant --privileged \
|
|
||||||
-e TZ=Europe/London \
|
|
||||||
-v /home/pi/docker/homeAssistant:/config \
|
|
||||||
-v /run/dbus:/run/dbus:ro \
|
|
||||||
--network=host --restart=unless-stopped \
|
|
||||||
homeassistant/home-assistant:2025.9
|
|
||||||
```
|
|
||||||
|
|
||||||
### SSH Access
|
### SSH Access
|
||||||
```bash
|
```bash
|
||||||
|
|
|
||||||
|
|
@ -273,8 +273,11 @@ To land a finished change from such a clone:
|
||||||
Slack audit feed; a no-op CI apply on a docs-only commit is harmless.
|
Slack audit feed; a no-op CI apply on a docs-only commit is harmless.
|
||||||
4. Leave the clone on clean `master` so auto-refresh keeps working.
|
4. Leave the clone on clean `master` so auto-refresh keeps working.
|
||||||
5. Tell the user in plain language what happened. Stack changes are
|
5. Tell the user in plain language what happened. Stack changes are
|
||||||
auto-applied by CI — verify the live result with the user's read-only
|
auto-applied by CI on push — or, with apply access, applied locally yourself
|
||||||
kubectl before saying "it's live".
|
(`scripts/tg apply`, from the main checkout, not a worktree); either path is
|
||||||
|
fine, but the change must always be committed here, never applied
|
||||||
|
uncommitted. Verify the live result with the user's read-only kubectl before
|
||||||
|
saying "it's live".
|
||||||
|
|
||||||
If a push to `master` is rejected by branch protection (user not on the
|
If a push to `master` is rejected by branch protection (user not on the
|
||||||
whitelist — e.g. new users before Viktor grants it), fall back to a
|
whitelist — e.g. new users before Viktor grants it), fall back to a
|
||||||
|
|
|
||||||
|
|
@ -125,7 +125,7 @@ How a **Service** is named in flow/audit data — its **namespace** is the prima
|
||||||
_Avoid_: equating "service identity" with a workload's **ServiceAccount** (that's the deferred enforcement principal, not the attribution key) or with cryptographic/SPIFFE identity; "Service" here is the domain **Service**, not the K8s `Service` object.
|
_Avoid_: equating "service identity" with a workload's **ServiceAccount** (that's the deferred enforcement principal, not the attribution key) or with cryptographic/SPIFFE identity; "Service" here is the domain **Service**, not the K8s `Service` object.
|
||||||
|
|
||||||
**Goldmane / Whisker**:
|
**Goldmane / Whisker**:
|
||||||
Calico 3.30's OSS flow-observability pair — **Goldmane** aggregates identity-stamped flows (namespace/pod/workload/labels + allow-deny + policy trace) streamed from Felix over gRPC into an in-memory ~60-min ring buffer (no etcd/API writes); **Whisker** is its live web UI. The east-west "who-talks-to-whom" data plane, succeeding raw iptables-`LOG`→journald lines (which carry no identity). Durable history requires emitting Goldmane flows to **Loki**; the in-memory buffer alone is not an audit trail.
|
Calico 3.30's OSS flow-observability pair — **Goldmane** aggregates identity-stamped flows (namespace/pod/workload/labels + allow-deny + policy trace) streamed from Felix over gRPC into an in-memory ~60-min ring buffer (no etcd/API writes); **Whisker** is its live web UI. The east-west "who-talks-to-whom" data plane, succeeding raw iptables-`LOG`→journald lines (which carry no identity). The in-memory buffer alone is not an audit trail — durable history is the **`goldmane-edge-aggregator`** (the implemented trail; ADR-0014 originally framed this as a Loki emitter), which streams Goldmane's gRPC `Flows.Stream` over mTLS and upserts the namespace-pair **edge set** into CNPG DB `goldmane_edges` + a daily `#alerts` digest (the `#security` channel was abandoned 2026-06-25). As-built: `docs/runbooks/goldmane-flow-trail.md`.
|
||||||
_Avoid_: assuming Goldmane persists (it's a ring buffer — lost on restart); expecting a ServiceAccount field in its schema (it carries labels, not SA); confusing it with Cilium **Hubble** (needs the Cilium datapath, unusable on Calico) or **Kiali** (needs an Istio mesh).
|
_Avoid_: assuming Goldmane persists (it's a ring buffer — lost on restart); expecting a ServiceAccount field in its schema (it carries labels, not SA); confusing it with Cilium **Hubble** (needs the Cilium datapath, unusable on Calico) or **Kiali** (needs an Istio mesh).
|
||||||
|
|
||||||
### Storage
|
### Storage
|
||||||
|
|
|
||||||
122
cli/cmd_vault.go
122
cli/cmd_vault.go
|
|
@ -15,7 +15,7 @@ import (
|
||||||
// Identity is the kernel UID; per-user creds live in that user's isolated Vault
|
// Identity is the kernel UID; per-user creds live in that user's isolated Vault
|
||||||
// path (secret/workstation/claude-users/<user>) read via their scoped token, and
|
// path (secret/workstation/claude-users/<user>) read via their scoped token, and
|
||||||
// decryption is done by the official `bw` CLI. See
|
// decryption is done by the official `bw` CLI. See
|
||||||
// docs/superpowers/specs/2026-06-24-homelab-vault-design.md.
|
// docs/runbooks/homelab-vault-onboarding.md.
|
||||||
func vaultCommands() []Command {
|
func vaultCommands() []Command {
|
||||||
return []Command{
|
return []Command{
|
||||||
{Path: []string{"vault", "setup"}, Tier: TierWrite,
|
{Path: []string{"vault", "setup"}, Tier: TierWrite,
|
||||||
|
|
@ -51,7 +51,7 @@ func vaultHelp() string {
|
||||||
homelab vault lock lock / log out the local bw session
|
homelab vault lock lock / log out the local bw session
|
||||||
|
|
||||||
Creds live only in your own Vault path; the admin never sees them. Identity is
|
Creds live only in your own Vault path; the admin never sees them. Identity is
|
||||||
your unix UID. Security model: docs/superpowers/specs/2026-06-24-homelab-vault-design.md
|
your unix UID. Security model: docs/runbooks/homelab-vault-onboarding.md
|
||||||
(note: anything running as your user can decrypt your vault — the accepted no-HITL trade).
|
(note: anything running as your user can decrypt your vault — the accepted no-HITL trade).
|
||||||
`
|
`
|
||||||
}
|
}
|
||||||
|
|
@ -128,6 +128,53 @@ func loadCreds(run cmdRunner, user string) (vwCreds, error) {
|
||||||
var vaultCurrentUser = func() string { return os.Getenv("USER") }
|
var vaultCurrentUser = func() string { return os.Getenv("USER") }
|
||||||
var vaultCurrentUID = func() string { return fmt.Sprintf("%d", os.Getuid()) }
|
var vaultCurrentUID = func() string { return fmt.Sprintf("%d", os.Getuid()) }
|
||||||
|
|
||||||
|
// scopedTokenPath is where claude-auth-sync keeps the user's scoped Vault token.
|
||||||
|
// MUST match CAS_VAULT_TOKEN_FILE in scripts/workstation/claude-auth-sync.sh.
|
||||||
|
func scopedTokenPath(home string) string {
|
||||||
|
return home + "/.config/claude-auth-sync/vault-token"
|
||||||
|
}
|
||||||
|
|
||||||
|
// vaultTokenSource decides which Vault token the `vault` child processes should
|
||||||
|
// use. Precedence: an explicit $VAULT_TOKEN, then a native ~/.vault-token (what
|
||||||
|
// admins carry), then the per-user scoped token claude-auth-sync maintains at
|
||||||
|
// scopedTokenPath(HOME) (policy workstation-claude-<user>, which grants exactly
|
||||||
|
// the create/read/update this tool needs on the user's own path). Returns the
|
||||||
|
// token to export — "" when nothing must be exported because the vault CLI reads
|
||||||
|
// the ambient credential natively — plus a source tag for tests/logging.
|
||||||
|
func vaultTokenSource(envToken string, haveVaultTokenFile bool, scopedToken string) (token, source string) {
|
||||||
|
switch {
|
||||||
|
case envToken != "":
|
||||||
|
return "", "env"
|
||||||
|
case haveVaultTokenFile:
|
||||||
|
return "", "file"
|
||||||
|
default:
|
||||||
|
if t := strings.TrimSpace(scopedToken); t != "" {
|
||||||
|
return t, "scoped"
|
||||||
|
}
|
||||||
|
return "", "none"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// fileNonEmpty reports whether path exists and has content.
|
||||||
|
func fileNonEmpty(path string) bool {
|
||||||
|
fi, err := os.Stat(path)
|
||||||
|
return err == nil && fi.Size() > 0
|
||||||
|
}
|
||||||
|
|
||||||
|
// ensureVaultToken wires vaultTokenSource to the real environment: when the user
|
||||||
|
// has no ambient Vault credential, it exports the claude-auth-sync scoped token
|
||||||
|
// so the `vault` child processes authenticate as workstation-claude-<user>. It
|
||||||
|
// is idempotent and safe for admins, whose explicit $VAULT_TOKEN / ~/.vault-token
|
||||||
|
// take precedence and are left untouched.
|
||||||
|
func ensureVaultToken() {
|
||||||
|
home := os.Getenv("HOME")
|
||||||
|
scoped, _ := os.ReadFile(scopedTokenPath(home))
|
||||||
|
tok, src := vaultTokenSource(os.Getenv("VAULT_TOKEN"), home != "" && fileNonEmpty(home+"/.vault-token"), string(scoped))
|
||||||
|
if src == "scoped" {
|
||||||
|
os.Setenv("VAULT_TOKEN", tok)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
// bwBaseEnv is the minimal non-secret environment bw/node need. We deliberately
|
// bwBaseEnv is the minimal non-secret environment bw/node need. We deliberately
|
||||||
// do NOT inherit the full parent env (keeps stray secrets out of the child).
|
// do NOT inherit the full parent env (keeps stray secrets out of the child).
|
||||||
func bwBaseEnv(appdata string) []string {
|
func bwBaseEnv(appdata string) []string {
|
||||||
|
|
@ -443,6 +490,7 @@ func runList(run cmdRunner, user, uid, search string) ([]string, error) {
|
||||||
|
|
||||||
func vaultList(args []string) error {
|
func vaultList(args []string) error {
|
||||||
hardenProcess()
|
hardenProcess()
|
||||||
|
ensureVaultToken()
|
||||||
search := ""
|
search := ""
|
||||||
for i := 0; i < len(args); i++ {
|
for i := 0; i < len(args); i++ {
|
||||||
if args[i] == "--search" && i+1 < len(args) {
|
if args[i] == "--search" && i+1 < len(args) {
|
||||||
|
|
@ -477,6 +525,7 @@ func vaultSearch(args []string) error {
|
||||||
|
|
||||||
func vaultCode(args []string) error {
|
func vaultCode(args []string) error {
|
||||||
hardenProcess()
|
hardenProcess()
|
||||||
|
ensureVaultToken()
|
||||||
if len(args) == 0 {
|
if len(args) == 0 {
|
||||||
return fmt.Errorf("usage: homelab vault code <name>")
|
return fmt.Errorf("usage: homelab vault code <name>")
|
||||||
}
|
}
|
||||||
|
|
@ -516,6 +565,7 @@ func statusSummary(run cmdRunner, user, uid string) string {
|
||||||
|
|
||||||
func vaultStatus(args []string) error {
|
func vaultStatus(args []string) error {
|
||||||
hardenProcess()
|
hardenProcess()
|
||||||
|
ensureVaultToken()
|
||||||
uid := vaultCurrentUID()
|
uid := vaultCurrentUID()
|
||||||
unlock, err := withUserLock(uid)
|
unlock, err := withUserLock(uid)
|
||||||
if err != nil {
|
if err != nil {
|
||||||
|
|
@ -542,32 +592,61 @@ func vaultLock(args []string) error {
|
||||||
return nil // lock/logout best-effort; never error the caller
|
return nil // lock/logout best-effort; never error the caller
|
||||||
}
|
}
|
||||||
|
|
||||||
// vaultPatchPublicArgs writes the non-secret identifiers via argv. Neither the
|
// kvWriteVerb selects the KV write semantics. merge=true → `kv patch -method=rw`
|
||||||
|
// (read-modify-write: needs only read+update, NOT the `patch` capability the
|
||||||
|
// scoped workstation-claude-<user> policy lacks, and preserves co-located keys
|
||||||
|
// such as claude-auth-sync's claude_ai_oauth_json). merge=false → `kv put`
|
||||||
|
// (creates the path on first use, before any sibling keys exist).
|
||||||
|
func kvWriteVerb(merge bool) []string {
|
||||||
|
if merge {
|
||||||
|
return []string{"kv", "patch", "-method=rw"}
|
||||||
|
}
|
||||||
|
return []string{"kv", "put"}
|
||||||
|
}
|
||||||
|
|
||||||
|
// vaultWritePublicArgs writes the non-secret identifiers via argv. Neither the
|
||||||
// email nor the API client_id is a usable credential on its own.
|
// email nor the API client_id is a usable credential on its own.
|
||||||
func vaultPatchPublicArgs(user, email, clientID string) []string {
|
func vaultWritePublicArgs(merge bool, user, email, clientID string) []string {
|
||||||
return []string{"kv", "patch", vwCredsPath(user),
|
return append(kvWriteVerb(merge), vwCredsPath(user),
|
||||||
"vaultwarden_email=" + email,
|
"vaultwarden_email="+email,
|
||||||
"vaultwarden_client_id=" + clientID,
|
"vaultwarden_client_id="+clientID,
|
||||||
}
|
)
|
||||||
}
|
}
|
||||||
|
|
||||||
// vaultPatchSecretArgs writes ONE secret value via the `key=-` stdin form, so
|
// vaultWriteSecretArgs writes ONE secret value via the `key=-` stdin form, so the
|
||||||
// the value never appears in argv (ps / /proc/<pid>/cmdline). The value is fed
|
// value never appears in argv (ps / /proc/<pid>/cmdline). Fed on stdin by
|
||||||
// on stdin by realRunnerStdin.
|
// realRunnerStdin.
|
||||||
func vaultPatchSecretArgs(user, key string) []string {
|
func vaultWriteSecretArgs(merge bool, user, key string) []string {
|
||||||
return []string{"kv", "patch", vwCredsPath(user), key + "=-"}
|
return append(kvWriteVerb(merge), vwCredsPath(user), key+"=-")
|
||||||
}
|
}
|
||||||
|
|
||||||
// writeCreds stores all four fields in the user's Vault path. The two real
|
// credsPathExists reports whether the user's KV path already holds data. Used to
|
||||||
// secrets (master password, API client_secret) go via stdin — never argv.
|
// pick create (`kv put`) vs merge (`kv patch -method=rw`) for the first write:
|
||||||
func writeCreds(user string, c vwCreds) error {
|
// claude-auth-sync usually creates the path first (Claude OAuth backup), but a
|
||||||
if _, err := realRunner("vault", vaultPatchPublicArgs(user, c.Email, c.ClientID), nil); err != nil {
|
// user could run `homelab vault setup` before that ever happens.
|
||||||
|
func credsPathExists(run cmdRunner, user string) bool {
|
||||||
|
_, err := run("vault", []string{"kv", "get", "-format=json", vwCredsPath(user)}, nil)
|
||||||
|
return err == nil
|
||||||
|
}
|
||||||
|
|
||||||
|
// cmdRunnerStdin is realRunnerStdin's shape, injected so writeCreds is testable.
|
||||||
|
type cmdRunnerStdin func(name string, argv, envv []string, stdin string) (string, error)
|
||||||
|
|
||||||
|
// writeCreds stores all four fields in the user's Vault path using only the
|
||||||
|
// capabilities the scoped policy grants (create/read/update — NOT `patch`). The
|
||||||
|
// first (public) write creates the path when absent; the two real secrets then
|
||||||
|
// merge in via read-modify-write so the public keys — and any claude-auth-sync
|
||||||
|
// keys already present — survive. Secret values travel on stdin, never argv.
|
||||||
|
func writeCreds(run cmdRunner, runStdin cmdRunnerStdin, user string, c vwCreds) error {
|
||||||
|
merge := credsPathExists(run, user)
|
||||||
|
if _, err := run("vault", vaultWritePublicArgs(merge, user, c.Email, c.ClientID), nil); err != nil {
|
||||||
return err
|
return err
|
||||||
}
|
}
|
||||||
if _, err := realRunnerStdin("vault", vaultPatchSecretArgs(user, "vaultwarden_master_password"), nil, c.MasterPassword); err != nil {
|
// The path now exists regardless of the branch above → merge the secrets in.
|
||||||
|
if _, err := runStdin("vault", vaultWriteSecretArgs(true, user, "vaultwarden_master_password"), nil, c.MasterPassword); err != nil {
|
||||||
return err
|
return err
|
||||||
}
|
}
|
||||||
if _, err := realRunnerStdin("vault", vaultPatchSecretArgs(user, "vaultwarden_client_secret"), nil, c.ClientSecret); err != nil {
|
if _, err := runStdin("vault", vaultWriteSecretArgs(true, user, "vaultwarden_client_secret"), nil, c.ClientSecret); err != nil {
|
||||||
return err
|
return err
|
||||||
}
|
}
|
||||||
return nil
|
return nil
|
||||||
|
|
@ -593,6 +672,7 @@ func promptLine(prompt string) (string, error) {
|
||||||
|
|
||||||
func vaultSetup(args []string) error {
|
func vaultSetup(args []string) error {
|
||||||
hardenProcess()
|
hardenProcess()
|
||||||
|
ensureVaultToken()
|
||||||
fmt.Fprintln(os.Stderr, "One-time setup. Stored ONLY in your own Vault path; the admin never sees it.")
|
fmt.Fprintln(os.Stderr, "One-time setup. Stored ONLY in your own Vault path; the admin never sees it.")
|
||||||
fmt.Fprintln(os.Stderr, "Get your API key at https://vaultwarden.viktorbarzin.me → Settings → Security → Keys → View API key.")
|
fmt.Fprintln(os.Stderr, "Get your API key at https://vaultwarden.viktorbarzin.me → Settings → Security → Keys → View API key.")
|
||||||
email, err := promptLine("Vaultwarden email: ")
|
email, err := promptLine("Vaultwarden email: ")
|
||||||
|
|
@ -615,7 +695,7 @@ func vaultSetup(args []string) error {
|
||||||
return fmt.Errorf("all fields are required")
|
return fmt.Errorf("all fields are required")
|
||||||
}
|
}
|
||||||
c := vwCreds{Email: email, MasterPassword: master, ClientID: clientID, ClientSecret: clientSecret}
|
c := vwCreds{Email: email, MasterPassword: master, ClientID: clientID, ClientSecret: clientSecret}
|
||||||
if err := writeCreds(vaultCurrentUser(), c); err != nil {
|
if err := writeCreds(realRunner, realRunnerStdin, vaultCurrentUser(), c); err != nil {
|
||||||
return fmt.Errorf("writing creds to your Vault path failed (scoped token present?): %w", err)
|
return fmt.Errorf("writing creds to your Vault path failed (scoped token present?): %w", err)
|
||||||
}
|
}
|
||||||
fmt.Fprintln(os.Stderr, "Stored. Verifying unlock…")
|
fmt.Fprintln(os.Stderr, "Stored. Verifying unlock…")
|
||||||
|
|
@ -634,6 +714,7 @@ func vaultSetup(args []string) error {
|
||||||
|
|
||||||
func vaultGet(args []string) error {
|
func vaultGet(args []string) error {
|
||||||
hardenProcess()
|
hardenProcess()
|
||||||
|
ensureVaultToken()
|
||||||
o, err := parseGetArgs(args)
|
o, err := parseGetArgs(args)
|
||||||
if err != nil {
|
if err != nil {
|
||||||
return err
|
return err
|
||||||
|
|
@ -660,4 +741,3 @@ func vaultGet(args []string) error {
|
||||||
emitSecret(val)
|
emitSecret(val)
|
||||||
return nil
|
return nil
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -233,12 +233,96 @@ func TestStatusSummaryUnconfigured(t *testing.T) {
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
func TestVaultPatchPublicArgs(t *testing.T) {
|
func TestEnsureVaultTokenSetsScopedFallback(t *testing.T) {
|
||||||
got := vaultPatchPublicArgs("emo", "e@x.me", "user.ci")
|
dir := t.TempDir()
|
||||||
want := []string{"kv", "patch", "secret/workstation/claude-users/emo",
|
cfg := dir + "/.config/claude-auth-sync"
|
||||||
|
if err := os.MkdirAll(cfg, 0o700); err != nil {
|
||||||
|
t.Fatal(err)
|
||||||
|
}
|
||||||
|
if err := os.WriteFile(cfg+"/vault-token", []byte("SCOPED-TOK\n"), 0o600); err != nil {
|
||||||
|
t.Fatal(err)
|
||||||
|
}
|
||||||
|
t.Setenv("HOME", dir)
|
||||||
|
t.Setenv("VAULT_TOKEN", "") // no ambient token
|
||||||
|
|
||||||
|
ensureVaultToken()
|
||||||
|
if got := os.Getenv("VAULT_TOKEN"); got != "SCOPED-TOK" {
|
||||||
|
t.Fatalf("VAULT_TOKEN = %q, want scoped fallback to be exported", got)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestEnsureVaultTokenKeepsExplicitEnv(t *testing.T) {
|
||||||
|
dir := t.TempDir()
|
||||||
|
cfg := dir + "/.config/claude-auth-sync"
|
||||||
|
if err := os.MkdirAll(cfg, 0o700); err != nil {
|
||||||
|
t.Fatal(err)
|
||||||
|
}
|
||||||
|
if err := os.WriteFile(cfg+"/vault-token", []byte("SCOPED-TOK"), 0o600); err != nil {
|
||||||
|
t.Fatal(err)
|
||||||
|
}
|
||||||
|
t.Setenv("HOME", dir)
|
||||||
|
t.Setenv("VAULT_TOKEN", "ADMIN-TOK")
|
||||||
|
|
||||||
|
ensureVaultToken()
|
||||||
|
if got := os.Getenv("VAULT_TOKEN"); got != "ADMIN-TOK" {
|
||||||
|
t.Fatalf("VAULT_TOKEN = %q, must not override an explicit token", got)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestScopedTokenPath(t *testing.T) {
|
||||||
|
if got := scopedTokenPath("/home/emo"); got != "/home/emo/.config/claude-auth-sync/vault-token" {
|
||||||
|
t.Fatalf("scopedTokenPath = %q", got)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestVaultTokenSource(t *testing.T) {
|
||||||
|
// Precedence: explicit $VAULT_TOKEN > ~/.vault-token (vault CLI native) >
|
||||||
|
// the claude-auth-sync per-user scoped token. This is what lets a non-admin
|
||||||
|
// workstation user (no ambient token) reach their own Vault path.
|
||||||
|
cases := []struct {
|
||||||
|
name string
|
||||||
|
env string
|
||||||
|
haveVaultToken bool
|
||||||
|
scoped string
|
||||||
|
wantTok, wantSrc string
|
||||||
|
}{
|
||||||
|
{"explicit env wins", "abc", true, "S", "", "env"},
|
||||||
|
{"vault-token file used natively", "", true, "S", "", "file"},
|
||||||
|
{"scoped fallback for non-admin", "", false, "S-TOK", "S-TOK", "scoped"},
|
||||||
|
{"scoped value is trimmed", "", false, " S-TOK\n", "S-TOK", "scoped"},
|
||||||
|
{"whitespace-only scoped is no token", "", false, " \n", "", "none"},
|
||||||
|
{"nothing configured", "", false, "", "", "none"},
|
||||||
|
}
|
||||||
|
for _, c := range cases {
|
||||||
|
tok, src := vaultTokenSource(c.env, c.haveVaultToken, c.scoped)
|
||||||
|
if tok != c.wantTok || src != c.wantSrc {
|
||||||
|
t.Errorf("%s: vaultTokenSource(%q,%v,%q) = (%q,%q), want (%q,%q)",
|
||||||
|
c.name, c.env, c.haveVaultToken, c.scoped, tok, src, c.wantTok, c.wantSrc)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestKvWriteVerb(t *testing.T) {
|
||||||
|
// merge=true → read-modify-write patch (needs only read+update, NOT the
|
||||||
|
// `patch` capability the scoped workstation policy lacks).
|
||||||
|
if got := kvWriteVerb(true); !reflect.DeepEqual(got, []string{"kv", "patch", "-method=rw"}) {
|
||||||
|
t.Fatalf("kvWriteVerb(true) = %v", got)
|
||||||
|
}
|
||||||
|
// merge=false → put (creates the path on first use)
|
||||||
|
if got := kvWriteVerb(false); !reflect.DeepEqual(got, []string{"kv", "put"}) {
|
||||||
|
t.Fatalf("kvWriteVerb(false) = %v", got)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestVaultWritePublicArgs(t *testing.T) {
|
||||||
|
got := vaultWritePublicArgs(true, "emo", "e@x.me", "user.ci")
|
||||||
|
want := []string{"kv", "patch", "-method=rw", "secret/workstation/claude-users/emo",
|
||||||
"vaultwarden_email=e@x.me", "vaultwarden_client_id=user.ci"}
|
"vaultwarden_email=e@x.me", "vaultwarden_client_id=user.ci"}
|
||||||
if !reflect.DeepEqual(got, want) {
|
if !reflect.DeepEqual(got, want) {
|
||||||
t.Fatalf("vaultPatchPublicArgs = %v", got)
|
t.Fatalf("vaultWritePublicArgs(merge) = %v", got)
|
||||||
|
}
|
||||||
|
if got := vaultWritePublicArgs(false, "emo", "e@x.me", "user.ci"); got[0] != "kv" || got[1] != "put" {
|
||||||
|
t.Fatalf("vaultWritePublicArgs(create) must use `kv put`, got %v", got)
|
||||||
}
|
}
|
||||||
for _, a := range got {
|
for _, a := range got {
|
||||||
if strings.Contains(a, "master_password") || strings.Contains(a, "client_secret") {
|
if strings.Contains(a, "master_password") || strings.Contains(a, "client_secret") {
|
||||||
|
|
@ -247,12 +331,12 @@ func TestVaultPatchPublicArgs(t *testing.T) {
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
func TestVaultPatchSecretArgsNoValueInArgv(t *testing.T) {
|
func TestVaultWriteSecretArgsNoValueInArgv(t *testing.T) {
|
||||||
for _, key := range []string{"vaultwarden_master_password", "vaultwarden_client_secret"} {
|
for _, key := range []string{"vaultwarden_master_password", "vaultwarden_client_secret"} {
|
||||||
got := vaultPatchSecretArgs("emo", key)
|
got := vaultWriteSecretArgs(true, "emo", key)
|
||||||
want := []string{"kv", "patch", "secret/workstation/claude-users/emo", key + "=-"}
|
want := []string{"kv", "patch", "-method=rw", "secret/workstation/claude-users/emo", key + "=-"}
|
||||||
if !reflect.DeepEqual(got, want) {
|
if !reflect.DeepEqual(got, want) {
|
||||||
t.Fatalf("vaultPatchSecretArgs(%q) = %v", key, got)
|
t.Fatalf("vaultWriteSecretArgs(%q) = %v", key, got)
|
||||||
}
|
}
|
||||||
if got[len(got)-1] != key+"=-" {
|
if got[len(got)-1] != key+"=-" {
|
||||||
t.Fatalf("secret value must be read from stdin (`%s=-`), got %v", key, got)
|
t.Fatalf("secret value must be read from stdin (`%s=-`), got %v", key, got)
|
||||||
|
|
@ -260,6 +344,90 @@ func TestVaultPatchSecretArgsNoValueInArgv(t *testing.T) {
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// recStdin records a stdin-bearing call for assertions.
|
||||||
|
type recStdin struct {
|
||||||
|
argv []string
|
||||||
|
stdin string
|
||||||
|
}
|
||||||
|
|
||||||
|
// TestWriteCredsCreatesThenMerges: when the path is ABSENT the first (public)
|
||||||
|
// write must `kv put` (create), and the two secrets must merge via patch -rw
|
||||||
|
// with values on stdin only — never the buggy plain `kv patch` (needs `patch`).
|
||||||
|
func TestWriteCredsCreatesThenMerges(t *testing.T) {
|
||||||
|
var calls [][]string
|
||||||
|
var stdinCalls []recStdin
|
||||||
|
run := func(name string, argv, envv []string) (string, error) {
|
||||||
|
calls = append(calls, append([]string{name}, argv...))
|
||||||
|
if len(argv) >= 2 && argv[0] == "kv" && argv[1] == "get" {
|
||||||
|
return "", fmt.Errorf("no value found") // path absent
|
||||||
|
}
|
||||||
|
return "", nil
|
||||||
|
}
|
||||||
|
runStdin := func(name string, argv, envv []string, stdin string) (string, error) {
|
||||||
|
stdinCalls = append(stdinCalls, recStdin{append([]string{name}, argv...), stdin})
|
||||||
|
return "", nil
|
||||||
|
}
|
||||||
|
c := vwCreds{Email: "e@x.me", MasterPassword: "PW", ClientID: "user.ci", ClientSecret: "CS"}
|
||||||
|
if err := writeCreds(run, runStdin, "emo", c); err != nil {
|
||||||
|
t.Fatalf("writeCreds: %v", err)
|
||||||
|
}
|
||||||
|
var sawPut, sawPlainPatch bool
|
||||||
|
for _, cl := range calls {
|
||||||
|
j := strings.Join(cl, " ")
|
||||||
|
if strings.Contains(j, "kv put") {
|
||||||
|
sawPut = true
|
||||||
|
}
|
||||||
|
if strings.Contains(j, "kv patch") && !strings.Contains(j, "-method=rw") {
|
||||||
|
sawPlainPatch = true
|
||||||
|
}
|
||||||
|
}
|
||||||
|
if !sawPut {
|
||||||
|
t.Fatalf("path absent → public write must be `kv put`; calls=%v", calls)
|
||||||
|
}
|
||||||
|
if sawPlainPatch {
|
||||||
|
t.Fatalf("must never use plain `kv patch` (needs `patch` capability); calls=%v", calls)
|
||||||
|
}
|
||||||
|
if len(stdinCalls) != 2 {
|
||||||
|
t.Fatalf("want 2 stdin secret writes, got %d", len(stdinCalls))
|
||||||
|
}
|
||||||
|
for _, sc := range stdinCalls {
|
||||||
|
if !strings.Contains(strings.Join(sc.argv, " "), "kv patch -method=rw") {
|
||||||
|
t.Errorf("secret write must use patch -method=rw: %v", sc.argv)
|
||||||
|
}
|
||||||
|
for _, a := range sc.argv {
|
||||||
|
if strings.Contains(a, "PW") || strings.Contains(a, "CS") {
|
||||||
|
t.Errorf("secret leaked into argv: %v", sc.argv)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
if stdinCalls[0].stdin != "PW" || stdinCalls[1].stdin != "CS" {
|
||||||
|
t.Errorf("stdin values wrong: %q,%q", stdinCalls[0].stdin, stdinCalls[1].stdin)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// TestWriteCredsMergesWhenPresent: when the path EXISTS, every write must merge
|
||||||
|
// (patch -rw) — a `kv put` would wipe sibling keys (e.g. claude_ai_oauth_json).
|
||||||
|
func TestWriteCredsMergesWhenPresent(t *testing.T) {
|
||||||
|
var calls [][]string
|
||||||
|
run := func(name string, argv, envv []string) (string, error) {
|
||||||
|
calls = append(calls, append([]string{name}, argv...))
|
||||||
|
return "{}", nil // get succeeds → path exists
|
||||||
|
}
|
||||||
|
runStdin := func(name string, argv, envv []string, stdin string) (string, error) {
|
||||||
|
calls = append(calls, append([]string{name}, argv...))
|
||||||
|
return "", nil
|
||||||
|
}
|
||||||
|
c := vwCreds{Email: "e@x.me", MasterPassword: "PW", ClientID: "user.ci", ClientSecret: "CS"}
|
||||||
|
if err := writeCreds(run, runStdin, "emo", c); err != nil {
|
||||||
|
t.Fatalf("writeCreds: %v", err)
|
||||||
|
}
|
||||||
|
for _, cl := range calls {
|
||||||
|
if strings.Contains(strings.Join(cl, " "), "kv put") {
|
||||||
|
t.Fatalf("path exists → must NOT `kv put` (wipes siblings): %v", cl)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
// TestNoSecretInArgvAcrossFlow is the load-bearing security test: across the
|
// TestNoSecretInArgvAcrossFlow is the load-bearing security test: across the
|
||||||
// whole get flow (vault reads, bw config/status/login/unlock/get) NO secret
|
// whole get flow (vault reads, bw config/status/login/unlock/get) NO secret
|
||||||
// value may appear in any command's argv — secrets travel via env/stdin only.
|
// value may appear in any command's argv — secrets travel via env/stdin only.
|
||||||
|
|
|
||||||
|
|
@ -5,6 +5,14 @@ exists to answer the question that drove the whole CLI — *which verbs are wort
|
||||||
adding next* — with data instead of one maintainer's habits (the earlier mining
|
adding next* — with data instead of one maintainer's habits (the earlier mining
|
||||||
covered a single user's ~51k commands, so the surface is shaped to that user).
|
covered a single user's ~51k commands, so the surface is shaped to that user).
|
||||||
|
|
||||||
|
> **Update (2026-06-26) — the cross-user privacy *norm* below is superseded by
|
||||||
|
> [ADR-0015](0015-os-is-the-authorization-boundary.md).** The prohibition this
|
||||||
|
> ADR leaned on ("reading another user's `~/.claude` is off-limits even for an
|
||||||
|
> owner in-session") no longer holds: the managed-settings policy now **defers
|
||||||
|
> to OS/sudo authorization**. The `usage top` telemetry design itself is
|
||||||
|
> unchanged and still current — only the "never read homes" framing in the
|
||||||
|
> third decision below is overtaken.
|
||||||
|
|
||||||
## Decisions
|
## Decisions
|
||||||
|
|
||||||
- **Emit on dispatch, in `dispatch()`.** The longest-prefix match already knows
|
- **Emit on dispatch, in `dispatch()`.** The longest-prefix match already knows
|
||||||
|
|
|
||||||
|
|
@ -27,3 +27,9 @@ As the Service count grows we want an audit-grade record of which Service talks
|
||||||
- **Enforcement gains a better data source.** Goldmane's allow/deny + policy-trace flows build the Wave 1 empirical egress allowlist faster than the current iptables-`LOG`→journald→Loki path, and policies select on namespace/label with no SA dependency.
|
- **Enforcement gains a better data source.** Goldmane's allow/deny + policy-trace flows build the Wave 1 empirical egress allowlist faster than the current iptables-`LOG`→journald→Loki path, and policies select on namespace/label with no SA dependency.
|
||||||
- **New ubiquitous language** recorded in `CONTEXT.md`: **Service identity** and **Goldmane / Whisker**.
|
- **New ubiquitous language** recorded in `CONTEXT.md`: **Service identity** and **Goldmane / Whisker**.
|
||||||
- **Revisit triggers:** adopt dedicated per-Service SAs if identity-aware NetworkPolicy needs a principal finer than namespace/label, or if mTLS is ever required; reconsider Retina if DNS/drop-level flow detail becomes necessary.
|
- **Revisit triggers:** adopt dedicated per-Service SAs if identity-aware NetworkPolicy needs a principal finer than namespace/label, or if mTLS is ever required; reconsider Retina if DNS/drop-level flow detail becomes necessary.
|
||||||
|
|
||||||
|
## As-built (2026-06-25)
|
||||||
|
|
||||||
|
Implemented across infra issues #57–#63. **One material deviation from the decision above:** the durable trail is NOT a Goldmane→Loki emitter (no such emitter exists in OSS Calico 3.30) — it is the **`goldmane-edge-aggregator`** service, which streams Goldmane's gRPC `Flows.Stream` API over mTLS and upserts the unique namespace-pair **edge set** (`edge(src_ns,dst_ns,action,first_seen,last_seen,flow_count)`, self-edges + empty-namespace flows dropped) into **CNPG DB `goldmane_edges`**, plus a daily `goldmane-edges-digest` CronJob → `#alerts` (all Slack consolidated to `#alerts`; the `#security` channel was abandoned 2026-06-25 — the shared webhook's Slack app isn't a member of it — see runbook). The mTLS client cert **reuses the operator's Tigera-CA-signed `whisker-backend-key-pair`** rather than copying the CA private key into TF state (Goldmane verifies CA-chain only, not identity) — re-apply `stacks/goldmane-edge-aggregator` if the operator rotates it. `service-identity` labels are live on the multi-Service namespaces (`monitoring`, `dbaas`). Whisker UI is Authentik-gated at `whisker.viktorbarzin.me`. Health: Prometheus alerts `AggregatorDown` + `DigestFailing` and cluster-health check #48.
|
||||||
|
|
||||||
|
Full as-built, query recipes (incl. the Wave-1 egress-allowlist derivation), and troubleshooting: [`docs/runbooks/goldmane-flow-trail.md`](../runbooks/goldmane-flow-trail.md). Stacks: `stacks/calico` (Goldmane/Whisker + Whisker ingress), `stacks/goldmane-edge-aggregator` (the trail). Code: `~/code/goldmane-edge-aggregator`.
|
||||||
|
|
|
||||||
57
docs/adr/0015-os-is-the-authorization-boundary.md
Normal file
57
docs/adr/0015-os-is-the-authorization-boundary.md
Normal file
|
|
@ -0,0 +1,57 @@
|
||||||
|
# OS is the authorization boundary: agents defer to Unix/sudo, not a stricter in-policy rule
|
||||||
|
|
||||||
|
Supersedes the cross-user privacy *norm* that the devvm managed-settings policy
|
||||||
|
carried and that ADR-0011 leaned on ("never read another user's home /
|
||||||
|
`~/.claude`, off-limits even for an owner in-session"). ADR-0011's actual
|
||||||
|
subject — `usage top` telemetry and its emit design — is unchanged and still
|
||||||
|
current; only the privacy prohibition it referenced is superseded here.
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
The devvm managed-settings policy (`/etc/claude-code/managed-settings.json`,
|
||||||
|
`claudeMd`) carried two rules that were, in practice, *stricter than the OS*:
|
||||||
|
"you are not the admin, do not escalate privileges" and "never read another
|
||||||
|
user's home directory, credentials, tokens, or `~/.claude`." The OS told a
|
||||||
|
different story: `wizard` holds `(ALL) NOPASSWD: ALL` — full passwordless root.
|
||||||
|
The kernel had already granted total read access; the policy was layering an
|
||||||
|
artificial refusal on top of an authorization the OS already permits, and the
|
||||||
|
"not the admin" framing was factually wrong for a NOPASSWD-root user.
|
||||||
|
|
||||||
|
Two honest ways to resolve the inconsistency: tighten sudo to match the policy,
|
||||||
|
or loosen the policy to match the OS. The owner chose the latter on 2026-06-26,
|
||||||
|
for analytics/debugging across the shared box.
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
- **Authorization follows the OS, not this policy.** Agents may access whatever
|
||||||
|
their OS user can access — directly or via `sudo` where they hold sudo rights
|
||||||
|
— and must not impose restrictions stricter than the OS. On this box that
|
||||||
|
includes other users' home directories and `~/.claude` for users who hold
|
||||||
|
broad sudo.
|
||||||
|
- **No separate prompt or carve-out** for OS-authorized access. The Unix
|
||||||
|
permission model + sudoers is the single source of truth for who may read
|
||||||
|
what. Other homes are `0750`-owned, so a cross-home read necessarily transits
|
||||||
|
`sudo` and is therefore captured in the sudo/auth audit log.
|
||||||
|
- **Cluster/infra RBAC tiering is unchanged.** kubectl / Vault / infra access
|
||||||
|
stays scoped to each user's RBAC tier; "defer to the OS" is about OS-level
|
||||||
|
file access, not a licence to exceed cluster RBAC.
|
||||||
|
- **Scope is symmetric and multi-user.** The rule lives in the *shared*
|
||||||
|
managed-settings, so every user's agents defer to that user's own sudo grant.
|
||||||
|
Any user with broad sudo gets the same cross-home read capability over other
|
||||||
|
users' files. Accepted by the owner with that understanding; emo's and
|
||||||
|
ancamilea's `~/.claude` is now agent-readable by sudo-holders.
|
||||||
|
- **Takes effect in a fresh session.** managed-settings loads at session start;
|
||||||
|
the session that made the change keeps running under the old policy.
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
- The privacy-preserving telemetry rationale in ADR-0011 (`usage top` as the
|
||||||
|
"cross-user analytics without reading homes" answer) remains useful but is no
|
||||||
|
longer the *only* sanctioned path; direct reads via `sudo` are now permitted.
|
||||||
|
- Larger blast radius: if an agent session running as a sudo-holder is
|
||||||
|
prompt-injected or otherwise compromised, it can now read every user's secrets
|
||||||
|
with no in-agent friction (sudo here is passwordless). The sudo/auth audit log
|
||||||
|
is the remaining accountability control.
|
||||||
|
- Reversible: restore the prior `claudeMd` bullets (backup kept at
|
||||||
|
`/etc/claude-code/managed-settings.json.bak-2026-06-26`) and start a fresh
|
||||||
|
session.
|
||||||
|
|
@ -205,6 +205,43 @@ healthy <0.3s, broken hangs). **Fix: cap `ulimit -n 65536` before x11vnc starts*
|
||||||
wrapper in `main.tf` (so it applies deterministically even though the image is
|
wrapper in `main.tf` (so it applies deterministically even though the image is
|
||||||
`:latest`/`IfNotPresent` and won't re-pull a rebuilt entrypoint). Same bug + fix
|
`:latest`/`IfNotPresent` and won't re-pull a rebuilt entrypoint). Same bug + fix
|
||||||
as the android-emulator stack.
|
as the android-emulator stack.
|
||||||
|
|
||||||
|
### noVNC black after a browser-container restart (x11vnc supervision)
|
||||||
|
|
||||||
|
A **distinct** failure from the fd-sweep gotcha above: the noVNC client *connects*
|
||||||
|
but the view is **black**, and the novnc container logs spew
|
||||||
|
`connecting to: localhost:5900` → `Failed to connect ... [Errno 111] Connection
|
||||||
|
refused` (x11vnc is **down**, not slow). Cause: `x11vnc` and `websockify` both run
|
||||||
|
in the **novnc** container, but x11vnc attaches to the **chrome-service** (browser)
|
||||||
|
container's Xvfb over `localhost:6099` (shared pod network). When the browser
|
||||||
|
container restarts — Chrome exits cleanly (exit 0, "Completed") or crashes — its
|
||||||
|
Xvfb vanishes and x11vnc loses its X connection and exits.
|
||||||
|
|
||||||
|
`entrypoint.sh` **supervises** x11vnc: it launches x11vnc and websockify as
|
||||||
|
background children and `wait -n`s on them, exiting non-zero if **either** dies, so
|
||||||
|
the kubelet restarts the novnc container, which re-waits for Xvfb on `:6099` and
|
||||||
|
relaunches x11vnc — the bridge **self-heals** across browser-container restarts.
|
||||||
|
(Before 2026-06-27, x11vnc was an unsupervised background child of an `exec`ed
|
||||||
|
websockify; a dead x11vnc was never relaunched, leaving `:5900` dead — a
|
||||||
|
`<defunct>` zombie — and the view black until a manual pod restart. Same
|
||||||
|
supervision pattern as the android-emulator stack's entrypoint.)
|
||||||
|
|
||||||
|
**Diagnose:** `kubectl exec -c novnc -- ps aux | grep x11vnc` (a `<defunct>`/Z
|
||||||
|
entry = the bug); or the RFB-banner probe from a sibling container (`python3 -c
|
||||||
|
"import socket;s=socket.socket();s.settimeout(2);s.connect(('127.0.0.1',5900));print(s.recv(12))"`
|
||||||
|
— healthy returns `b'RFB 003.008\n'`, broken = `ConnectionRefused`). **Immediate
|
||||||
|
recovery** (no image change): restart just the novnc container with `kubectl exec
|
||||||
|
-n chrome-service deploy/chrome-service -c novnc -- kill 1` — re-runs its entrypoint
|
||||||
|
and relaunches x11vnc **without** touching the browser session/in-flight CDP jobs.
|
||||||
|
|
||||||
|
> **Deploying a rebuilt novnc entrypoint:** Keel is **off** for this deployment
|
||||||
|
> (`keel.sh/policy=never`, because the browser container's playwright image is
|
||||||
|
> version-pinned to f1-stream) and the image is `:latest`/`IfNotPresent`, so a
|
||||||
|
> rebuilt `:latest` will **not** redeploy on its own. After the
|
||||||
|
> `build-chrome-service-novnc.yml` GHA build pushes `:latest` + `:<sha>`,
|
||||||
|
> **SHA-pin** the novnc `image` in `main.tf` to the new `:<sha>` to force the pull
|
||||||
|
> and rollout (the novnc image is TF-managed — not in the deployment's
|
||||||
|
> `lifecycle.ignore_changes`).
|
||||||
- **snapshot-server sidecar** (`mcr.microsoft.com/playwright/python:v1.48.0-noble`)
|
- **snapshot-server sidecar** (`mcr.microsoft.com/playwright/python:v1.48.0-noble`)
|
||||||
serves `GET /api/snapshot` from `/profile/snapshots/storage-state.json`,
|
serves `GET /api/snapshot` from `/profile/snapshots/storage-state.json`,
|
||||||
bearer-gated by `PW_TOKEN`. Service `chrome-snapshot` maps :8088 → :8088
|
bearer-gated by `PW_TOKEN`. Service `chrome-snapshot` maps :8088 → :8088
|
||||||
|
|
|
||||||
|
|
@ -286,7 +286,7 @@ Uptime Kuma monitors: TCP SMTP (port 25) on `176.12.22.76` (external), IMAP (por
|
||||||
|
|
||||||
#### Security Alerts (Wave 1 — planned, beads `code-8ywc`)
|
#### Security Alerts (Wave 1 — planned, beads `code-8ywc`)
|
||||||
|
|
||||||
Routed via **Loki ruler → Alertmanager → `#security` Slack receiver**. Same handling path as infra alerts. Single channel with severity labels inside (critical/warning/info), not three separate channels. Detection sources: K8s API audit log (`job=kube-audit`), Vault audit log (`job=vault-audit`), PVE sshd journald (`job=sshd-pve`), Calico flow logs (`job=calico-flow`, W1.6 only).
|
Routed via **Loki ruler → Alertmanager → the `slack-security` receiver, which posts to `#alerts`** (it keeps its `[SECURITY/<sev>]` title styling so security-lane alerts stand out there). Same handling path as infra alerts; severity labels carried in the alert (critical/warning/info). The dedicated `#security` channel was abandoned 2026-06-25 — the shared `alertmanager_slack_api_url` webhook's Slack app isn't a member of it (a `#security` override 404s), so everything consolidated to `#alerts`. Detection sources: K8s API audit log (`job=kube-audit`), Vault audit log (`job=vault-audit`), PVE sshd journald (`job=sshd-pve`), Calico flow logs (`job=calico-flow`, W1.6 only).
|
||||||
|
|
||||||
| # | Source | Event | Severity |
|
| # | Source | Event | Severity |
|
||||||
|---|---|---|---|
|
|---|---|---|---|
|
||||||
|
|
@ -318,9 +318,20 @@ IOPS impact estimated ~1-2 GB/day additional disk writes after custom audit-poli
|
||||||
Detects the inverse of the K-series alerts: a service that **must work WITHOUT Authentik SSO** getting accidentally walled off. Services on `ingress_factory auth = "required"` put Authentik forward-auth on `/`, which 302-bounces native-client / public / webhook / WebSocket / SPA-XHR paths. We carve those out with path-scoped `auth = "none"` ingresses; a TF revert, a bad deploy, or `ingress_factory`'s fail-closed `auth` default flipping back to `"required"` can silently clobber a carve-out.
|
Detects the inverse of the K-series alerts: a service that **must work WITHOUT Authentik SSO** getting accidentally walled off. Services on `ingress_factory auth = "required"` put Authentik forward-auth on `/`, which 302-bounces native-client / public / webhook / WebSocket / SPA-XHR paths. We carve those out with path-scoped `auth = "none"` ingresses; a TF revert, a bad deploy, or `ingress_factory`'s fail-closed `auth` default flipping back to `"required"` can silently clobber a carve-out.
|
||||||
|
|
||||||
- **Mechanism**: `blackbox-exporter` (monitoring ns) probes a representative GET-able URL per carve-out with `no_follow_redirects: true`. The `http_no_authentik_redirect` module FAILS the probe (`fail_if_header_matches` on the `Location` header, regex `authentik\.viktorbarzin\.me|/outpost\.goauthentik\.io|/application/o/authorize`) iff the response redirects to Authentik. `valid_status_codes` enumerates all expected non-Authentik responses **including 301/302** (so a legitimate redirect, e.g. a short-link 302, or a 404 carve-out like meshcentral `/agent.ashx`, stays green). Scrape job: `blackbox-authentik-walloff` (1m).
|
- **Mechanism**: `blackbox-exporter` (monitoring ns) probes a representative GET-able URL per carve-out with `no_follow_redirects: true`. The `http_no_authentik_redirect` module FAILS the probe (`fail_if_header_matches` on the `Location` header, regex `authentik\.viktorbarzin\.me|/outpost\.goauthentik\.io|/application/o/authorize`) iff the response redirects to Authentik. `valid_status_codes` enumerates all expected non-Authentik responses **including 301/302** (so a legitimate redirect, e.g. a short-link 302, or a 404 carve-out like meshcentral `/agent.ashx`, stays green). Scrape job: `blackbox-authentik-walloff` (1m).
|
||||||
- **Alert**: `probe_failed_due_to_regex{job="blackbox-authentik-walloff"} == 1` for 10m → `severity=warning`, `lane=security` → **`#security` Slack** (Slack-only, no paging). `probe_failed_due_to_regex` (not bare `probe_success==0`) is the signal: it isolates the Authentik-redirect from unrelated 5xx/DNS/TLS failures already covered by reachability alerts. Inhibited by `TraefikDown` and `AuthentikDown` (symptom, not regression, during those outages).
|
- **Alert**: `probe_failed_due_to_regex{job="blackbox-authentik-walloff"} == 1` for 10m → `severity=warning`, `lane=security` → posts to **`#alerts`** via the `slack-security` receiver, which keeps its `[SECURITY]` styling (Slack-only, no paging; the dedicated `#security` channel was abandoned 2026-06-25 — the shared webhook's app isn't a member of it). `probe_failed_due_to_regex` (not bare `probe_success==0`) is the signal: it isolates the Authentik-redirect from unrelated 5xx/DNS/TLS failures already covered by reachability alerts. Inhibited by `TraefikDown` and `AuthentikDown` (symptom, not regression, during those outages).
|
||||||
- **Target list + how to add one**: `local.authentik_walloff_targets` in `stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf` — a map of `service → URL`. To guard a NEW carve-out, add ONE line. Verify it does NOT already 302 to Authentik first: `curl -s -o /dev/null -w '%{http_code} %{redirect_url}\n' '<url>'`. The map key becomes the `service` label on the metric + alert. (Note: openclaw `task-webhook` is intentionally NOT probed — no public DNS record.)
|
- **Target list + how to add one**: `local.authentik_walloff_targets` in `stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf` — a map of `service → URL`. To guard a NEW carve-out, add ONE line. Verify it does NOT already 302 to Authentik first: `curl -s -o /dev/null -w '%{http_code} %{redirect_url}\n' '<url>'`. The map key becomes the `service` label on the metric + alert. (Note: openclaw `task-webhook` is intentionally NOT probed — no public DNS record.)
|
||||||
|
|
||||||
|
#### East-west flow observability (Goldmane edge-aggregator) — `AggregatorDown` / `DigestFailing` (ADR-0014)
|
||||||
|
|
||||||
|
Health for the durable "who-talks-to-whom" trail (Calico Goldmane → `goldmane-edge-aggregator` → CNPG `goldmane_edges` → daily `#alerts` digest; full trail in security.md + [runbooks/goldmane-flow-trail.md](../runbooks/goldmane-flow-trail.md)). The aggregator pod exposes **no `/metrics`**, so health is inferred from kube-state-metrics. Alert group `Network Observability (Goldmane)` in `prometheus_chart_values.tpl`; both route the default `slack-warning` receiver → **`#alerts`**.
|
||||||
|
|
||||||
|
| Alert | Expr (abridged) | For | Severity |
|
||||||
|
|---|---|---|---|
|
||||||
|
| `AggregatorDown` | `kube_deployment_status_replicas_available{namespace="goldmane-edge-aggregator",deployment="goldmane-edge-aggregator"} < 1` (+ Prometheus-restart guard) | 15m | warning |
|
||||||
|
| `DigestFailing` | `kube_job_status_failed{namespace="goldmane-edge-aggregator",job_name=~"goldmane-edges-digest.*"} > 0` within 24h | 30m | warning |
|
||||||
|
|
||||||
|
The two layers are **complementary**: `AggregatorDown` ⇒ no new edges land in the DB; `DigestFailing` ⇒ edges still land but nobody is told. (`< 1` requires the metric series to exist — a fully-deleted Deployment is instead caught by cluster-health check #48 below as "deployment missing".) A freshness probe (#61b) was deliberately skipped — `AggregatorDown` is the agreed floor. **Cluster-health check #48** (`check_goldmane_aggregator` in `scripts/cluster_healthcheck.sh`) reads the Deployment's `Available` condition independently (human / `--quiet` / `--json`; JSON key `goldmane_aggregator`).
|
||||||
|
|
||||||
#### Backup Alerts
|
#### Backup Alerts
|
||||||
- **PostgreSQLBackupStale**: >36h since last backup
|
- **PostgreSQLBackupStale**: >36h since last backup
|
||||||
- **MySQLBackupStale**: >36h since last backup
|
- **MySQLBackupStale**: >36h since last backup
|
||||||
|
|
|
||||||
|
|
@ -541,7 +541,7 @@ Separate from the in-cluster namespace-owner model above, the **devvm** (`10.0.1
|
||||||
|
|
||||||
**RBAC tiers:** `admin` (Viktor — cluster-admin, unlocked tree, secrets) · `power-user` (cluster-wide read-only, NO Secrets, via a dedicated `oidc-power-user-readonly` ClusterRole) · `namespace-owner` (admin in own namespace only). Each session acts as the user's **own** OIDC identity (kubelogin), never the admin's.
|
**RBAC tiers:** `admin` (Viktor — cluster-admin, unlocked tree, secrets) · `power-user` (cluster-wide read-only, NO Secrets, via a dedicated `oidc-power-user-readonly` ClusterRole) · `namespace-owner` (admin in own namespace only). Each session acts as the user's **own** OIDC identity (kubelogin), never the admin's.
|
||||||
|
|
||||||
**Config inheritance (live):** wizard authors the base (his chezmoi-versioned `~/.claude`). Two native layers carry it to every user — the enforced org `claudeMd` in `/etc/claude-code/managed-settings.json` (top precedence, all sessions) and per-user `~/.claude/{skills,rules,…}` **symlinks** to the base (seeded via `/etc/skel`; edits propagate live). Secrets stay per-user at mode 600, never symlinked. **The managed config self-deploys from the repo** (2026-06-10): the hourly reconcile's `sync_managed_config` installs `scripts/workstation/managed-settings.json` to `/etc/claude-code/` whenever the repo copy changes — so editing the claudeMd = edit + commit, no manual install — and `refresh_codex_mirror` regenerates each user's `~/.codex/AGENTS.md` (a static mirror of the claudeMd; only files carrying the mirror header are touched, user-customized ones are left alone). Repo-level guidance (`.claude/CLAUDE.md`, `AGENTS.md`, `CONTEXT.md` in the infra repo) reaches non-admins through their auto-freshened clones — commit + push and every user has it within the hour.
|
**Config inheritance (live):** wizard authors the base (his chezmoi-versioned `~/.claude`). Two native layers carry it to every user — the enforced org `claudeMd` in `/etc/claude-code/managed-settings.json` (top precedence, all sessions) and per-user `~/.claude/{skills,rules,…}` **symlinks** to the base (seeded via `/etc/skel`; edits propagate live). Secrets stay per-user at mode 600, never symlinked. **(2026-06-26: the managed `claudeMd` now defers OS-level file access to the OS/sudo — a user holding broad `sudo` may read other users' files incl. `~/.claude`; the mode-600 / no-symlink posture is unchanged but is no longer reinforced by an agent "never read other homes" rule. See [ADR-0015](../adr/0015-os-is-the-authorization-boundary.md).)** **The managed config self-deploys from the repo** (2026-06-10): the hourly reconcile's `sync_managed_config` installs `scripts/workstation/managed-settings.json` to `/etc/claude-code/` whenever the repo copy changes — so editing the claudeMd = edit + commit, no manual install — and `refresh_codex_mirror` regenerates each user's `~/.codex/AGENTS.md` (a static mirror of the claudeMd; only files carrying the mirror header are touched, user-customized ones are left alone). Repo-level guidance (`.claude/CLAUDE.md`, `AGENTS.md`, `CONTEXT.md` in the infra repo) reaches non-admins through their auto-freshened clones — commit + push and every user has it within the hour.
|
||||||
|
|
||||||
**Memory — homelab CLI hooks (rolled out 2026-06-21, deploy-fixed 2026-06-22):** the per-user `claude_memory` MCP was retired for the **homelab-memory hooks** — the reconcile's `install_memory` (re)installs four scripts into `~/.claude/hooks/` each run (`homelab-memory-recall.py` UserPromptSubmit recall, `auto-learn.py` Stop-hook extraction, `pre-compact-backup.sh`/`post-compact-recovery.sh`), wires them into `settings.json` if-absent + additive, and removes the old `claude_memory` MCP. **The provisioner binary itself now self-deploys from the repo** (step 0: `bash -n`-gated `install` + re-exec when `scripts/t3-provision-users.sh` differs from `/usr/local/bin/t3-provision-users`, guarded against re-exec loops / DRY_RUN mutation) — added after this very rollout sat committed-but-undeployed for a day (only the manual `setup-devvm.sh` had ever deployed the binary), so the hourly reconcile kept running the pre-memory version and emo/anca silently lost memory (recall + auto-learn never wired). A latent `set -e` abort in `install_memory` (a bare `[[ -d plugin-dir ]] && …` returning non-zero) was also fixed; it had killed the reconcile after the first user the first time it actually ran. The hooks need a `MEMORY_API_KEY` (or `CLAUDE_MEMORY_API_KEY`) in the user's `settings.json` env — the `homelab` CLI defaults the API URL, so **the key is the only hard requirement**; `install_memory` reuses an existing key and only WARNs if absent (it does NOT mint one — that's an admin Vault step, see Remaining). wizard + emo carry a key from their original MCP setup; **ancamilea is keyless → her memory no-ops until a key is minted.** (`auto-learn.py`'s passive store calls the API directly, so it additionally needs `*_API_URL` in env to avoid its local-SQLite fallback; recall + manual `homelab memory store` go through the URL-defaulting CLI and need only the key.)
|
**Memory — homelab CLI hooks (rolled out 2026-06-21, deploy-fixed 2026-06-22):** the per-user `claude_memory` MCP was retired for the **homelab-memory hooks** — the reconcile's `install_memory` (re)installs four scripts into `~/.claude/hooks/` each run (`homelab-memory-recall.py` UserPromptSubmit recall, `auto-learn.py` Stop-hook extraction, `pre-compact-backup.sh`/`post-compact-recovery.sh`), wires them into `settings.json` if-absent + additive, and removes the old `claude_memory` MCP. **The provisioner binary itself now self-deploys from the repo** (step 0: `bash -n`-gated `install` + re-exec when `scripts/t3-provision-users.sh` differs from `/usr/local/bin/t3-provision-users`, guarded against re-exec loops / DRY_RUN mutation) — added after this very rollout sat committed-but-undeployed for a day (only the manual `setup-devvm.sh` had ever deployed the binary), so the hourly reconcile kept running the pre-memory version and emo/anca silently lost memory (recall + auto-learn never wired). A latent `set -e` abort in `install_memory` (a bare `[[ -d plugin-dir ]] && …` returning non-zero) was also fixed; it had killed the reconcile after the first user the first time it actually ran. The hooks need a `MEMORY_API_KEY` (or `CLAUDE_MEMORY_API_KEY`) in the user's `settings.json` env — the `homelab` CLI defaults the API URL, so **the key is the only hard requirement**; `install_memory` reuses an existing key and only WARNs if absent (it does NOT mint one — that's an admin Vault step, see Remaining). wizard + emo carry a key from their original MCP setup; **ancamilea is keyless → her memory no-ops until a key is minted.** (`auto-learn.py`'s passive store calls the API directly, so it additionally needs `*_API_URL` in env to avoid its local-SQLite fallback; recall + manual `homelab memory store` go through the URL-defaulting CLI and need only the key.)
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -272,7 +272,7 @@ Beads epic: `code-8ywc`. **Status: partially live as of 2026-05-18.**
|
||||||
|
|
||||||
The block below documents the locked design.
|
The block below documents the locked design.
|
||||||
|
|
||||||
Response model: **(I) Slack-only, daily skim.** All security alerts land in a new `#security` Slack channel via Alertmanager. No paging. Mean detection time accepted as ~12-24h; the design weight sits on prevention (Kyverno enforce, NetworkPolicy default-deny egress) rather than runtime detection.
|
Response model: **(I) Slack-only, daily skim.** All security alerts post to **`#alerts`** via Alertmanager (the `slack-security` receiver keeps its distinct `[SECURITY/<sev>]` title styling so security-lane alerts still stand out). The dedicated `#security` channel was abandoned (2026-06-25) — the shared `alertmanager_slack_api_url` incoming webhook's Slack app isn't a member of it, so a channel override there returns HTTP `404 channel_not_found`; everything consolidated to `#alerts`. No paging. Mean detection time accepted as ~12-24h; the design weight sits on prevention (Kyverno enforce, NetworkPolicy default-deny egress) rather than runtime detection.
|
||||||
|
|
||||||
#### Detection sources
|
#### Detection sources
|
||||||
|
|
||||||
|
|
@ -285,7 +285,7 @@ Response model: **(I) Slack-only, daily skim.** All security alerts land in a ne
|
||||||
|
|
||||||
#### Alert rules (16 total)
|
#### Alert rules (16 total)
|
||||||
|
|
||||||
Routed via **Loki ruler → Alertmanager → `#security` Slack receiver**. Same handling path as existing infra alerts — silenceable in Alertmanager UI, history queryable, severity labels (critical/warning/info) inside the single `#security` channel.
|
Routed via **Loki ruler → Alertmanager → the `slack-security` receiver, which posts to `#alerts`** (it keeps its `[SECURITY/<sev>]` title styling so security-lane alerts stand out there; the dedicated `#security` channel was abandoned 2026-06-25 — the shared webhook's Slack app isn't a member of it). Same handling path as existing infra alerts — silenceable in Alertmanager UI, history queryable, severity labels (critical/warning/info) carried in the alert.
|
||||||
|
|
||||||
**K8s API audit (K2-K9, 8 rules — K1 cluster-admin-grant intentionally skipped):**
|
**K8s API audit (K2-K9, 8 rules — K1 cluster-admin-grant intentionally skipped):**
|
||||||
|
|
||||||
|
|
@ -364,6 +364,69 @@ Beads: `code-8ywc` W1.6 + W1.7. **Status: planned.**
|
||||||
- Rare-event misses: a Sunday-only CronJob's egress won't appear in 7 days of flow logs. Mitigation: extend observation to 2 weeks for namespaces with weekly CronJobs.
|
- Rare-event misses: a Sunday-only CronJob's egress won't appear in 7 days of flow logs. Mitigation: extend observation to 2 weeks for namespaces with weekly CronJobs.
|
||||||
- Mass-rollout cascade: the 26h March 2026 outage (memory id=390) was a mass-change cascade. Mitigation: phased per-namespace with health-check pauses, similar to the 2026-05-17 Keel phased rollout (memory id=1972).
|
- Mass-rollout cascade: the 26h March 2026 outage (memory id=390) was a mass-change cascade. Mitigation: phased per-namespace with health-check pauses, similar to the 2026-05-17 Keel phased rollout (memory id=1972).
|
||||||
|
|
||||||
|
#### Deriving the per-namespace egress allowlist from the edge trail (Wave 1 W1.7)
|
||||||
|
|
||||||
|
The durable **east-west flow trail** (below) is now the preferred data source for
|
||||||
|
the *internal* (namespace-to-namespace) half of each Wave-1 egress allowlist —
|
||||||
|
faster and identity-stamped vs the original iptables-`LOG`→journald→Loki path
|
||||||
|
(ADR-0014: "Enforcement gains a better data source"). The unique observed
|
||||||
|
namespace pairs live in CNPG DB `goldmane_edges`, table `edge`. To derive the
|
||||||
|
namespaces a source is observed talking to (the `allow` set that seeds its
|
||||||
|
NetworkPolicy):
|
||||||
|
|
||||||
|
```sql
|
||||||
|
SELECT DISTINCT dst_ns FROM edge WHERE src_ns='<ns>' AND action='allow' ORDER BY dst_ns;
|
||||||
|
```
|
||||||
|
|
||||||
|
The full SQL recipe (whole-cluster matrix, deny sanity-checks, the ≥7-day
|
||||||
|
observation caveat) is in
|
||||||
|
[runbooks/goldmane-flow-trail.md → Deriving the Wave-1 egress allowlist](../runbooks/goldmane-flow-trail.md#deriving-the-wave-1-egress-allowlist-from-the-edge-table-infra-62).
|
||||||
|
**External / public-internet egress is NOT in this table** (empty-namespace flows
|
||||||
|
are dropped) — for those destinations keep using the Calico flow-log observation
|
||||||
|
(the W1.6 snapshot, `wave1-egress-observation-2026-05-22.md`). This feeds the
|
||||||
|
existing observe-then-enforce effort (beads `code-8ywc`); **enforce-flips remain
|
||||||
|
out of scope** of the trail — it is observe-and-derive only.
|
||||||
|
|
||||||
|
### East-west flow observability (Goldmane / Whisker + edge trail) (ADR-0014)
|
||||||
|
|
||||||
|
The "who-talks-to-whom" data plane that succeeds raw iptables-`LOG` lines (which
|
||||||
|
carried no identity). **Service identity = the workload's namespace** (primary),
|
||||||
|
refined by a `service-identity` label in the few multi-Service namespaces
|
||||||
|
(`monitoring`, `kube-system`, `dbaas`). End-to-end trail, three layers:
|
||||||
|
|
||||||
|
1. **Calico Goldmane + Whisker** (`calico-system`) — Goldmane aggregates
|
||||||
|
identity-stamped flows (ns/pod/workload/labels + allow-deny + policy-trace)
|
||||||
|
streamed from Felix over gRPC into a **~60-min in-memory ring buffer** (no
|
||||||
|
etcd/API writes — the etcd-cost constraint that drove the design). **Whisker**
|
||||||
|
is its live web UI at `whisker.viktorbarzin.me` (Authentik-gated,
|
||||||
|
`auth = "required"` — Whisker has no own login; an additive NetworkPolicy ORs
|
||||||
|
Traefik past the operator's default-deny `whisker` NP). The ring buffer is
|
||||||
|
**not** a trail (lost on Goldmane restart). Enabled via operator CRs in
|
||||||
|
`stacks/calico/main.tf`; reversible toggle (Goldmane is OSS tech-preview).
|
||||||
|
2. **`goldmane-edge-aggregator`** (`stacks/goldmane-edge-aggregator`) — streams
|
||||||
|
Goldmane's gRPC `Flows.Stream` over **mTLS** and upserts the low-cardinality
|
||||||
|
namespace-pair edge set (`edge(src_ns,dst_ns,action,first_seen,last_seen,
|
||||||
|
flow_count)`) into CNPG DB `goldmane_edges`. Self-edges and empty-namespace
|
||||||
|
(public-internet) flows are dropped — in-cluster relationships only. The mTLS
|
||||||
|
client cert **reuses the operator's Tigera-CA-signed `whisker-backend-key-pair`**
|
||||||
|
(Goldmane verifies CA-chain only, not identity) rather than copying the CA
|
||||||
|
private key into TF state — **re-apply the stack if the operator rotates that
|
||||||
|
Secret**.
|
||||||
|
3. **`goldmane-edges-digest`** CronJob — posts first-seen edges daily to
|
||||||
|
**`#alerts`** (reuses the alert-digest webhook). All Slack now consolidates to
|
||||||
|
`#alerts`; the `#security` channel was abandoned 2026-06-25 because that
|
||||||
|
webhook's Slack app isn't a member of it (a `#security` override 404s). See
|
||||||
|
runbook.
|
||||||
|
|
||||||
|
The trail is **attribution-grade, not cryptographic** (reconstructs events in a
|
||||||
|
trusted cluster; cannot prove identity against a spoofing pod — accepted trust-model
|
||||||
|
limit; east-west stays plaintext, no mTLS between app pods). Health is covered by
|
||||||
|
the **`AggregatorDown`** + **`DigestFailing`** alerts and cluster-health check #48
|
||||||
|
(see monitoring.md). Full as-built, query recipes, and troubleshooting:
|
||||||
|
[runbooks/goldmane-flow-trail.md](../runbooks/goldmane-flow-trail.md). Decision:
|
||||||
|
[ADR-0014](../adr/0014-service-identity-and-east-west-observability.md); glossary
|
||||||
|
`CONTEXT.md` → **Service identity**, **Goldmane / Whisker**.
|
||||||
|
|
||||||
### TLS & HTTP/3
|
### TLS & HTTP/3
|
||||||
|
|
||||||
**Traefik** handles TLS termination:
|
**Traefik** handles TLS termination:
|
||||||
|
|
|
||||||
|
|
@ -0,0 +1,97 @@
|
||||||
|
# Post-mortem: k8s 1.34→1.35 upgrade stalled — etcd IO starvation (2026-06-24)
|
||||||
|
|
||||||
|
> Filename kept for inbound links. The originally-suspected cause (kubeadm-config
|
||||||
|
> OIDC drift) turned out **not** to be the crash — see "Correction" below. The OIDC
|
||||||
|
> drift was a real *separate* latent bug fixed in the same change.
|
||||||
|
|
||||||
|
**Impact:** The autonomous k8s-version-upgrade chain (23:00 UTC nightly) reached
|
||||||
|
the master control-plane phase for the first time — preflight passed, etcd
|
||||||
|
snapshot taken, master cordoned + drained, etcd upgraded 3.6.5→3.6.6 — then the
|
||||||
|
kube-apiserver upgrade to v1.35.6 **crash-looped**. kubeadm waited its 5-minute
|
||||||
|
static-pod-hash window across all internal retries, then auto-rolled-back to
|
||||||
|
v1.34.9. The cluster stayed healthy on 1.34.9 (apiserver, all 7 nodes Ready), but
|
||||||
|
the run left **k8s-master cordoned** and the chain **wedged on `in_flight=1`**.
|
||||||
|
No data loss; no user-facing outage (the master carries control-plane taints, so
|
||||||
|
no workloads were displaced).
|
||||||
|
|
||||||
|
**Trigger:** the first *minor* upgrade the chain ever attempted (1.34→1.35) — the
|
||||||
|
first time kubeadm upgrades etcd (3.6.5→3.6.6) and regenerates the control-plane
|
||||||
|
static pods, i.e. the first time the upgrade pushes real write-IO at etcd.
|
||||||
|
|
||||||
|
## Root cause — etcd IO starvation on the shared HDD
|
||||||
|
|
||||||
|
The new kube-apiserver could not establish/keep a working connection to etcd
|
||||||
|
during the upgrade because **etcd was IO-starved**. etcd's surviving container log
|
||||||
|
from the crash window (`/var/log/pods/.../etcd/0.log`, 23:04–23:20 UTC) shows:
|
||||||
|
|
||||||
|
- **1,180** `apply request took too long` warnings in 16 minutes;
|
||||||
|
- individual applies of **4.3s / 2.9s / 2.7s / 1.8s** (healthy is <100ms),
|
||||||
|
clustered at **23:18:51 UTC** — exactly when kubeadm's final attempt was trying
|
||||||
|
to bring the new apiserver up.
|
||||||
|
|
||||||
|
A reproduced 1.35.6 apiserver with no etcd dies with
|
||||||
|
`F instance.go:233 Error creating leases: error creating storage factory: context
|
||||||
|
deadline exceeded` — the same failure mode a multi-second etcd produces. etcd
|
||||||
|
lives on the contended `sdc` HDD (**beads code-oflt**: "etcd/critical VM disks on
|
||||||
|
shared sdc HDD — recurring IO-storm root cause"). The upgrade itself piled IO onto
|
||||||
|
that spindle:
|
||||||
|
|
||||||
|
1. etcd's own upgrade-restart + WAL/db re-read (it restarted ~23:04, re-elected);
|
||||||
|
2. kubeadm dumping a full **~400MB etcd DB backup** to
|
||||||
|
`/etc/kubernetes/tmp/kubeadm-backup-etcd-<ts>/` (on the same HDD) before the
|
||||||
|
etcd upgrade — and **145 of these had accumulated to 28GB** (kubeadm never
|
||||||
|
cleans them up), pushing master root fs to **73%**, above the 70% kubelet
|
||||||
|
image-GC threshold, so image GC churned during the drain too;
|
||||||
|
3. master-drain pod evictions.
|
||||||
|
|
||||||
|
### Correction — it was NOT the OIDC flag swap
|
||||||
|
|
||||||
|
`kubeadm upgrade diff v1.35.6` showed the regenerated manifest also swaps
|
||||||
|
`--authentication-config` (structured multi-issuer OIDC) back to legacy
|
||||||
|
single-issuer `--oidc-*` flags (kubeadm-config drift, see secondary finding). That
|
||||||
|
was the *first* hypothesis — but an isolated repro of the 1.35.6 apiserver with
|
||||||
|
those exact `--oidc-*` flags **and authentik reachable** initialised OIDC cleanly
|
||||||
|
(`oidc.go:313`, no error) and ran fine until it hit the (deliberately dead) test
|
||||||
|
etcd. So the auth swap does **not** crash the apiserver; it was a red herring for
|
||||||
|
the crash. Image pull (all v1.35.6 images pre-pulled), OOM (none), and disk-full
|
||||||
|
were also ruled out.
|
||||||
|
|
||||||
|
## Secondary finding (real, fixed separately) — kubeadm-config OIDC drift
|
||||||
|
|
||||||
|
apiserver auth is configured in three places that must agree:
|
||||||
|
(1) `/etc/kubernetes/pki/auth-config.yaml` (structured, two issuers: `kubernetes`
|
||||||
|
+ `k8s-dashboard`, added 2026-06-19); (2) the live static-pod manifest
|
||||||
|
(`--authentication-config`); (3) the kubeadm-config `ClusterConfiguration` CM —
|
||||||
|
which still carried the legacy `--oidc-*` extraArgs. `kubeadm upgrade` regenerates
|
||||||
|
the manifest from (3), so it would have reverted structured auth → **dashboard +
|
||||||
|
kubectl SSO break after a successful upgrade** (recoverable: the chain's
|
||||||
|
post-master `restore.sh` re-adds the flag). This is a real bug, just not the crash.
|
||||||
|
|
||||||
|
## Resolution
|
||||||
|
|
||||||
|
1. **Reclaimed the 28GB kubeadm scratch** on master (`/etc/kubernetes/tmp/kubeadm-backup-*`) — root fs 73% → 23%.
|
||||||
|
2. **Reconciled kubeadm-config live** (zero cluster impact — CM only read at upgrade time): dropped `--oidc-*`, added `--authentication-config` via `kubeadm init phase upload-config kubeadm`. `kubeadm upgrade diff` then shows only the control-plane image bumps.
|
||||||
|
3. **Recovered:** uncordoned k8s-master, cleared the stuck `in_flight` gauge + annotation, deleted last night's Complete/Failed `1-35-6` phase jobs (a Complete preflight would otherwise make the detector idempotent-skip the re-run).
|
||||||
|
|
||||||
|
## Prevention (landed in this change)
|
||||||
|
|
||||||
|
| Gap | Fix |
|
||||||
|
|-----|-----|
|
||||||
|
| kubeadm leaks ~400MB etcd-DB backups into `/etc/kubernetes/tmp` forever (→ disk fills, image-GC churn, write-IO on etcd's spindle) | **`upgrade-step.sh` preflight now prunes** `/etc/kubernetes/tmp/kubeadm-backup-*` + `kubeadm-upgraded-manifests*` older than 3 days on master, every run. Best-effort, never aborts. |
|
||||||
|
| kubeadm-config drift would silently break SSO after an upgrade | `apiserver-oidc.tf`'s remote script now **also reconciles kubeadm-config** (`kubeadm init phase upload-config`), delivered via the `apiserver-oidc-restore` ConfigMap the chain re-runs (CI needs no ssh) or a local `-replace` apply. Preflight **alerts** (not blocks — SSO drift is recoverable) if `kubeadm upgrade diff` would still drop `--authentication-config`. |
|
||||||
|
| etcd on the contended `sdc` HDD starves under upgrade IO | **Durable fix is beads code-oflt** (move etcd/critical VM disks off `sdc`). Not in this change. Mitigations above reduce the upgrade's own IO; reclaimed disk removes the image-GC variable. |
|
||||||
|
|
||||||
|
## Lessons
|
||||||
|
|
||||||
|
- **Capture the failing component's own logs before concluding.** The `kubeadm
|
||||||
|
upgrade diff` made the OIDC swap look like the cause; only etcd's log (multi-second
|
||||||
|
applies) + an isolated apiserver repro showed the truth (etcd IO). A clean diff is
|
||||||
|
"what config changes," not "why it crashed."
|
||||||
|
- **etcd on shared HDD is the cluster's recurring fragility** (immich IO storm
|
||||||
|
2026-05-25, this stall). Upgrades concentrate IO (etcd restart + kubeadm's 400MB
|
||||||
|
backup copy + drain) onto that spindle. code-oflt is the real fix.
|
||||||
|
- **Tools that leave per-operation scratch must be reaped.** kubeadm's
|
||||||
|
`/etc/kubernetes/tmp` etcd backups are throwaway (real backups → NFS) but never
|
||||||
|
GC'd; 28GB had silently accumulated.
|
||||||
|
- **Out-of-band control-plane edits must be written back to kubeadm-config** — else
|
||||||
|
`kubeadm upgrade` silently reverts them (here: SSO; could be admission/audit/API flags).
|
||||||
|
|
@ -11,6 +11,11 @@ inference every six hours and backs up only the `claudeAiOauth` object to:
|
||||||
secret/workstation/claude-users/<os-user>
|
secret/workstation/claude-users/<os-user>
|
||||||
```
|
```
|
||||||
|
|
||||||
|
The backup **merges** into that path (`vault kv patch -method=rw`, falling back to
|
||||||
|
`kv put` only when the path does not exist yet), so keys that other tools
|
||||||
|
co-locate there — notably `homelab vault`'s `vaultwarden_*` credentials — survive.
|
||||||
|
A blind `kv put` here silently wiped them on every six-hourly run (fixed 2026-06-26).
|
||||||
|
|
||||||
The user's unrelated `mcpOAuth` credentials never leave their home directory.
|
The user's unrelated `mcpOAuth` credentials never leave their home directory.
|
||||||
Each renewal service has a distinct 32-day periodic Vault token, mode `0600`, at
|
Each renewal service has a distinct 32-day periodic Vault token, mode `0600`, at
|
||||||
`~/.config/claude-auth-sync/vault-token`. Its policy can access only that user's
|
`~/.config/claude-auth-sync/vault-token`. Its policy can access only that user's
|
||||||
|
|
|
||||||
301
docs/runbooks/goldmane-flow-trail.md
Normal file
301
docs/runbooks/goldmane-flow-trail.md
Normal file
|
|
@ -0,0 +1,301 @@
|
||||||
|
# Goldmane Flow Trail — east-west "who-talks-to-whom" observability
|
||||||
|
|
||||||
|
> As-built runbook for the Calico Goldmane + Whisker flow plane and the
|
||||||
|
> `goldmane-edge-aggregator` durable audit trail. Design + rationale:
|
||||||
|
> [ADR-0014](../adr/0014-service-identity-and-east-west-observability.md).
|
||||||
|
> Glossary: `CONTEXT.md` → **Service identity**, **Goldmane / Whisker**.
|
||||||
|
> Implements infra issues #57 (Whisker ingress), #58 (aggregator), #61
|
||||||
|
> (monitoring), #62 (egress allowlist queries), #63 (these docs).
|
||||||
|
|
||||||
|
## What the trail is
|
||||||
|
|
||||||
|
Three layers turn raw east-west traffic into a queryable, durable record of
|
||||||
|
which Service talks to which. **Service identity = the workload's namespace**
|
||||||
|
(primary), refined by a `service-identity` label in the few multi-Service
|
||||||
|
namespaces (`monitoring`, `kube-system`, `dbaas`) — see ADR-0014.
|
||||||
|
|
||||||
|
| Layer | Component | Lifetime | Where it lives |
|
||||||
|
|---|---|---|---|
|
||||||
|
| **Live map** | Calico **Goldmane** + **Whisker** | ~60-min in-memory ring buffer (lost on Goldmane restart) | `calico-system`; Whisker UI at `whisker.viktorbarzin.me` |
|
||||||
|
| **Durable trail** | `goldmane-edge-aggregator` (`aggregate` mode) | persistent | CNPG Postgres DB `goldmane_edges`, table `edge` |
|
||||||
|
| **Notification** | `goldmane-edges-digest` CronJob (`digest` mode) | daily | Slack `#alerts` |
|
||||||
|
|
||||||
|
**Goldmane** aggregates identity-stamped flows (namespace / pod / workload /
|
||||||
|
labels + allow-deny + policy-trace) streamed from Felix (the existing
|
||||||
|
`calico-node` DaemonSet) over gRPC into a ~60-minute in-memory ring buffer —
|
||||||
|
**nothing is written to etcd or the K8s API** (the etcd-cost constraint that
|
||||||
|
drove the whole design). **Whisker** is its live web UI. Because the ring
|
||||||
|
buffer is *not* a trail (a Goldmane restart loses the window), the
|
||||||
|
`goldmane-edge-aggregator` consumes Goldmane's gRPC `Flows.Stream` API over
|
||||||
|
mTLS and upserts the unique **namespace-pair edge set** into Postgres; a daily
|
||||||
|
CronJob posts first-seen edges to Slack.
|
||||||
|
|
||||||
|
The edge set is deliberately **low-cardinality** — one row per
|
||||||
|
`(src_ns, dst_ns, action)`, *not* per-pod or per-port — so the table stays
|
||||||
|
small no matter how much traffic flows.
|
||||||
|
|
||||||
|
## Where the data lives
|
||||||
|
|
||||||
|
### Whisker UI — live, ~60 min
|
||||||
|
- `https://whisker.viktorbarzin.me` (Authentik-gated — Whisker ships no own
|
||||||
|
login; `auth = "required"`). Shows the live flow stream + a service graph for
|
||||||
|
roughly the last hour. Use it for "what is talking right now"; it is **not**
|
||||||
|
history.
|
||||||
|
- In-cluster: `Service goldmane:7443` (gRPC/mTLS), `Service whisker:8081`
|
||||||
|
(HTTP), both in `calico-system`.
|
||||||
|
|
||||||
|
### CNPG `goldmane_edges` — durable
|
||||||
|
- Postgres DB `goldmane_edges` on the CNPG cluster
|
||||||
|
(`pg-cluster-rw.dbaas.svc.cluster.local:5432`). One table:
|
||||||
|
|
||||||
|
```
|
||||||
|
edge(src_ns text, dst_ns text, action text,
|
||||||
|
first_seen timestamptz, last_seen timestamptz, flow_count bigint,
|
||||||
|
PRIMARY KEY (src_ns, dst_ns, action))
|
||||||
|
```
|
||||||
|
|
||||||
|
- `action` ∈ `allow` / `deny` / `pass` / `unspecified` (normalised Goldmane
|
||||||
|
action).
|
||||||
|
- **Self-edges (`src_ns == dst_ns`) and empty-namespace flows** (host-endpoint
|
||||||
|
/ public-internet) are **dropped** — the trail is about in-cluster service
|
||||||
|
relationships only. (Egress to the public internet is therefore NOT in this
|
||||||
|
table; it lives in the Wave-1 Calico flow-log path — see security.md.)
|
||||||
|
- A **"new edge"** = a row whose `first_seen` falls inside the digest window.
|
||||||
|
- Role `goldmane_edges` (Vault-rotated, 7-day) owns the DB. The `edge` table
|
||||||
|
is created idempotently by the aggregator at startup (canonical DDL also in
|
||||||
|
the repo at `migrations/0001_edge.sql`).
|
||||||
|
|
||||||
|
### Slack `#alerts` — daily digest
|
||||||
|
|
||||||
|
> **Channel note (2026-06-25):** posts to **`#alerts`**. The dedicated `#security` channel was abandoned — the shared `alertmanager_slack_api_url` incoming webhook's Slack app is not a member of it, so a channel override there returns HTTP `404 channel_not_found`. Everything now posts to `#alerts` (this digest plus alertmanager's `slack-security` receiver, which keeps its `[SECURITY]` styling so security-lane alerts still stand out there).
|
||||||
|
|
||||||
|
- CronJob `goldmane-edges-digest` (08:00 Europe/London) posts edges first seen
|
||||||
|
in the last 24h. Quiet when there are none. Reuses the existing alert-digest
|
||||||
|
Slack incoming webhook (Vault `secret/viktor` → `alertmanager_slack_api_url`)
|
||||||
|
— no new webhook was created.
|
||||||
|
|
||||||
|
## How to enable / disable
|
||||||
|
|
||||||
|
### Goldmane + Whisker (the flow plane)
|
||||||
|
Operator CRs in **`stacks/calico/main.tf`** — NOT the Helm `goldmane`/`whisker`
|
||||||
|
flags (those stay `false`; the operator's own `installation`/`apiServer` are
|
||||||
|
operator-managed via the `goldmanes`/`whiskers.operator.tigera.io` CRDs):
|
||||||
|
|
||||||
|
- `kubectl_manifest.goldmane` (kind `Goldmane`) — creating it makes the operator
|
||||||
|
re-render `calico-node` with the `FELIX_FLOWLOGSGOLDMANESERVER` env (the
|
||||||
|
operator auto-wires Felix — **do NOT patch FelixConfiguration**), triggering a
|
||||||
|
supervised `calico-node` DaemonSet roll. Yields `Deployment` + `Service
|
||||||
|
goldmane:7443`.
|
||||||
|
- `kubectl_manifest.whisker` (kind `Whisker`, `depends_on` goldmane;
|
||||||
|
`notifications = Disabled`). Yields `Deployment` + `Service whisker:8081`.
|
||||||
|
|
||||||
|
**To disable:** delete those two CRs and re-apply `stacks/calico`. Reversible
|
||||||
|
toggle (Goldmane is tech-preview in OSS Calico 3.30 — the main standing risk per
|
||||||
|
ADR-0014).
|
||||||
|
|
||||||
|
### Whisker public ingress (infra #57)
|
||||||
|
Also in `stacks/calico/main.tf`:
|
||||||
|
- `module "ingress_whisker"` (`ingress_factory`, `auth = "required"`,
|
||||||
|
`dns_type = "proxied"`) → `whisker.viktorbarzin.me`.
|
||||||
|
- `kubernetes_network_policy_v1.whisker_allow_traefik` — **required alongside the
|
||||||
|
ingress**: the operator's own `whisker` NetworkPolicy (owned by the Whisker CR)
|
||||||
|
is `policyTypes: [Ingress]` with no rules = default-deny ingress to the pod.
|
||||||
|
This additive NP ORs in an allow for `namespaceSelector
|
||||||
|
kubernetes.io/metadata.name=traefik` on TCP 8081. Without it Traefik 502s.
|
||||||
|
|
||||||
|
### The aggregator + digest (the durable trail) — `stacks/goldmane-edge-aggregator`
|
||||||
|
A Tier-1 stack (PG state) mirroring the claude-memory pattern. `scripts/tg
|
||||||
|
apply` from `stacks/goldmane-edge-aggregator/`. It provisions: the namespace,
|
||||||
|
the mTLS client material, the Postgres DB-init Job, the `DATABASE_URL`
|
||||||
|
ExternalSecret (Vault static role `pg-goldmane-edges`), the Slack ExternalSecret,
|
||||||
|
the `aggregate` Deployment, and the `digest` CronJob. **To disable the trail
|
||||||
|
without touching the flow plane:** scale `deployment/goldmane-edge-aggregator` to
|
||||||
|
0 (transient) or remove the stack (permanent) — Goldmane/Whisker keep running.
|
||||||
|
|
||||||
|
Image: `ghcr.io/viktorbarzin/goldmane-edge-aggregator` (PRIVATE) — the
|
||||||
|
`goldmane-edge-aggregator` namespace must be in the `ghcr-credentials` Kyverno
|
||||||
|
allowlist (`stacks/kyverno/modules/kyverno/ghcr-credentials.tf`,
|
||||||
|
`local.ghcr_private_namespaces`) or pulls 401. Code repo:
|
||||||
|
`~/code/goldmane-edge-aggregator` (see its `README.md` + `DEPLOY.md`).
|
||||||
|
|
||||||
|
## mTLS cert — the REUSE decision (cert-reuse gotcha)
|
||||||
|
|
||||||
|
The aggregator dials `goldmane:7443` over **mutual TLS**. Goldmane requires the
|
||||||
|
client cert to chain to the **Tigera CA**, but it does **NOT authorize by client
|
||||||
|
identity** — any Tigera-CA-signed cert is accepted.
|
||||||
|
|
||||||
|
Rather than copy the Tigera CA **private key** into Terraform state to mint our
|
||||||
|
own cert (a needless CA-key exposure; the `hashicorp/tls` provider also clashes
|
||||||
|
with this repo's global generate-providers/lockfile pattern), the stack
|
||||||
|
**REUSES the operator-minted, Tigera-CA-signed `whisker-backend-key-pair`
|
||||||
|
Secret** (`calico-system`), copying its `tls.crt`/`tls.key` into the
|
||||||
|
`goldmane-client-tls` Secret in the aggregator namespace. The CA *bundle* that
|
||||||
|
verifies Goldmane's serving cert (`tigera-ca-bundle` ConfigMap, key
|
||||||
|
`tigera-ca-bundle.crt`) is likewise copied verbatim (a ConfigMap can't be
|
||||||
|
cross-namespace-mounted).
|
||||||
|
|
||||||
|
> **GOTCHA — if the operator rotates `whisker-backend-key-pair`, re-apply
|
||||||
|
> `stacks/goldmane-edge-aggregator`** to re-sync the copied cert. Symptom of a
|
||||||
|
> stale copy: the `aggregate` pod logs TLS handshake / `Flows.Stream` failures
|
||||||
|
> and no `last_seen` updates land in the `edge` table. Hardening follow-up
|
||||||
|
> (noted in the stack): mint an own-identity cert in-namespace if Whisker is ever
|
||||||
|
> removed (which would delete the reused source Secret).
|
||||||
|
|
||||||
|
The Deployment leaves `GOLDMANE_HOST=goldmane.calico-system.svc.cluster.local:7443`
|
||||||
|
and the default cert/CA paths; the default ServerName (host sans port) is a SAN
|
||||||
|
on Goldmane's live serving cert, so no `GOLDMANE_SERVER_NAME` /
|
||||||
|
`GOLDMANE_TLS_INSECURE` override is needed.
|
||||||
|
|
||||||
|
## How to query who-talks-to-whom
|
||||||
|
|
||||||
|
`psql` into the DB (creds: Vault static role `static-creds/pg-goldmane-edges`, or
|
||||||
|
exec a CNPG pod). All queries are against the single `edge` table.
|
||||||
|
|
||||||
|
```sql
|
||||||
|
-- Everything talking to a namespace (inbound), most-active first
|
||||||
|
SELECT src_ns, action, flow_count, first_seen, last_seen
|
||||||
|
FROM edge WHERE dst_ns = '<ns>' ORDER BY flow_count DESC;
|
||||||
|
|
||||||
|
-- Everything a namespace talks TO (outbound)
|
||||||
|
SELECT dst_ns, action, flow_count, first_seen, last_seen
|
||||||
|
FROM edge WHERE src_ns = '<ns>' ORDER BY last_seen DESC;
|
||||||
|
|
||||||
|
-- New edges in the last 24h (what the digest reports)
|
||||||
|
SELECT src_ns, dst_ns, action, flow_count, first_seen
|
||||||
|
FROM edge WHERE first_seen > now() - interval '24 hours'
|
||||||
|
ORDER BY first_seen DESC;
|
||||||
|
|
||||||
|
-- Any DENIED edges (policy is dropping this pair)
|
||||||
|
SELECT src_ns, dst_ns, flow_count, last_seen
|
||||||
|
FROM edge WHERE action = 'deny' ORDER BY last_seen DESC;
|
||||||
|
|
||||||
|
-- Full edge set as a graph adjacency list
|
||||||
|
SELECT src_ns, dst_ns, action, flow_count FROM edge ORDER BY src_ns, dst_ns;
|
||||||
|
```
|
||||||
|
|
||||||
|
For the **live** (sub-hour) view including pod/port detail, use the Whisker UI —
|
||||||
|
the `edge` table intentionally aggregates that away.
|
||||||
|
|
||||||
|
## Deriving the Wave-1 egress allowlist from the edge table (infra #62)
|
||||||
|
|
||||||
|
The durable edge set is a faster, identity-stamped data source for the existing
|
||||||
|
**observe-then-enforce** egress effort (beads `code-8ywc`; snapshot
|
||||||
|
`docs/architecture/wave1-egress-observation-2026-05-22.md`) than the original
|
||||||
|
iptables-`LOG` → journald → Loki path (ADR-0014 consequence: "Enforcement gains
|
||||||
|
a better data source"). It replaces the *internal* (namespace-to-namespace) leg
|
||||||
|
of the allowlist; **external/public-internet egress is NOT in this table** (empty
|
||||||
|
dst namespace, dropped) — for those destinations keep using the Calico flow-log
|
||||||
|
path described in security.md.
|
||||||
|
|
||||||
|
**Per-namespace internal egress allowlist** — the set of in-cluster namespaces a
|
||||||
|
given source is *observed* talking to with `action='allow'`:
|
||||||
|
|
||||||
|
```sql
|
||||||
|
-- Internal egress allowlist for one namespace (feeds its NetworkPolicy)
|
||||||
|
SELECT DISTINCT dst_ns
|
||||||
|
FROM edge
|
||||||
|
WHERE src_ns = '<ns>' AND action = 'allow'
|
||||||
|
ORDER BY dst_ns;
|
||||||
|
```
|
||||||
|
|
||||||
|
```sql
|
||||||
|
-- Full internal egress matrix for all namespaces at once
|
||||||
|
SELECT src_ns, array_agg(DISTINCT dst_ns ORDER BY dst_ns) AS allowed_dst_ns
|
||||||
|
FROM edge
|
||||||
|
WHERE action = 'allow'
|
||||||
|
GROUP BY src_ns
|
||||||
|
ORDER BY src_ns;
|
||||||
|
```
|
||||||
|
|
||||||
|
```sql
|
||||||
|
-- Sanity: namespaces with a DENY edge already (policy is biting; investigate
|
||||||
|
-- before tightening further)
|
||||||
|
SELECT DISTINCT src_ns, dst_ns FROM edge WHERE action = 'deny';
|
||||||
|
```
|
||||||
|
|
||||||
|
**How this feeds enforcement (scope):** the derived `dst_ns` set is the
|
||||||
|
*internal* half of a namespace's egress allowlist — it tells you which
|
||||||
|
in-cluster namespaces to permit before flipping that namespace to default-deny.
|
||||||
|
The universal baseline (kube-dns :53, often dbaas :3306/:5432, redis :6379) and
|
||||||
|
the external destinations still come from the Wave-1 observation snapshot.
|
||||||
|
**Enforce-flips remain OUT OF SCOPE** here — this is observe-and-derive only;
|
||||||
|
the phased per-namespace default-deny rollout (starting `recruiter-responder`)
|
||||||
|
is tracked under `code-8ywc`. Cross-links:
|
||||||
|
[security.md → NetworkPolicy Default-Deny Egress](../architecture/security.md#networkpolicy-default-deny-egress-wave-1--observe-then-enforce-tier-34),
|
||||||
|
[wave1-egress-observation-2026-05-22.md](../architecture/wave1-egress-observation-2026-05-22.md),
|
||||||
|
[ADR-0014](../adr/0014-service-identity-and-east-west-observability.md).
|
||||||
|
|
||||||
|
> **Caveat (same as the Wave-1 snapshot):** an edge only exists if it was
|
||||||
|
> *observed*. A weekly CronJob or a 7-day Vault rotation may not have fired yet —
|
||||||
|
> collect ≥7 days of edges before treating a namespace's `allow` set as
|
||||||
|
> complete. The `first_seen` column tells you how long an edge has been known;
|
||||||
|
> the digest surfaces brand-new ones daily.
|
||||||
|
|
||||||
|
## Monitoring & health (infra #61)
|
||||||
|
|
||||||
|
The aggregator pod has **no `/metrics` endpoint** — health is inferred from
|
||||||
|
kube-state-metrics. Three complementary signals (memory ids 6598, 6599;
|
||||||
|
see also [monitoring.md → Security Alerts](../architecture/monitoring.md#security-alerts-wave-1--planned-beads-code-8ywc)):
|
||||||
|
|
||||||
|
| Signal | What | Where |
|
||||||
|
|---|---|---|
|
||||||
|
| **`AggregatorDown`** | `kube_deployment_status_replicas_available{namespace="goldmane-edge-aggregator",deployment="goldmane-edge-aggregator"} < 1` for 15m → warning | Prometheus alert group `Network Observability (Goldmane)` in `stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl`; routes `slack-warning` → `#alerts` |
|
||||||
|
| **`DigestFailing`** | `kube_job_status_failed{...job_name=~"goldmane-edges-digest.*"} > 0` within 24h, for 30m → warning | same alert group → `#alerts` |
|
||||||
|
| **cluster-health #48** | `check_goldmane_aggregator` reads the Deployment's `Available` condition (missing or not-Available → FAIL) | `scripts/cluster_healthcheck.sh` (human / `--quiet` / `--json` modes; emits `goldmane_aggregator`) |
|
||||||
|
|
||||||
|
The two alert layers are deliberately complementary: `AggregatorDown` →
|
||||||
|
**no new edges land** in the DB; `DigestFailing` → **edges still land but nobody
|
||||||
|
is told**. A freshness probe (#61b) was intentionally skipped — `AggregatorDown`
|
||||||
|
is the agreed floor.
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
**Whisker UI 502 / unreachable.** The additive
|
||||||
|
`kubernetes_network_policy_v1.whisker_allow_traefik` is missing or the
|
||||||
|
operator's default-deny `whisker` NP regenerated — re-apply `stacks/calico`. A
|
||||||
|
brand-new ingress host is also invisible to LAN split-horizon until the hourly
|
||||||
|
`technitium-ingress-dns-sync` runs (memory #5349); test meanwhile with
|
||||||
|
`curl -sSI --resolve whisker.viktorbarzin.me:443:10.0.20.203 https://whisker.viktorbarzin.me`
|
||||||
|
(expect a 302 to Authentik — the gate working).
|
||||||
|
|
||||||
|
**No new `last_seen` updates / `AggregatorDown` firing.** Check the `aggregate`
|
||||||
|
pod logs (`kubectl logs -n goldmane-edge-aggregator deploy/goldmane-edge-aggregator`).
|
||||||
|
Common causes, in order:
|
||||||
|
1. **Stale mTLS cert** — the operator rotated `whisker-backend-key-pair`; re-apply
|
||||||
|
`stacks/goldmane-edge-aggregator` (see cert-reuse gotcha above). Symptom: TLS
|
||||||
|
handshake / `Flows.Stream` errors.
|
||||||
|
2. **Stale DB password** — the 7-day Vault rotation bounced the credential but
|
||||||
|
the pod kept the old one. The Deployment carries
|
||||||
|
`secret.reloader.stakater.com/reload: goldmane-edges-db-creds`; if it's not
|
||||||
|
restarting on rotation, verify the Reloader annotation and the ExternalSecret.
|
||||||
|
3. **Goldmane restarted** — the in-memory window was lost (expected); the stream
|
||||||
|
reconnects automatically and resumes upserting. No data loss in the DB
|
||||||
|
(only the sub-hour live window in Whisker is gone).
|
||||||
|
|
||||||
|
**Digest never posts / `DigestFailing` firing.** Inspect the most recent
|
||||||
|
`goldmane-edges-digest-*` Job (`kubectl get jobs -n goldmane-edge-aggregator`;
|
||||||
|
`kubectl logs job/<name>`). The CronJob's `ttl_seconds_after_finished=86400` GCs
|
||||||
|
pods after a day, so check soon after a failed run. With `SLACK_WEBHOOK_URL`
|
||||||
|
empty the binary forces a dry-run (no post) — verify the `goldmane-edges-slack`
|
||||||
|
ExternalSecret resolved. A dry run / smoke test: run the image with `args:
|
||||||
|
["digest"]` + `DRY_RUN=1` to print the message instead of POSTing.
|
||||||
|
> Known state (2026-06-25): the digest CronJob's first Job **failed** and it has
|
||||||
|
> never successfully posted (`lastSuccessfulTime` empty) — the digest leg is the
|
||||||
|
> live gap; `DigestFailing` is catching it. Edges still land in the DB via the
|
||||||
|
> `aggregate` Deployment; only the `#alerts` digest notification is affected.
|
||||||
|
> Investigation/fix belongs to the aggregator slice (#58/#60), not monitoring.
|
||||||
|
|
||||||
|
**No edges at all in the table.** Confirm Goldmane is enabled
|
||||||
|
(`kubectl get goldmane,whisker -A`) and `calico-node` rolled with the
|
||||||
|
`FELIX_FLOWLOGSGOLDMANESERVER` env; confirm the `goldmane-edges-db-init` Job
|
||||||
|
completed; confirm the aggregator pod is `Running` and not `ImagePullBackOff`
|
||||||
|
(ghcr allowlist).
|
||||||
|
|
||||||
|
## Related
|
||||||
|
- [ADR-0014 — Service identity & east-west observability](../adr/0014-service-identity-and-east-west-observability.md)
|
||||||
|
- [security.md — NetworkPolicy Default-Deny Egress + east-west flow observability](../architecture/security.md)
|
||||||
|
- [monitoring.md — east-west flow observability + alerts](../architecture/monitoring.md)
|
||||||
|
- [wave1-egress-observation-2026-05-22.md](../architecture/wave1-egress-observation-2026-05-22.md)
|
||||||
|
- `CONTEXT.md` glossary — **Service identity**, **Goldmane / Whisker**
|
||||||
|
- Code: `~/code/goldmane-edge-aggregator` (`README.md`, `DEPLOY.md`); stacks
|
||||||
|
`stacks/goldmane-edge-aggregator`, `stacks/calico`
|
||||||
121
docs/runbooks/homelab-vault-onboarding.md
Normal file
121
docs/runbooks/homelab-vault-onboarding.md
Normal file
|
|
@ -0,0 +1,121 @@
|
||||||
|
# `homelab vault` onboarding (per-user Vaultwarden access)
|
||||||
|
|
||||||
|
## Scope
|
||||||
|
|
||||||
|
`homelab vault` gives each devvm roster user no-HITL access to **their own**
|
||||||
|
Vaultwarden vault (and any Organization Collection shared with their account)
|
||||||
|
from the command line. It shells out to the official `bw` CLI; the user's
|
||||||
|
Vaultwarden credentials live only in their isolated Vault path
|
||||||
|
`secret/workstation/claude-users/<os-user>` and are decrypted as that OS user —
|
||||||
|
the admin never sees them.
|
||||||
|
|
||||||
|
```text
|
||||||
|
homelab vault setup one-time: store VW email + master password + API key
|
||||||
|
homelab vault status configured / unlocked / reachable (no secrets)
|
||||||
|
homelab vault list [--search Q] item names (no secrets)
|
||||||
|
homelab vault get <name> [--field password|username|uri|notes|totp] [--json]
|
||||||
|
homelab vault code <name> current TOTP code
|
||||||
|
homelab vault lock lock / log out the local bw session
|
||||||
|
```
|
||||||
|
|
||||||
|
## How auth works (why a non-admin can use it)
|
||||||
|
|
||||||
|
`homelab vault` runs `vault` as the calling user. It resolves a Vault token in
|
||||||
|
this order (`ensureVaultToken`, `cli/cmd_vault.go`):
|
||||||
|
|
||||||
|
1. an explicit `$VAULT_TOKEN`, then
|
||||||
|
2. a native `~/.vault-token` (what admins carry), then
|
||||||
|
3. the per-user **scoped token** that `claude-auth-sync` maintains at
|
||||||
|
`~/.config/claude-auth-sync/vault-token` (policy `workstation-claude-<user>`).
|
||||||
|
|
||||||
|
That scoped policy grants exactly `create`/`read`/`update` on the user's own
|
||||||
|
`secret/workstation/claude-users/<user>` path — no `patch` capability — so the
|
||||||
|
tool writes with `vault kv patch -method=rw` (read-modify-write), falling back to
|
||||||
|
`kv put` only when the path does not exist yet. This preserves the
|
||||||
|
`claude_ai_oauth_json` key that [claude-auth-sync](claude-auth-renew-workstation.md)
|
||||||
|
co-locates there. (Both bugs that previously made this admin-only were fixed
|
||||||
|
2026-06-27.)
|
||||||
|
|
||||||
|
## Prerequisites (per user)
|
||||||
|
|
||||||
|
- The user is in `scripts/workstation/roster.yaml` and the **vault** stack has
|
||||||
|
been applied → their `workstation-claude-<user>` policy exists.
|
||||||
|
- The user's workstation was provisioned (`setup-devvm.sh`) → their scoped Vault
|
||||||
|
token exists at `~/.config/claude-auth-sync/vault-token`.
|
||||||
|
- `bw` is installed **system-wide** at `/usr/bin/bw` (see below).
|
||||||
|
- The user has a Vaultwarden account at `https://vaultwarden.viktorbarzin.me`
|
||||||
|
(self-service signup is open; admin panel is disabled).
|
||||||
|
|
||||||
|
## One-time admin steps (devvm)
|
||||||
|
|
||||||
|
`bw` must be system-wide so every user resolves it (it is a Node script, and
|
||||||
|
`node` is already system-wide at `/usr/bin/node`). `setup-devvm.sh` installs it
|
||||||
|
to the npm `/usr` prefix; the guard checks the **system** path, not
|
||||||
|
`command -v bw` (an admin's own `~/.local/bin/bw` used to mask the system
|
||||||
|
install, leaving non-admins with no backend). To install on a running box:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo npm install -g --prefix /usr "@bitwarden/cli@^2024"
|
||||||
|
bw --version # confirm /usr/bin/bw resolves
|
||||||
|
```
|
||||||
|
|
||||||
|
After landing a `cli/` change, rebuild the binary so users pick it up:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo bash -c 'cd /home/wizard/code/infra/cli && \
|
||||||
|
go build -ldflags "-X main.version=$(git -C /home/wizard/code/infra describe --tags --always 2>/dev/null || echo dev)" \
|
||||||
|
-o /usr/local/bin/homelab .'
|
||||||
|
```
|
||||||
|
|
||||||
|
(or just re-run `scripts/workstation/setup-devvm.sh` as root, which rebuilds it.)
|
||||||
|
|
||||||
|
## User onboarding
|
||||||
|
|
||||||
|
The user runs these as themselves. The master password / API key are entered
|
||||||
|
interactively (never on the command line) and stored only in the user's Vault
|
||||||
|
path.
|
||||||
|
|
||||||
|
1. In the Vaultwarden web vault → **Settings → Security → Keys → View API key**,
|
||||||
|
copy the `client_id` (`user.xxxx`) and `client_secret`.
|
||||||
|
2. Configure:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
homelab vault setup # prompts: VW email, API client_id/secret, master password
|
||||||
|
homelab vault status # → "vault: configured, unlocked, reachable ✓"
|
||||||
|
homelab vault list # item names (own vault + any shared Collections)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Shared-Collection access (sharing passwords with a user)
|
||||||
|
|
||||||
|
`homelab vault` surfaces Organization Collection items automatically once the
|
||||||
|
user's Vaultwarden account is a confirmed member. These steps are done by the
|
||||||
|
vault owner in the **Vaultwarden web UI** (they need the owner's master
|
||||||
|
password — not an infra/Terraform operation):
|
||||||
|
|
||||||
|
1. Create or reuse an **Organization** and a **Collection** of shared logins.
|
||||||
|
2. **Invite** the user's Vaultwarden account to the Organization, granting
|
||||||
|
**"Can view"** on that Collection (least privilege).
|
||||||
|
3. The user accepts the email invite and confirms membership.
|
||||||
|
4. The user runs `homelab vault list` — the shared items now appear alongside
|
||||||
|
their own (a `homelab vault status` sync picks them up).
|
||||||
|
|
||||||
|
## Security model (the no-HITL trade)
|
||||||
|
|
||||||
|
Identity is the kernel UID. Anything running as the user can decrypt the user's
|
||||||
|
vault — this is the accepted trade for no-human-in-the-loop fetches. Secrets
|
||||||
|
never appear in `argv` (passed via env or stdin), core dumps are disabled, TOTP
|
||||||
|
fetches are logged to syslog/Loki, and on a TTY values go to the clipboard
|
||||||
|
(auto-clearing) rather than scrollback. The admin's Vault token is never used by
|
||||||
|
a non-admin: each user authenticates with their own scoped token.
|
||||||
|
|
||||||
|
## Verification
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# the scoped token carries the right policy
|
||||||
|
VAULT_TOKEN="$(sudo cat /home/<user>/.config/claude-auth-sync/vault-token)" \
|
||||||
|
vault token lookup -format=json | jq '.data.display_name, .data.policies'
|
||||||
|
# → "token-devvm-claude-auth-<user>", [..., "workstation-claude-<user>"]
|
||||||
|
|
||||||
|
sudo -u <user> -i bw --version # /usr/bin/bw resolves for the user
|
||||||
|
sudo -u <user> -i homelab vault status
|
||||||
|
```
|
||||||
|
|
@ -41,6 +41,8 @@ Job 0 — preflight (pinned: k8s-node1)
|
||||||
├── halt-on-alert (kured-style ignore-list)
|
├── halt-on-alert (kured-style ignore-list)
|
||||||
├── 24h-quiet baseline (no Ready transitions <24h ago)
|
├── 24h-quiet baseline (no Ready transitions <24h ago)
|
||||||
├── kubeadm upgrade plan matches target (skipped when master already at target — partial-resume)
|
├── kubeadm upgrade plan matches target (skipped when master already at target — partial-resume)
|
||||||
|
├── apiserver-OIDC drift check: kubeadm upgrade diff drops --authentication-config? → Slack WARN (recoverable; not a block)
|
||||||
|
├── reclaim kubeadm scratch: prune /etc/kubernetes/tmp/kubeadm-backup-* >3d on master (kubeadm leaks ~400MB etcd-db backups)
|
||||||
├── Push k8s_upgrade_in_flight=1, k8s_upgrade_started_timestamp=$(date +%s)
|
├── Push k8s_upgrade_in_flight=1, k8s_upgrade_started_timestamp=$(date +%s)
|
||||||
├── Trigger backup-etcd Job, wait, verify snapshot byte count
|
├── Trigger backup-etcd Job, wait, verify snapshot byte count
|
||||||
├── SSH master: containerd skew fix (if master < workers)
|
├── SSH master: containerd skew fix (if master < workers)
|
||||||
|
|
@ -222,22 +224,34 @@ Exposed in K8s via ExternalSecret `k8s-upgrade-creds` in the `k8s-upgrade` names
|
||||||
|
|
||||||
## Common Operations
|
## Common Operations
|
||||||
|
|
||||||
### Post-upgrade: apiserver OIDC restore (AUTOMATED by the chain since 2026-06-19)
|
### apiserver OIDC + kubeadm upgrades (kubeadm-config reconciliation since 2026-06-24)
|
||||||
|
|
||||||
`kubeadm upgrade apply` **regenerates `/etc/kubernetes/manifests/kube-apiserver.yaml`
|
`kubeadm upgrade apply` **regenerates `/etc/kubernetes/manifests/kube-apiserver.yaml`
|
||||||
and drops the `--authentication-config` flag**, silently disabling apiserver
|
from kubeadm-config**. apiserver auth uses a structured multi-issuer
|
||||||
OIDC (kubectl/kubelogin CLI **and** the web dashboard SSO break — tokens get
|
`--authentication-config` (kubectl + dashboard SSO), but kubeadm-config used to
|
||||||
401). This used to require a manual re-apply after **every** control-plane bump.
|
still carry the legacy single-issuer `--oidc-*` extraArgs — so every upgrade
|
||||||
|
reverted the flag, **silently breaking SSO after the upgrade** (the apiserver does
|
||||||
|
NOT crash on this — verified by isolated repro; it's recoverable via the restore
|
||||||
|
script below). NB: the **1.34→1.35 stall on 2026-06-24 was a *separate* issue —
|
||||||
|
etcd IO starvation**, not this drift; post-mortem:
|
||||||
|
`docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md`.
|
||||||
|
|
||||||
**Now automated:** the `rbac` stack publishes its OIDC restore script to the
|
**Primary fix (2026-06-24):** `stacks/rbac/modules/rbac/apiserver-oidc.tf` now
|
||||||
`kube-system/apiserver-oidc-restore` ConfigMap, and the version-upgrade chain's
|
**reconciles kubeadm-config** (`kubeadm init phase upload-config kubeadm`, rewriting
|
||||||
`phase_master` re-runs it on master immediately after `kubeadm upgrade apply`
|
`apiServer.extraArgs`: drop `--oidc-*`, add `--authentication-config`) as part of
|
||||||
(while tigera-operator is still quiesced, so the flag-add apiserver restart can't
|
its remote script. So kubeadm regenerates a **correct** manifest and the apiserver
|
||||||
crashloop the operator). It's idempotent, health-gates `/livez` with
|
upgrades with a pure image bump — `kubeadm upgrade diff <target>` shows only the
|
||||||
auto-rollback, and is **non-fatal** — a failure only lags SSO until the next rbac
|
image change. Zero live impact (the CM is read only during an upgrade).
|
||||||
apply (the version upgrade itself already succeeded). So a chain-driven
|
|
||||||
control-plane bump no longer breaks SSO. The master phase self-skips when master
|
**Backstops:**
|
||||||
is already at target, so this only runs when master was actually upgraded.
|
- **Preflight check 4b** runs `kubeadm upgrade diff` and **alerts** (Slack WARN, does
|
||||||
|
NOT block — the drift only breaks SSO, which is recoverable) if
|
||||||
|
`--authentication-config` would still be dropped.
|
||||||
|
- The `rbac` stack still publishes its restore script to the
|
||||||
|
`kube-system/apiserver-oidc-restore` ConfigMap, and `phase_master` re-runs it on
|
||||||
|
master right after `kubeadm upgrade apply` (idempotent, `/livez`-gated with
|
||||||
|
auto-rollback, non-fatal) — now redundant belt-and-suspenders that *also*
|
||||||
|
re-reconciles kubeadm-config. Self-skips when master is already at target.
|
||||||
|
|
||||||
**Manual fallback** — only for an out-of-band/manual `kubeadm` upgrade, or if the
|
**Manual fallback** — only for an out-of-band/manual `kubeadm` upgrade, or if the
|
||||||
chain logged `WARN: --authentication-config absent after re-apply`:
|
chain logged `WARN: --authentication-config absent after re-apply`:
|
||||||
|
|
|
||||||
|
|
@ -27,7 +27,7 @@ KUBECONFIG_PATH="${KUBECONFIG:-${HOME}/.kube/config}"
|
||||||
[[ -f "$KUBECONFIG_PATH" ]] || KUBECONFIG_PATH="$(pwd)/config"
|
[[ -f "$KUBECONFIG_PATH" ]] || KUBECONFIG_PATH="$(pwd)/config"
|
||||||
KUBECTL=""
|
KUBECTL=""
|
||||||
JSON_RESULTS=()
|
JSON_RESULTS=()
|
||||||
TOTAL_CHECKS=47
|
TOTAL_CHECKS=48
|
||||||
|
|
||||||
# Parallel execution settings. Each check function is self-contained — it
|
# Parallel execution settings. Each check function is self-contained — it
|
||||||
# only reads cluster state and mutates the in-memory counters / JSON_RESULTS
|
# only reads cluster state and mutates the in-memory counters / JSON_RESULTS
|
||||||
|
|
@ -3156,6 +3156,44 @@ PYEOF
|
||||||
esac
|
esac
|
||||||
}
|
}
|
||||||
|
|
||||||
|
# --- 48. Goldmane edge-aggregator availability ---
|
||||||
|
#
|
||||||
|
# The goldmane-edge-aggregator Deployment (ADR-0014 / infra #58) streams Calico
|
||||||
|
# Goldmane flows into the goldmane_edges CNPG DB — the durable who-talks-to-whom
|
||||||
|
# trail. The pod has NO /metrics endpoint, so its liveness can't be scraped;
|
||||||
|
# this check reads the Deployment's Available condition directly so the trail
|
||||||
|
# silently dying surfaces in the health board (mirrors the AggregatorDown
|
||||||
|
# Prometheus alert). Missing Deployment / not-Available -> FAIL.
|
||||||
|
check_goldmane_aggregator() {
|
||||||
|
section 48 "Goldmane Edge-Aggregator"
|
||||||
|
local ns="goldmane-edge-aggregator" dep="goldmane-edge-aggregator"
|
||||||
|
local avail desired ready
|
||||||
|
|
||||||
|
# One get; absent Deployment is a hard fail (the trail isn't deployed).
|
||||||
|
if ! $KUBECTL get deploy "$dep" -n "$ns" >/dev/null 2>&1; then
|
||||||
|
[[ "$QUIET" == true ]] && section_always 48 "Goldmane Edge-Aggregator"
|
||||||
|
fail "Deployment $ns/$dep not found — who-talks-to-whom edge trail is not running"
|
||||||
|
json_add "goldmane_aggregator" "FAIL" "deployment missing"
|
||||||
|
return 0
|
||||||
|
fi
|
||||||
|
|
||||||
|
avail=$($KUBECTL get deploy "$dep" -n "$ns" \
|
||||||
|
-o jsonpath='{.status.conditions[?(@.type=="Available")].status}' 2>/dev/null)
|
||||||
|
ready=$($KUBECTL get deploy "$dep" -n "$ns" -o jsonpath='{.status.readyReplicas}' 2>/dev/null)
|
||||||
|
desired=$($KUBECTL get deploy "$dep" -n "$ns" -o jsonpath='{.spec.replicas}' 2>/dev/null)
|
||||||
|
ready=${ready:-0}
|
||||||
|
desired=${desired:-0}
|
||||||
|
|
||||||
|
if [[ "$avail" == "True" ]]; then
|
||||||
|
pass "Edge-aggregator Available ($ready/$desired ready)"
|
||||||
|
json_add "goldmane_aggregator" "PASS" "${ready}/${desired} ready"
|
||||||
|
else
|
||||||
|
[[ "$QUIET" == true ]] && section_always 48 "Goldmane Edge-Aggregator"
|
||||||
|
fail "Edge-aggregator NOT Available ($ready/$desired ready) — edge trail has stopped recording"
|
||||||
|
json_add "goldmane_aggregator" "FAIL" "${ready}/${desired} ready; Available=${avail:-unknown}"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
# --- Summary ---
|
# --- Summary ---
|
||||||
print_summary() {
|
print_summary() {
|
||||||
if [[ "$JSON" == true ]]; then
|
if [[ "$JSON" == true ]]; then
|
||||||
|
|
@ -3224,7 +3262,7 @@ main() {
|
||||||
check_monitoring_prom_am check_monitoring_vault check_monitoring_css
|
check_monitoring_prom_am check_monitoring_vault check_monitoring_css
|
||||||
check_external_replicas check_external_divergence check_pve_thermals
|
check_external_replicas check_external_divergence check_pve_thermals
|
||||||
check_pve_load check_external_traefik_5xx check_ha_status_dashboard
|
check_pve_load check_external_traefik_5xx check_ha_status_dashboard
|
||||||
check_immich_search check_csi_ghost_drift
|
check_immich_search check_csi_ghost_drift check_goldmane_aggregator
|
||||||
)
|
)
|
||||||
|
|
||||||
# Auto-fix mutates cluster state inside individual checks — keep that
|
# Auto-fix mutates cluster state inside individual checks — keep that
|
||||||
|
|
|
||||||
|
|
@ -28,5 +28,61 @@ ok "accept own scoped Vault token" cas_vault_identity_ok token-devvm-claude-auth
|
||||||
no "reject another user's token" cas_vault_identity_ok token-devvm-claude-auth-anca default,workstation-claude-anca
|
no "reject another user's token" cas_vault_identity_ok token-devvm-claude-auth-anca default,workstation-claude-anca
|
||||||
no "reject wrong policy" cas_vault_identity_ok token-devvm-claude-auth-emo default,workstation-claude-anca
|
no "reject wrong policy" cas_vault_identity_ok token-devvm-claude-auth-emo default,workstation-claude-anca
|
||||||
|
|
||||||
|
# --- Regression: cas_backup must MERGE into the shared Vault path, preserving
|
||||||
|
# sibling keys that other tools co-locate there (e.g. `homelab vault`'s
|
||||||
|
# vaultwarden_* creds) — NOT overwrite the whole KV document. A blind `kv put`
|
||||||
|
# wiped them every 6h (claude-auth-sync clobber, 2026-06-26).
|
||||||
|
fakebin="$tmp/bin"; mkdir -p "$fakebin"
|
||||||
|
store="$tmp/vault-store.json"
|
||||||
|
cat > "$fakebin/vault" <<'FAKE'
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
# Minimal KV-v2 fake backed by $VAULT_FAKE_STORE (a flat JSON object).
|
||||||
|
[[ "$1" == kv ]] || { echo '{}'; exit 0; } # token lookup etc. -> ignore
|
||||||
|
op="$2"; shift 2
|
||||||
|
store="$VAULT_FAKE_STORE"
|
||||||
|
case "$op" in
|
||||||
|
get)
|
||||||
|
for a in "$@"; do [[ "$a" == -field=* ]] && field="${a#-field=}"; done
|
||||||
|
if [[ "$*" == *-format=json* ]]; then
|
||||||
|
[[ -f "$store" ]] || { echo "No value found"; exit 2; }
|
||||||
|
jq -n --argjson d "$(cat "$store")" '{data:{data:$d}}'; exit 0
|
||||||
|
fi
|
||||||
|
[[ -f "$store" ]] || exit 2 # bare get == existence check
|
||||||
|
if [[ -n "${field:-}" ]]; then
|
||||||
|
v="$(jq -r --arg k "$field" '.[$k] // empty' "$store")"; [[ -n "$v" ]] || exit 1
|
||||||
|
printf '%s' "$v"; exit 0
|
||||||
|
fi
|
||||||
|
exit 0 ;;
|
||||||
|
put) echo '{}' > "$store" ;; # full replace
|
||||||
|
patch) [[ -f "$store" ]] || { echo "No value found"; exit 2; } ;; # merge (rw)
|
||||||
|
*) exit 1 ;;
|
||||||
|
esac
|
||||||
|
for a in "$@"; do
|
||||||
|
case "$a" in
|
||||||
|
-*|secret/*) continue ;; # flags + the path arg
|
||||||
|
*=*) k="${a%%=*}"; v="${a#*=}"
|
||||||
|
t="$(mktemp)"; jq --arg k "$k" --arg v "$v" '.[$k]=$v' "$store" > "$t" && mv "$t" "$store" ;;
|
||||||
|
esac
|
||||||
|
done
|
||||||
|
exit 0
|
||||||
|
FAKE
|
||||||
|
chmod +x "$fakebin/vault"
|
||||||
|
|
||||||
|
CAS_VAULT_PATH="secret/workstation/claude-users/test"
|
||||||
|
CAS_CREDENTIALS="$tmp/credentials.json"
|
||||||
|
CAS_STATE_DIR="$tmp/state"
|
||||||
|
_oldpath="$PATH"; PATH="$fakebin:$PATH"; export VAULT_FAKE_STORE="$store"
|
||||||
|
|
||||||
|
printf '{"vaultwarden_master_password":"keep-me"}\n' > "$store" # pretend `homelab vault setup` ran
|
||||||
|
ok "backup succeeds (existing doc)" cas_backup
|
||||||
|
eq "merge preserves sibling key" keep-me "$(jq -r '.vaultwarden_master_password' "$store")"
|
||||||
|
eq "merge writes claude oauth" access "$(jq -r '.claude_ai_oauth_json|fromjson|.accessToken' "$store")"
|
||||||
|
|
||||||
|
rm -f "$store" # fresh user: no doc yet
|
||||||
|
ok "backup succeeds (creates doc)" cas_backup
|
||||||
|
eq "create writes claude oauth" access "$(jq -r '.claude_ai_oauth_json|fromjson|.accessToken' "$store")"
|
||||||
|
|
||||||
|
PATH="$_oldpath"; unset VAULT_FAKE_STORE
|
||||||
|
|
||||||
printf '\n%d passed, %d failed\n' "$pass" "$fail"
|
printf '\n%d passed, %d failed\n' "$pass" "$fail"
|
||||||
(( fail == 0 ))
|
(( fail == 0 ))
|
||||||
|
|
|
||||||
|
|
@ -82,7 +82,17 @@ cas_backup() {
|
||||||
return 1
|
return 1
|
||||||
}
|
}
|
||||||
expires="$(jq -r '.expiresAt' <<<"$oauth")"
|
expires="$(jq -r '.expiresAt' <<<"$oauth")"
|
||||||
vault kv put "$CAS_VAULT_PATH" \
|
# MERGE into the shared path so sibling keys other tools co-locate there
|
||||||
|
# (e.g. `homelab vault`'s vaultwarden_* creds) survive. `kv patch -method=rw`
|
||||||
|
# is read+update (needs no `patch` capability) but requires the secret to
|
||||||
|
# already exist, so create it with `kv put` on the very first backup only.
|
||||||
|
local -a write_cmd
|
||||||
|
if vault kv get "$CAS_VAULT_PATH" >/dev/null 2>&1; then
|
||||||
|
write_cmd=(vault kv patch -method=rw "$CAS_VAULT_PATH")
|
||||||
|
else
|
||||||
|
write_cmd=(vault kv put "$CAS_VAULT_PATH")
|
||||||
|
fi
|
||||||
|
"${write_cmd[@]}" \
|
||||||
claude_ai_oauth_json="$oauth" \
|
claude_ai_oauth_json="$oauth" \
|
||||||
credential_expires_at_ms="$expires" \
|
credential_expires_at_ms="$expires" \
|
||||||
backed_up_at="$(date -Is)" >/dev/null || {
|
backed_up_at="$(date -Is)" >/dev/null || {
|
||||||
|
|
|
||||||
|
|
@ -19,13 +19,29 @@ unpinned-CLI dependencies out of the hourly **root** reconcile.
|
||||||
|
|
||||||
- `mattpocock/skills` (https://github.com/mattpocock/skills) — all except `find-skills`
|
- `mattpocock/skills` (https://github.com/mattpocock/skills) — all except `find-skills`
|
||||||
- `vercel-labs/skills` (https://github.com/vercel-labs/skills) — `find-skills`
|
- `vercel-labs/skills` (https://github.com/vercel-labs/skills) — `find-skills`
|
||||||
|
- **homelab-local, emo-PERSONALIZED** — `cluster-health` here is an
|
||||||
|
**emo-specific variant**, not a copy of the canonical skill. It started as a
|
||||||
|
copy of this repo's `.claude/skills/cluster-health/` but was rewritten on
|
||||||
|
2026-06-26 to focus on ha-sofia + emo's Sofia devices (emo is the only entry
|
||||||
|
in `SKILL_USERS`, a read-only power-user). The canonical admin skill
|
||||||
|
(`.claude/skills/cluster-health/`) is the full 47-check version and is left
|
||||||
|
untouched. **Do NOT `cp -a` the canonical copy over this one** — that would
|
||||||
|
clobber the personalization. Maintain the two independently.
|
||||||
|
|
||||||
## Refreshing
|
## Refreshing
|
||||||
|
|
||||||
Re-snapshot from a current install and commit the diff:
|
Re-snapshot the upstream skills from a current install and commit the diff:
|
||||||
|
|
||||||
```sh
|
```sh
|
||||||
cp -a ~/.agents/skills/. scripts/workstation/claude-skills/
|
cp -a ~/.agents/skills/. scripts/workstation/claude-skills/
|
||||||
```
|
```
|
||||||
|
|
||||||
Snapshot taken 2026-06-23.
|
`cluster-health` is hand-maintained (emo variant) — it is **not** covered by the
|
||||||
|
`cp -a` above and must **not** be overwritten from `.claude/skills/`. Edit it in
|
||||||
|
place here when emo's needs change, then refresh his live copy (the provisioner's
|
||||||
|
`install_skills()` is if-absent, so it won't update an existing `~/.agents/skills`
|
||||||
|
copy — `cp` the new `SKILL.md` to `/home/emo/.agents/skills/cluster-health/` and
|
||||||
|
`chown emo:emo`, or remove emo's copy and re-run the reconcile).
|
||||||
|
|
||||||
|
Snapshot taken 2026-06-23 (upstream); `cluster-health` vendored 2026-06-26,
|
||||||
|
personalized for emo 2026-06-26.
|
||||||
|
|
|
||||||
146
scripts/workstation/claude-skills/cluster-health/SKILL.md
Normal file
146
scripts/workstation/claude-skills/cluster-health/SKILL.md
Normal file
|
|
@ -0,0 +1,146 @@
|
||||||
|
---
|
||||||
|
name: cluster-health
|
||||||
|
description: |
|
||||||
|
Personalized for emo. Check whether the homelab Kubernetes cluster is
|
||||||
|
affecting ha-sofia or the Sofia smart-home devices it runs (Tuya devices,
|
||||||
|
the MPPT ATS, lights, climate, security, irrigation). Use when:
|
||||||
|
(1) "is ha-sofia ok", "are my devices / the ATS / the lights down",
|
||||||
|
(2) "is the cluster affecting Sofia / my devices",
|
||||||
|
(3) "check the cluster", "cluster health", "is everything running",
|
||||||
|
(4) a device on the Барзини → Статус dashboard looks offline.
|
||||||
|
Runs the cluster-wide healthcheck read-only and triages it by what
|
||||||
|
ha-sofia actually depends on; the rest of the cluster is the admin's area.
|
||||||
|
author: Claude Code
|
||||||
|
version: 3.0.0-emo
|
||||||
|
date: 2026-06-26
|
||||||
|
---
|
||||||
|
|
||||||
|
# Cluster Health — personalized for emo (ha-sofia focus)
|
||||||
|
|
||||||
|
## What you actually care about
|
||||||
|
|
||||||
|
You care about **ha-sofia** and the **Sofia smart-home devices** it runs —
|
||||||
|
the Tuya devices, the **MPPT ATS**, and the lights / climate / security /
|
||||||
|
irrigation on your **Барзини → Статус** dashboard. The wider Kubernetes
|
||||||
|
cluster matters to you **only when it's breaking something ha-sofia or your
|
||||||
|
devices depend on.** Anything else is the admin's (wizard's) area — note it in
|
||||||
|
one line and move on; don't chase it.
|
||||||
|
|
||||||
|
You have **read-only** cluster access. You can SEE everything but change
|
||||||
|
nothing — so when something on your chain is broken, the job is to confirm it
|
||||||
|
and hand it off, not to repair it.
|
||||||
|
|
||||||
|
## How ha-sofia depends on the cluster
|
||||||
|
|
||||||
|
ha-sofia itself runs at the house (HAOS at https://ha-sofia.viktorbarzin.me) —
|
||||||
|
**not** in the cluster. The cluster reaches it through exactly two things:
|
||||||
|
|
||||||
|
1. **tuya-bridge** (namespace `tuya-bridge`) — the REST API ha-sofia calls for
|
||||||
|
every Tuya device **and the MPPT ATS**. If it's unhealthy, your Tuya devices
|
||||||
|
+ ATS stop responding. **This is the #1 thing to check.**
|
||||||
|
2. **The path that carries ha-sofia ⇄ tuya-bridge and keeps ha-sofia
|
||||||
|
reachable**: cloudflared (tunnel) → Traefik (LB) → the ingress + TLS cert
|
||||||
|
for `tuya-bridge.viktorbarzin.me` and `ha-sofia.viktorbarzin.me`, plus
|
||||||
|
Technitium DNS. If any of these break, ha-sofia can't reach tuya-bridge and
|
||||||
|
you can't reach ha-sofia remotely.
|
||||||
|
|
||||||
|
Everything else in the cluster is unrelated to you unless it's hosting one of
|
||||||
|
those pods.
|
||||||
|
|
||||||
|
## Step 1 — run the healthcheck (read-only, with your HA token)
|
||||||
|
|
||||||
|
Your account can't read Vault, so load your own ha-sofia token first (it was
|
||||||
|
minted for you and lives at `~/.config/cluster-health/haos_token`). Then run
|
||||||
|
the script from YOUR clone, read-only:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd /home/emo/code
|
||||||
|
export HOME_ASSISTANT_SOFIA_TOKEN="$(cat ~/.config/cluster-health/haos_token)"
|
||||||
|
bash scripts/cluster_healthcheck.sh --no-fix --quiet
|
||||||
|
# machine-readable instead:
|
||||||
|
# bash scripts/cluster_healthcheck.sh --no-fix --quiet --json | tee /tmp/cluster-health.json
|
||||||
|
```
|
||||||
|
|
||||||
|
- **Never pass `--fix`** — it deletes pods (a write); you're read-only and it
|
||||||
|
will fail.
|
||||||
|
- Exit codes: `0` healthy, `1` warnings, `2` failures.
|
||||||
|
|
||||||
|
With the token exported, the **ha-sofia checks run for you**:
|
||||||
|
26 Entity Availability · 27 Integration Health · 28 Automation Status ·
|
||||||
|
29 System Resources · **45 Status Dashboard** — your Барзини → Статус view,
|
||||||
|
classifying every device tile as OK / ⚠️ / Offline across Сигурност, Мрежа &
|
||||||
|
IT, Енергия, Климат, Уреди, Мултимедия, Осветление, Поливна. Check 30 also
|
||||||
|
covers the **tuya** exporter.
|
||||||
|
|
||||||
|
## Step 2 — triage the output by relevance to YOU
|
||||||
|
|
||||||
|
Read the PASS/WARN/FAIL summary, then split the WARN/FAIL items in two:
|
||||||
|
|
||||||
|
- **On your chain → this is what matters.** Anything touching: `tuya-bridge`,
|
||||||
|
`cloudflared`, `traefik`, DNS (check 21), the TLS cert / ingress for your two
|
||||||
|
hosts (checks 12, 22, 31, 32), or a **node** hosting those pods — plus all the
|
||||||
|
**ha-sofia** checks (26–29, 45) and the **tuya** exporter (30).
|
||||||
|
- **Not on your chain → one line, then drop it.** Summarise as "N unrelated
|
||||||
|
cluster issues (admin's area)" and don't investigate.
|
||||||
|
|
||||||
|
## Step 3 — read-only checks for your chain
|
||||||
|
|
||||||
|
All of these work with your read-only access:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# tuya-bridge — your devices + the ATS
|
||||||
|
kubectl get pods -n tuya-bridge
|
||||||
|
kubectl rollout status deploy/tuya-bridge -n tuya-bridge
|
||||||
|
kubectl logs -n tuya-bridge deploy/tuya-bridge --tail=50
|
||||||
|
|
||||||
|
# the reachability path ha-sofia uses
|
||||||
|
kubectl get pods -n cloudflared
|
||||||
|
kubectl get pods -n traefik
|
||||||
|
kubectl get ingress -A | grep -Ei 'tuya-bridge|ha-sofia'
|
||||||
|
|
||||||
|
# whole external path in one shot (DNS + tunnel + Traefik + cert):
|
||||||
|
curl -sI --max-time 10 https://tuya-bridge.viktorbarzin.me | head -1
|
||||||
|
# reachable -> HTTP/2 200 / 401 / 403 (any HTTP response = path is up)
|
||||||
|
# broken -> curl: timeout / could not resolve host
|
||||||
|
```
|
||||||
|
|
||||||
|
The fastest **device-level** signal is your own dashboard: open
|
||||||
|
**https://ha-sofia.viktorbarzin.me → Барзини → Статус**. If devices show
|
||||||
|
Offline / Разкачен / ⚠️ **but tuya-bridge is healthy**, the problem is at the
|
||||||
|
house (device power / Wi-Fi / the Sofia TP-Link network) — **not** the cluster.
|
||||||
|
|
||||||
|
## Step 4 — if something on your chain is broken
|
||||||
|
|
||||||
|
You can't fix the cluster (read-only), so **capture + hand off**:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
kubectl describe pod -n tuya-bridge <pod>
|
||||||
|
kubectl logs -n tuya-bridge <pod> --previous --tail=200
|
||||||
|
```
|
||||||
|
|
||||||
|
Then file it for the admin with the **`/file-issue`** skill — e.g. *"ha-sofia
|
||||||
|
Tuya devices + ATS unresponsive; tuya-bridge pod CrashLooping"* with the output
|
||||||
|
above. cloudflared / Traefik / DNS outages are cluster-wide — the admin's
|
||||||
|
alerting is already firing, but file it so it's tracked from your side too.
|
||||||
|
|
||||||
|
## What will skip for you (expected — not failures)
|
||||||
|
|
||||||
|
A few checks need access your account doesn't have. They warn/skip — that's
|
||||||
|
normal, and **none of them are on your ha-sofia chain**:
|
||||||
|
|
||||||
|
- **Uptime Kuma (14)** — needs an admin password from Vault.
|
||||||
|
- **PVE host checks** — 36 (LVM snapshots), 43 (host thermals), 44 (host load),
|
||||||
|
and the Proxmox CSI ghost-disk check — all need root SSH to the Proxmox host.
|
||||||
|
- **`--fix`** — pod deletion (a write); not available to you.
|
||||||
|
|
||||||
|
(The ha-sofia checks are **not** in this list — your token makes them work.)
|
||||||
|
|
||||||
|
## Your ha-sofia token
|
||||||
|
|
||||||
|
- Stored at `~/.config/cluster-health/haos_token` (yours, mode 600).
|
||||||
|
- It's a **dedicated** long-lived token, named `emo-cluster-health` under
|
||||||
|
ha-sofia → your profile → **Long-Lived Access Tokens**. Revoking it there
|
||||||
|
affects only you.
|
||||||
|
- It currently carries admin-level HA scope (Home Assistant only lets a token
|
||||||
|
be minted for the account that created it, and it was minted via the admin
|
||||||
|
account). If it ever stops working, tell wizard and a fresh one can be minted.
|
||||||
|
|
@ -1,4 +1,4 @@
|
||||||
{
|
{
|
||||||
"claudeMd": "# Viktor Barzin homelab — shared multi-user Claude Code Workstation (devvm)\n\nYou are running as a specific OS user on a SHARED devvm Workstation, not as the admin. These org-wide rules apply to EVERY user and sit at the top of settings precedence (they cannot be overridden by a user's own config):\n\n- Respect your permission tier. kubectl, Vault, and infra access are scoped to your RBAC tier (admin / power-user / namespace-owner). Do not attempt to escalate privileges or reach another user's resources.\n- Secrets are per-user. Never read another user's home directory, credentials, tokens, or ~/.claude secrets. Your own secrets live in your home at mode 600.\n- Infrastructure changes go through Terraform/Terragrunt — never direct kubectl apply/edit/patch. Committed stack changes are auto-applied by CI on push to master; verify the live result with your read-only kubectl.\n- The AGENT does ALL git mechanics silently — the user may not know git, so never ask them to commit, push, pull, or open anything, and never surface git jargon. Lifecycle (worktrees, landing, cleanup): ~/.claude/rules/execution.md. Org red-lines on top:\n - THE COMMIT MESSAGE IS THE AUDIT TRAIL — subject says WHAT changed; body says WHY in plain words (paraphrase the user's actual request).\n - Never use [ci skip] as a non-admin (it hides the change from the audit feed).\n - Push rejected by branch protection (user not whitelisted) → fall back to a <os-user>/<topic> branch + PR via the Forgejo API (token = password field in ~/.git-credentials).\n - Keep every clone on a clean master when done; tell the user in plain words what happened.\n - Full recipe: AGENTS.md → \"Non-admin workstation users\" in your infra clone.\n- Follow the engineering rules in ~/.claude/rules/ (execution, planning) and every CLAUDE.md in the repo tree.\n- Code lives under ~/code in one of two per-user layouts: either ~/code IS the git-crypt-LOCKED infra clone (single layout), or ~/code is a workspace directory of per-project clones — the locked infra clone at ~/code/infra plus other project repos alongside it. [ -d ~/code/.git ] means single. In locked infra clones secret files read as ciphertext — that is expected, not an error.\n",
|
"claudeMd": "# Viktor Barzin homelab — shared multi-user Claude Code Workstation (devvm)\n\nYou are running as a specific OS user on a SHARED devvm Workstation, not as the admin. These org-wide rules apply to EVERY user and sit at the top of settings precedence (they cannot be overridden by a user's own config):\n\n- Respect your permission tier for cluster/infra. kubectl, Vault, and infra access are scoped to your RBAC tier (admin / power-user / namespace-owner) — don't try to exceed it. At the OS level, authorization follows the OS: where your user holds `sudo`, using it is permitted and audited.\n- OS access follows OS permissions, not a separate rule here. You may read whatever your OS user can read — directly or via `sudo` where you hold it — including other users' home directories, credentials, tokens, and `~/.claude`. Do not impose restrictions stricter than the OS; no extra prompt is required for OS-authorized access. Protect your own secrets at mode 600.\n- Infrastructure changes go through Terraform/Terragrunt — never direct kubectl apply/edit/patch. Committed stack changes are auto-applied by CI on push to master; verify the live result with your read-only kubectl.\n- The AGENT does ALL git mechanics silently — the user may not know git, so never ask them to commit, push, pull, or open anything, and never surface git jargon. Lifecycle (worktrees, landing, cleanup): ~/.claude/rules/execution.md. Org red-lines on top:\n - THE COMMIT MESSAGE IS THE AUDIT TRAIL — subject says WHAT changed; body says WHY in plain words (paraphrase the user's actual request).\n - Never use [ci skip] as a non-admin (it hides the change from the audit feed).\n - Push rejected by branch protection (user not whitelisted) → fall back to a <os-user>/<topic> branch + PR via the Forgejo API (token = password field in ~/.git-credentials).\n - Keep every clone on a clean master when done; tell the user in plain words what happened.\n - Full recipe: AGENTS.md → \"Non-admin workstation users\" in your infra clone.\n- Follow the engineering rules in ~/.claude/rules/ (execution, planning) and every CLAUDE.md in the repo tree.\n- Code lives under ~/code in one of two per-user layouts: either ~/code IS the git-crypt-LOCKED infra clone (single layout), or ~/code is a workspace directory of per-project clones — the locked infra clone at ~/code/infra plus other project repos alongside it. [ -d ~/code/.git ] means single. In locked infra clones secret files read as ciphertext — that is expected, not an error.\n",
|
||||||
"model": "claude-opus-4-8"
|
"model": "claude-opus-4-8"
|
||||||
}
|
}
|
||||||
|
|
|
||||||
|
|
@ -72,11 +72,14 @@ if [[ -n "$want_t3" && "$(t3 --version 2>/dev/null | awk '{print $NF}' | sed 's/
|
||||||
fi
|
fi
|
||||||
|
|
||||||
# 2c) Bitwarden CLI — backs `homelab vault` (per-user no-HITL Vaultwarden access).
|
# 2c) Bitwarden CLI — backs `homelab vault` (per-user no-HITL Vaultwarden access).
|
||||||
# npm-global so every user's PATH resolves it. Pinned major; best-effort (a
|
# Install SYSTEM-WIDE (npm prefix /usr → /usr/bin/bw) so EVERY user's PATH
|
||||||
# failure only disables `homelab vault`, nothing else on the box).
|
# resolves it. The guard tests the SYSTEM path, NOT `command -v bw`: the
|
||||||
if ! command -v bw >/dev/null; then
|
# latter is satisfied by an admin's own ~/.local/bin/bw and would skip the
|
||||||
log "npm: installing @bitwarden/cli (homelab vault backend)"
|
# system install, leaving non-admins (emo, anca, …) with no backend. Pinned
|
||||||
npm install -g "@bitwarden/cli@^2024" >/dev/null 2>&1 || log "WARN: @bitwarden/cli install failed; homelab vault unavailable"
|
# major; best-effort (a failure only disables `homelab vault`).
|
||||||
|
if [ ! -x /usr/bin/bw ] && [ ! -x /usr/local/bin/bw ]; then
|
||||||
|
log "npm: installing @bitwarden/cli system-wide (homelab vault backend)"
|
||||||
|
npm install -g --prefix /usr "@bitwarden/cli@^2024" >/dev/null 2>&1 || log "WARN: @bitwarden/cli install failed; homelab vault unavailable"
|
||||||
fi
|
fi
|
||||||
|
|
||||||
# 3) kubelogin (kubectl oidc-login) system-wide — NOT the apt 'kubelogin' (= Azure tool).
|
# 3) kubelogin (kubectl oidc-login) system-wide — NOT the apt 'kubelogin' (= Azure tool).
|
||||||
|
|
|
||||||
|
|
@ -5,6 +5,9 @@ variable "tls_secret_name" {
|
||||||
variable "nfs_server" { type = string }
|
variable "nfs_server" { type = string }
|
||||||
|
|
||||||
resource "kubernetes_manifest" "external_secret" {
|
resource "kubernetes_manifest" "external_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
|
||||||
|
|
@ -5,6 +5,9 @@ variable "tls_secret_name" {
|
||||||
variable "nfs_server" { type = string }
|
variable "nfs_server" { type = string }
|
||||||
|
|
||||||
resource "kubernetes_manifest" "external_secret" {
|
resource "kubernetes_manifest" "external_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
@ -42,6 +45,9 @@ data "kubernetes_secret" "eso_secrets" {
|
||||||
# DB credentials from Vault database engine (rotated automatically)
|
# DB credentials from Vault database engine (rotated automatically)
|
||||||
# Provides DATABASE_URL that auto-updates when password rotates
|
# Provides DATABASE_URL that auto-updates when password rotates
|
||||||
resource "kubernetes_manifest" "db_external_secret" {
|
resource "kubernetes_manifest" "db_external_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
|
||||||
|
|
@ -6,6 +6,9 @@
|
||||||
# are non-secret and live in values.yaml. The reloader annotation rolls the
|
# are non-secret and live in values.yaml. The reloader annotation rolls the
|
||||||
# authentik pods if the password ever changes.
|
# authentik pods if the password ever changes.
|
||||||
resource "kubernetes_manifest" "authentik_email_secret" {
|
resource "kubernetes_manifest" "authentik_email_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
|
||||||
|
|
@ -601,6 +601,9 @@ resource "kubernetes_config_map" "beadboard_config" {
|
||||||
# Pulls the claude-agent-service bearer token from Vault so BeadBoard can
|
# Pulls the claude-agent-service bearer token from Vault so BeadBoard can
|
||||||
# dispatch agent jobs via the in-cluster HTTP API.
|
# dispatch agent jobs via the in-cluster HTTP API.
|
||||||
resource "kubernetes_manifest" "beadboard_agent_service_secret" {
|
resource "kubernetes_manifest" "beadboard_agent_service_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
|
||||||
|
|
@ -28,6 +28,9 @@ resource "kubernetes_namespace" "broker_sync" {
|
||||||
# trading212_api_keys — JSON array of {account_id, account_type, api_key, name, currency}
|
# trading212_api_keys — JSON array of {account_id, account_type, api_key, name, currency}
|
||||||
# imap_host, imap_user, imap_password, imap_directory — for InvestEngine + Schwab email ingest
|
# imap_host, imap_user, imap_password, imap_directory — for InvestEngine + Schwab email ingest
|
||||||
resource "kubernetes_manifest" "external_secret" {
|
resource "kubernetes_manifest" "external_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
|
||||||
|
|
@ -212,3 +212,65 @@ resource "kubectl_manifest" "whisker" {
|
||||||
spec = { notifications = "Disabled" }
|
spec = { notifications = "Disabled" }
|
||||||
})
|
})
|
||||||
}
|
}
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Gated public ingress for the Whisker UI (infra #57 / ADR-0014).
|
||||||
|
#
|
||||||
|
# whisker.viktorbarzin.me -> whisker:8081, Authentik-gated (auth="required":
|
||||||
|
# Whisker ships NO own login — it's an admin observability UI, so Authentik
|
||||||
|
# forward-auth is the only gate between strangers and the flow view). The
|
||||||
|
# operator replicated `tls-secret` into calico-system already.
|
||||||
|
#
|
||||||
|
# TWO coupled pieces are required because the operator's own `whisker`
|
||||||
|
# NetworkPolicy (owned by the Whisker CR above) sets policyTypes:[Ingress]
|
||||||
|
# with NO ingress rules => default-deny on ingress to the whisker pod. The
|
||||||
|
# additive NP below ORs in a Traefik allow (k8s NetworkPolicies are additive
|
||||||
|
# across policies selecting the same pod), so we never edit the operator NP.
|
||||||
|
module "ingress_whisker" {
|
||||||
|
source = "../../modules/kubernetes/ingress_factory"
|
||||||
|
dns_type = "proxied"
|
||||||
|
namespace = "calico-system"
|
||||||
|
name = "whisker"
|
||||||
|
service_name = "whisker"
|
||||||
|
port = 8081
|
||||||
|
auth = "required"
|
||||||
|
tls_secret_name = "tls-secret"
|
||||||
|
extra_annotations = {
|
||||||
|
"gethomepage.dev/enabled" = "true"
|
||||||
|
"gethomepage.dev/name" = "Whisker"
|
||||||
|
"gethomepage.dev/description" = "Calico flow observability (who-talks-to-whom)"
|
||||||
|
"gethomepage.dev/icon" = "calico.png"
|
||||||
|
"gethomepage.dev/group" = "Infrastructure"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
# Additive NetworkPolicy: permit Traefik -> whisker:8081. ORs with the
|
||||||
|
# operator's default-deny `whisker` NP (selecting the same pod) so Traefik
|
||||||
|
# can reach the UI without touching the operator-owned policy.
|
||||||
|
resource "kubernetes_network_policy_v1" "whisker_allow_traefik" {
|
||||||
|
metadata {
|
||||||
|
name = "whisker-allow-traefik"
|
||||||
|
namespace = "calico-system"
|
||||||
|
}
|
||||||
|
spec {
|
||||||
|
pod_selector {
|
||||||
|
match_labels = {
|
||||||
|
"app.kubernetes.io/name" = "whisker"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
policy_types = ["Ingress"]
|
||||||
|
ingress {
|
||||||
|
from {
|
||||||
|
namespace_selector {
|
||||||
|
match_labels = {
|
||||||
|
"kubernetes.io/metadata.name" = "traefik"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
ports {
|
||||||
|
port = "8081"
|
||||||
|
protocol = "TCP"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
|
||||||
|
|
@ -19,6 +19,9 @@ resource "kubernetes_namespace" "changedetection" {
|
||||||
}
|
}
|
||||||
|
|
||||||
resource "kubernetes_manifest" "external_secret" {
|
resource "kubernetes_manifest" "external_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
|
||||||
|
|
@ -19,14 +19,14 @@ for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15; do
|
||||||
sleep 2
|
sleep 2
|
||||||
done
|
done
|
||||||
|
|
||||||
# websockify runs as PID 1; x11vnc is a child so its logs land on container stdout
|
# Both x11vnc and websockify run as supervised children of this entrypoint (PID
|
||||||
# `-noshm` skips MIT-SHM probes that fail across container boundaries (each
|
# 1) so their logs land on container stdout and the `wait -n` at the end can catch
|
||||||
# container has its own /dev/shm); `-noxdamage` skips XDAMAGE which Xvfb
|
# either one dying. `-noshm` skips MIT-SHM probes that fail across container
|
||||||
# doesn't expose; `-quiet` keeps the polling chatter out of pod logs.
|
# boundaries (each container has its own /dev/shm); `-noxdamage` skips XDAMAGE
|
||||||
|
# which Xvfb doesn't expose; `-quiet` keeps the polling chatter out of pod logs.
|
||||||
echo "starting x11vnc -> :5900"
|
echo "starting x11vnc -> :5900"
|
||||||
x11vnc -display localhost:99 -nopw -listen 0.0.0.0 -rfbport 5900 \
|
x11vnc -display localhost:99 -nopw -listen 0.0.0.0 -rfbport 5900 \
|
||||||
-forever -shared -noshm -noxdamage -quiet 2>&1 &
|
-forever -shared -noshm -noxdamage -quiet 2>&1 &
|
||||||
X11VNC_PID=$!
|
|
||||||
|
|
||||||
for i in 1 2 3 4 5 6 7 8 9 10; do
|
for i in 1 2 3 4 5 6 7 8 9 10; do
|
||||||
if echo > /dev/tcp/127.0.0.1/5900 2>/dev/null; then
|
if echo > /dev/tcp/127.0.0.1/5900 2>/dev/null; then
|
||||||
|
|
@ -43,4 +43,18 @@ if ! echo > /dev/tcp/127.0.0.1/5900 2>/dev/null; then
|
||||||
fi
|
fi
|
||||||
|
|
||||||
echo "starting websockify -> :6080"
|
echo "starting websockify -> :6080"
|
||||||
exec websockify --web=/usr/share/novnc 6080 localhost:5900
|
# Run websockify in the background (it was `exec`ed before) so BOTH it and x11vnc
|
||||||
|
# are supervised. x11vnc attaches to the chrome-service container's Xvfb over
|
||||||
|
# localhost:6099 (shared pod network); when that container restarts, x11vnc loses
|
||||||
|
# its X connection and exits. Previously websockify was PID 1 and x11vnc was an
|
||||||
|
# unsupervised child, so a dead x11vnc was never relaunched: :5900 stayed dead and
|
||||||
|
# the noVNC view went black until a manual pod restart. Now if EITHER process
|
||||||
|
# exits, `wait -n` returns and we exit non-zero so the kubelet restarts this
|
||||||
|
# container, which re-waits for Xvfb and relaunches x11vnc — the bridge self-heals
|
||||||
|
# across browser-container restarts. (Same supervision pattern as the
|
||||||
|
# android-emulator stack's entrypoint.)
|
||||||
|
websockify --web=/usr/share/novnc 6080 localhost:5900 &
|
||||||
|
|
||||||
|
wait -n || true
|
||||||
|
echo "novnc: a supervised process (x11vnc or websockify) exited; exiting so the kubelet restarts this container." >&2
|
||||||
|
exit 1
|
||||||
|
|
|
||||||
|
|
@ -41,6 +41,9 @@ resource "kubernetes_namespace" "chrome_service" {
|
||||||
# --- Secrets (single-key extract: api_bearer_token) ---
|
# --- Secrets (single-key extract: api_bearer_token) ---
|
||||||
|
|
||||||
resource "kubernetes_manifest" "external_secret" {
|
resource "kubernetes_manifest" "external_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
@ -330,15 +333,23 @@ resource "kubernetes_deployment" "chrome_service" {
|
||||||
container {
|
container {
|
||||||
name = "novnc"
|
name = "novnc"
|
||||||
# Phase 3 cutover 2026-05-07 — Forgejo registry consolidation.
|
# Phase 3 cutover 2026-05-07 — Forgejo registry consolidation.
|
||||||
image = "ghcr.io/viktorbarzin/chrome-service-novnc:latest"
|
# SHA-pinned (not :latest): Keel is OFF for this deployment
|
||||||
|
# (keel.sh/policy=never, below) and :latest/IfNotPresent won't re-pull a
|
||||||
|
# rebuilt image, so a new noVNC entrypoint only deploys when this digest
|
||||||
|
# is bumped here. Bump after build-chrome-service-novnc.yml pushes a new
|
||||||
|
# SHA tag — then WAIT for that apply pipeline to finish before pushing
|
||||||
|
# anything else: Woodpecker cancel-previous SIGKILLs an in-flight apply
|
||||||
|
# mid-run (memory id=1957), which is exactly how the 2026-06-27 apply got
|
||||||
|
# killed. 2026-06-27: bumped to land the x11vnc-supervision self-heal fix
|
||||||
|
# (noVNC went black after a browser-container restart; see
|
||||||
|
# docs/architecture/chrome-service.md "x11vnc supervision").
|
||||||
|
image = "ghcr.io/viktorbarzin/chrome-service-novnc:19d0f0933a8ec75be6cfa077db88e0f8c3760f40"
|
||||||
image_pull_policy = "IfNotPresent"
|
image_pull_policy = "IfNotPresent"
|
||||||
# Cap RLIMIT_NOFILE before the entrypoint runs. Containerd grants pods
|
# Cap RLIMIT_NOFILE before the entrypoint runs. Containerd grants pods
|
||||||
# nofile=2^31; x11vnc sweeps the whole fd table on each client connect,
|
# nofile=2^31; x11vnc sweeps the whole fd table on each client connect,
|
||||||
# so every VNC connection hangs on "Connecting" until it times out
|
# so every VNC connection hangs on "Connecting" until it times out
|
||||||
# (fd-sweep bug, same as android-emulator). entrypoint.sh now also sets
|
# (fd-sweep bug, same as android-emulator). entrypoint.sh also sets this;
|
||||||
# this, but the image is :latest/IfNotPresent so a rebuilt entrypoint
|
# the wrapper keeps the cap deterministic even off a cached image.
|
||||||
# isn't guaranteed to be pulled — this wrapper applies the cap
|
|
||||||
# deterministically on every rollout off the cached image.
|
|
||||||
command = ["bash", "-c", "ulimit -n 65536; exec /entrypoint.sh"]
|
command = ["bash", "-c", "ulimit -n 65536; exec /entrypoint.sh"]
|
||||||
port {
|
port {
|
||||||
name = "http"
|
name = "http"
|
||||||
|
|
|
||||||
|
|
@ -49,6 +49,9 @@ resource "kubernetes_namespace" "ci_pipeline_health" {
|
||||||
# billing on PRIVATE mirrors, which a future scoped read:packages rotation of
|
# billing on PRIVATE mirrors, which a future scoped read:packages rotation of
|
||||||
# the alias could not do. Blast radius = this single-CronJob namespace.
|
# the alias could not do. Blast radius = this single-CronJob namespace.
|
||||||
resource "kubernetes_manifest" "external_secret" {
|
resource "kubernetes_manifest" "external_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
|
||||||
|
|
@ -38,6 +38,9 @@ resource "kubernetes_namespace" "claude_agent" {
|
||||||
# --- Secrets ---
|
# --- Secrets ---
|
||||||
|
|
||||||
resource "kubernetes_manifest" "external_secret" {
|
resource "kubernetes_manifest" "external_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
|
||||||
|
|
@ -57,6 +57,9 @@ resource "kubernetes_service_account" "breakglass" {
|
||||||
# DENIED this path (see stacks/vault/main.tf) so the shared, prompt-injectable
|
# DENIED this path (see stacks/vault/main.tf) so the shared, prompt-injectable
|
||||||
# pod can never read it.
|
# pod can never read it.
|
||||||
resource "kubernetes_manifest" "external_secret_ssh" {
|
resource "kubernetes_manifest" "external_secret_ssh" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
@ -82,6 +85,9 @@ resource "kubernetes_manifest" "external_secret_ssh" {
|
||||||
# Env secrets: the Anthropic OAuth token (shared with claude-agent-service —
|
# Env secrets: the Anthropic OAuth token (shared with claude-agent-service —
|
||||||
# same account) and the app bearer token (in-cluster/CLI fallback caller auth).
|
# same account) and the app bearer token (in-cluster/CLI fallback caller auth).
|
||||||
resource "kubernetes_manifest" "external_secret_env" {
|
resource "kubernetes_manifest" "external_secret_env" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
|
||||||
|
|
@ -29,6 +29,9 @@ resource "kubernetes_namespace" "claude-memory" {
|
||||||
}
|
}
|
||||||
|
|
||||||
resource "kubernetes_manifest" "external_secret" {
|
resource "kubernetes_manifest" "external_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
@ -57,6 +60,9 @@ resource "kubernetes_manifest" "external_secret" {
|
||||||
|
|
||||||
# DB credentials from Vault database engine (rotated every 24h)
|
# DB credentials from Vault database engine (rotated every 24h)
|
||||||
resource "kubernetes_manifest" "db_external_secret" {
|
resource "kubernetes_manifest" "db_external_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
|
||||||
|
|
@ -5,6 +5,9 @@ variable "tls_secret_name" {
|
||||||
variable "public_ip" { type = string }
|
variable "public_ip" { type = string }
|
||||||
|
|
||||||
resource "kubernetes_manifest" "external_secret" {
|
resource "kubernetes_manifest" "external_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
|
||||||
|
|
@ -23,6 +23,9 @@ resource "kubernetes_namespace" "dawarich" {
|
||||||
}
|
}
|
||||||
|
|
||||||
resource "kubernetes_manifest" "external_secret" {
|
resource "kubernetes_manifest" "external_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
|
||||||
|
|
@ -745,7 +745,10 @@ resource "kubernetes_deployment" "phpmyadmin" {
|
||||||
labels = {
|
labels = {
|
||||||
"app" = "phpmyadmin"
|
"app" = "phpmyadmin"
|
||||||
tier = var.tier
|
tier = var.tier
|
||||||
|
# ADR-0014 service identity: dbaas is a multi-Service namespace, so the
|
||||||
|
# namespace alone can't attribute Goldmane flows. Value = the fronting
|
||||||
|
# Service name (kubernetes_service.phpmyadmin is named "pma").
|
||||||
|
"service-identity" = "pma"
|
||||||
}
|
}
|
||||||
annotations = {
|
annotations = {
|
||||||
"reloader.stakater.com/search" = "true"
|
"reloader.stakater.com/search" = "true"
|
||||||
|
|
@ -762,6 +765,10 @@ resource "kubernetes_deployment" "phpmyadmin" {
|
||||||
metadata {
|
metadata {
|
||||||
labels = {
|
labels = {
|
||||||
"app" = "phpmyadmin"
|
"app" = "phpmyadmin"
|
||||||
|
# ADR-0014: Goldmane/Felix stamps POD labels onto flows, so the
|
||||||
|
# disambiguating identity must live on the pod template (not just
|
||||||
|
# the Deployment metadata above). Not in selector → no replace.
|
||||||
|
"service-identity" = "pma"
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
spec {
|
spec {
|
||||||
|
|
@ -812,8 +819,19 @@ resource "kubernetes_deployment" "phpmyadmin" {
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
lifecycle {
|
lifecycle {
|
||||||
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
|
ignore_changes = [
|
||||||
ignore_changes = [spec[0].template[0].spec[0].dns_config]
|
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
|
||||||
|
# This Deployment is Keel-enrolled (keel.sh/policy=patch). Ignore the
|
||||||
|
# attributes Keel/Kyverno mutate at runtime so `terragrunt apply` (incl.
|
||||||
|
# the daily drift plan) doesn't fight them or revert the live image —
|
||||||
|
# canonical KEEL/KYVERNO lifecycle guard, matches linkwarden/chrome-service.
|
||||||
|
metadata[0].annotations["keel.sh/policy"],
|
||||||
|
metadata[0].annotations["keel.sh/trigger"],
|
||||||
|
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
|
||||||
|
metadata[0].annotations["keel.sh/match-tag"],
|
||||||
|
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE — Keel manages tag updates
|
||||||
|
spec[0].template[0].metadata[0].annotations["keel.sh/update-time"], # KEEL_LIFECYCLE_V1
|
||||||
|
]
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
@ -1499,6 +1517,10 @@ resource "kubernetes_deployment" "pgadmin" {
|
||||||
}
|
}
|
||||||
labels = {
|
labels = {
|
||||||
tier = var.tier
|
tier = var.tier
|
||||||
|
# ADR-0014 service identity: dbaas is a multi-Service namespace, so the
|
||||||
|
# namespace alone can't attribute Goldmane flows. Value = the fronting
|
||||||
|
# Service name (kubernetes_service.pgadmin is named "pgadmin").
|
||||||
|
"service-identity" = "pgadmin"
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
spec {
|
spec {
|
||||||
|
|
@ -1514,6 +1536,10 @@ resource "kubernetes_deployment" "pgadmin" {
|
||||||
metadata {
|
metadata {
|
||||||
labels = {
|
labels = {
|
||||||
app = "pgadmin"
|
app = "pgadmin"
|
||||||
|
# ADR-0014: Goldmane/Felix stamps POD labels onto flows, so the
|
||||||
|
# disambiguating identity must live on the pod template (not just
|
||||||
|
# the Deployment metadata above). Not in selector → no replace.
|
||||||
|
"service-identity" = "pgadmin"
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
spec {
|
spec {
|
||||||
|
|
@ -1568,8 +1594,20 @@ resource "kubernetes_deployment" "pgadmin" {
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
lifecycle {
|
lifecycle {
|
||||||
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
|
ignore_changes = [
|
||||||
ignore_changes = [spec[0].template[0].spec[0].dns_config]
|
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
|
||||||
|
# This Deployment is Keel-enrolled (keel.sh/policy=patch) and Keel has
|
||||||
|
# bumped the live image (dpage/pgadmin4:9.16). Ignore the Keel/Kyverno
|
||||||
|
# runtime-mutated attributes so `terragrunt apply` (incl. the daily drift
|
||||||
|
# plan) doesn't revert the image to bare `dpage/pgadmin4` or strip Keel's
|
||||||
|
# annotations — canonical guard, matches linkwarden/chrome-service.
|
||||||
|
metadata[0].annotations["keel.sh/policy"],
|
||||||
|
metadata[0].annotations["keel.sh/trigger"],
|
||||||
|
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
|
||||||
|
metadata[0].annotations["keel.sh/match-tag"],
|
||||||
|
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE — Keel manages tag updates
|
||||||
|
spec[0].template[0].metadata[0].annotations["keel.sh/update-time"], # KEEL_LIFECYCLE_V1
|
||||||
|
]
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
resource "kubernetes_service" "pgadmin" {
|
resource "kubernetes_service" "pgadmin" {
|
||||||
|
|
|
||||||
|
|
@ -20,6 +20,9 @@ resource "kubernetes_namespace" "diun" {
|
||||||
}
|
}
|
||||||
|
|
||||||
resource "kubernetes_manifest" "external_secret" {
|
resource "kubernetes_manifest" "external_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
|
||||||
|
|
@ -20,6 +20,9 @@ resource "kubernetes_namespace" "ebooks" {
|
||||||
|
|
||||||
# ExternalSecrets for all three sources
|
# ExternalSecrets for all three sources
|
||||||
resource "kubernetes_manifest" "calibre_external_secret" {
|
resource "kubernetes_manifest" "calibre_external_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
@ -47,6 +50,9 @@ resource "kubernetes_manifest" "calibre_external_secret" {
|
||||||
}
|
}
|
||||||
|
|
||||||
resource "kubernetes_manifest" "audiobookshelf_external_secret" {
|
resource "kubernetes_manifest" "audiobookshelf_external_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
@ -74,6 +80,9 @@ resource "kubernetes_manifest" "audiobookshelf_external_secret" {
|
||||||
}
|
}
|
||||||
|
|
||||||
resource "kubernetes_manifest" "servarr_external_secret" {
|
resource "kubernetes_manifest" "servarr_external_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
|
||||||
|
|
@ -33,6 +33,9 @@ resource "kubernetes_namespace" "f1-stream" {
|
||||||
}
|
}
|
||||||
|
|
||||||
resource "kubernetes_manifest" "external_secret" {
|
resource "kubernetes_manifest" "external_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
@ -62,6 +65,9 @@ resource "kubernetes_manifest" "external_secret" {
|
||||||
# Pull the chrome-service bearer token into this namespace as a separate
|
# Pull the chrome-service bearer token into this namespace as a separate
|
||||||
# Secret so the verifier can reach the in-cluster Playwright pool.
|
# Secret so the verifier can reach the in-cluster Playwright pool.
|
||||||
resource "kubernetes_manifest" "chrome_service_client_secret" {
|
resource "kubernetes_manifest" "chrome_service_client_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
|
||||||
|
|
@ -53,6 +53,9 @@ resource "kubernetes_namespace" "fire_planner" {
|
||||||
# Seed before applying:
|
# Seed before applying:
|
||||||
# secret/fire-planner -> property `recompute_bearer_token`
|
# secret/fire-planner -> property `recompute_bearer_token`
|
||||||
resource "kubernetes_manifest" "external_secret" {
|
resource "kubernetes_manifest" "external_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
@ -115,6 +118,9 @@ resource "kubernetes_manifest" "external_secret" {
|
||||||
# Template builds the asyncpg DSN consumed by the FastAPI app + CronJob
|
# Template builds the asyncpg DSN consumed by the FastAPI app + CronJob
|
||||||
# as DB_CONNECTION_STRING.
|
# as DB_CONNECTION_STRING.
|
||||||
resource "kubernetes_manifest" "db_external_secret" {
|
resource "kubernetes_manifest" "db_external_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
@ -159,6 +165,9 @@ resource "kubernetes_manifest" "db_external_secret" {
|
||||||
# pg-sync sidecar populates `daily_account_valuation` etc. hourly; the
|
# pg-sync sidecar populates `daily_account_valuation` etc. hourly; the
|
||||||
# fire-planner ingest reads those tables via this role.
|
# fire-planner ingest reads those tables via this role.
|
||||||
resource "kubernetes_manifest" "wealthfolio_sync_db_external_secret" {
|
resource "kubernetes_manifest" "wealthfolio_sync_db_external_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
@ -661,6 +670,9 @@ variable "run_examples_bulk_ingest" {
|
||||||
|
|
||||||
# Reddit OAuth creds pulled from Vault secret/viktor.
|
# Reddit OAuth creds pulled from Vault secret/viktor.
|
||||||
resource "kubernetes_manifest" "external_secret_examples_reddit" {
|
resource "kubernetes_manifest" "external_secret_examples_reddit" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
@ -701,6 +713,9 @@ resource "kubernetes_manifest" "external_secret_examples_reddit" {
|
||||||
# claude-agent-service bearer pulled separately so its rotation cadence
|
# claude-agent-service bearer pulled separately so its rotation cadence
|
||||||
# is decoupled from the Reddit creds.
|
# is decoupled from the Reddit creds.
|
||||||
resource "kubernetes_manifest" "external_secret_examples_claude" {
|
resource "kubernetes_manifest" "external_secret_examples_claude" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
|
||||||
|
|
@ -6,6 +6,9 @@
|
||||||
# (stacks/authentik/email-secret.tf) — one credential, one rotation point. The
|
# (stacks/authentik/email-secret.tf) — one credential, one rotation point. The
|
||||||
# reloader annotation rolls the Forgejo pod if the password is ever rotated.
|
# reloader annotation rolls the Forgejo pod if the password is ever rotated.
|
||||||
resource "kubernetes_manifest" "forgejo_email_secret" {
|
resource "kubernetes_manifest" "forgejo_email_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
|
||||||
|
|
@ -3,6 +3,9 @@ variable "tls_secret_name" {
|
||||||
sensitive = true
|
sensitive = true
|
||||||
}
|
}
|
||||||
resource "kubernetes_manifest" "external_secret" {
|
resource "kubernetes_manifest" "external_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
|
||||||
|
|
@ -18,6 +18,9 @@ resource "kubernetes_namespace" "immich" {
|
||||||
}
|
}
|
||||||
|
|
||||||
resource "kubernetes_manifest" "external_secret" {
|
resource "kubernetes_manifest" "external_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
|
||||||
|
|
@ -57,16 +57,19 @@ resource "kubernetes_namespace" "goldmane_edge_aggregator" {
|
||||||
# -----------------------------------------------------------------------------
|
# -----------------------------------------------------------------------------
|
||||||
# The aggregator dials goldmane:7443 over mutual TLS. We mint a client cert
|
# The aggregator dials goldmane:7443 over mutual TLS. We mint a client cert
|
||||||
# signed by the Tigera CA (the same CA that issues Goldmane's serving cert), so
|
# signed by the Tigera CA (the same CA that issues Goldmane's serving cert), so
|
||||||
# Goldmane trusts the client and the client trusts Goldmane's server cert via
|
# Goldmane requires mutual TLS on :7443 and verifies the client cert chains to
|
||||||
# the published CA bundle.
|
# the Tigera CA — it does NOT authorize by client identity, so ANY Tigera-CA-
|
||||||
#
|
# signed cert is accepted. Rather than copy the Tigera CA PRIVATE KEY into TF
|
||||||
# The Tigera CA private key lives in the `tigera-ca-private` Secret in
|
# state to mint our own (a needless CA-key exposure; the hashicorp/tls provider
|
||||||
# tigera-operator (Opaque; verified keys: tls.crt + tls.key). The stack's apply
|
# is also incompatible with this repo's global generate-providers/lockfile
|
||||||
# identity needs RBAC get on that secret — see the Role/RoleBinding below.
|
# pattern), we REUSE the operator-minted, Tigera-CA-signed client cert
|
||||||
data "kubernetes_secret" "tigera_ca" {
|
# `whisker-backend-key-pair` (calico-system). We never touch the CA key.
|
||||||
|
# Trade-off: if the operator rotates that cert, re-apply to re-sync (hardening
|
||||||
|
# follow-up: mint an own-identity cert in-namespace if Whisker is ever removed).
|
||||||
|
data "kubernetes_secret" "whisker_backend" {
|
||||||
metadata {
|
metadata {
|
||||||
name = "tigera-ca-private"
|
name = "whisker-backend-key-pair"
|
||||||
namespace = "tigera-operator"
|
namespace = "calico-system"
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
@ -93,46 +96,11 @@ resource "kubernetes_config_map" "tigera_ca_bundle" {
|
||||||
data = data.kubernetes_config_map.tigera_ca_bundle.data
|
data = data.kubernetes_config_map.tigera_ca_bundle.data
|
||||||
}
|
}
|
||||||
|
|
||||||
# Client private key.
|
# Client cert + key for mTLS to goldmane:7443, mounted at TLS_CERT_PATH /
|
||||||
resource "tls_private_key" "goldmane_client" {
|
# TLS_KEY_PATH defaults (/etc/goldmane-client-tls/tls.crt and .../tls.key).
|
||||||
algorithm = "RSA"
|
# Sourced verbatim from the operator's whisker-backend client key-pair (read
|
||||||
rsa_bits = 2048
|
# above) — already Tigera-CA-signed, which is all Goldmane verifies. No CA key
|
||||||
}
|
# is touched and no cross-namespace CA RBAC is needed.
|
||||||
|
|
||||||
# CSR for the client cert. CN identifies the client; the service-DNS SAN mirrors
|
|
||||||
# how Felix/whisker-backend present a client identity to Goldmane.
|
|
||||||
resource "tls_cert_request" "goldmane_client" {
|
|
||||||
private_key_pem = tls_private_key.goldmane_client.private_key_pem
|
|
||||||
subject {
|
|
||||||
common_name = "goldmane-edge-aggregator"
|
|
||||||
organization = "goldmane-edge-aggregator"
|
|
||||||
}
|
|
||||||
dns_names = [
|
|
||||||
"goldmane-edge-aggregator",
|
|
||||||
"goldmane-edge-aggregator.goldmane-edge-aggregator.svc.cluster.local",
|
|
||||||
]
|
|
||||||
}
|
|
||||||
|
|
||||||
# Sign the CSR with the Tigera CA. 10-year validity (87600h): re-apply rotates
|
|
||||||
# it well before expiry; a long horizon avoids surprise mTLS outages from an
|
|
||||||
# unattended stack. The Tigera CA itself outlives this (operator-managed).
|
|
||||||
resource "tls_locally_signed_cert" "goldmane_client" {
|
|
||||||
cert_request_pem = tls_cert_request.goldmane_client.cert_request_pem
|
|
||||||
ca_private_key_pem = data.kubernetes_secret.tigera_ca.data["tls.key"]
|
|
||||||
ca_cert_pem = data.kubernetes_secret.tigera_ca.data["tls.crt"]
|
|
||||||
|
|
||||||
validity_period_hours = 87600 # 10y
|
|
||||||
early_renewal_hours = 720 # re-sign on apply when <30d remain
|
|
||||||
|
|
||||||
allowed_uses = [
|
|
||||||
"client_auth",
|
|
||||||
"digital_signature",
|
|
||||||
"key_encipherment",
|
|
||||||
]
|
|
||||||
}
|
|
||||||
|
|
||||||
# The minted client cert + key, mounted at TLS_CERT_PATH / TLS_KEY_PATH defaults
|
|
||||||
# (/etc/goldmane-client-tls/tls.crt and .../tls.key).
|
|
||||||
resource "kubernetes_secret" "goldmane_client_tls" {
|
resource "kubernetes_secret" "goldmane_client_tls" {
|
||||||
metadata {
|
metadata {
|
||||||
name = "goldmane-client-tls"
|
name = "goldmane-client-tls"
|
||||||
|
|
@ -140,47 +108,8 @@ resource "kubernetes_secret" "goldmane_client_tls" {
|
||||||
}
|
}
|
||||||
type = "Opaque"
|
type = "Opaque"
|
||||||
data = {
|
data = {
|
||||||
"tls.crt" = tls_locally_signed_cert.goldmane_client.cert_pem
|
"tls.crt" = data.kubernetes_secret.whisker_backend.data["tls.crt"]
|
||||||
"tls.key" = tls_private_key.goldmane_client.private_key_pem
|
"tls.key" = data.kubernetes_secret.whisker_backend.data["tls.key"]
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
# Narrow RBAC so this stack's apply identity (and ESO/Reloader are unaffected)
|
|
||||||
# can `get` the Tigera CA private key in tigera-operator. The data source above
|
|
||||||
# reads it at apply time; this Role/RoleBinding documents + grants that access
|
|
||||||
# rather than relying on cluster-admin. The subject is the same SA the other
|
|
||||||
# Tier-1 stacks apply as (claude-agent/terraform-state for headless, the human
|
|
||||||
# OIDC identity interactively) — both are cluster-admin today, so this is
|
|
||||||
# belt-and-braces / least-privilege intent for when apply identities tighten.
|
|
||||||
resource "kubernetes_role" "read_tigera_ca" {
|
|
||||||
metadata {
|
|
||||||
name = "goldmane-edge-aggregator-read-tigera-ca"
|
|
||||||
namespace = "tigera-operator"
|
|
||||||
}
|
|
||||||
rule {
|
|
||||||
api_groups = [""]
|
|
||||||
resources = ["secrets"]
|
|
||||||
resource_names = ["tigera-ca-private"]
|
|
||||||
verbs = ["get"]
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
resource "kubernetes_role_binding" "read_tigera_ca" {
|
|
||||||
metadata {
|
|
||||||
name = "goldmane-edge-aggregator-read-tigera-ca"
|
|
||||||
namespace = "tigera-operator"
|
|
||||||
}
|
|
||||||
role_ref {
|
|
||||||
api_group = "rbac.authorization.k8s.io"
|
|
||||||
kind = "Role"
|
|
||||||
name = kubernetes_role.read_tigera_ca.metadata[0].name
|
|
||||||
}
|
|
||||||
# The headless apply identity (claude-agent-service runs Tier-1 applies as the
|
|
||||||
# `terraform-state` Vault K8s role in the claude-agent namespace).
|
|
||||||
subject {
|
|
||||||
kind = "ServiceAccount"
|
|
||||||
name = "default"
|
|
||||||
namespace = "claude-agent"
|
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
@ -227,6 +156,11 @@ resource "kubernetes_job" "db_init" {
|
||||||
timeouts {
|
timeouts {
|
||||||
create = "2m"
|
create = "2m"
|
||||||
}
|
}
|
||||||
|
lifecycle {
|
||||||
|
# KYVERNO_LIFECYCLE_V1: Kyverno injects dns_config (ndots=2); ignore it so
|
||||||
|
# this idempotent Job isn't replaced (Jobs are immutable) on every apply.
|
||||||
|
ignore_changes = [spec[0].template[0].spec[0].dns_config]
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
# ExternalSecret projecting the Vault-rotated (7-day) credential into a K8s
|
# ExternalSecret projecting the Vault-rotated (7-day) credential into a K8s
|
||||||
|
|
@ -234,6 +168,9 @@ resource "kubernetes_job" "db_init" {
|
||||||
# place in the CNPG connection allowlist are added in stacks/vault/main.tf
|
# place in the CNPG connection allowlist are added in stacks/vault/main.tf
|
||||||
# (see this stack's terragrunt.hcl note). remoteRef key: static-creds/pg-goldmane-edges.
|
# (see this stack's terragrunt.hcl note). remoteRef key: static-creds/pg-goldmane-edges.
|
||||||
resource "kubernetes_manifest" "db_external_secret" {
|
resource "kubernetes_manifest" "db_external_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
@ -276,6 +213,9 @@ resource "kubernetes_manifest" "db_external_secret" {
|
||||||
# into this namespace as SLACK_WEBHOOK_URL via an ExternalSecret (no new
|
# into this namespace as SLACK_WEBHOOK_URL via an ExternalSecret (no new
|
||||||
# webhook). The digest CronJob defaults to #security.
|
# webhook). The digest CronJob defaults to #security.
|
||||||
resource "kubernetes_manifest" "slack_external_secret" {
|
resource "kubernetes_manifest" "slack_external_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
@ -295,7 +235,7 @@ resource "kubernetes_manifest" "slack_external_secret" {
|
||||||
data = [{
|
data = [{
|
||||||
secretKey = "SLACK_WEBHOOK_URL"
|
secretKey = "SLACK_WEBHOOK_URL"
|
||||||
remoteRef = {
|
remoteRef = {
|
||||||
key = "monitoring"
|
key = "viktor"
|
||||||
property = "alertmanager_slack_api_url"
|
property = "alertmanager_slack_api_url"
|
||||||
}
|
}
|
||||||
}]
|
}]
|
||||||
|
|
@ -516,7 +456,12 @@ resource "kubernetes_cron_job_v1" "digest" {
|
||||||
}
|
}
|
||||||
env {
|
env {
|
||||||
name = "SLACK_CHANNEL"
|
name = "SLACK_CHANNEL"
|
||||||
value = "#security"
|
# Posts to #alerts. The dedicated #security channel was abandoned
|
||||||
|
# 2026-06-25 — the shared alertmanager_slack_api_url webhook's
|
||||||
|
# Slack app isn't a member of it (channel override 404s), so all
|
||||||
|
# Slack (incl. alertmanager's security-lane alerts) consolidated
|
||||||
|
# to #alerts. See docs/runbooks/goldmane-flow-trail.md.
|
||||||
|
value = "#alerts"
|
||||||
}
|
}
|
||||||
|
|
||||||
resources {
|
resources {
|
||||||
|
|
|
||||||
|
|
@ -5,6 +5,9 @@ variable "tls_secret_name" {
|
||||||
variable "nfs_server" { type = string }
|
variable "nfs_server" { type = string }
|
||||||
|
|
||||||
resource "kubernetes_manifest" "external_secret" {
|
resource "kubernetes_manifest" "external_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
|
||||||
|
|
@ -208,6 +208,9 @@ module "ingress" {
|
||||||
}
|
}
|
||||||
|
|
||||||
resource "kubernetes_manifest" "external_secret" {
|
resource "kubernetes_manifest" "external_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
|
||||||
|
|
@ -250,6 +250,9 @@ module "ingress_test" {
|
||||||
}
|
}
|
||||||
|
|
||||||
resource "kubernetes_manifest" "external_secret_db" {
|
resource "kubernetes_manifest" "external_secret_db" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
@ -284,6 +287,9 @@ resource "kubernetes_manifest" "external_secret_db" {
|
||||||
}
|
}
|
||||||
|
|
||||||
resource "kubernetes_manifest" "external_secret_kv" {
|
resource "kubernetes_manifest" "external_secret_kv" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
|
||||||
|
|
@ -37,6 +37,9 @@ module "tls_secret" {
|
||||||
# --- Secrets (ESO from Vault) ---
|
# --- Secrets (ESO from Vault) ---
|
||||||
|
|
||||||
resource "kubernetes_manifest" "external_secret" {
|
resource "kubernetes_manifest" "external_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
|
||||||
|
|
@ -162,6 +162,9 @@ resource "kubernetes_resource_quota" "immich" {
|
||||||
}
|
}
|
||||||
|
|
||||||
resource "kubernetes_manifest" "external_secret" {
|
resource "kubernetes_manifest" "external_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
|
||||||
|
|
@ -20,6 +20,9 @@ resource "kubernetes_namespace" "insta2spotify" {
|
||||||
}
|
}
|
||||||
|
|
||||||
resource "kubernetes_manifest" "external_secret" {
|
resource "kubernetes_manifest" "external_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
|
||||||
|
|
@ -35,6 +35,14 @@ resource "kubernetes_namespace" "instagram_poster" {
|
||||||
# - immich_tag_instagram (optional — auto-resolved if missing)
|
# - immich_tag_instagram (optional — auto-resolved if missing)
|
||||||
# - immich_tag_posted (optional — auto-resolved if missing)
|
# - immich_tag_posted (optional — auto-resolved if missing)
|
||||||
resource "kubernetes_manifest" "external_secret" {
|
resource "kubernetes_manifest" "external_secret" {
|
||||||
|
# The external-secrets controller takes server-side-apply ownership of
|
||||||
|
# .spec.refreshInterval, so a plain TF apply conflicts. force_conflicts lets
|
||||||
|
# TF win (values match, so it's stable) — same pattern as grafana/woodpecker/
|
||||||
|
# traefik/k8s-version-upgrade. Surfaced 2026-06-24 by the first IG apply since
|
||||||
|
# the ESO v1 migration (the scale-to-0 push).
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
@ -139,6 +147,11 @@ resource "kubernetes_manifest" "external_secret" {
|
||||||
# ESO refreshes the K8s Secret every 15m. `reloader.stakater.com/match`
|
# ESO refreshes the K8s Secret every 15m. `reloader.stakater.com/match`
|
||||||
# bounces the pod when the password changes.
|
# bounces the pod when the password changes.
|
||||||
resource "kubernetes_manifest" "benchmark_db_external_secret" {
|
resource "kubernetes_manifest" "benchmark_db_external_secret" {
|
||||||
|
# See external_secret above — ESO owns .spec.refreshInterval; force_conflicts
|
||||||
|
# lets the TF apply win instead of erroring on the field-manager conflict.
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
@ -227,7 +240,11 @@ resource "kubernetes_deployment" "instagram_poster" {
|
||||||
}
|
}
|
||||||
|
|
||||||
spec {
|
spec {
|
||||||
replicas = 1
|
# Scaled to 0 (2026-06-24): Instagram Graph integration is unused and its
|
||||||
|
# ExternalSecret is dead (missing ig_graph_long_lived_token /
|
||||||
|
# ig_business_account_id in Vault secret/instagram-poster). Set back to 1
|
||||||
|
# after minting a Meta long-lived token and populating those keys.
|
||||||
|
replicas = 0
|
||||||
# RWO PVC — cannot rolling-update.
|
# RWO PVC — cannot rolling-update.
|
||||||
strategy {
|
strategy {
|
||||||
type = "Recreate"
|
type = "Recreate"
|
||||||
|
|
|
||||||
|
|
@ -41,6 +41,9 @@ resource "kubernetes_namespace" "job_hunter" {
|
||||||
# digest_to_address — where the weekly digest goes
|
# digest_to_address — where the weekly digest goes
|
||||||
# digest_from_address — From: header for the digest
|
# digest_from_address — From: header for the digest
|
||||||
resource "kubernetes_manifest" "external_secret" {
|
resource "kubernetes_manifest" "external_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
@ -105,6 +108,9 @@ resource "kubernetes_manifest" "external_secret" {
|
||||||
# DB credentials from Vault database engine (7-day rotation).
|
# DB credentials from Vault database engine (7-day rotation).
|
||||||
# Template builds the asyncpg DSN consumed by the FastAPI app as DB_CONNECTION_STRING.
|
# Template builds the asyncpg DSN consumed by the FastAPI app as DB_CONNECTION_STRING.
|
||||||
resource "kubernetes_manifest" "db_external_secret" {
|
resource "kubernetes_manifest" "db_external_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
@ -325,6 +331,9 @@ resource "kubernetes_service" "job_hunter" {
|
||||||
# references it as $__env{JOB_HUNTER_PG_PASSWORD}. Reloader restarts
|
# references it as $__env{JOB_HUNTER_PG_PASSWORD}. Reloader restarts
|
||||||
# Grafana whenever ESO updates this secret (every 7d on rotation).
|
# Grafana whenever ESO updates this secret (every 7d on rotation).
|
||||||
resource "kubernetes_manifest" "grafana_job_hunter_db_external_secret" {
|
resource "kubernetes_manifest" "grafana_job_hunter_db_external_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
|
||||||
|
|
@ -5,6 +5,9 @@
|
||||||
# -----------------------------------------------------------------------------
|
# -----------------------------------------------------------------------------
|
||||||
|
|
||||||
resource "kubernetes_manifest" "oauth2_proxy_externalsecret" {
|
resource "kubernetes_manifest" "oauth2_proxy_externalsecret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
|
||||||
|
|
@ -416,6 +416,39 @@ phase_preflight() {
|
||||||
fi
|
fi
|
||||||
fi
|
fi
|
||||||
|
|
||||||
|
# 4b. apiserver-OIDC drift check (backstop for the rbac stack's kubeadm-config
|
||||||
|
# reconciliation). A `kubeadm upgrade` REGENERATES the apiserver manifest from
|
||||||
|
# kubeadm-config; if kubeadm-config still carries the legacy single-issuer
|
||||||
|
# --oidc-* args instead of --authentication-config, the regenerated apiserver
|
||||||
|
# loses structured multi-issuer auth → kubectl + dashboard SSO break AFTER the
|
||||||
|
# upgrade. This is RECOVERABLE (the apiserver does NOT crash — verified by an
|
||||||
|
# isolated repro 2026-06-24; the chain's post-master restore.sh re-adds the flag,
|
||||||
|
# and the rbac stack reconciles kubeadm-config so it won't recur) — so this is an
|
||||||
|
# ALERT, not a block. (NB the 2026-06-24 stall was NOT this — it was etcd IO
|
||||||
|
# starvation; see docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md.)
|
||||||
|
# Skip on an at-target master (resume — no apiserver regen).
|
||||||
|
if [ "$master_kubelet_v" != "$TARGET_VERSION" ]; then
|
||||||
|
local apiserver_diff
|
||||||
|
apiserver_diff=$(ssh "${SSH_OPTS[@]}" "$(ssh_target k8s-master)" "sudo kubeadm upgrade diff v$TARGET_VERSION 2>/dev/null" || true)
|
||||||
|
if echo "$apiserver_diff" | grep -qE '^-[[:space:]].*--authentication-config'; then
|
||||||
|
slack "WARN preflight — kubeadm upgrade will DROP --authentication-config (kubeadm-config OIDC drift). SSO breaks post-upgrade until restore.sh re-adds it; re-apply the rbac stack to reconcile kubeadm-config. Proceeding (recoverable, not a crash)."
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
|
||||||
|
# 4c. Reclaim kubeadm scratch on master. `kubeadm upgrade apply` dumps a full
|
||||||
|
# ~400MB etcd DB backup into /etc/kubernetes/tmp/kubeadm-backup-etcd-<ts>/ before
|
||||||
|
# every etcd upgrade and NEVER cleans it up — 145 dirs / 28GB had accumulated by
|
||||||
|
# 2026-06-24, pushing master root fs to 73% (image-GC churn + extra write IO on
|
||||||
|
# the shared HDD where etcd lives — a contributor to the etcd IO starvation that
|
||||||
|
# stalled that run, see post-mortem). Real etcd backups go to NFS, so these are
|
||||||
|
# throwaway. Prune ones >3 days old (keeps a short rollback window). Best-effort;
|
||||||
|
# never aborts the chain.
|
||||||
|
if [ "$master_kubelet_v" != "$TARGET_VERSION" ]; then
|
||||||
|
ssh "${SSH_OPTS[@]}" "$(ssh_target k8s-master)" \
|
||||||
|
"sudo find /etc/kubernetes/tmp -maxdepth 1 -type d \( -name 'kubeadm-backup-*' -o -name 'kubeadm-upgraded-manifests*' \) -mtime +3 -exec rm -rf {} + 2>/dev/null; echo -n 'master root after prune: '; df -h / | awk 'NR==2{print \$5\" used, \"\$4\" free\"}'" \
|
||||||
|
|| echo "kubeadm-scratch prune skipped (ssh/df failed) — non-fatal"
|
||||||
|
fi
|
||||||
|
|
||||||
# 5. Push in-flight + started_timestamp metrics + ns annotations
|
# 5. Push in-flight + started_timestamp metrics + ns annotations
|
||||||
$KUBECTL annotate ns "$NS" \
|
$KUBECTL annotate ns "$NS" \
|
||||||
"viktorbarzin.me/k8s-upgrade-in-flight=$(date -u +%FT%TZ)" \
|
"viktorbarzin.me/k8s-upgrade-in-flight=$(date -u +%FT%TZ)" \
|
||||||
|
|
|
||||||
|
|
@ -304,6 +304,9 @@ resource "kubernetes_config_map" "kms_slack_notifier" {
|
||||||
}
|
}
|
||||||
|
|
||||||
resource "kubernetes_manifest" "kms_slack_external_secret" {
|
resource "kubernetes_manifest" "kms_slack_external_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
|
||||||
|
|
@ -29,6 +29,9 @@ resource "kubernetes_namespace" "linkwarden" {
|
||||||
}
|
}
|
||||||
|
|
||||||
resource "kubernetes_manifest" "external_secret" {
|
resource "kubernetes_manifest" "external_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
@ -57,6 +60,9 @@ resource "kubernetes_manifest" "external_secret" {
|
||||||
|
|
||||||
# DB credentials from Vault database engine (rotated every 24h)
|
# DB credentials from Vault database engine (rotated every 24h)
|
||||||
resource "kubernetes_manifest" "db_external_secret" {
|
resource "kubernetes_manifest" "db_external_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
|
||||||
|
|
@ -800,6 +800,9 @@ resource "kubernetes_service" "mailserver_proxy" {
|
||||||
# `EMAIL_MONITOR_IMAP_PASSWORD` so the CronJob can consume them via a single
|
# `EMAIL_MONITOR_IMAP_PASSWORD` so the CronJob can consume them via a single
|
||||||
# `env_from { secret_ref {} }` block.
|
# `env_from { secret_ref {} }` block.
|
||||||
resource "kubernetes_manifest" "email_roundtrip_monitor_secrets" {
|
resource "kubernetes_manifest" "email_roundtrip_monitor_secrets" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
|
||||||
|
|
@ -25,6 +25,9 @@ resource "kubernetes_namespace" "matrix" {
|
||||||
# flipped to false. The token stays in Vault so registration can be re-opened
|
# flipped to false. The token stays in Vault so registration can be re-opened
|
||||||
# later (e.g. to add family) without regenerating it.
|
# later (e.g. to add family) without regenerating it.
|
||||||
resource "kubernetes_manifest" "secrets_external_secret" {
|
resource "kubernetes_manifest" "secrets_external_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
|
||||||
|
|
@ -130,6 +130,11 @@ resource "kubernetes_deployment" "blackbox_exporter" {
|
||||||
labels = {
|
labels = {
|
||||||
app = "blackbox-exporter"
|
app = "blackbox-exporter"
|
||||||
tier = var.tier
|
tier = var.tier
|
||||||
|
# ADR-0014 service identity: monitoring is a multi-Service namespace, so
|
||||||
|
# the namespace alone can't attribute Goldmane flows. Value = the
|
||||||
|
# fronting Service name (kubernetes_service.blackbox_exporter is named
|
||||||
|
# "blackbox-exporter").
|
||||||
|
"service-identity" = "blackbox-exporter"
|
||||||
}
|
}
|
||||||
annotations = {
|
annotations = {
|
||||||
"reloader.stakater.com/search" = "true"
|
"reloader.stakater.com/search" = "true"
|
||||||
|
|
@ -146,6 +151,10 @@ resource "kubernetes_deployment" "blackbox_exporter" {
|
||||||
metadata {
|
metadata {
|
||||||
labels = {
|
labels = {
|
||||||
app = "blackbox-exporter"
|
app = "blackbox-exporter"
|
||||||
|
# ADR-0014: Goldmane/Felix stamps POD labels onto flows, so the
|
||||||
|
# disambiguating identity must live on the pod template (not just
|
||||||
|
# the Deployment metadata above). Not in selector → no replace.
|
||||||
|
"service-identity" = "blackbox-exporter"
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
spec {
|
spec {
|
||||||
|
|
|
||||||
|
|
@ -5,6 +5,11 @@ resource "kubernetes_deployment" "goflow2" {
|
||||||
labels = {
|
labels = {
|
||||||
app = "goflow2"
|
app = "goflow2"
|
||||||
tier = var.tier
|
tier = var.tier
|
||||||
|
# ADR-0014 service identity: monitoring is a multi-Service namespace, so
|
||||||
|
# the namespace alone can't attribute Goldmane flows. Value = the
|
||||||
|
# fronting Service name (kubernetes_service.goflow2 — the metrics svc; the
|
||||||
|
# goflow2-netflow NodePort is the same pod by another name).
|
||||||
|
"service-identity" = "goflow2"
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
spec {
|
spec {
|
||||||
|
|
@ -18,6 +23,10 @@ resource "kubernetes_deployment" "goflow2" {
|
||||||
metadata {
|
metadata {
|
||||||
labels = {
|
labels = {
|
||||||
app = "goflow2"
|
app = "goflow2"
|
||||||
|
# ADR-0014: Goldmane/Felix stamps POD labels onto flows, so the
|
||||||
|
# disambiguating identity must live on the pod template (not just
|
||||||
|
# the Deployment metadata above). Not in selector → no replace.
|
||||||
|
"service-identity" = "goflow2"
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
spec {
|
spec {
|
||||||
|
|
|
||||||
|
|
@ -71,6 +71,15 @@ resource "kubernetes_persistent_volume" "alertmanager_pv" {
|
||||||
# DB credentials from Vault database engine (rotated automatically)
|
# DB credentials from Vault database engine (rotated automatically)
|
||||||
# Provides GF_DATABASE_PASSWORD that auto-updates when password rotates
|
# Provides GF_DATABASE_PASSWORD that auto-updates when password rotates
|
||||||
resource "kubernetes_manifest" "grafana_db_creds" {
|
resource "kubernetes_manifest" "grafana_db_creds" {
|
||||||
|
# The external-secrets controller takes server-side-apply ownership of
|
||||||
|
# .spec.refreshInterval, so a plain TF apply conflicts ("conflict with
|
||||||
|
# external-secrets ... .spec.refreshInterval"). force_conflicts lets TF win
|
||||||
|
# (values match, so it's stable) — same pattern as the woodpecker/traefik/
|
||||||
|
# k8s-version-upgrade stacks. Surfaced 2026-06-24: the first monitoring apply
|
||||||
|
# in a while exposed this latent conflict (prior pushes were docs-only).
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
|
||||||
|
|
@ -47,6 +47,10 @@ resource "kubernetes_deployment" "idrac-redfish" {
|
||||||
labels = {
|
labels = {
|
||||||
app = "idrac-redfish-exporter"
|
app = "idrac-redfish-exporter"
|
||||||
tier = var.tier
|
tier = var.tier
|
||||||
|
# ADR-0014 service identity: monitoring is a multi-Service namespace, so
|
||||||
|
# the namespace alone can't attribute Goldmane flows. Value = the
|
||||||
|
# fronting Service name (kubernetes_service.idrac-redfish-exporter).
|
||||||
|
"service-identity" = "idrac-redfish-exporter"
|
||||||
}
|
}
|
||||||
annotations = {
|
annotations = {
|
||||||
"reloader.stakater.com/search" = "true"
|
"reloader.stakater.com/search" = "true"
|
||||||
|
|
@ -63,6 +67,10 @@ resource "kubernetes_deployment" "idrac-redfish" {
|
||||||
metadata {
|
metadata {
|
||||||
labels = {
|
labels = {
|
||||||
app = "idrac-redfish-exporter"
|
app = "idrac-redfish-exporter"
|
||||||
|
# ADR-0014: Goldmane/Felix stamps POD labels onto flows, so the
|
||||||
|
# disambiguating identity must live on the pod template (not just
|
||||||
|
# the Deployment metadata above). Not in selector → no replace.
|
||||||
|
"service-identity" = "idrac-redfish-exporter"
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
spec {
|
spec {
|
||||||
|
|
|
||||||
|
|
@ -60,9 +60,10 @@ alertmanager:
|
||||||
receiver: slack-warning
|
receiver: slack-warning
|
||||||
routes:
|
routes:
|
||||||
# Wave 1 security lane — matches alerts that set `lane = "security"`
|
# Wave 1 security lane — matches alerts that set `lane = "security"`
|
||||||
# (K2-K9, V1-V7, S1 from Loki ruler). Routes to dedicated #security
|
# (K2-K9, V1-V7, S1 from Loki ruler). Posts via the slack-security
|
||||||
# channel regardless of severity. Defined first + continue: false so
|
# receiver (distinct [SECURITY] styling) to #alerts; the dedicated
|
||||||
# security alerts never fall through to the generic #alerts channel.
|
# #security channel was abandoned 2026-06-25 (shared webhook can't reach
|
||||||
|
# it). continue: false so they get the security-styled receiver.
|
||||||
- receiver: slack-security
|
- receiver: slack-security
|
||||||
group_wait: 10s
|
group_wait: 10s
|
||||||
group_interval: 1m
|
group_interval: 1m
|
||||||
|
|
@ -235,7 +236,10 @@ alertmanager:
|
||||||
- name: slack-security
|
- name: slack-security
|
||||||
slack_configs:
|
slack_configs:
|
||||||
- send_resolved: true
|
- send_resolved: true
|
||||||
channel: "#security"
|
# #security was abandoned 2026-06-25 — the shared incoming webhook's
|
||||||
|
# Slack app isn't a member of it (channel override 404s). Security-lane
|
||||||
|
# alerts keep their distinct [SECURITY] styling but post to #alerts.
|
||||||
|
channel: "#alerts"
|
||||||
color: '{{ if eq .Status "firing" }}{{ if eq (index .Alerts 0).Labels.severity "critical" }}danger{{ else }}warning{{ end }}{{ else }}good{{ end }}'
|
color: '{{ if eq .Status "firing" }}{{ if eq (index .Alerts 0).Labels.severity "critical" }}danger{{ else }}warning{{ end }}{{ else }}good{{ end }}'
|
||||||
fallback: '{{ if eq .Status "firing" }}[SECURITY-{{ (index .Alerts 0).Labels.severity | toUpper }}]{{ else }}[RESOLVED]{{ end }}: {{ .GroupLabels.alertname }}'
|
fallback: '{{ if eq .Status "firing" }}[SECURITY-{{ (index .Alerts 0).Labels.severity | toUpper }}]{{ else }}[RESOLVED]{{ end }}: {{ .GroupLabels.alertname }}'
|
||||||
title: '{{ if eq .Status "firing" }}[SECURITY/{{ (index .Alerts 0).Labels.severity | toUpper }}]{{ else }}[RESOLVED]{{ end }} {{ .GroupLabels.alertname }} ({{ .Alerts | len }})'
|
title: '{{ if eq .Status "firing" }}[SECURITY/{{ (index .Alerts 0).Labels.severity | toUpper }}]{{ else }}[RESOLVED]{{ end }} {{ .GroupLabels.alertname }} ({{ .Alerts | len }})'
|
||||||
|
|
@ -253,6 +257,19 @@ alertmanager:
|
||||||
memory: 256Mi
|
memory: 256Mi
|
||||||
limits:
|
limits:
|
||||||
memory: 256Mi
|
memory: 256Mi
|
||||||
|
# kube-state-metrics idles ~45Mi but briefly spikes past the monitoring-namespace
|
||||||
|
# LimitRange default (256Mi) during a full object relist (450+ pods, 150+ jobs, all
|
||||||
|
# secrets/endpoints), so it gets OOMKilled. Each OOM blacks out KSM-derived series
|
||||||
|
# for ~5min and cascades into a wall of false "<svc>Down" criticals that self-resolve
|
||||||
|
# (storm 2026-06-26 08:42). Burstable: low request (minimal reservation) + a 512Mi
|
||||||
|
# limit to absorb the relist peak. No CPU limit (cluster-wide policy).
|
||||||
|
kube-state-metrics:
|
||||||
|
resources:
|
||||||
|
requests:
|
||||||
|
cpu: 100m
|
||||||
|
memory: 64Mi
|
||||||
|
limits:
|
||||||
|
memory: 512Mi
|
||||||
prometheus-node-exporter:
|
prometheus-node-exporter:
|
||||||
enabled: true
|
enabled: true
|
||||||
resources:
|
resources:
|
||||||
|
|
@ -1450,6 +1467,49 @@ serverFiles:
|
||||||
Remediation: right-size top reservers via Goldilocks (immich-server,
|
Remediation: right-size top reservers via Goldilocks (immich-server,
|
||||||
frigate, prometheus, pg-cluster, paperless) or bump VM RAM on
|
frigate, prometheus, pg-cluster, paperless) or bump VM RAM on
|
||||||
k8s-node2/k8s-node3 from 32GB → 48GB to match node1.
|
k8s-node2/k8s-node3 from 32GB → 48GB to match node1.
|
||||||
|
# Goldmane edge-aggregator (ADR-0014 / infra #58, #61): the durable
|
||||||
|
# who-talks-to-whom trail. The aggregator pod has NO /metrics endpoint,
|
||||||
|
# so its health is inferred from kube-state-metrics signals — the trail
|
||||||
|
# must not silently die. Two failure modes are covered:
|
||||||
|
# - the aggregate Deployment stops consuming Goldmane's flow stream
|
||||||
|
# (AggregatorDown) → no new edges ever land in the goldmane_edges DB
|
||||||
|
# - the daily digest CronJob can't post new edges to Slack
|
||||||
|
# (DigestFailing) → edges still land but nobody is told.
|
||||||
|
# A freshness probe (max(last_seen) staleness) is intentionally NOT here:
|
||||||
|
# AggregatorDown is the agreed floor and needs no extra moving parts.
|
||||||
|
- name: Network Observability (Goldmane)
|
||||||
|
rules:
|
||||||
|
# Deployment has <1 available replica for 15m. kube-state-metrics
|
||||||
|
# keeps `kube_deployment_status_replicas_available` (metric-keep list
|
||||||
|
# in serverFiles below). The 15m window rides out a normal rollout /
|
||||||
|
# node drain without paging; a genuinely-dead aggregator means the
|
||||||
|
# edge trail has stopped recording and stays down.
|
||||||
|
- alert: AggregatorDown
|
||||||
|
expr: |
|
||||||
|
kube_deployment_status_replicas_available{namespace="goldmane-edge-aggregator",deployment="goldmane-edge-aggregator"} < 1
|
||||||
|
and on() (time() - process_start_time_seconds{job="prometheus"}) > 900
|
||||||
|
for: 15m
|
||||||
|
labels:
|
||||||
|
severity: warning
|
||||||
|
annotations:
|
||||||
|
summary: "goldmane-edge-aggregator has no available replica — the who-talks-to-whom edge trail has stopped recording"
|
||||||
|
description: "The aggregate Deployment streams Calico Goldmane flows into the goldmane_edges CNPG DB. With 0 replicas, no new namespace-pair edges are captured. `kubectl -n goldmane-edge-aggregator describe deploy goldmane-edge-aggregator` + check the goldmane svc (calico-system) is reachable."
|
||||||
|
# The goldmane-edges-digest CronJob has a failed Job that started in
|
||||||
|
# the last 24h. Mirrors the generic JobFailed shape but scoped to the
|
||||||
|
# digest so it routes here. `for: 30m` rides out the apply/scrape
|
||||||
|
# transient; the digest runs daily so a real failure won't self-heal
|
||||||
|
# until the next run — surface it same-day rather than waiting 24h.
|
||||||
|
- alert: DigestFailing
|
||||||
|
expr: |
|
||||||
|
kube_job_status_failed{namespace="goldmane-edge-aggregator", job_name=~"goldmane-edges-digest.*"} > 0
|
||||||
|
and on(namespace, job_name)
|
||||||
|
(time() - kube_job_status_start_time{namespace="goldmane-edge-aggregator", job_name=~"goldmane-edges-digest.*"}) < 86400
|
||||||
|
for: 30m
|
||||||
|
labels:
|
||||||
|
severity: warning
|
||||||
|
annotations:
|
||||||
|
summary: "goldmane-edges-digest CronJob failing — new edges captured but not posted to #alerts"
|
||||||
|
description: "The daily edge digest Job {{ $labels.job_name }} failed. Edges may still be landing in the goldmane_edges DB but no one is being notified of new namespace-pairs. `kubectl -n goldmane-edge-aggregator logs job/{{ $labels.job_name }}`."
|
||||||
- name: Infrastructure Health
|
- name: Infrastructure Health
|
||||||
rules:
|
rules:
|
||||||
- alert: HomeAssistantDown
|
- alert: HomeAssistantDown
|
||||||
|
|
@ -3190,7 +3250,8 @@ serverFiles:
|
||||||
# means blackbox's fail_if_header_matches caught a Location -> Authentik:
|
# means blackbox's fail_if_header_matches caught a Location -> Authentik:
|
||||||
# a path-scoped `auth = "none"` carve-out was clobbered (TF revert, deploy,
|
# a path-scoped `auth = "none"` carve-out was clobbered (TF revert, deploy,
|
||||||
# ingress_factory default flipping back to auth="required"). lane=security
|
# ingress_factory default flipping back to auth="required"). lane=security
|
||||||
# routes it to the #security Slack receiver (Slack-only, no paging).
|
# routes it to the slack-security receiver, which posts to #alerts
|
||||||
|
# (#security abandoned 2026-06-25; Slack-only, no paging).
|
||||||
- name: Authentik Walling Off
|
- name: Authentik Walling Off
|
||||||
rules:
|
rules:
|
||||||
- alert: AuthentikWallingOffPublicPath
|
- alert: AuthentikWallingOffPublicPath
|
||||||
|
|
|
||||||
|
|
@ -22,6 +22,10 @@ resource "kubernetes_deployment" "pve_exporter" {
|
||||||
namespace = kubernetes_namespace.monitoring.metadata[0].name
|
namespace = kubernetes_namespace.monitoring.metadata[0].name
|
||||||
labels = {
|
labels = {
|
||||||
tier = var.tier
|
tier = var.tier
|
||||||
|
# ADR-0014 service identity: monitoring is a multi-Service namespace, so
|
||||||
|
# the namespace alone can't attribute Goldmane flows. Value = the
|
||||||
|
# fronting Service name (kubernetes_service.proxmox-exporter).
|
||||||
|
"service-identity" = "proxmox-exporter"
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
@ -37,6 +41,10 @@ resource "kubernetes_deployment" "pve_exporter" {
|
||||||
metadata {
|
metadata {
|
||||||
labels = {
|
labels = {
|
||||||
app = "proxmox-exporter"
|
app = "proxmox-exporter"
|
||||||
|
# ADR-0014: Goldmane/Felix stamps POD labels onto flows, so the
|
||||||
|
# disambiguating identity must live on the pod template (not just
|
||||||
|
# the Deployment metadata above). Not in selector → no replace.
|
||||||
|
"service-identity" = "proxmox-exporter"
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -31,6 +31,10 @@ resource "kubernetes_deployment" "snmp-exporter" {
|
||||||
labels = {
|
labels = {
|
||||||
app = "snmp-exporter"
|
app = "snmp-exporter"
|
||||||
tier = var.tier
|
tier = var.tier
|
||||||
|
# ADR-0014 service identity: monitoring is a multi-Service namespace, so
|
||||||
|
# the namespace alone can't attribute Goldmane flows. Value = the
|
||||||
|
# fronting Service name (kubernetes_service.snmp-exporter).
|
||||||
|
"service-identity" = "snmp-exporter"
|
||||||
}
|
}
|
||||||
annotations = {
|
annotations = {
|
||||||
"reloader.stakater.com/search" = "true"
|
"reloader.stakater.com/search" = "true"
|
||||||
|
|
@ -47,6 +51,10 @@ resource "kubernetes_deployment" "snmp-exporter" {
|
||||||
metadata {
|
metadata {
|
||||||
labels = {
|
labels = {
|
||||||
app = "snmp-exporter"
|
app = "snmp-exporter"
|
||||||
|
# ADR-0014: Goldmane/Felix stamps POD labels onto flows, so the
|
||||||
|
# disambiguating identity must live on the pod template (not just
|
||||||
|
# the Deployment metadata above). Not in selector → no replace.
|
||||||
|
"service-identity" = "snmp-exporter"
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
spec {
|
spec {
|
||||||
|
|
|
||||||
|
|
@ -26,6 +26,9 @@ resource "kubernetes_namespace" "n8n" {
|
||||||
}
|
}
|
||||||
|
|
||||||
resource "kubernetes_manifest" "external_secret" {
|
resource "kubernetes_manifest" "external_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
@ -53,6 +56,9 @@ resource "kubernetes_manifest" "external_secret" {
|
||||||
}
|
}
|
||||||
|
|
||||||
resource "kubernetes_manifest" "external_secret_claude_agent" {
|
resource "kubernetes_manifest" "external_secret_claude_agent" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
@ -84,6 +90,9 @@ resource "kubernetes_manifest" "external_secret_claude_agent" {
|
||||||
# Shared secrets for the Immich → Telegram → Postiz Instagram pipeline.
|
# Shared secrets for the Immich → Telegram → Postiz Instagram pipeline.
|
||||||
# Workflows in stacks/n8n/workflows/instagram-*.json reference these env vars.
|
# Workflows in stacks/n8n/workflows/instagram-*.json reference these env vars.
|
||||||
resource "kubernetes_manifest" "external_secret_instagram_pipeline" {
|
resource "kubernetes_manifest" "external_secret_instagram_pipeline" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
|
||||||
|
|
@ -19,6 +19,9 @@ resource "kubernetes_namespace" "navidrome" {
|
||||||
}
|
}
|
||||||
|
|
||||||
resource "kubernetes_manifest" "external_secret" {
|
resource "kubernetes_manifest" "external_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
|
||||||
|
|
@ -21,6 +21,9 @@ resource "kubernetes_namespace" "netbox" {
|
||||||
}
|
}
|
||||||
|
|
||||||
resource "kubernetes_manifest" "external_secret" {
|
resource "kubernetes_manifest" "external_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
|
||||||
|
|
@ -58,6 +58,9 @@ resource "kubernetes_namespace" "nextcloud_todos" {
|
||||||
# DB user: created in dbaas (null_resource.pg_nextcloud_todos_db); password
|
# DB user: created in dbaas (null_resource.pg_nextcloud_todos_db); password
|
||||||
# managed via the Vault database engine — see static-creds/pg-nextcloud-todos.
|
# managed via the Vault database engine — see static-creds/pg-nextcloud-todos.
|
||||||
resource "kubernetes_manifest" "external_secret" {
|
resource "kubernetes_manifest" "external_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
@ -97,6 +100,9 @@ resource "kubernetes_manifest" "external_secret" {
|
||||||
# Pre-req in dbaas: CNPG cluster has DB `nextcloud_todos`, role
|
# Pre-req in dbaas: CNPG cluster has DB `nextcloud_todos`, role
|
||||||
# `nextcloud_todos`, and Vault role `static-creds/pg-nextcloud-todos`.
|
# `nextcloud_todos`, and Vault role `static-creds/pg-nextcloud-todos`.
|
||||||
resource "kubernetes_manifest" "db_external_secret" {
|
resource "kubernetes_manifest" "db_external_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
|
||||||
|
|
@ -125,6 +125,9 @@ resource "kubernetes_namespace" "nextcloud" {
|
||||||
# other enrolled workload (immich, freshrss) — is both correct and drift-free.
|
# other enrolled workload (immich, freshrss) — is both correct and drift-free.
|
||||||
|
|
||||||
resource "kubernetes_manifest" "external_secret" {
|
resource "kubernetes_manifest" "external_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
@ -154,6 +157,9 @@ resource "kubernetes_manifest" "external_secret" {
|
||||||
# DB credentials from Vault database engine (rotated every 24h)
|
# DB credentials from Vault database engine (rotated every 24h)
|
||||||
# Nextcloud Helm chart reads password at runtime via existingSecret reference
|
# Nextcloud Helm chart reads password at runtime via existingSecret reference
|
||||||
resource "kubernetes_manifest" "db_external_secret" {
|
resource "kubernetes_manifest" "db_external_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
|
||||||
|
|
@ -4,6 +4,9 @@ variable "tls_secret_name" {
|
||||||
}
|
}
|
||||||
|
|
||||||
resource "kubernetes_manifest" "external_secret" {
|
resource "kubernetes_manifest" "external_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
|
||||||
|
|
@ -24,6 +24,9 @@ resource "kubernetes_namespace" "onlyoffice" {
|
||||||
}
|
}
|
||||||
|
|
||||||
resource "kubernetes_manifest" "external_secret" {
|
resource "kubernetes_manifest" "external_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
|
||||||
|
|
@ -37,6 +37,9 @@ module "tls_secret" {
|
||||||
}
|
}
|
||||||
|
|
||||||
resource "kubernetes_manifest" "external_secret" {
|
resource "kubernetes_manifest" "external_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
|
||||||
|
|
@ -5,6 +5,9 @@ variable "tls_secret_name" {
|
||||||
variable "nfs_server" { type = string }
|
variable "nfs_server" { type = string }
|
||||||
|
|
||||||
resource "kubernetes_manifest" "external_secret" {
|
resource "kubernetes_manifest" "external_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
|
||||||
|
|
@ -26,6 +26,9 @@ resource "kubernetes_namespace" "paperless_ai" {
|
||||||
# api_key — M2M key between the Node UI and the Python RAG service.
|
# api_key — M2M key between the Node UI and the Python RAG service.
|
||||||
# custom_api_key — placeholder bearer for llama-swap (no auth, field required).
|
# custom_api_key — placeholder bearer for llama-swap (no auth, field required).
|
||||||
resource "kubernetes_manifest" "external_secret" {
|
resource "kubernetes_manifest" "external_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
|
||||||
|
|
@ -28,6 +28,9 @@ resource "kubernetes_namespace" "paperless-mcp" {
|
||||||
# Paperless API token (MCP -> paperless). Synced from Vault to a K8s Secret
|
# Paperless API token (MCP -> paperless). Synced from Vault to a K8s Secret
|
||||||
# by ESO; the pod reads it via secret_key_ref.
|
# by ESO; the pod reads it via secret_key_ref.
|
||||||
resource "kubernetes_manifest" "external_secret" {
|
resource "kubernetes_manifest" "external_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
|
||||||
|
|
@ -34,6 +34,9 @@ resource "kubernetes_namespace" "paperless-ngx" {
|
||||||
}
|
}
|
||||||
|
|
||||||
resource "kubernetes_manifest" "external_secret" {
|
resource "kubernetes_manifest" "external_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
@ -74,7 +77,7 @@ resource "kubernetes_persistent_volume_claim" "data_encrypted" {
|
||||||
annotations = {
|
annotations = {
|
||||||
"resize.topolvm.io/threshold" = "10%"
|
"resize.topolvm.io/threshold" = "10%"
|
||||||
"resize.topolvm.io/increase" = "100%"
|
"resize.topolvm.io/increase" = "100%"
|
||||||
"resize.topolvm.io/storage_limit" = "5Gi"
|
"resize.topolvm.io/storage_limit" = "30Gi"
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
spec {
|
spec {
|
||||||
|
|
@ -183,6 +186,20 @@ resource "kubernetes_deployment" "paperless-ngx" {
|
||||||
name = "PAPERLESS_OCR_USER_ARGS"
|
name = "PAPERLESS_OCR_USER_ARGS"
|
||||||
value = "{\"invalidate_digital_signatures\": true}"
|
value = "{\"invalidate_digital_signatures\": true}"
|
||||||
}
|
}
|
||||||
|
# OCR language(s) used per document. bul+eng covers the Bulgarian
|
||||||
|
# (Cyrillic) + English document set being imported (e.g. emo's
|
||||||
|
# archive). Multiple langs => tesseract tries all; "+" not " ".
|
||||||
|
env {
|
||||||
|
name = "PAPERLESS_OCR_LANGUAGE"
|
||||||
|
value = "bul+eng"
|
||||||
|
}
|
||||||
|
# Language data packages installed at container start (space-
|
||||||
|
# separated). The image ships eng (+deu/fra/ita/spa); bul must be
|
||||||
|
# apt-installed here so OCR_LANGUAGE=bul+eng resolves.
|
||||||
|
env {
|
||||||
|
name = "PAPERLESS_OCR_LANGUAGES"
|
||||||
|
value = "bul eng"
|
||||||
|
}
|
||||||
volume_mount {
|
volume_mount {
|
||||||
name = "data"
|
name = "data"
|
||||||
mount_path = "/usr/src/paperless/data"
|
mount_path = "/usr/src/paperless/data"
|
||||||
|
|
|
||||||
|
|
@ -58,6 +58,9 @@ resource "kubernetes_namespace" "payslip_ingest" {
|
||||||
# - `actualbudget_budget_sync_id`
|
# - `actualbudget_budget_sync_id`
|
||||||
# (same as Viktor's sync_id)
|
# (same as Viktor's sync_id)
|
||||||
resource "kubernetes_manifest" "external_secret" {
|
resource "kubernetes_manifest" "external_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
@ -133,6 +136,9 @@ resource "kubernetes_manifest" "external_secret" {
|
||||||
# DB credentials from Vault database engine (rotated every 7 days).
|
# DB credentials from Vault database engine (rotated every 7 days).
|
||||||
# Template builds the asyncpg DSN consumed by the FastAPI app as DB_CONNECTION_STRING.
|
# Template builds the asyncpg DSN consumed by the FastAPI app as DB_CONNECTION_STRING.
|
||||||
resource "kubernetes_manifest" "db_external_secret" {
|
resource "kubernetes_manifest" "db_external_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
@ -450,6 +456,9 @@ resource "kubernetes_cron_job_v1" "actualbudget_payroll_sync" {
|
||||||
# references it as $__env{PAYSLIPS_PG_PASSWORD}. Reloader restarts
|
# references it as $__env{PAYSLIPS_PG_PASSWORD}. Reloader restarts
|
||||||
# Grafana whenever ESO updates this secret (every 7d on rotation).
|
# Grafana whenever ESO updates this secret (every 7d on rotation).
|
||||||
resource "kubernetes_manifest" "grafana_payslips_db_external_secret" {
|
resource "kubernetes_manifest" "grafana_payslips_db_external_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
|
||||||
|
|
@ -28,6 +28,9 @@ resource "kubernetes_namespace" "phpipam" {
|
||||||
}
|
}
|
||||||
|
|
||||||
resource "kubernetes_manifest" "external_secret" {
|
resource "kubernetes_manifest" "external_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
@ -57,6 +60,9 @@ resource "kubernetes_manifest" "external_secret" {
|
||||||
}
|
}
|
||||||
|
|
||||||
resource "kubernetes_manifest" "external_secret_pfsense_ssh" {
|
resource "kubernetes_manifest" "external_secret_pfsense_ssh" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
@ -86,6 +92,9 @@ resource "kubernetes_manifest" "external_secret_pfsense_ssh" {
|
||||||
}
|
}
|
||||||
|
|
||||||
resource "kubernetes_manifest" "external_secret_admin" {
|
resource "kubernetes_manifest" "external_secret_admin" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
|
||||||
|
|
@ -19,6 +19,9 @@ resource "kubernetes_namespace" "plotting-book" {
|
||||||
}
|
}
|
||||||
|
|
||||||
resource "kubernetes_manifest" "external_secret" {
|
resource "kubernetes_manifest" "external_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
|
||||||
|
|
@ -72,6 +72,9 @@ resource "kubernetes_persistent_volume_claim" "uploads" {
|
||||||
# Helm-owned Secret resource intact. The chart's deployment already wires
|
# Helm-owned Secret resource intact. The chart's deployment already wires
|
||||||
# this Secret in via `envFrom: secretRef: postiz-secrets`.
|
# this Secret in via `envFrom: secretRef: postiz-secrets`.
|
||||||
resource "kubernetes_manifest" "external_secret_jwt" {
|
resource "kubernetes_manifest" "external_secret_jwt" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
@ -188,17 +191,18 @@ resource "kubernetes_service" "temporal" {
|
||||||
}
|
}
|
||||||
|
|
||||||
# ──────────────────────────────────────────────────────────────────────────────
|
# ──────────────────────────────────────────────────────────────────────────────
|
||||||
# Backup CronJob — nightly pg_dump of the bundled postiz-postgresql to NFS.
|
# Backup CronJob — nightly pg_dump of the postiz database to NFS.
|
||||||
#
|
#
|
||||||
# The bundled PostgreSQL StatefulSet uses local-path storage on the K8s node
|
# Postiz's database lives on the SHARED CNPG cluster
|
||||||
# OS disk (chart default), which is NOT covered by Layer 1 (LVM thin
|
# (pg-cluster-rw.dbaas.svc.cluster.local/postiz) — the chart's bundled
|
||||||
# snapshots) or Layer 2 (sda file backup) of the 3-2-1 pipeline. A pg_dump
|
# PostgreSQL was dropped in the CNPG migration, so the old `postiz-postgresql`
|
||||||
# CronJob writing to /srv/nfs/postiz-backup/ closes the gap: dumps land on
|
# host no longer resolves (this CronJob was failing on it for weeks —
|
||||||
# Proxmox host NFS → covered by inotify-driven offsite sync to Synology.
|
# BackupCronJobFailed; repointed 2026-06-26). The dump now connects via the
|
||||||
# Three databases are dumped: postiz (app data), temporal (workflow engine),
|
# app's own DATABASE_URL (from the postiz-secrets Secret) so it always tracks
|
||||||
# temporal_visibility (workflow search). Bitnami chart-default credentials
|
# the live host + credentials. Dumps land on /srv/nfs/postiz-backup/ → covered
|
||||||
# are used — same creds the Postiz pod itself uses, scoped to the postiz
|
# by inotify-driven offsite sync to Synology, closing the gap (CNPG data PVCs
|
||||||
# namespace via ClusterIP-only Services.
|
# live in dbaas, excluded from the LVM-snapshot leg). Only the postiz app DB is
|
||||||
|
# dumped here; temporal's DBs are not.
|
||||||
# ──────────────────────────────────────────────────────────────────────────────
|
# ──────────────────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
module "nfs_backup_host" {
|
module "nfs_backup_host" {
|
||||||
|
|
@ -248,10 +252,9 @@ resource "kubernetes_cron_job_v1" "postgres_backup" {
|
||||||
STATUS=0
|
STATUS=0
|
||||||
for db in postiz; do
|
for db in postiz; do
|
||||||
echo "Dumping $db..."
|
echo "Dumping $db..."
|
||||||
if PGPASSWORD=postiz-password pg_dump -h postiz-postgresql -U postiz \
|
if pg_dump -d "$DATABASE_URL" \
|
||||||
--format=custom --compress=6 \
|
--format=custom --compress=6 \
|
||||||
--file="$BACKUP_DIR/$db-$TIMESTAMP.dump" \
|
--file="$BACKUP_DIR/$db-$TIMESTAMP.dump"; then
|
||||||
"$db"; then
|
|
||||||
echo " OK: $db ($(du -h "$BACKUP_DIR/$db-$TIMESTAMP.dump" | cut -f1))"
|
echo " OK: $db ($(du -h "$BACKUP_DIR/$db-$TIMESTAMP.dump" | cut -f1))"
|
||||||
else
|
else
|
||||||
echo " FAIL: $db" >&2
|
echo " FAIL: $db" >&2
|
||||||
|
|
@ -268,6 +271,18 @@ resource "kubernetes_cron_job_v1" "postgres_backup" {
|
||||||
exit $STATUS
|
exit $STATUS
|
||||||
EOT
|
EOT
|
||||||
]
|
]
|
||||||
|
# Connect to the live CNPG database using the app's own
|
||||||
|
# DATABASE_URL (postgresql://postiz:…@pg-cluster-rw.dbaas…/postiz)
|
||||||
|
# instead of a hardcoded host/password — survives credential changes.
|
||||||
|
env {
|
||||||
|
name = "DATABASE_URL"
|
||||||
|
value_from {
|
||||||
|
secret_key_ref {
|
||||||
|
name = "postiz-secrets"
|
||||||
|
key = "DATABASE_URL"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
volume_mount {
|
volume_mount {
|
||||||
name = "backup"
|
name = "backup"
|
||||||
mount_path = "/backup"
|
mount_path = "/backup"
|
||||||
|
|
|
||||||
|
|
@ -207,6 +207,9 @@ resource "kubernetes_cluster_role_binding" "pve_snapshot_admin" {
|
||||||
# Creates K8s Secret "proxmox-csi-encryption" in kube-system from Vault KV.
|
# Creates K8s Secret "proxmox-csi-encryption" in kube-system from Vault KV.
|
||||||
# Referenced by the proxmox-lvm-encrypted StorageClass for node-stage and node-expand.
|
# Referenced by the proxmox-lvm-encrypted StorageClass for node-stage and node-expand.
|
||||||
resource "kubernetes_manifest" "external_secret_encryption" {
|
resource "kubernetes_manifest" "external_secret_encryption" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
|
||||||
|
|
@ -10,16 +10,29 @@
|
||||||
# match the existing RBAC subjects (kind: User, name: <raw email>; group names
|
# match the existing RBAC subjects (kind: User, name: <raw email>; group names
|
||||||
# verbatim). Do NOT add a prefix or existing bindings break.
|
# verbatim). Do NOT add a prefix or existing bindings break.
|
||||||
#
|
#
|
||||||
# DRIFT WARNING: this edits the kube-apiserver static-pod manifest on the single
|
# DRIFT WARNING (and how it's now handled): apiserver auth lives in THREE places
|
||||||
# master. A `kubeadm upgrade` regenerates that manifest and DROPS this flag (this
|
# that must stay in sync, because a `kubeadm upgrade` REGENERATES the static-pod
|
||||||
# is exactly how OIDC silently broke before — the flag was wiped and the
|
# manifest from kubeadm-config:
|
||||||
# content-hash trigger never re-fired). After any k8s control-plane upgrade,
|
# 1. /etc/kubernetes/pki/auth-config.yaml — the structured authn file
|
||||||
# re-apply the rbac stack to restore apiserver OIDC. See
|
# 2. the live kube-apiserver static-pod manifest — references it via the flag
|
||||||
# docs/plans/2026-06-04-k8s-dashboard-sso-design.md.
|
# 3. the kubeadm-config ClusterConfiguration CM — what kubeadm regenerates from
|
||||||
|
# Originally only (1)+(2) were managed, so every kubeadm upgrade rewrote the
|
||||||
|
# manifest from the STALE CM, reverting --authentication-config to single-issuer
|
||||||
|
# --oidc-* flags. The consequence is SSO breakage AFTER the upgrade: kubectl +
|
||||||
|
# dashboard lose multi-issuer auth (the apiserver does NOT crash on this — verified
|
||||||
|
# by an isolated repro 2026-06-24; the 2026-06-24 v1.35 upgrade *stall* was a
|
||||||
|
# separate etcd IO-starvation issue, see
|
||||||
|
# docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md). The
|
||||||
|
# remote script below now ALSO reconciles (3) via `kubeadm init phase
|
||||||
|
# upload-config`, so a future kubeadm upgrade regenerates a CORRECT manifest. The
|
||||||
|
# k8s-version-upgrade chain additionally ALERTS (does not block — SSO drift is
|
||||||
|
# recoverable) via `kubeadm upgrade diff` in preflight if --authentication-config
|
||||||
|
# would still be dropped.
|
||||||
#
|
#
|
||||||
# SAFETY: the remote script health-gates on /livez and AUTO-ROLLS-BACK the
|
# SAFETY: the remote script health-gates on /livez and AUTO-ROLLS-BACK the
|
||||||
# manifest from a timestamped backup if the apiserver does not recover, so a
|
# manifest from a timestamped backup if the apiserver does not recover, so a
|
||||||
# malformed config cannot leave the single master down.
|
# malformed config cannot leave the single master down. Reconciling kubeadm-config
|
||||||
|
# is zero-impact on the running cluster (the CM is only read during an upgrade).
|
||||||
|
|
||||||
variable "k8s_master_host" {
|
variable "k8s_master_host" {
|
||||||
type = string
|
type = string
|
||||||
|
|
@ -97,12 +110,55 @@ locals {
|
||||||
print('flag-inserted' if done else 'ANCHOR-NOT-FOUND')
|
print('flag-inserted' if done else 'ANCHOR-NOT-FOUND')
|
||||||
PY
|
PY
|
||||||
|
|
||||||
|
# Reconciles the kubeadm-config ClusterConfiguration's apiServer.extraArgs:
|
||||||
|
# drops the stale single-issuer --oidc-* args and ensures --authentication-config
|
||||||
|
# is present (anchored after --authorization-mode). Stdlib-only (the master is
|
||||||
|
# only guaranteed python3, not pyyaml/yq). Idempotent; preserves all other
|
||||||
|
# fields (etcd args, audit args, extraVolumes) verbatim. Exits 3 if the
|
||||||
|
# authorization-mode anchor is missing (fail loud, leave the CM untouched).
|
||||||
|
kubeadm_oidc_reconcile_py = <<-PY
|
||||||
|
import sys
|
||||||
|
lines = sys.stdin.read().split('\n')
|
||||||
|
out, i, n = [], 0, len(lines)
|
||||||
|
have_authn = any('name: authentication-config' in l for l in lines)
|
||||||
|
inserted = have_authn
|
||||||
|
while i < n:
|
||||||
|
ln = lines[i]; s = ln.strip()
|
||||||
|
if s.startswith('- name: oidc-'):
|
||||||
|
i += 2 if (i + 1 < n and lines[i + 1].strip().startswith('value:')) else 1
|
||||||
|
continue
|
||||||
|
out.append(ln)
|
||||||
|
if (not inserted) and s == '- name: authorization-mode':
|
||||||
|
indent = ln[:len(ln) - len(ln.lstrip())]
|
||||||
|
if i + 1 < n and lines[i + 1].strip().startswith('value:'):
|
||||||
|
out.append(lines[i + 1]); i += 2
|
||||||
|
else:
|
||||||
|
i += 1
|
||||||
|
out.append(indent + '- name: authentication-config')
|
||||||
|
out.append(indent + ' value: /etc/kubernetes/pki/auth-config.yaml')
|
||||||
|
inserted = True
|
||||||
|
continue
|
||||||
|
i += 1
|
||||||
|
if not inserted:
|
||||||
|
sys.stderr.write('ANCHOR-NOT-FOUND: authorization-mode\n'); sys.exit(3)
|
||||||
|
sys.stdout.write('\n'.join(out))
|
||||||
|
PY
|
||||||
|
|
||||||
# Whole remote operation, base64-embedded for byte-exact transfer (no
|
# Whole remote operation, base64-embedded for byte-exact transfer (no
|
||||||
# heredoc/escaping hazards across SSH).
|
# heredoc/escaping hazards across SSH).
|
||||||
apiserver_auth_remote_script = <<-SH
|
apiserver_auth_remote_script = <<-SH
|
||||||
MANIFEST=/etc/kubernetes/manifests/kube-apiserver.yaml
|
MANIFEST=/etc/kubernetes/manifests/kube-apiserver.yaml
|
||||||
AUTHCFG=/etc/kubernetes/pki/auth-config.yaml
|
AUTHCFG=/etc/kubernetes/pki/auth-config.yaml
|
||||||
TS=$(date +%s)
|
TS=$(date +%s)
|
||||||
|
# Manifest backups MUST live OUTSIDE /etc/kubernetes/manifests/ — the kubelet
|
||||||
|
# treats EVERY file in that dir as a static pod, so a kube-apiserver.yaml.bak
|
||||||
|
# there becomes a SECOND apiserver static pod. On a kubeadm upgrade (when the
|
||||||
|
# real manifest's image changes) the two conflict, the kubelet flip-flops, the
|
||||||
|
# new apiserver never stabilises → kubeadm "static Pod hash did not change" →
|
||||||
|
# rollback. This stalled the 1.34->1.35 upgrade for days (root cause found
|
||||||
|
# 2026-06-26; the old `cp "$MANIFEST" "$MANIFEST.bak"` planted it on 2026-06-18).
|
||||||
|
BAKDIR=/etc/kubernetes/apiserver-oidc-bak
|
||||||
|
sudo install -d -m 700 "$BAKDIR"
|
||||||
|
|
||||||
# 1. Write the structured AuthenticationConfiguration (hot-reloaded by the
|
# 1. Write the structured AuthenticationConfiguration (hot-reloaded by the
|
||||||
# apiserver on change; mounted into the pod via the existing pki hostPath).
|
# apiserver on change; mounted into the pod via the existing pki hostPath).
|
||||||
|
|
@ -112,7 +168,7 @@ locals {
|
||||||
# 2. Ensure the apiserver references it. Only touch the manifest (→ restart)
|
# 2. Ensure the apiserver references it. Only touch the manifest (→ restart)
|
||||||
# when the flag is missing; otherwise the file write above hot-reloads.
|
# when the flag is missing; otherwise the file write above hot-reloads.
|
||||||
if ! sudo grep -q -- '--authentication-config=' "$MANIFEST"; then
|
if ! sudo grep -q -- '--authentication-config=' "$MANIFEST"; then
|
||||||
sudo cp "$MANIFEST" "$MANIFEST.bak.$TS"
|
sudo cp "$MANIFEST" "$BAKDIR/kube-apiserver.yaml.$TS"
|
||||||
sudo sed -i '/--oidc-issuer-url/d;/--oidc-client-id/d;/--oidc-username-claim/d;/--oidc-groups-claim/d' "$MANIFEST"
|
sudo sed -i '/--oidc-issuer-url/d;/--oidc-client-id/d;/--oidc-username-claim/d;/--oidc-groups-claim/d' "$MANIFEST"
|
||||||
echo '${base64encode(local.apiserver_flag_insert_py)}' | base64 -d | sudo python3 - "$MANIFEST"
|
echo '${base64encode(local.apiserver_flag_insert_py)}' | base64 -d | sudo python3 - "$MANIFEST"
|
||||||
fi
|
fi
|
||||||
|
|
@ -131,12 +187,36 @@ locals {
|
||||||
done
|
done
|
||||||
if [ "$ok" != "1" ]; then
|
if [ "$ok" != "1" ]; then
|
||||||
echo "kube-apiserver UNHEALTHY after change — rolling back"
|
echo "kube-apiserver UNHEALTHY after change — rolling back"
|
||||||
BAK=$(ls -t "$MANIFEST".bak.* 2>/dev/null | head -1)
|
BAK=$(ls -t "$BAKDIR"/kube-apiserver.yaml.* 2>/dev/null | head -1)
|
||||||
if [ -n "$BAK" ]; then sudo cp "$BAK" "$MANIFEST"; fi
|
if [ -n "$BAK" ]; then sudo cp "$BAK" "$MANIFEST"; fi
|
||||||
for i in $(seq 1 60); do sleep 2; if curl -sk https://localhost:6443/livez 2>/dev/null | grep -q '^ok'; then break; fi; done
|
for i in $(seq 1 60); do sleep 2; if curl -sk https://localhost:6443/livez 2>/dev/null | grep -q '^ok'; then break; fi; done
|
||||||
echo "rolled back to previous manifest"; exit 1
|
echo "rolled back to previous manifest"; exit 1
|
||||||
fi
|
fi
|
||||||
echo "kube-apiserver healthy with multi-issuer --authentication-config"
|
echo "kube-apiserver healthy with multi-issuer --authentication-config"
|
||||||
|
|
||||||
|
# 5. Reconcile kubeadm-config so a FUTURE `kubeadm upgrade` regenerates the
|
||||||
|
# apiserver manifest WITH --authentication-config instead of reverting to
|
||||||
|
# the stale single-issuer --oidc-* flags. Without this, kubeadm rewrote the
|
||||||
|
# manifest from kubeadm-config on every control-plane upgrade and the
|
||||||
|
# regenerated apiserver crash-looped (the 2026-06-24 v1.35 upgrade stall).
|
||||||
|
# Zero live impact (the CM is only read at upgrade time); idempotent;
|
||||||
|
# best-effort (the chain's `kubeadm upgrade diff` preflight gate is the
|
||||||
|
# backstop if this cannot run).
|
||||||
|
KC="sudo kubectl --kubeconfig /etc/kubernetes/admin.conf"
|
||||||
|
CC=$($KC -n kube-system get cm kubeadm-config -o jsonpath='{.data.ClusterConfiguration}' 2>/dev/null || true)
|
||||||
|
if [ -n "$CC" ] && { echo "$CC" | grep -q 'oidc-issuer-url' || ! echo "$CC" | grep -q 'authentication-config'; }; then
|
||||||
|
echo "Reconciling kubeadm-config (oidc-* -> authentication-config) so kubeadm upgrade keeps structured auth"
|
||||||
|
echo '${base64encode(local.kubeadm_oidc_reconcile_py)}' | base64 -d > /tmp/reconcile_kubeadm_oidc.py
|
||||||
|
if printf '%s' "$CC" | python3 /tmp/reconcile_kubeadm_oidc.py > /tmp/kubeadm-cc-new.yaml \
|
||||||
|
&& sudo kubeadm init phase upload-config kubeadm --config /tmp/kubeadm-cc-new.yaml; then
|
||||||
|
echo "kubeadm-config reconciled: future control-plane upgrades keep --authentication-config"
|
||||||
|
else
|
||||||
|
echo "WARN: kubeadm-config reconcile failed; the upgrade-chain preflight gate will block the next upgrade"
|
||||||
|
fi
|
||||||
|
rm -f /tmp/reconcile_kubeadm_oidc.py /tmp/kubeadm-cc-new.yaml
|
||||||
|
else
|
||||||
|
echo "kubeadm-config already uses --authentication-config (no oidc drift)"
|
||||||
|
fi
|
||||||
SH
|
SH
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
@ -155,6 +235,14 @@ resource "null_resource" "apiserver_oidc_config" {
|
||||||
}
|
}
|
||||||
|
|
||||||
triggers = {
|
triggers = {
|
||||||
|
# Intentionally hash ONLY the issuer config, NOT the remote script. CI applies
|
||||||
|
# the rbac stack with no ssh_private_key (var defaults to ""), so a re-run of
|
||||||
|
# this SSH provisioner in CI would fail — hence the null_resource must stay a
|
||||||
|
# no-op on a plain CI apply. Script changes (e.g. the 2026-06-24 kubeadm-config
|
||||||
|
# reconciliation) reach the cluster via the apiserver-oidc-restore ConfigMap
|
||||||
|
# below (a plain k8s resource, no ssh) which the upgrade chain re-runs. To force
|
||||||
|
# this provisioner to re-run after a script change, apply locally with
|
||||||
|
# `-replace` + TF_VAR_ssh_private_key (see docs/runbooks/k8s-version-upgrade.md).
|
||||||
auth_config = sha256(local.apiserver_auth_config_yaml)
|
auth_config = sha256(local.apiserver_auth_config_yaml)
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
|
||||||
|
|
@ -7,6 +7,9 @@ variable "redis_host" { type = string }
|
||||||
variable "mysql_host" { type = string }
|
variable "mysql_host" { type = string }
|
||||||
|
|
||||||
resource "kubernetes_manifest" "external_secret" {
|
resource "kubernetes_manifest" "external_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
@ -36,6 +39,9 @@ resource "kubernetes_manifest" "external_secret" {
|
||||||
# DB credentials from Vault database engine (rotated automatically)
|
# DB credentials from Vault database engine (rotated automatically)
|
||||||
# Provides DB_CONNECTION_STRING that auto-updates when password rotates
|
# Provides DB_CONNECTION_STRING that auto-updates when password rotates
|
||||||
resource "kubernetes_manifest" "db_external_secret" {
|
resource "kubernetes_manifest" "db_external_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
@ -85,6 +91,9 @@ data "kubernetes_secret" "eso_secrets" {
|
||||||
# fresh node would also fail. ESO renders the dockerconfigjson server-side
|
# fresh node would also fail. ESO renders the dockerconfigjson server-side
|
||||||
# (Sprig `b64enc`) so the PAT never sits in K8s in cleartext.
|
# (Sprig `b64enc`) so the PAT never sits in K8s in cleartext.
|
||||||
resource "kubernetes_manifest" "dockerhub_pull_secret" {
|
resource "kubernetes_manifest" "dockerhub_pull_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
|
||||||
|
|
@ -55,6 +55,9 @@ resource "kubernetes_namespace" "recruiter_responder" {
|
||||||
# Schema in CNPG: `recruiter_responder` (alembic creates on first migrate).
|
# Schema in CNPG: `recruiter_responder` (alembic creates on first migrate).
|
||||||
# DB user: created via Vault database engine — see static-creds/pg-recruiter-responder.
|
# DB user: created via Vault database engine — see static-creds/pg-recruiter-responder.
|
||||||
resource "kubernetes_manifest" "external_secret" {
|
resource "kubernetes_manifest" "external_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
@ -107,6 +110,9 @@ resource "kubernetes_manifest" "external_secret" {
|
||||||
# Pre-req in dbaas: CNPG cluster has DB `recruiter_responder`, role
|
# Pre-req in dbaas: CNPG cluster has DB `recruiter_responder`, role
|
||||||
# `recruiter_responder`, and Vault role `static-creds/pg-recruiter-responder`.
|
# `recruiter_responder`, and Vault role `static-creds/pg-recruiter-responder`.
|
||||||
resource "kubernetes_manifest" "db_external_secret" {
|
resource "kubernetes_manifest" "db_external_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
|
||||||
|
|
@ -41,6 +41,9 @@ module "tls_secret" {
|
||||||
}
|
}
|
||||||
|
|
||||||
resource "kubernetes_manifest" "external_secret" {
|
resource "kubernetes_manifest" "external_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
|
||||||
|
|
@ -25,6 +25,9 @@ resource "kubernetes_namespace" "rybbit" {
|
||||||
}
|
}
|
||||||
|
|
||||||
resource "kubernetes_manifest" "external_secret" {
|
resource "kubernetes_manifest" "external_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
|
||||||
|
|
@ -185,6 +185,9 @@ resource "kubernetes_service" "aiostreams" {
|
||||||
}
|
}
|
||||||
|
|
||||||
resource "kubernetes_manifest" "probe_secrets" {
|
resource "kubernetes_manifest" "probe_secrets" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
|
||||||
|
|
@ -5,6 +5,9 @@ variable "tls_secret_name" {
|
||||||
variable "nfs_server" { type = string }
|
variable "nfs_server" { type = string }
|
||||||
|
|
||||||
resource "kubernetes_manifest" "external_secret" {
|
resource "kubernetes_manifest" "external_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
|
||||||
|
|
@ -21,6 +21,9 @@ resource "kubernetes_namespace" "shadowsocks" {
|
||||||
}
|
}
|
||||||
|
|
||||||
resource "kubernetes_manifest" "external_secret" {
|
resource "kubernetes_manifest" "external_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
|
||||||
|
|
@ -20,6 +20,9 @@ resource "kubernetes_namespace" "speedtest" {
|
||||||
}
|
}
|
||||||
|
|
||||||
resource "kubernetes_manifest" "external_secret" {
|
resource "kubernetes_manifest" "external_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
|
||||||
|
|
@ -16,6 +16,9 @@
|
||||||
# `secret/stem95su.rclone_conf`. A failed run surfaces as a failed Job.
|
# `secret/stem95su.rclone_conf`. A failed run surfaces as a failed Job.
|
||||||
|
|
||||||
resource "kubernetes_manifest" "rclone_external_secret" {
|
resource "kubernetes_manifest" "rclone_external_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
|
||||||
|
|
@ -58,6 +58,9 @@ resource "kubernetes_namespace" "t3_afk" {
|
||||||
# (wired into ~/.gitconfig insteadOf rewrites in the container command).
|
# (wired into ~/.gitconfig insteadOf rewrites in the container command).
|
||||||
|
|
||||||
resource "kubernetes_manifest" "external_secret" {
|
resource "kubernetes_manifest" "external_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
|
||||||
|
|
@ -22,6 +22,9 @@ resource "kubernetes_namespace" "tandoor" {
|
||||||
}
|
}
|
||||||
|
|
||||||
resource "kubernetes_manifest" "external_secret" {
|
resource "kubernetes_manifest" "external_secret" {
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
|
||||||
Some files were not shown because too many files have changed in this diff Show more
Loading…
Add table
Add a link
Reference in a new issue