WIP: goldmane-edge-aggregator deploy stack + vault role + ghcr allowlist (infra #58 )

NOT APPLIED. Staged for a fresh-session finish (see memory runbook). Contains: - stacks/goldmane-edge-aggregator/{main.tf,terragrunt.hcl}: namespace, TF-minted mTLS client cert from tigera-ca-private, goldmane_edges PG DB-init Job, db + slack ExternalSecrets, aggregate Deployment + digest CronJob. - stacks/vault/main.tf: pg-goldmane-edges static rotation role (Tier-0). - stacks/kyverno/.../ghcr-credentials.tf: ns added to the private-image allowlist. KNOWN BLOCKER: the stack uses the hashicorp/tls provider (cert minting) but the root terragrunt.hcl generate "k8s_providers" block doesn't declare it, and a second required_providers (the removed versions.tf) is illegal. FIX = add tls to that global block (mirrors proxmox/kubectl). Then apply order: db_init (creates goldmane_edges role) -> kyverno -> vault (Tier-0, plan-review) -> stack ExternalSecrets (targeted, first-apply) -> stack full -> verify mTLS to goldmane:7443. Vault KV secret/goldmane-edge-aggregator already created. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 13:01:37 +00:00
26 changed files with 4053 additions and 4916 deletions
--- a/.claude/CLAUDE.md
+++ b/.claude/CLAUDE.md
@ -243,8 +243,7 @@ Plan in `docs/architecture/security.md` + response playbook in `docs/runbooks/se
 - **Source-IP allowlist (K2, K9, V7, S1)**: `10.0.20.0/22`, `192.168.1.0/24` (Proxmox + Sofia LAN), K8s pod CIDR, K8s service CIDR, Headscale tailnet. **Policy: no public-IP access** — Vault, kube-apiserver, PVE sshd must transit LAN or Headscale. **One documented exception (2026-06-11): break-glass SSH** — PVE sshd on a WAN-exposed `:52222`, key-only, dedicated break-glass key only (`Match LocalPort`), rate-limited + fail2ban; intentionally cluster-independent so it survives an outage. As-built `docs/runbooks/breakglass-ssh.md`. (Replaced the 2026-05-30 port-knock design — circular Vault dep caused a lockout.)
 - **Response model**: (I) Slack-only daily skim. All security alerts via Loki ruler → Alertmanager → `#security` Slack receiver. Single channel with severity labels inside (critical/warning/info). No paging.
 - **Kyverno policies (wave 1)**: `deny-privileged-containers`, `deny-host-namespaces`, `restrict-sys-admin`, `require-trusted-registries` flip Audit→Enforce with the 31-namespace exclude list (memory id=1970). `failurePolicy: Ignore` preserved. Cosign `verify-images` deferred.
- **NetworkPolicy default-deny egress (wave 1)**: observe-then-enforce (γ approach) — Calico flow logs cluster-wide + GlobalNetworkPolicy log-only on tier 3+4, build empirical allowlist after 1 week, phased per-namespace enforce starting `recruiter-responder`. Tier 0/1/2 deferred. **The internal (ns-to-ns) half of each allowlist now derives faster from the east-west flow trail** (below): `SELECT DISTINCT dst_ns FROM edge WHERE src_ns='<ns>' AND action='allow'`. External egress is NOT in that table (empty-ns flows dropped) — those still come from the Calico flow-log W1.6 snapshot. Enforce-flips remain out of scope of the trail (observe-and-derive only; beads `code-8ywc`).
+- **NetworkPolicy default-deny egress (wave 1)**: observe-then-enforce (γ approach) — Calico flow logs cluster-wide + GlobalNetworkPolicy log-only on tier 3+4, build empirical allowlist after 1 week, phased per-namespace enforce starting `recruiter-responder`. Tier 0/1/2 deferred.
 - **East-west flow trail (who-talks-to-whom, ADR-0014)**: Calico **Goldmane** (`goldmane.calico-system:7443`, gRPC/mTLS, ~60-min in-memory ring buffer — no etcd writes) + **Whisker** live UI (`whisker.viktorbarzin.me`, Authentik-gated) → **`goldmane-edge-aggregator`** streams Goldmane's `Flows.Stream` over mTLS and upserts the namespace-pair **edge set** (`edge(src_ns,dst_ns,action,first_seen,last_seen,flow_count)`, self-edges + public-internet flows dropped) into **CNPG DB `goldmane_edges`** → daily **`goldmane-edges-digest`** CronJob posts first-seen edges to `#alerts` (the shared webhook's Slack app isn't in `#security` → 404 channel_not_found; flip `SLACK_CHANNEL` back once invited — see runbook). **CERT-REUSE GOTCHA**: the aggregator's mTLS client cert reuses the operator's Tigera-CA-signed `whisker-backend-key-pair` Secret (Goldmane verifies CA-chain only) — **re-apply `stacks/goldmane-edge-aggregator` if the operator rotates it** (symptom: no `last_seen` updates, `AggregatorDown`). Service identity = namespace, + `service-identity` label only in `monitoring`/`kube-system`/`dbaas`. Health: `AggregatorDown` + `DigestFailing` alerts + cluster-health #48. Runbook: `docs/runbooks/goldmane-flow-trail.md`. (Goldmane is OSS tech-preview — reversible operator-CR toggle in `stacks/calico/main.tf`.)
 - **What's NOT in scope**: canary tokens (rejected — self-trigger risk with Viktor's normal `vault kv list secret/viktor` and `kubectl get secret -A` workflows), Falco/Tetragon (too noisy for Slack-only daily check), Cloudflare/GitHub audit polling (deferred to wave 2).
 ## Storage & Backup Architecture
--- a/.claude/reference/service-catalog.md
+++ b/.claude/reference/service-catalog.md
@ -13,8 +13,6 @@
 | authentik | Identity provider (SSO) | authentik |
 | cloudflared | Cloudflare tunnel | cloudflared |
 | authelia | Auth middleware (may be merged into ebooks or removed) | platform |
 | goldmane | Calico 3.30 OSS flow aggregator (`goldmane.calico-system.svc:7443`, gRPC/mTLS). Stamps identity (ns/pod/workload/labels + allow-deny) on every flow from Felix into a ~60-min in-memory ring buffer — no etcd/API writes. East-west "who-talks-to-whom" source (ADR-0014). Enabled via operator CR (`kubectl_manifest.goldmane`). | calico |
 | whisker | Calico 3.30 OSS live flow-observability UI (`whisker.calico-system.svc:8081`) at `whisker.viktorbarzin.me` (Authentik-gated, `auth=required` — no own login; additive NP ORs Traefik past the operator default-deny). ~60-min live view of Goldmane flows, NOT history. Enabled via operator CR (`kubectl_manifest.whisker`). | calico |
 | monitoring | Prometheus/Grafana/Loki stack | monitoring |
 ## Storage & Security (Tier: cluster)
@ -39,7 +37,6 @@
 ## Active Use
 | Service | Description | Stack |
 |---------|-------------|-------|
 | goldmane-edge-aggregator | Durable who-talks-to-whom audit trail (ADR-0014 / #58). Go service: `aggregate` Deployment streams Goldmane's gRPC `Flows.Stream` (mTLS) and upserts the low-cardinality namespace-pair edge set (`edge(src_ns,dst_ns,action,first_seen,last_seen,flow_count)`) into CNPG DB `goldmane_edges`; `goldmane-edges-digest` CronJob posts first-seen edges daily to `#security`. mTLS client cert REUSES the operator's `whisker-backend-key-pair` (re-apply if rotated). Tier-4-aux. Image `ghcr.io/viktorbarzin/goldmane-edge-aggregator` (private). Runbook: [goldmane-flow-trail.md](../../docs/runbooks/goldmane-flow-trail.md). | goldmane-edge-aggregator |
 | mailserver | Email (docker-mailserver) | mailserver |
 | shadowsocks | Proxy | shadowsocks |
 | webhook_handler | Webhook processing | webhook_handler |
@ -164,4 +161,3 @@ procedures) are documented in `infra/docs/runbooks/`:
 | pfSense + Unbound DNS | [pfsense-unbound.md](../../docs/runbooks/pfsense-unbound.md) |
 | Mailserver PROXY-protocol / HAProxy | [mailserver-pfsense-haproxy.md](../../docs/runbooks/mailserver-pfsense-haproxy.md) |
 | Technitium apply flow | [technitium-apply.md](../../docs/runbooks/technitium-apply.md) |
 | Goldmane flow trail (east-west who-talks-to-whom) | [goldmane-flow-trail.md](../../docs/runbooks/goldmane-flow-trail.md) |
--- a/.claude/skills/home-assistant/SKILL.md
+++ b/.claude/skills/home-assistant/SKILL.md
@ -11,8 +11,8 @@ description: |
  There are TWO Home Assistant deployments: ha-london (default) and ha-sofia.
  Always use Home Assistant for smart home control.
 author: Claude Code
-version: 2.1.0
+version: 2.0.0
-date: 2026-06-24
+date: 2026-02-07
 ---
 # Home Assistant Control
@ -395,27 +395,14 @@ Advanced SSH, File Editor, Studio Code Server, InfluxDB, Mosquitto, Node-RED, Fr
 ## ha-london Knowledge Map
 ### Overview
- **HA Version**: 2026.5.2 on **Home Assistant OS** (HAOS — managed appliance, NOT a `docker run` container). Latest is 2026.6.4 (update available, deliberately not applied).
+- **HA Version**: 2025.9.1 (Docker container on Raspberry Pi)
 - **Location**: London, UK
- **Platform**: Raspberry Pi 4, HA OS
+- **Platform**: Raspberry Pi 4, HA OS (not Docker standalone)
- **Access from the Sofia devvm**: london is **remote** — `homelab ha ssh --instance london` generally WON'T connect (ADR-0012). Drive it via the API: `homelab ha token --instance london` + `https://ha-london.viktorbarzin.me/api/...`, and the WebSocket API `wss://ha-london.viktorbarzin.me/api/websocket` for dashboards / config-entries / HACS installs.
+- **SSH**: `ssh hassio@192.168.8.103` (requires `sudo` for file access)
- **SSH (only from the London LAN)**: `ssh hassio@192.168.8.103` (requires `sudo` for file access)
+- **Config path**: `/config/` (requires `sudo` for file access)
 - **Config path**: `/config/`
 - **3 tracked people**: Viktor Barzin, Anca Milea, Gheorghe Milea
 - **Zone**: London (home)
 ### Dashboards (redesigned 2026-06-24)
 **Glossary** (HA terms — keep distinct):
 - **Dashboard** = a sidebar entry (Overview, Air Quality, Map). Sidebar *order* is a per-USER frontend preference, not in any dashboard config.
 - **View** = a tab inside a dashboard. View order is global (stored in the dashboard config).
 - **Card** = a widget inside a view.
 - **Overview** (`lovelace`, the default): responsive **sections** views, styled with Mushroom + mini-graph-card.
  - **Home** tab: *Who's home* · *Comfort & Air* (CO₂/temp/humidity/PM2.5/VOC chips + CO₂ and temp/humidity trend graphs + link to Air Quality) · *Cowboy* (battery/range/last-ride) · *Energy* (5 Kasa plugs + power trend) · *Quick actions* (Netflix/Stremio/Night).
  - **More** tab: *Network* (GL-MT6000 router) · *System* (HA version/update, last backup, RPi power) · *Phones*.
 - **Air Quality** (`air-quality`): deep-dive (views: Home, Detailed). (`detialed`→`detailed` path typo fixed 2026-06-24.)
 - Built via the WS `lovelace/config/save` API (london is remote — no SSH path).
 ### Key Systems
 #### 1. Smart Plugs (TP-Link Kasa) — Energy Monitoring
@ -437,15 +424,10 @@ Named plugs with power/energy tracking:
 - PM1.0/2.5/4.0/10 particulate sensors
 - VOC, NOx, ammonia, CO, ethanol, hydrogen, methane, NO2 gas sensors
-#### 3. Cowboy E-Bike (`elsbrock/cowboy-ha`)
+#### 3. Cowboy E-Bike
-Bike named **"Classic Performance"** → entities are `sensor.classic_performance_*` (26 total). The old `sensor.bike_*` names are GONE (they were the dead `jdejaegh` integration).
+- `sensor.bike_state_of_charge`: Battery %
- `sensor.classic_performance_remaining_battery`: Battery % (was `sensor.bike_state_of_charge`)
+- `sensor.bike_total_distance`: Total km
- `sensor.classic_performance_remaining_range`: Range km
+- `sensor.bike_total_co2_saved`: CO2 saved (grams)
 - `sensor.classic_performance_mileage`: Total km (was `sensor.bike_total_distance`)
 - `sensor.classic_performance_saved_co2`: Lifetime CO2 saved (was `sensor.bike_total_co2_saved`)
 - Plus `_distance_today`, `_last_trip_*`, `_battery_health`, `device_tracker.classic_performance`, etc.
 - **GOTCHA**: live battery/range/mileage read `unknown` while the bike is parked/asleep — Cowboy only reports live SoC when awake (ridden/charging); trip-history + `distance_today` stay live regardless.
 - Auth: account **email+password** (no AWS Cognito — that was the dead `jdejaegh`/`cowboybike` lineage). Setup via UI config flow / REST `config_entries/flow`. Creds in Vaultwarden item **"cowboy bike"** (`homelab vault get "cowboy bike"`).
 #### 4. Uptime Monitoring (UptimeRobot)
 - `sensor.blog`: blog uptime
@ -464,17 +446,12 @@ Bike named **"Classic Performance"** → entities are `sensor.classic_performanc
 - Scripts: `script.start_netflix`, `script.start_stremio`
 - Scene: `scene.night` (turns off Livia + Michelle plugs)
-### Custom Components (HACS integrations)
+### Custom Components
- **cowboy** (`elsbrock/cowboy-ha` v1.2.0): Cowboy e-bike — revived 2026-06-24. The old `jdejaegh/home-assistant-cowboy` repo is **dead (404)**; don't chase it.
+- **cowboy**: Cowboy e-bike integration (HACS)
- **hildebrandglow_dcc**: UK smart meter DCC energy — **DISABLED by user** (config entry `disabled_by: user`), not broken.
+- **hildebrandglow_dcc**: UK smart meter DCC energy data (HACS)
 ### HACS frontend cards (plugins)
 - **Mushroom** (`piitaya/lovelace-mushroom`), **mini-graph-card** (`kalkih/mini-graph-card`), **plotly-graph-card** (`dbuezas/lovelace-plotly-graph-card`) — used by the redesigned Overview. Install over WS `hacs/repository/download`; resources auto-register in storage mode.
 ### Integrations
-ESPHome, TP-Link Kasa, Tapo, UptimeRobot, **Cowboy** (elsbrock), Oral-B BLE, Ookla Speedtest (exposes only an `update` entity, no live speed sensors), HACS, OpenRouter (free LLMs), Piper (TTS), Whisper (STT), Android TV/ADB.
+ESPHome, TP-Link Kasa, Tapo, UptimeRobot, Cowboy, Hildebrand Glow DCC, Oral-B BLE, Ookla Speedtest, HACS, OpenRouter (multiple free LLMs), Piper (local TTS), Whisper (local STT), Android TV/ADB
 - **Disabled by user (NOT broken)**: `met` + `metoffice` (weather — so `weather.*` entities are ABSENT), `roomba` (Rumi vacuum), `hildebrandglow_dcc` (energy).
 - **Failing**: `tplink` **Tapo P100** projector plug — `setup_retry`, 403 KLAP handshake from 192.168.8.108 (plug off / firmware). Left as-is.
 ### AI / Voice Assistants
 - 5 free LLM conversation agents: Google Gemma 3 27B, Meta Llama 3.2 3B, Mistral Devstral 2, OpenAI GPT-OSS-20B, Z.AI GLM 4.5 Air
@ -489,8 +466,15 @@ ESPHome, TP-Link Kasa, Tapo, UptimeRobot, **Cowboy** (elsbrock), Oral-B BLE, Ook
 - Anca arrival/departure notifications
 - Night scene: turns off Livia + Michelle
-### Platform (HAOS — ignore any legacy `docker run` snippet)
+### Docker Setup
-ha-london runs **Home Assistant OS** (managed appliance), NOT a hand-run Docker container. There is no `docker run homeassistant/home-assistant` to manage. Install HACS components over the WebSocket API (`hacs/repository/download` with the repo's HACS id), then restart via `POST /api/services/homeassistant/restart` — a HAOS restart drops automations for ~1–2 min and resets `sensor.uptime` (use that as the "back up" marker).
+```bash
 docker run -d --name homeassistant --privileged \
  -e TZ=Europe/London \
  -v /home/pi/docker/homeAssistant:/config \
  -v /run/dbus:/run/dbus:ro \
  --network=host --restart=unless-stopped \
  homeassistant/home-assistant:2025.9
 ```
 ### SSH Access
 ```bash
--- a/CONTEXT.md
+++ b/CONTEXT.md
@ -125,7 +125,7 @@ How a **Service** is named in flow/audit data — its **namespace** is the prima
 _Avoid_: equating "service identity" with a workload's **ServiceAccount** (that's the deferred enforcement principal, not the attribution key) or with cryptographic/SPIFFE identity; "Service" here is the domain **Service**, not the K8s `Service` object.
 **Goldmane / Whisker**:
-Calico 3.30's OSS flow-observability pair — **Goldmane** aggregates identity-stamped flows (namespace/pod/workload/labels + allow-deny + policy trace) streamed from Felix over gRPC into an in-memory ~60-min ring buffer (no etcd/API writes); **Whisker** is its live web UI. The east-west "who-talks-to-whom" data plane, succeeding raw iptables-`LOG`→journald lines (which carry no identity). The in-memory buffer alone is not an audit trail — durable history is the **`goldmane-edge-aggregator`** (the implemented trail; ADR-0014 originally framed this as a Loki emitter), which streams Goldmane's gRPC `Flows.Stream` over mTLS and upserts the namespace-pair **edge set** into CNPG DB `goldmane_edges` + a daily `#security` digest. As-built: `docs/runbooks/goldmane-flow-trail.md`.
+Calico 3.30's OSS flow-observability pair — **Goldmane** aggregates identity-stamped flows (namespace/pod/workload/labels + allow-deny + policy trace) streamed from Felix over gRPC into an in-memory ~60-min ring buffer (no etcd/API writes); **Whisker** is its live web UI. The east-west "who-talks-to-whom" data plane, succeeding raw iptables-`LOG`→journald lines (which carry no identity). Durable history requires emitting Goldmane flows to **Loki**; the in-memory buffer alone is not an audit trail.
 _Avoid_: assuming Goldmane persists (it's a ring buffer — lost on restart); expecting a ServiceAccount field in its schema (it carries labels, not SA); confusing it with Cilium **Hubble** (needs the Cilium datapath, unusable on Calico) or **Kiali** (needs an Istio mesh).
 ### Storage
--- a/docs/adr/0014-service-identity-and-east-west-observability.md
+++ b/docs/adr/0014-service-identity-and-east-west-observability.md
@ -27,9 +27,3 @@ As the Service count grows we want an audit-grade record of which Service talks
 - **Enforcement gains a better data source.** Goldmane's allow/deny + policy-trace flows build the Wave 1 empirical egress allowlist faster than the current iptables-`LOG`→journald→Loki path, and policies select on namespace/label with no SA dependency.
 - **New ubiquitous language** recorded in `CONTEXT.md`: **Service identity** and **Goldmane / Whisker**.
 - **Revisit triggers:** adopt dedicated per-Service SAs if identity-aware NetworkPolicy needs a principal finer than namespace/label, or if mTLS is ever required; reconsider Retina if DNS/drop-level flow detail becomes necessary.
 ## As-built (2026-06-25)
 Implemented across infra issues #57–#63. **One material deviation from the decision above:** the durable trail is NOT a Goldmane→Loki emitter (no such emitter exists in OSS Calico 3.30) — it is the **`goldmane-edge-aggregator`** service, which streams Goldmane's gRPC `Flows.Stream` API over mTLS and upserts the unique namespace-pair **edge set** (`edge(src_ns,dst_ns,action,first_seen,last_seen,flow_count)`, self-edges + empty-namespace flows dropped) into **CNPG DB `goldmane_edges`**, plus a daily `goldmane-edges-digest` CronJob → `#alerts` (the shared webhook can't reach `#security` — see runbook). The mTLS client cert **reuses the operator's Tigera-CA-signed `whisker-backend-key-pair`** rather than copying the CA private key into TF state (Goldmane verifies CA-chain only, not identity) — re-apply `stacks/goldmane-edge-aggregator` if the operator rotates it. `service-identity` labels are live on the multi-Service namespaces (`monitoring`, `dbaas`). Whisker UI is Authentik-gated at `whisker.viktorbarzin.me`. Health: Prometheus alerts `AggregatorDown` + `DigestFailing` and cluster-health check #48.
 Full as-built, query recipes (incl. the Wave-1 egress-allowlist derivation), and troubleshooting: [`docs/runbooks/goldmane-flow-trail.md`](../runbooks/goldmane-flow-trail.md). Stacks: `stacks/calico` (Goldmane/Whisker + Whisker ingress), `stacks/goldmane-edge-aggregator` (the trail). Code: `~/code/goldmane-edge-aggregator`.
--- a/docs/architecture/monitoring.md
+++ b/docs/architecture/monitoring.md
@ -321,17 +321,6 @@ Detects the inverse of the K-series alerts: a service that **must work WITHOUT A
 - **Alert**: `probe_failed_due_to_regex{job="blackbox-authentik-walloff"} == 1` for 10m → `severity=warning`, `lane=security` → **`#security` Slack** (Slack-only, no paging). `probe_failed_due_to_regex` (not bare `probe_success==0`) is the signal: it isolates the Authentik-redirect from unrelated 5xx/DNS/TLS failures already covered by reachability alerts. Inhibited by `TraefikDown` and `AuthentikDown` (symptom, not regression, during those outages).
 - **Target list + how to add one**: `local.authentik_walloff_targets` in `stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf` — a map of `service → URL`. To guard a NEW carve-out, add ONE line. Verify it does NOT already 302 to Authentik first: `curl -s -o /dev/null -w '%{http_code} %{redirect_url}\n' '<url>'`. The map key becomes the `service` label on the metric + alert. (Note: openclaw `task-webhook` is intentionally NOT probed — no public DNS record.)
 #### East-west flow observability (Goldmane edge-aggregator) — `AggregatorDown` / `DigestFailing` (ADR-0014)
 Health for the durable "who-talks-to-whom" trail (Calico Goldmane → `goldmane-edge-aggregator` → CNPG `goldmane_edges` → daily `#alerts` digest; full trail in security.md + [runbooks/goldmane-flow-trail.md](../runbooks/goldmane-flow-trail.md)). The aggregator pod exposes **no `/metrics`**, so health is inferred from kube-state-metrics. Alert group `Network Observability (Goldmane)` in `prometheus_chart_values.tpl`; both route the default `slack-warning` receiver → **`#alerts`**.
 | Alert | Expr (abridged) | For | Severity |
 |---|---|---|---|
 | `AggregatorDown` | `kube_deployment_status_replicas_available{namespace="goldmane-edge-aggregator",deployment="goldmane-edge-aggregator"} < 1` (+ Prometheus-restart guard) | 15m | warning |
 | `DigestFailing` | `kube_job_status_failed{namespace="goldmane-edge-aggregator",job_name=~"goldmane-edges-digest.*"} > 0` within 24h | 30m | warning |
 The two layers are **complementary**: `AggregatorDown` ⇒ no new edges land in the DB; `DigestFailing` ⇒ edges still land but nobody is told. (`< 1` requires the metric series to exist — a fully-deleted Deployment is instead caught by cluster-health check #48 below as "deployment missing".) A freshness probe (#61b) was deliberately skipped — `AggregatorDown` is the agreed floor. **Cluster-health check #48** (`check_goldmane_aggregator` in `scripts/cluster_healthcheck.sh`) reads the Deployment's `Available` condition independently (human / `--quiet` / `--json`; JSON key `goldmane_aggregator`).
 #### Backup Alerts
 - **PostgreSQLBackupStale**: >36h since last backup
 - **MySQLBackupStale**: >36h since last backup
--- a/docs/architecture/security.md
+++ b/docs/architecture/security.md
@ -364,67 +364,6 @@ Beads: `code-8ywc` W1.6 + W1.7. **Status: planned.**
 - Rare-event misses: a Sunday-only CronJob's egress won't appear in 7 days of flow logs. Mitigation: extend observation to 2 weeks for namespaces with weekly CronJobs.
 - Mass-rollout cascade: the 26h March 2026 outage (memory id=390) was a mass-change cascade. Mitigation: phased per-namespace with health-check pauses, similar to the 2026-05-17 Keel phased rollout (memory id=1972).
 #### Deriving the per-namespace egress allowlist from the edge trail (Wave 1 W1.7)
 The durable **east-west flow trail** (below) is now the preferred data source for
 the *internal* (namespace-to-namespace) half of each Wave-1 egress allowlist —
 faster and identity-stamped vs the original iptables-`LOG`→journald→Loki path
 (ADR-0014: "Enforcement gains a better data source"). The unique observed
 namespace pairs live in CNPG DB `goldmane_edges`, table `edge`. To derive the
 namespaces a source is observed talking to (the `allow` set that seeds its
 NetworkPolicy):
 ```sql
 SELECT DISTINCT dst_ns FROM edge WHERE src_ns='<ns>' AND action='allow' ORDER BY dst_ns;
 ```
 The full SQL recipe (whole-cluster matrix, deny sanity-checks, the ≥7-day
 observation caveat) is in
 [runbooks/goldmane-flow-trail.md → Deriving the Wave-1 egress allowlist](../runbooks/goldmane-flow-trail.md#deriving-the-wave-1-egress-allowlist-from-the-edge-table-infra-62).
 **External / public-internet egress is NOT in this table** (empty-namespace flows
 are dropped) — for those destinations keep using the Calico flow-log observation
 (the W1.6 snapshot, `wave1-egress-observation-2026-05-22.md`). This feeds the
 existing observe-then-enforce effort (beads `code-8ywc`); **enforce-flips remain
 out of scope** of the trail — it is observe-and-derive only.
 ### East-west flow observability (Goldmane / Whisker + edge trail) (ADR-0014)
 The "who-talks-to-whom" data plane that succeeds raw iptables-`LOG` lines (which
 carried no identity). **Service identity = the workload's namespace** (primary),
 refined by a `service-identity` label in the few multi-Service namespaces
 (`monitoring`, `kube-system`, `dbaas`). End-to-end trail, three layers:
 1. **Calico Goldmane + Whisker** (`calico-system`) — Goldmane aggregates
   identity-stamped flows (ns/pod/workload/labels + allow-deny + policy-trace)
   streamed from Felix over gRPC into a **~60-min in-memory ring buffer** (no
   etcd/API writes — the etcd-cost constraint that drove the design). **Whisker**
   is its live web UI at `whisker.viktorbarzin.me` (Authentik-gated,
   `auth = "required"` — Whisker has no own login; an additive NetworkPolicy ORs
   Traefik past the operator's default-deny `whisker` NP). The ring buffer is
   **not** a trail (lost on Goldmane restart). Enabled via operator CRs in
   `stacks/calico/main.tf`; reversible toggle (Goldmane is OSS tech-preview).
 2. **`goldmane-edge-aggregator`** (`stacks/goldmane-edge-aggregator`) — streams
   Goldmane's gRPC `Flows.Stream` over **mTLS** and upserts the low-cardinality
   namespace-pair edge set (`edge(src_ns,dst_ns,action,first_seen,last_seen,
   flow_count)`) into CNPG DB `goldmane_edges`. Self-edges and empty-namespace
   (public-internet) flows are dropped — in-cluster relationships only. The mTLS
   client cert **reuses the operator's Tigera-CA-signed `whisker-backend-key-pair`**
   (Goldmane verifies CA-chain only, not identity) rather than copying the CA
   private key into TF state — **re-apply the stack if the operator rotates that
   Secret**.
 3. **`goldmane-edges-digest`** CronJob — posts first-seen edges daily to
   **`#alerts`** (reuses the alert-digest webhook; a `#security` override 404s —
   that webhook's Slack app isn't a member of `#security`; see runbook).
 The trail is **attribution-grade, not cryptographic** (reconstructs events in a
 trusted cluster; cannot prove identity against a spoofing pod — accepted trust-model
 limit; east-west stays plaintext, no mTLS between app pods). Health is covered by
 the **`AggregatorDown`** + **`DigestFailing`** alerts and cluster-health check #48
 (see monitoring.md). Full as-built, query recipes, and troubleshooting:
 [runbooks/goldmane-flow-trail.md](../runbooks/goldmane-flow-trail.md). Decision:
 [ADR-0014](../adr/0014-service-identity-and-east-west-observability.md); glossary
 `CONTEXT.md` → **Service identity**, **Goldmane / Whisker**.
 ### TLS & HTTP/3
 **Traefik** handles TLS termination:
--- a/docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md
+++ b/docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md
@ -1,97 +0,0 @@
 # Post-mortem: k8s 1.34→1.35 upgrade stalled — etcd IO starvation (2026-06-24)
 > Filename kept for inbound links. The originally-suspected cause (kubeadm-config
 > OIDC drift) turned out **not** to be the crash — see "Correction" below. The OIDC
 > drift was a real *separate* latent bug fixed in the same change.
 **Impact:** The autonomous k8s-version-upgrade chain (23:00 UTC nightly) reached
 the master control-plane phase for the first time — preflight passed, etcd
 snapshot taken, master cordoned + drained, etcd upgraded 3.6.5→3.6.6 — then the
 kube-apiserver upgrade to v1.35.6 **crash-looped**. kubeadm waited its 5-minute
 static-pod-hash window across all internal retries, then auto-rolled-back to
 v1.34.9. The cluster stayed healthy on 1.34.9 (apiserver, all 7 nodes Ready), but
 the run left **k8s-master cordoned** and the chain **wedged on `in_flight=1`**.
 No data loss; no user-facing outage (the master carries control-plane taints, so
 no workloads were displaced).
 **Trigger:** the first *minor* upgrade the chain ever attempted (1.34→1.35) — the
 first time kubeadm upgrades etcd (3.6.5→3.6.6) and regenerates the control-plane
 static pods, i.e. the first time the upgrade pushes real write-IO at etcd.
 ## Root cause — etcd IO starvation on the shared HDD
 The new kube-apiserver could not establish/keep a working connection to etcd
 during the upgrade because **etcd was IO-starved**. etcd's surviving container log
 from the crash window (`/var/log/pods/.../etcd/0.log`, 23:04–23:20 UTC) shows:
 - **1,180** `apply request took too long` warnings in 16 minutes;
 - individual applies of **4.3s / 2.9s / 2.7s / 1.8s** (healthy is <100ms),
  clustered at **23:18:51 UTC** — exactly when kubeadm's final attempt was trying
  to bring the new apiserver up.
 A reproduced 1.35.6 apiserver with no etcd dies with
 `F instance.go:233 Error creating leases: error creating storage factory: context
 deadline exceeded` — the same failure mode a multi-second etcd produces. etcd
 lives on the contended `sdc` HDD (**beads code-oflt**: "etcd/critical VM disks on
 shared sdc HDD — recurring IO-storm root cause"). The upgrade itself piled IO onto
 that spindle:
 1. etcd's own upgrade-restart + WAL/db re-read (it restarted ~23:04, re-elected);
 2. kubeadm dumping a full **~400MB etcd DB backup** to
   `/etc/kubernetes/tmp/kubeadm-backup-etcd-<ts>/` (on the same HDD) before the
   etcd upgrade — and **145 of these had accumulated to 28GB** (kubeadm never
   cleans them up), pushing master root fs to **73%**, above the 70% kubelet
   image-GC threshold, so image GC churned during the drain too;
 3. master-drain pod evictions.
 ### Correction — it was NOT the OIDC flag swap
 `kubeadm upgrade diff v1.35.6` showed the regenerated manifest also swaps
 `--authentication-config` (structured multi-issuer OIDC) back to legacy
 single-issuer `--oidc-*` flags (kubeadm-config drift, see secondary finding). That
 was the *first* hypothesis — but an isolated repro of the 1.35.6 apiserver with
 those exact `--oidc-*` flags **and authentik reachable** initialised OIDC cleanly
 (`oidc.go:313`, no error) and ran fine until it hit the (deliberately dead) test
 etcd. So the auth swap does **not** crash the apiserver; it was a red herring for
 the crash. Image pull (all v1.35.6 images pre-pulled), OOM (none), and disk-full
 were also ruled out.
 ## Secondary finding (real, fixed separately) — kubeadm-config OIDC drift
 apiserver auth is configured in three places that must agree:
 (1) `/etc/kubernetes/pki/auth-config.yaml` (structured, two issuers: `kubernetes`
 + `k8s-dashboard`, added 2026-06-19); (2) the live static-pod manifest
 (`--authentication-config`); (3) the kubeadm-config `ClusterConfiguration` CM —
 which still carried the legacy `--oidc-*` extraArgs. `kubeadm upgrade` regenerates
 the manifest from (3), so it would have reverted structured auth → **dashboard +
 kubectl SSO break after a successful upgrade** (recoverable: the chain's
 post-master `restore.sh` re-adds the flag). This is a real bug, just not the crash.
 ## Resolution
 1. **Reclaimed the 28GB kubeadm scratch** on master (`/etc/kubernetes/tmp/kubeadm-backup-*`) — root fs 73% → 23%.
 2. **Reconciled kubeadm-config live** (zero cluster impact — CM only read at upgrade time): dropped `--oidc-*`, added `--authentication-config` via `kubeadm init phase upload-config kubeadm`. `kubeadm upgrade diff` then shows only the control-plane image bumps.
 3. **Recovered:** uncordoned k8s-master, cleared the stuck `in_flight` gauge + annotation, deleted last night's Complete/Failed `1-35-6` phase jobs (a Complete preflight would otherwise make the detector idempotent-skip the re-run).
 ## Prevention (landed in this change)
 | Gap | Fix |
 |-----|-----|
 | kubeadm leaks ~400MB etcd-DB backups into `/etc/kubernetes/tmp` forever (→ disk fills, image-GC churn, write-IO on etcd's spindle) | **`upgrade-step.sh` preflight now prunes** `/etc/kubernetes/tmp/kubeadm-backup-*` + `kubeadm-upgraded-manifests*` older than 3 days on master, every run. Best-effort, never aborts. |
 | kubeadm-config drift would silently break SSO after an upgrade | `apiserver-oidc.tf`'s remote script now **also reconciles kubeadm-config** (`kubeadm init phase upload-config`), delivered via the `apiserver-oidc-restore` ConfigMap the chain re-runs (CI needs no ssh) or a local `-replace` apply. Preflight **alerts** (not blocks — SSO drift is recoverable) if `kubeadm upgrade diff` would still drop `--authentication-config`. |
 | etcd on the contended `sdc` HDD starves under upgrade IO | **Durable fix is beads code-oflt** (move etcd/critical VM disks off `sdc`). Not in this change. Mitigations above reduce the upgrade's own IO; reclaimed disk removes the image-GC variable. |
 ## Lessons
 - **Capture the failing component's own logs before concluding.** The `kubeadm
  upgrade diff` made the OIDC swap look like the cause; only etcd's log (multi-second
  applies) + an isolated apiserver repro showed the truth (etcd IO). A clean diff is
  "what config changes," not "why it crashed."
 - **etcd on shared HDD is the cluster's recurring fragility** (immich IO storm
  2026-05-25, this stall). Upgrades concentrate IO (etcd restart + kubeadm's 400MB
  backup copy + drain) onto that spindle. code-oflt is the real fix.
 - **Tools that leave per-operation scratch must be reaped.** kubeadm's
  `/etc/kubernetes/tmp` etcd backups are throwaway (real backups → NFS) but never
  GC'd; 28GB had silently accumulated.
 - **Out-of-band control-plane edits must be written back to kubeadm-config** — else
  `kubeadm upgrade` silently reverts them (here: SSO; could be admission/audit/API flags).
--- a/docs/runbooks/goldmane-flow-trail.md
+++ b/docs/runbooks/goldmane-flow-trail.md
@ -1,301 +0,0 @@
 # Goldmane Flow Trail — east-west "who-talks-to-whom" observability
 > As-built runbook for the Calico Goldmane + Whisker flow plane and the
 > `goldmane-edge-aggregator` durable audit trail. Design + rationale:
 > [ADR-0014](../adr/0014-service-identity-and-east-west-observability.md).
 > Glossary: `CONTEXT.md` → **Service identity**, **Goldmane / Whisker**.
 > Implements infra issues #57 (Whisker ingress), #58 (aggregator), #61
 > (monitoring), #62 (egress allowlist queries), #63 (these docs).
 ## What the trail is
 Three layers turn raw east-west traffic into a queryable, durable record of
 which Service talks to which. **Service identity = the workload's namespace**
 (primary), refined by a `service-identity` label in the few multi-Service
 namespaces (`monitoring`, `kube-system`, `dbaas`) — see ADR-0014.
 | Layer | Component | Lifetime | Where it lives |
 |---|---|---|---|
 | **Live map** | Calico **Goldmane** + **Whisker** | ~60-min in-memory ring buffer (lost on Goldmane restart) | `calico-system`; Whisker UI at `whisker.viktorbarzin.me` |
 | **Durable trail** | `goldmane-edge-aggregator` (`aggregate` mode) | persistent | CNPG Postgres DB `goldmane_edges`, table `edge` |
 | **Notification** | `goldmane-edges-digest` CronJob (`digest` mode) | daily | Slack `#alerts` |
 **Goldmane** aggregates identity-stamped flows (namespace / pod / workload /
 labels + allow-deny + policy-trace) streamed from Felix (the existing
 `calico-node` DaemonSet) over gRPC into a ~60-minute in-memory ring buffer —
 **nothing is written to etcd or the K8s API** (the etcd-cost constraint that
 drove the whole design). **Whisker** is its live web UI. Because the ring
 buffer is *not* a trail (a Goldmane restart loses the window), the
 `goldmane-edge-aggregator` consumes Goldmane's gRPC `Flows.Stream` API over
 mTLS and upserts the unique **namespace-pair edge set** into Postgres; a daily
 CronJob posts first-seen edges to Slack.
 The edge set is deliberately **low-cardinality** — one row per
 `(src_ns, dst_ns, action)`, *not* per-pod or per-port — so the table stays
 small no matter how much traffic flows.
 ## Where the data lives
 ### Whisker UI — live, ~60 min
 - `https://whisker.viktorbarzin.me` (Authentik-gated — Whisker ships no own
  login; `auth = "required"`). Shows the live flow stream + a service graph for
  roughly the last hour. Use it for "what is talking right now"; it is **not**
  history.
 - In-cluster: `Service goldmane:7443` (gRPC/mTLS), `Service whisker:8081`
  (HTTP), both in `calico-system`.
 ### CNPG `goldmane_edges` — durable
 - Postgres DB `goldmane_edges` on the CNPG cluster
  (`pg-cluster-rw.dbaas.svc.cluster.local:5432`). One table:
  ```
  edge(src_ns text, dst_ns text, action text,
       first_seen timestamptz, last_seen timestamptz, flow_count bigint,
       PRIMARY KEY (src_ns, dst_ns, action))
  ```
  - `action` ∈ `allow` / `deny` / `pass` / `unspecified` (normalised Goldmane
    action).
  - **Self-edges (`src_ns == dst_ns`) and empty-namespace flows** (host-endpoint
    / public-internet) are **dropped** — the trail is about in-cluster service
    relationships only. (Egress to the public internet is therefore NOT in this
    table; it lives in the Wave-1 Calico flow-log path — see security.md.)
  - A **"new edge"** = a row whose `first_seen` falls inside the digest window.
  - Role `goldmane_edges` (Vault-rotated, 7-day) owns the DB. The `edge` table
    is created idempotently by the aggregator at startup (canonical DDL also in
    the repo at `migrations/0001_edge.sql`).
 ### Slack `#alerts` — daily digest
 > **Channel note (2026-06-25):** posts to **`#alerts`**, not `#security`. The shared `alertmanager_slack_api_url` incoming webhook's Slack app is not a member of `#security`, so a channel override there returns HTTP `404 channel_not_found` (this almost certainly also breaks alertmanager's `slack-security` receiver — verify separately). To route the digest (and security alerts) to `#security`: invite that webhook's Slack app to `#security`, then set `SLACK_CHANNEL=#security` in `stacks/goldmane-edge-aggregator` and re-apply.
 - CronJob `goldmane-edges-digest` (08:00 Europe/London) posts edges first seen
  in the last 24h. Quiet when there are none. Reuses the existing alert-digest
  Slack incoming webhook (Vault `secret/viktor` → `alertmanager_slack_api_url`)
  — no new webhook was created.
 ## How to enable / disable
 ### Goldmane + Whisker (the flow plane)
 Operator CRs in **`stacks/calico/main.tf`** — NOT the Helm `goldmane`/`whisker`
 flags (those stay `false`; the operator's own `installation`/`apiServer` are
 operator-managed via the `goldmanes`/`whiskers.operator.tigera.io` CRDs):
 - `kubectl_manifest.goldmane` (kind `Goldmane`) — creating it makes the operator
  re-render `calico-node` with the `FELIX_FLOWLOGSGOLDMANESERVER` env (the
  operator auto-wires Felix — **do NOT patch FelixConfiguration**), triggering a
  supervised `calico-node` DaemonSet roll. Yields `Deployment` + `Service
  goldmane:7443`.
 - `kubectl_manifest.whisker` (kind `Whisker`, `depends_on` goldmane;
  `notifications = Disabled`). Yields `Deployment` + `Service whisker:8081`.
 **To disable:** delete those two CRs and re-apply `stacks/calico`. Reversible
 toggle (Goldmane is tech-preview in OSS Calico 3.30 — the main standing risk per
 ADR-0014).
 ### Whisker public ingress (infra #57)
 Also in `stacks/calico/main.tf`:
 - `module "ingress_whisker"` (`ingress_factory`, `auth = "required"`,
  `dns_type = "proxied"`) → `whisker.viktorbarzin.me`.
 - `kubernetes_network_policy_v1.whisker_allow_traefik` — **required alongside the
  ingress**: the operator's own `whisker` NetworkPolicy (owned by the Whisker CR)
  is `policyTypes: [Ingress]` with no rules = default-deny ingress to the pod.
  This additive NP ORs in an allow for `namespaceSelector
  kubernetes.io/metadata.name=traefik` on TCP 8081. Without it Traefik 502s.
 ### The aggregator + digest (the durable trail) — `stacks/goldmane-edge-aggregator`
 A Tier-1 stack (PG state) mirroring the claude-memory pattern. `scripts/tg
 apply` from `stacks/goldmane-edge-aggregator/`. It provisions: the namespace,
 the mTLS client material, the Postgres DB-init Job, the `DATABASE_URL`
 ExternalSecret (Vault static role `pg-goldmane-edges`), the Slack ExternalSecret,
 the `aggregate` Deployment, and the `digest` CronJob. **To disable the trail
 without touching the flow plane:** scale `deployment/goldmane-edge-aggregator` to
 0 (transient) or remove the stack (permanent) — Goldmane/Whisker keep running.
 Image: `ghcr.io/viktorbarzin/goldmane-edge-aggregator` (PRIVATE) — the
 `goldmane-edge-aggregator` namespace must be in the `ghcr-credentials` Kyverno
 allowlist (`stacks/kyverno/modules/kyverno/ghcr-credentials.tf`,
 `local.ghcr_private_namespaces`) or pulls 401. Code repo:
 `~/code/goldmane-edge-aggregator` (see its `README.md` + `DEPLOY.md`).
 ## mTLS cert — the REUSE decision (cert-reuse gotcha)
 The aggregator dials `goldmane:7443` over **mutual TLS**. Goldmane requires the
 client cert to chain to the **Tigera CA**, but it does **NOT authorize by client
 identity** — any Tigera-CA-signed cert is accepted.
 Rather than copy the Tigera CA **private key** into Terraform state to mint our
 own cert (a needless CA-key exposure; the `hashicorp/tls` provider also clashes
 with this repo's global generate-providers/lockfile pattern), the stack
 **REUSES the operator-minted, Tigera-CA-signed `whisker-backend-key-pair`
 Secret** (`calico-system`), copying its `tls.crt`/`tls.key` into the
 `goldmane-client-tls` Secret in the aggregator namespace. The CA *bundle* that
 verifies Goldmane's serving cert (`tigera-ca-bundle` ConfigMap, key
 `tigera-ca-bundle.crt`) is likewise copied verbatim (a ConfigMap can't be
 cross-namespace-mounted).
 > **GOTCHA — if the operator rotates `whisker-backend-key-pair`, re-apply
 > `stacks/goldmane-edge-aggregator`** to re-sync the copied cert. Symptom of a
 > stale copy: the `aggregate` pod logs TLS handshake / `Flows.Stream` failures
 > and no `last_seen` updates land in the `edge` table. Hardening follow-up
 > (noted in the stack): mint an own-identity cert in-namespace if Whisker is ever
 > removed (which would delete the reused source Secret).
 The Deployment leaves `GOLDMANE_HOST=goldmane.calico-system.svc.cluster.local:7443`
 and the default cert/CA paths; the default ServerName (host sans port) is a SAN
 on Goldmane's live serving cert, so no `GOLDMANE_SERVER_NAME` /
 `GOLDMANE_TLS_INSECURE` override is needed.
 ## How to query who-talks-to-whom
 `psql` into the DB (creds: Vault static role `static-creds/pg-goldmane-edges`, or
 exec a CNPG pod). All queries are against the single `edge` table.
 ```sql
 -- Everything talking to a namespace (inbound), most-active first
 SELECT src_ns, action, flow_count, first_seen, last_seen
 FROM edge WHERE dst_ns = '<ns>' ORDER BY flow_count DESC;
 -- Everything a namespace talks TO (outbound)
 SELECT dst_ns, action, flow_count, first_seen, last_seen
 FROM edge WHERE src_ns = '<ns>' ORDER BY last_seen DESC;
 -- New edges in the last 24h (what the digest reports)
 SELECT src_ns, dst_ns, action, flow_count, first_seen
 FROM edge WHERE first_seen > now() - interval '24 hours'
 ORDER BY first_seen DESC;
 -- Any DENIED edges (policy is dropping this pair)
 SELECT src_ns, dst_ns, flow_count, last_seen
 FROM edge WHERE action = 'deny' ORDER BY last_seen DESC;
 -- Full edge set as a graph adjacency list
 SELECT src_ns, dst_ns, action, flow_count FROM edge ORDER BY src_ns, dst_ns;
 ```
 For the **live** (sub-hour) view including pod/port detail, use the Whisker UI —
 the `edge` table intentionally aggregates that away.
 ## Deriving the Wave-1 egress allowlist from the edge table (infra #62)
 The durable edge set is a faster, identity-stamped data source for the existing
 **observe-then-enforce** egress effort (beads `code-8ywc`; snapshot
 `docs/architecture/wave1-egress-observation-2026-05-22.md`) than the original
 iptables-`LOG` → journald → Loki path (ADR-0014 consequence: "Enforcement gains
 a better data source"). It replaces the *internal* (namespace-to-namespace) leg
 of the allowlist; **external/public-internet egress is NOT in this table** (empty
 dst namespace, dropped) — for those destinations keep using the Calico flow-log
 path described in security.md.
 **Per-namespace internal egress allowlist** — the set of in-cluster namespaces a
 given source is *observed* talking to with `action='allow'`:
 ```sql
 -- Internal egress allowlist for one namespace (feeds its NetworkPolicy)
 SELECT DISTINCT dst_ns
 FROM edge
 WHERE src_ns = '<ns>' AND action = 'allow'
 ORDER BY dst_ns;
 ```
 ```sql
 -- Full internal egress matrix for all namespaces at once
 SELECT src_ns, array_agg(DISTINCT dst_ns ORDER BY dst_ns) AS allowed_dst_ns
 FROM edge
 WHERE action = 'allow'
 GROUP BY src_ns
 ORDER BY src_ns;
 ```
 ```sql
 -- Sanity: namespaces with a DENY edge already (policy is biting; investigate
 -- before tightening further)
 SELECT DISTINCT src_ns, dst_ns FROM edge WHERE action = 'deny';
 ```
 **How this feeds enforcement (scope):** the derived `dst_ns` set is the
 *internal* half of a namespace's egress allowlist — it tells you which
 in-cluster namespaces to permit before flipping that namespace to default-deny.
 The universal baseline (kube-dns :53, often dbaas :3306/:5432, redis :6379) and
 the external destinations still come from the Wave-1 observation snapshot.
 **Enforce-flips remain OUT OF SCOPE** here — this is observe-and-derive only;
 the phased per-namespace default-deny rollout (starting `recruiter-responder`)
 is tracked under `code-8ywc`. Cross-links:
 [security.md → NetworkPolicy Default-Deny Egress](../architecture/security.md#networkpolicy-default-deny-egress-wave-1--observe-then-enforce-tier-34),
 [wave1-egress-observation-2026-05-22.md](../architecture/wave1-egress-observation-2026-05-22.md),
 [ADR-0014](../adr/0014-service-identity-and-east-west-observability.md).
 > **Caveat (same as the Wave-1 snapshot):** an edge only exists if it was
 > *observed*. A weekly CronJob or a 7-day Vault rotation may not have fired yet —
 > collect ≥7 days of edges before treating a namespace's `allow` set as
 > complete. The `first_seen` column tells you how long an edge has been known;
 > the digest surfaces brand-new ones daily.
 ## Monitoring & health (infra #61)
 The aggregator pod has **no `/metrics` endpoint** — health is inferred from
 kube-state-metrics. Three complementary signals (memory ids 6598, 6599;
 see also [monitoring.md → Security Alerts](../architecture/monitoring.md#security-alerts-wave-1--planned-beads-code-8ywc)):
 | Signal | What | Where |
 |---|---|---|
 | **`AggregatorDown`** | `kube_deployment_status_replicas_available{namespace="goldmane-edge-aggregator",deployment="goldmane-edge-aggregator"} < 1` for 15m → warning | Prometheus alert group `Network Observability (Goldmane)` in `stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl`; routes `slack-warning` → `#alerts` |
 | **`DigestFailing`** | `kube_job_status_failed{...job_name=~"goldmane-edges-digest.*"} > 0` within 24h, for 30m → warning | same alert group → `#alerts` |
 | **cluster-health #48** | `check_goldmane_aggregator` reads the Deployment's `Available` condition (missing or not-Available → FAIL) | `scripts/cluster_healthcheck.sh` (human / `--quiet` / `--json` modes; emits `goldmane_aggregator`) |
 The two alert layers are deliberately complementary: `AggregatorDown` →
 **no new edges land** in the DB; `DigestFailing` → **edges still land but nobody
 is told**. A freshness probe (#61b) was intentionally skipped — `AggregatorDown`
 is the agreed floor.
 ## Troubleshooting
 **Whisker UI 502 / unreachable.** The additive
 `kubernetes_network_policy_v1.whisker_allow_traefik` is missing or the
 operator's default-deny `whisker` NP regenerated — re-apply `stacks/calico`. A
 brand-new ingress host is also invisible to LAN split-horizon until the hourly
 `technitium-ingress-dns-sync` runs (memory #5349); test meanwhile with
 `curl -sSI --resolve whisker.viktorbarzin.me:443:10.0.20.203 https://whisker.viktorbarzin.me`
 (expect a 302 to Authentik — the gate working).
 **No new `last_seen` updates / `AggregatorDown` firing.** Check the `aggregate`
 pod logs (`kubectl logs -n goldmane-edge-aggregator deploy/goldmane-edge-aggregator`).
 Common causes, in order:
 1. **Stale mTLS cert** — the operator rotated `whisker-backend-key-pair`; re-apply
   `stacks/goldmane-edge-aggregator` (see cert-reuse gotcha above). Symptom: TLS
   handshake / `Flows.Stream` errors.
 2. **Stale DB password** — the 7-day Vault rotation bounced the credential but
   the pod kept the old one. The Deployment carries
   `secret.reloader.stakater.com/reload: goldmane-edges-db-creds`; if it's not
   restarting on rotation, verify the Reloader annotation and the ExternalSecret.
 3. **Goldmane restarted** — the in-memory window was lost (expected); the stream
   reconnects automatically and resumes upserting. No data loss in the DB
   (only the sub-hour live window in Whisker is gone).
 **Digest never posts / `DigestFailing` firing.** Inspect the most recent
 `goldmane-edges-digest-*` Job (`kubectl get jobs -n goldmane-edge-aggregator`;
 `kubectl logs job/<name>`). The CronJob's `ttl_seconds_after_finished=86400` GCs
 pods after a day, so check soon after a failed run. With `SLACK_WEBHOOK_URL`
 empty the binary forces a dry-run (no post) — verify the `goldmane-edges-slack`
 ExternalSecret resolved. A dry run / smoke test: run the image with `args:
 ["digest"]` + `DRY_RUN=1` to print the message instead of POSTing.
 > Known state (2026-06-25): the digest CronJob's first Job **failed** and it has
 > never successfully posted (`lastSuccessfulTime` empty) — the digest leg is the
 > live gap; `DigestFailing` is catching it. Edges still land in the DB via the
 > `aggregate` Deployment; only the `#security` notification is affected.
 > Investigation/fix belongs to the aggregator slice (#58/#60), not monitoring.
 **No edges at all in the table.** Confirm Goldmane is enabled
 (`kubectl get goldmane,whisker -A`) and `calico-node` rolled with the
 `FELIX_FLOWLOGSGOLDMANESERVER` env; confirm the `goldmane-edges-db-init` Job
 completed; confirm the aggregator pod is `Running` and not `ImagePullBackOff`
 (ghcr allowlist).
 ## Related
 - [ADR-0014 — Service identity & east-west observability](../adr/0014-service-identity-and-east-west-observability.md)
 - [security.md — NetworkPolicy Default-Deny Egress + east-west flow observability](../architecture/security.md)
 - [monitoring.md — east-west flow observability + alerts](../architecture/monitoring.md)
 - [wave1-egress-observation-2026-05-22.md](../architecture/wave1-egress-observation-2026-05-22.md)
 - `CONTEXT.md` glossary — **Service identity**, **Goldmane / Whisker**
 - Code: `~/code/goldmane-edge-aggregator` (`README.md`, `DEPLOY.md`); stacks
  `stacks/goldmane-edge-aggregator`, `stacks/calico`
--- a/docs/runbooks/k8s-version-upgrade.md
+++ b/docs/runbooks/k8s-version-upgrade.md
@ -41,8 +41,6 @@ Job 0 — preflight       (pinned: k8s-node1)
  ├── halt-on-alert (kured-style ignore-list)
  ├── 24h-quiet baseline (no Ready transitions <24h ago)
  ├── kubeadm upgrade plan matches target (skipped when master already at target — partial-resume)
  ├── apiserver-OIDC drift check: kubeadm upgrade diff drops --authentication-config? → Slack WARN (recoverable; not a block)
  ├── reclaim kubeadm scratch: prune /etc/kubernetes/tmp/kubeadm-backup-* >3d on master (kubeadm leaks ~400MB etcd-db backups)
  ├── Push k8s_upgrade_in_flight=1, k8s_upgrade_started_timestamp=$(date +%s)
  ├── Trigger backup-etcd Job, wait, verify snapshot byte count
  ├── SSH master: containerd skew fix (if master < workers)
@ -224,34 +222,22 @@ Exposed in K8s via ExternalSecret `k8s-upgrade-creds` in the `k8s-upgrade` names
 ## Common Operations
-### apiserver OIDC + kubeadm upgrades (kubeadm-config reconciliation since 2026-06-24)
+### Post-upgrade: apiserver OIDC restore (AUTOMATED by the chain since 2026-06-19)
 `kubeadm upgrade apply` **regenerates `/etc/kubernetes/manifests/kube-apiserver.yaml`
-from kubeadm-config**. apiserver auth uses a structured multi-issuer
+and drops the `--authentication-config` flag**, silently disabling apiserver
-`--authentication-config` (kubectl + dashboard SSO), but kubeadm-config used to
+OIDC (kubectl/kubelogin CLI **and** the web dashboard SSO break — tokens get
-still carry the legacy single-issuer `--oidc-*` extraArgs — so every upgrade
+401). This used to require a manual re-apply after **every** control-plane bump.
 reverted the flag, **silently breaking SSO after the upgrade** (the apiserver does
 NOT crash on this — verified by isolated repro; it's recoverable via the restore
 script below). NB: the **1.34→1.35 stall on 2026-06-24 was a *separate* issue —
 etcd IO starvation**, not this drift; post-mortem:
 `docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md`.
-**Primary fix (2026-06-24):** `stacks/rbac/modules/rbac/apiserver-oidc.tf` now
+**Now automated:** the `rbac` stack publishes its OIDC restore script to the
-**reconciles kubeadm-config** (`kubeadm init phase upload-config kubeadm`, rewriting
+`kube-system/apiserver-oidc-restore` ConfigMap, and the version-upgrade chain's
-`apiServer.extraArgs`: drop `--oidc-*`, add `--authentication-config`) as part of
+`phase_master` re-runs it on master immediately after `kubeadm upgrade apply`
-its remote script. So kubeadm regenerates a **correct** manifest and the apiserver
+(while tigera-operator is still quiesced, so the flag-add apiserver restart can't
-upgrades with a pure image bump — `kubeadm upgrade diff <target>` shows only the
+crashloop the operator). It's idempotent, health-gates `/livez` with
-image change. Zero live impact (the CM is read only during an upgrade).
+auto-rollback, and is **non-fatal** — a failure only lags SSO until the next rbac
-
+apply (the version upgrade itself already succeeded). So a chain-driven
-**Backstops:**
+control-plane bump no longer breaks SSO. The master phase self-skips when master
- **Preflight check 4b** runs `kubeadm upgrade diff` and **alerts** (Slack WARN, does
+is already at target, so this only runs when master was actually upgraded.
  NOT block — the drift only breaks SSO, which is recoverable) if
  `--authentication-config` would still be dropped.
 - The `rbac` stack still publishes its restore script to the
  `kube-system/apiserver-oidc-restore` ConfigMap, and `phase_master` re-runs it on
  master right after `kubeadm upgrade apply` (idempotent, `/livez`-gated with
  auto-rollback, non-fatal) — now redundant belt-and-suspenders that *also*
  re-reconciles kubeadm-config. Self-skips when master is already at target.
 **Manual fallback** — only for an out-of-band/manual `kubeadm` upgrade, or if the
 chain logged `WARN: --authentication-config absent after re-apply`:
--- a/scripts/cluster_healthcheck.sh
+++ b/scripts/cluster_healthcheck.sh
@ -27,7 +27,7 @@ KUBECONFIG_PATH="${KUBECONFIG:-${HOME}/.kube/config}"
 [[ -f "$KUBECONFIG_PATH" ]] || KUBECONFIG_PATH="$(pwd)/config"
 KUBECTL=""
 JSON_RESULTS=()
-TOTAL_CHECKS=48
+TOTAL_CHECKS=47
 # Parallel execution settings. Each check function is self-contained — it
 # only reads cluster state and mutates the in-memory counters / JSON_RESULTS
@ -3156,44 +3156,6 @@ PYEOF
    esac
 }
 # --- 48. Goldmane edge-aggregator availability ---
 #
 # The goldmane-edge-aggregator Deployment (ADR-0014 / infra #58) streams Calico
 # Goldmane flows into the goldmane_edges CNPG DB — the durable who-talks-to-whom
 # trail. The pod has NO /metrics endpoint, so its liveness can't be scraped;
 # this check reads the Deployment's Available condition directly so the trail
 # silently dying surfaces in the health board (mirrors the AggregatorDown
 # Prometheus alert). Missing Deployment / not-Available -> FAIL.
 check_goldmane_aggregator() {
    section 48 "Goldmane Edge-Aggregator"
    local ns="goldmane-edge-aggregator" dep="goldmane-edge-aggregator"
    local avail desired ready
    # One get; absent Deployment is a hard fail (the trail isn't deployed).
    if ! $KUBECTL get deploy "$dep" -n "$ns" >/dev/null 2>&1; then
        [[ "$QUIET" == true ]] && section_always 48 "Goldmane Edge-Aggregator"
        fail "Deployment $ns/$dep not found — who-talks-to-whom edge trail is not running"
        json_add "goldmane_aggregator" "FAIL" "deployment missing"
        return 0
    fi
    avail=$($KUBECTL get deploy "$dep" -n "$ns" \
        -o jsonpath='{.status.conditions[?(@.type=="Available")].status}' 2>/dev/null)
    ready=$($KUBECTL get deploy "$dep" -n "$ns" -o jsonpath='{.status.readyReplicas}' 2>/dev/null)
    desired=$($KUBECTL get deploy "$dep" -n "$ns" -o jsonpath='{.spec.replicas}' 2>/dev/null)
    ready=${ready:-0}
    desired=${desired:-0}
    if [[ "$avail" == "True" ]]; then
        pass "Edge-aggregator Available ($ready/$desired ready)"
        json_add "goldmane_aggregator" "PASS" "${ready}/${desired} ready"
    else
        [[ "$QUIET" == true ]] && section_always 48 "Goldmane Edge-Aggregator"
        fail "Edge-aggregator NOT Available ($ready/$desired ready) — edge trail has stopped recording"
        json_add "goldmane_aggregator" "FAIL" "${ready}/${desired} ready; Available=${avail:-unknown}"
    fi
 }
 # --- Summary ---
 print_summary() {
    if [[ "$JSON" == true ]]; then
@ -3262,7 +3224,7 @@ main() {
        check_monitoring_prom_am check_monitoring_vault check_monitoring_css
        check_external_replicas check_external_divergence check_pve_thermals
        check_pve_load check_external_traefik_5xx check_ha_status_dashboard
-        check_immich_search check_csi_ghost_drift check_goldmane_aggregator
+        check_immich_search check_csi_ghost_drift
    )
    # Auto-fix mutates cluster state inside individual checks — keep that
--- a/stacks/calico/main.tf
+++ b/stacks/calico/main.tf
@ -212,65 +212,3 @@ resource "kubectl_manifest" "whisker" {
    spec       = { notifications = "Disabled" }
  })
 }
 # ---------------------------------------------------------------------------
 # Gated public ingress for the Whisker UI (infra #57 / ADR-0014).
 #
 # whisker.viktorbarzin.me -> whisker:8081, Authentik-gated (auth="required":
 # Whisker ships NO own login — it's an admin observability UI, so Authentik
 # forward-auth is the only gate between strangers and the flow view). The
 # operator replicated `tls-secret` into calico-system already.
 #
 # TWO coupled pieces are required because the operator's own `whisker`
 # NetworkPolicy (owned by the Whisker CR above) sets policyTypes:[Ingress]
 # with NO ingress rules => default-deny on ingress to the whisker pod. The
 # additive NP below ORs in a Traefik allow (k8s NetworkPolicies are additive
 # across policies selecting the same pod), so we never edit the operator NP.
 module "ingress_whisker" {
  source          = "../../modules/kubernetes/ingress_factory"
  dns_type        = "proxied"
  namespace       = "calico-system"
  name            = "whisker"
  service_name    = "whisker"
  port            = 8081
  auth            = "required"
  tls_secret_name = "tls-secret"
  extra_annotations = {
    "gethomepage.dev/enabled"     = "true"
    "gethomepage.dev/name"        = "Whisker"
    "gethomepage.dev/description" = "Calico flow observability (who-talks-to-whom)"
    "gethomepage.dev/icon"        = "calico.png"
    "gethomepage.dev/group"       = "Infrastructure"
  }
 }
 # Additive NetworkPolicy: permit Traefik -> whisker:8081. ORs with the
 # operator's default-deny `whisker` NP (selecting the same pod) so Traefik
 # can reach the UI without touching the operator-owned policy.
 resource "kubernetes_network_policy_v1" "whisker_allow_traefik" {
  metadata {
    name      = "whisker-allow-traefik"
    namespace = "calico-system"
  }
  spec {
    pod_selector {
      match_labels = {
        "app.kubernetes.io/name" = "whisker"
      }
    }
    policy_types = ["Ingress"]
    ingress {
      from {
        namespace_selector {
          match_labels = {
            "kubernetes.io/metadata.name" = "traefik"
          }
        }
      }
      ports {
        port     = "8081"
        protocol = "TCP"
      }
    }
  }
 }
--- a/stacks/dbaas/modules/dbaas/main.tf
+++ b/stacks/dbaas/modules/dbaas/main.tf
@ -745,10 +745,7 @@ resource "kubernetes_deployment" "phpmyadmin" {
    labels = {
      "app" = "phpmyadmin"
      tier  = var.tier
-      # ADR-0014 service identity: dbaas is a multi-Service namespace, so the
+
      # namespace alone can't attribute Goldmane flows. Value = the fronting
      # Service name (kubernetes_service.phpmyadmin is named "pma").
      "service-identity" = "pma"
    }
    annotations = {
      "reloader.stakater.com/search" = "true"
@ -765,10 +762,6 @@ resource "kubernetes_deployment" "phpmyadmin" {
      metadata {
        labels = {
          "app" = "phpmyadmin"
          # ADR-0014: Goldmane/Felix stamps POD labels onto flows, so the
          # disambiguating identity must live on the pod template (not just
          # the Deployment metadata above). Not in selector → no replace.
          "service-identity" = "pma"
        }
      }
      spec {
@ -819,19 +812,8 @@ resource "kubernetes_deployment" "phpmyadmin" {
    }
  }
  lifecycle {
-    ignore_changes = [
+    # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
-      spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
+    ignore_changes = [spec[0].template[0].spec[0].dns_config]
      # This Deployment is Keel-enrolled (keel.sh/policy=patch). Ignore the
      # attributes Keel/Kyverno mutate at runtime so `terragrunt apply` (incl.
      # the daily drift plan) doesn't fight them or revert the live image —
      # canonical KEEL/KYVERNO lifecycle guard, matches linkwarden/chrome-service.
      metadata[0].annotations["keel.sh/policy"],
      metadata[0].annotations["keel.sh/trigger"],
      metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
      metadata[0].annotations["keel.sh/match-tag"],
      spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE — Keel manages tag updates
      spec[0].template[0].metadata[0].annotations["keel.sh/update-time"], # KEEL_LIFECYCLE_V1
    ]
  }
 }
@ -1517,10 +1499,6 @@ resource "kubernetes_deployment" "pgadmin" {
    }
    labels = {
      tier = var.tier
      # ADR-0014 service identity: dbaas is a multi-Service namespace, so the
      # namespace alone can't attribute Goldmane flows. Value = the fronting
      # Service name (kubernetes_service.pgadmin is named "pgadmin").
      "service-identity" = "pgadmin"
    }
  }
  spec {
@ -1536,10 +1514,6 @@ resource "kubernetes_deployment" "pgadmin" {
      metadata {
        labels = {
          app = "pgadmin"
          # ADR-0014: Goldmane/Felix stamps POD labels onto flows, so the
          # disambiguating identity must live on the pod template (not just
          # the Deployment metadata above). Not in selector → no replace.
          "service-identity" = "pgadmin"
        }
      }
      spec {
@ -1594,20 +1568,8 @@ resource "kubernetes_deployment" "pgadmin" {
    }
  }
  lifecycle {
-    ignore_changes = [
+    # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
-      spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
+    ignore_changes = [spec[0].template[0].spec[0].dns_config]
      # This Deployment is Keel-enrolled (keel.sh/policy=patch) and Keel has
      # bumped the live image (dpage/pgadmin4:9.16). Ignore the Keel/Kyverno
      # runtime-mutated attributes so `terragrunt apply` (incl. the daily drift
      # plan) doesn't revert the image to bare `dpage/pgadmin4` or strip Keel's
      # annotations — canonical guard, matches linkwarden/chrome-service.
      metadata[0].annotations["keel.sh/policy"],
      metadata[0].annotations["keel.sh/trigger"],
      metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
      metadata[0].annotations["keel.sh/match-tag"],
      spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE — Keel manages tag updates
      spec[0].template[0].metadata[0].annotations["keel.sh/update-time"], # KEEL_LIFECYCLE_V1
    ]
  }
 }
 resource "kubernetes_service" "pgadmin" {
--- a/stacks/goldmane-edge-aggregator/main.tf
+++ b/stacks/goldmane-edge-aggregator/main.tf
@ -57,19 +57,16 @@ resource "kubernetes_namespace" "goldmane_edge_aggregator" {
 # -----------------------------------------------------------------------------
 # The aggregator dials goldmane:7443 over mutual TLS. We mint a client cert
 # signed by the Tigera CA (the same CA that issues Goldmane's serving cert), so
-# Goldmane requires mutual TLS on :7443 and verifies the client cert chains to
+# Goldmane trusts the client and the client trusts Goldmane's server cert via
-# the Tigera CA — it does NOT authorize by client identity, so ANY Tigera-CA-
+# the published CA bundle.
-# signed cert is accepted. Rather than copy the Tigera CA PRIVATE KEY into TF
+#
-# state to mint our own (a needless CA-key exposure; the hashicorp/tls provider
+# The Tigera CA private key lives in the `tigera-ca-private` Secret in
-# is also incompatible with this repo's global generate-providers/lockfile
+# tigera-operator (Opaque; verified keys: tls.crt + tls.key). The stack's apply
-# pattern), we REUSE the operator-minted, Tigera-CA-signed client cert
+# identity needs RBAC get on that secret — see the Role/RoleBinding below.
-# `whisker-backend-key-pair` (calico-system). We never touch the CA key.
+data "kubernetes_secret" "tigera_ca" {
 # Trade-off: if the operator rotates that cert, re-apply to re-sync (hardening
 # follow-up: mint an own-identity cert in-namespace if Whisker is ever removed).
 data "kubernetes_secret" "whisker_backend" {
  metadata {
-    name      = "whisker-backend-key-pair"
+    name      = "tigera-ca-private"
-    namespace = "calico-system"
+    namespace = "tigera-operator"
  }
 }
@ -96,11 +93,46 @@ resource "kubernetes_config_map" "tigera_ca_bundle" {
  data = data.kubernetes_config_map.tigera_ca_bundle.data
 }
-# Client cert + key for mTLS to goldmane:7443, mounted at TLS_CERT_PATH /
+# Client private key.
-# TLS_KEY_PATH defaults (/etc/goldmane-client-tls/tls.crt and .../tls.key).
+resource "tls_private_key" "goldmane_client" {
-# Sourced verbatim from the operator's whisker-backend client key-pair (read
+  algorithm = "RSA"
-# above) — already Tigera-CA-signed, which is all Goldmane verifies. No CA key
+  rsa_bits  = 2048
-# is touched and no cross-namespace CA RBAC is needed.
+}
 # CSR for the client cert. CN identifies the client; the service-DNS SAN mirrors
 # how Felix/whisker-backend present a client identity to Goldmane.
 resource "tls_cert_request" "goldmane_client" {
  private_key_pem = tls_private_key.goldmane_client.private_key_pem
  subject {
    common_name  = "goldmane-edge-aggregator"
    organization = "goldmane-edge-aggregator"
  }
  dns_names = [
    "goldmane-edge-aggregator",
    "goldmane-edge-aggregator.goldmane-edge-aggregator.svc.cluster.local",
  ]
 }
 # Sign the CSR with the Tigera CA. 10-year validity (87600h): re-apply rotates
 # it well before expiry; a long horizon avoids surprise mTLS outages from an
 # unattended stack. The Tigera CA itself outlives this (operator-managed).
 resource "tls_locally_signed_cert" "goldmane_client" {
  cert_request_pem   = tls_cert_request.goldmane_client.cert_request_pem
  ca_private_key_pem = data.kubernetes_secret.tigera_ca.data["tls.key"]
  ca_cert_pem        = data.kubernetes_secret.tigera_ca.data["tls.crt"]
  validity_period_hours = 87600 # 10y
  early_renewal_hours   = 720   # re-sign on apply when <30d remain
  allowed_uses = [
    "client_auth",
    "digital_signature",
    "key_encipherment",
  ]
 }
 # The minted client cert + key, mounted at TLS_CERT_PATH / TLS_KEY_PATH defaults
 # (/etc/goldmane-client-tls/tls.crt and .../tls.key).
 resource "kubernetes_secret" "goldmane_client_tls" {
  metadata {
    name      = "goldmane-client-tls"
@ -108,8 +140,47 @@ resource "kubernetes_secret" "goldmane_client_tls" {
  }
  type = "Opaque"
  data = {
-    "tls.crt" = data.kubernetes_secret.whisker_backend.data["tls.crt"]
+    "tls.crt" = tls_locally_signed_cert.goldmane_client.cert_pem
-    "tls.key" = data.kubernetes_secret.whisker_backend.data["tls.key"]
+    "tls.key" = tls_private_key.goldmane_client.private_key_pem
  }
 }
 # Narrow RBAC so this stack's apply identity (and ESO/Reloader are unaffected)
 # can `get` the Tigera CA private key in tigera-operator. The data source above
 # reads it at apply time; this Role/RoleBinding documents + grants that access
 # rather than relying on cluster-admin. The subject is the same SA the other
 # Tier-1 stacks apply as (claude-agent/terraform-state for headless, the human
 # OIDC identity interactively) — both are cluster-admin today, so this is
 # belt-and-braces / least-privilege intent for when apply identities tighten.
 resource "kubernetes_role" "read_tigera_ca" {
  metadata {
    name      = "goldmane-edge-aggregator-read-tigera-ca"
    namespace = "tigera-operator"
  }
  rule {
    api_groups     = [""]
    resources      = ["secrets"]
    resource_names = ["tigera-ca-private"]
    verbs          = ["get"]
  }
 }
 resource "kubernetes_role_binding" "read_tigera_ca" {
  metadata {
    name      = "goldmane-edge-aggregator-read-tigera-ca"
    namespace = "tigera-operator"
  }
  role_ref {
    api_group = "rbac.authorization.k8s.io"
    kind      = "Role"
    name      = kubernetes_role.read_tigera_ca.metadata[0].name
  }
  # The headless apply identity (claude-agent-service runs Tier-1 applies as the
  # `terraform-state` Vault K8s role in the claude-agent namespace).
  subject {
    kind      = "ServiceAccount"
    name      = "default"
    namespace = "claude-agent"
  }
 }
@ -156,11 +227,6 @@ resource "kubernetes_job" "db_init" {
  timeouts {
    create = "2m"
  }
  lifecycle {
    # KYVERNO_LIFECYCLE_V1: Kyverno injects dns_config (ndots=2); ignore it so
    # this idempotent Job isn't replaced (Jobs are immutable) on every apply.
    ignore_changes = [spec[0].template[0].spec[0].dns_config]
  }
 }
 # ExternalSecret projecting the Vault-rotated (7-day) credential into a K8s
@ -229,7 +295,7 @@ resource "kubernetes_manifest" "slack_external_secret" {
      data = [{
        secretKey = "SLACK_WEBHOOK_URL"
        remoteRef = {
-          key      = "viktor"
+          key      = "monitoring"
          property = "alertmanager_slack_api_url"
        }
      }]
@ -450,15 +516,7 @@ resource "kubernetes_cron_job_v1" "digest" {
              }
              env {
                name  = "SLACK_CHANNEL"
-                # The shared alertmanager_slack_api_url incoming webhook's Slack
+                value = "#security"
                # app is NOT a member of #security, so overriding the channel to
                # it returns HTTP 404 channel_not_found (verified 2026-06-25).
                # alertmanager's own slack-security receiver shares this webhook
                # and almost certainly hits the same wall. Post to #alerts (the
                # webhook's working channel, same as alert-digest) until the app
                # is invited to #security, then flip this back. See
                # docs/runbooks/goldmane-flow-trail.md.
                value = "#alerts"
              }
              resources {
--- a/stacks/instagram-poster/modules/instagram-poster/main.tf
+++ b/stacks/instagram-poster/modules/instagram-poster/main.tf
@ -35,14 +35,6 @@ resource "kubernetes_namespace" "instagram_poster" {
 #     - immich_tag_instagram      (optional — auto-resolved if missing)
 #     - immich_tag_posted         (optional — auto-resolved if missing)
 resource "kubernetes_manifest" "external_secret" {
  # The external-secrets controller takes server-side-apply ownership of
  # .spec.refreshInterval, so a plain TF apply conflicts. force_conflicts lets
  # TF win (values match, so it's stable) — same pattern as grafana/woodpecker/
  # traefik/k8s-version-upgrade. Surfaced 2026-06-24 by the first IG apply since
  # the ESO v1 migration (the scale-to-0 push).
  field_manager {
    force_conflicts = true
  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
@ -147,11 +139,6 @@ resource "kubernetes_manifest" "external_secret" {
 # ESO refreshes the K8s Secret every 15m. `reloader.stakater.com/match`
 # bounces the pod when the password changes.
 resource "kubernetes_manifest" "benchmark_db_external_secret" {
  # See external_secret above — ESO owns .spec.refreshInterval; force_conflicts
  # lets the TF apply win instead of erroring on the field-manager conflict.
  field_manager {
    force_conflicts = true
  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
@ -240,11 +227,7 @@ resource "kubernetes_deployment" "instagram_poster" {
  }
  spec {
-    # Scaled to 0 (2026-06-24): Instagram Graph integration is unused and its
+    replicas = 1
    # ExternalSecret is dead (missing ig_graph_long_lived_token /
    # ig_business_account_id in Vault secret/instagram-poster). Set back to 1
    # after minting a Meta long-lived token and populating those keys.
    replicas = 0
    # RWO PVC — cannot rolling-update.
    strategy {
      type = "Recreate"
--- a/stacks/k8s-version-upgrade/scripts/upgrade-step.sh
+++ b/stacks/k8s-version-upgrade/scripts/upgrade-step.sh
@ -416,39 +416,6 @@ phase_preflight() {
    fi
  fi
  # 4b. apiserver-OIDC drift check (backstop for the rbac stack's kubeadm-config
  # reconciliation). A `kubeadm upgrade` REGENERATES the apiserver manifest from
  # kubeadm-config; if kubeadm-config still carries the legacy single-issuer
  # --oidc-* args instead of --authentication-config, the regenerated apiserver
  # loses structured multi-issuer auth → kubectl + dashboard SSO break AFTER the
  # upgrade. This is RECOVERABLE (the apiserver does NOT crash — verified by an
  # isolated repro 2026-06-24; the chain's post-master restore.sh re-adds the flag,
  # and the rbac stack reconciles kubeadm-config so it won't recur) — so this is an
  # ALERT, not a block. (NB the 2026-06-24 stall was NOT this — it was etcd IO
  # starvation; see docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md.)
  # Skip on an at-target master (resume — no apiserver regen).
  if [ "$master_kubelet_v" != "$TARGET_VERSION" ]; then
    local apiserver_diff
    apiserver_diff=$(ssh "${SSH_OPTS[@]}" "$(ssh_target k8s-master)" "sudo kubeadm upgrade diff v$TARGET_VERSION 2>/dev/null" || true)
    if echo "$apiserver_diff" | grep -qE '^-[[:space:]].*--authentication-config'; then
      slack "WARN preflight — kubeadm upgrade will DROP --authentication-config (kubeadm-config OIDC drift). SSO breaks post-upgrade until restore.sh re-adds it; re-apply the rbac stack to reconcile kubeadm-config. Proceeding (recoverable, not a crash)."
    fi
  fi
  # 4c. Reclaim kubeadm scratch on master. `kubeadm upgrade apply` dumps a full
  # ~400MB etcd DB backup into /etc/kubernetes/tmp/kubeadm-backup-etcd-<ts>/ before
  # every etcd upgrade and NEVER cleans it up — 145 dirs / 28GB had accumulated by
  # 2026-06-24, pushing master root fs to 73% (image-GC churn + extra write IO on
  # the shared HDD where etcd lives — a contributor to the etcd IO starvation that
  # stalled that run, see post-mortem). Real etcd backups go to NFS, so these are
  # throwaway. Prune ones >3 days old (keeps a short rollback window). Best-effort;
  # never aborts the chain.
  if [ "$master_kubelet_v" != "$TARGET_VERSION" ]; then
    ssh "${SSH_OPTS[@]}" "$(ssh_target k8s-master)" \
      "sudo find /etc/kubernetes/tmp -maxdepth 1 -type d \( -name 'kubeadm-backup-*' -o -name 'kubeadm-upgraded-manifests*' \) -mtime +3 -exec rm -rf {} + 2>/dev/null; echo -n 'master root after prune: '; df -h / | awk 'NR==2{print \$5\" used, \"\$4\" free\"}'" \
      || echo "kubeadm-scratch prune skipped (ssh/df failed) — non-fatal"
  fi
  # 5. Push in-flight + started_timestamp metrics + ns annotations
  $KUBECTL annotate ns "$NS" \
    "viktorbarzin.me/k8s-upgrade-in-flight=$(date -u +%FT%TZ)" \
--- a/stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf
+++ b/stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf
@ -130,11 +130,6 @@ resource "kubernetes_deployment" "blackbox_exporter" {
    labels = {
      app  = "blackbox-exporter"
      tier = var.tier
      # ADR-0014 service identity: monitoring is a multi-Service namespace, so
      # the namespace alone can't attribute Goldmane flows. Value = the
      # fronting Service name (kubernetes_service.blackbox_exporter is named
      # "blackbox-exporter").
      "service-identity" = "blackbox-exporter"
    }
    annotations = {
      "reloader.stakater.com/search" = "true"
@ -151,10 +146,6 @@ resource "kubernetes_deployment" "blackbox_exporter" {
      metadata {
        labels = {
          app = "blackbox-exporter"
          # ADR-0014: Goldmane/Felix stamps POD labels onto flows, so the
          # disambiguating identity must live on the pod template (not just
          # the Deployment metadata above). Not in selector → no replace.
          "service-identity" = "blackbox-exporter"
        }
      }
      spec {
--- a/stacks/monitoring/modules/monitoring/goflow2.tf
+++ b/stacks/monitoring/modules/monitoring/goflow2.tf
@ -5,11 +5,6 @@ resource "kubernetes_deployment" "goflow2" {
    labels = {
      app  = "goflow2"
      tier = var.tier
      # ADR-0014 service identity: monitoring is a multi-Service namespace, so
      # the namespace alone can't attribute Goldmane flows. Value = the
      # fronting Service name (kubernetes_service.goflow2 — the metrics svc; the
      # goflow2-netflow NodePort is the same pod by another name).
      "service-identity" = "goflow2"
    }
  }
  spec {
@ -23,10 +18,6 @@ resource "kubernetes_deployment" "goflow2" {
      metadata {
        labels = {
          app = "goflow2"
          # ADR-0014: Goldmane/Felix stamps POD labels onto flows, so the
          # disambiguating identity must live on the pod template (not just
          # the Deployment metadata above). Not in selector → no replace.
          "service-identity" = "goflow2"
        }
      }
      spec {
--- a/stacks/monitoring/modules/monitoring/grafana.tf
+++ b/stacks/monitoring/modules/monitoring/grafana.tf
@ -71,15 +71,6 @@ resource "kubernetes_persistent_volume" "alertmanager_pv" {
 # DB credentials from Vault database engine (rotated automatically)
 # Provides GF_DATABASE_PASSWORD that auto-updates when password rotates
 resource "kubernetes_manifest" "grafana_db_creds" {
  # The external-secrets controller takes server-side-apply ownership of
  # .spec.refreshInterval, so a plain TF apply conflicts ("conflict with
  # external-secrets ... .spec.refreshInterval"). force_conflicts lets TF win
  # (values match, so it's stable) — same pattern as the woodpecker/traefik/
  # k8s-version-upgrade stacks. Surfaced 2026-06-24: the first monitoring apply
  # in a while exposed this latent conflict (prior pushes were docs-only).
  field_manager {
    force_conflicts = true
  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
--- a/stacks/monitoring/modules/monitoring/idrac.tf
+++ b/stacks/monitoring/modules/monitoring/idrac.tf
@ -47,10 +47,6 @@ resource "kubernetes_deployment" "idrac-redfish" {
    labels = {
      app  = "idrac-redfish-exporter"
      tier = var.tier
      # ADR-0014 service identity: monitoring is a multi-Service namespace, so
      # the namespace alone can't attribute Goldmane flows. Value = the
      # fronting Service name (kubernetes_service.idrac-redfish-exporter).
      "service-identity" = "idrac-redfish-exporter"
    }
    annotations = {
      "reloader.stakater.com/search" = "true"
@ -67,10 +63,6 @@ resource "kubernetes_deployment" "idrac-redfish" {
      metadata {
        labels = {
          app = "idrac-redfish-exporter"
          # ADR-0014: Goldmane/Felix stamps POD labels onto flows, so the
          # disambiguating identity must live on the pod template (not just
          # the Deployment metadata above). Not in selector → no replace.
          "service-identity" = "idrac-redfish-exporter"
        }
      }
      spec {
--- a/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl
+++ b/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl
@ -1450,49 +1450,6 @@ serverFiles:
                Remediation: right-size top reservers via Goldilocks (immich-server,
                frigate, prometheus, pg-cluster, paperless) or bump VM RAM on
                k8s-node2/k8s-node3 from 32GB → 48GB to match node1.
      # Goldmane edge-aggregator (ADR-0014 / infra #58, #61): the durable
      # who-talks-to-whom trail. The aggregator pod has NO /metrics endpoint,
      # so its health is inferred from kube-state-metrics signals — the trail
      # must not silently die. Two failure modes are covered:
      #   - the aggregate Deployment stops consuming Goldmane's flow stream
      #     (AggregatorDown) → no new edges ever land in the goldmane_edges DB
      #   - the daily digest CronJob can't post new edges to Slack
      #     (DigestFailing) → edges still land but nobody is told.
      # A freshness probe (max(last_seen) staleness) is intentionally NOT here:
      # AggregatorDown is the agreed floor and needs no extra moving parts.
      - name: Network Observability (Goldmane)
        rules:
          # Deployment has <1 available replica for 15m. kube-state-metrics
          # keeps `kube_deployment_status_replicas_available` (metric-keep list
          # in serverFiles below). The 15m window rides out a normal rollout /
          # node drain without paging; a genuinely-dead aggregator means the
          # edge trail has stopped recording and stays down.
          - alert: AggregatorDown
            expr: |
              kube_deployment_status_replicas_available{namespace="goldmane-edge-aggregator",deployment="goldmane-edge-aggregator"} < 1
              and on() (time() - process_start_time_seconds{job="prometheus"}) > 900
            for: 15m
            labels:
              severity: warning
            annotations:
              summary: "goldmane-edge-aggregator has no available replica — the who-talks-to-whom edge trail has stopped recording"
              description: "The aggregate Deployment streams Calico Goldmane flows into the goldmane_edges CNPG DB. With 0 replicas, no new namespace-pair edges are captured. `kubectl -n goldmane-edge-aggregator describe deploy goldmane-edge-aggregator` + check the goldmane svc (calico-system) is reachable."
          # The goldmane-edges-digest CronJob has a failed Job that started in
          # the last 24h. Mirrors the generic JobFailed shape but scoped to the
          # digest so it routes here. `for: 30m` rides out the apply/scrape
          # transient; the digest runs daily so a real failure won't self-heal
          # until the next run — surface it same-day rather than waiting 24h.
          - alert: DigestFailing
            expr: |
              kube_job_status_failed{namespace="goldmane-edge-aggregator", job_name=~"goldmane-edges-digest.*"} > 0
              and on(namespace, job_name)
              (time() - kube_job_status_start_time{namespace="goldmane-edge-aggregator", job_name=~"goldmane-edges-digest.*"}) < 86400
            for: 30m
            labels:
              severity: warning
            annotations:
              summary: "goldmane-edges-digest CronJob failing — new edges captured but not posted to #security"
              description: "The daily edge digest Job {{ $labels.job_name }} failed. Edges may still be landing in the goldmane_edges DB but no one is being notified of new namespace-pairs. `kubectl -n goldmane-edge-aggregator logs job/{{ $labels.job_name }}`."
      - name: Infrastructure Health
        rules:
          - alert: HomeAssistantDown
--- a/stacks/monitoring/modules/monitoring/pve_exporter.tf
+++ b/stacks/monitoring/modules/monitoring/pve_exporter.tf
@ -22,10 +22,6 @@ resource "kubernetes_deployment" "pve_exporter" {
    namespace = kubernetes_namespace.monitoring.metadata[0].name
    labels = {
      tier = var.tier
      # ADR-0014 service identity: monitoring is a multi-Service namespace, so
      # the namespace alone can't attribute Goldmane flows. Value = the
      # fronting Service name (kubernetes_service.proxmox-exporter).
      "service-identity" = "proxmox-exporter"
    }
  }
@ -41,10 +37,6 @@ resource "kubernetes_deployment" "pve_exporter" {
      metadata {
        labels = {
          app = "proxmox-exporter"
          # ADR-0014: Goldmane/Felix stamps POD labels onto flows, so the
          # disambiguating identity must live on the pod template (not just
          # the Deployment metadata above). Not in selector → no replace.
          "service-identity" = "proxmox-exporter"
        }
      }
--- a/stacks/monitoring/modules/monitoring/snmp_exporter.tf
+++ b/stacks/monitoring/modules/monitoring/snmp_exporter.tf
@ -31,10 +31,6 @@ resource "kubernetes_deployment" "snmp-exporter" {
    labels = {
      app  = "snmp-exporter"
      tier = var.tier
      # ADR-0014 service identity: monitoring is a multi-Service namespace, so
      # the namespace alone can't attribute Goldmane flows. Value = the
      # fronting Service name (kubernetes_service.snmp-exporter).
      "service-identity" = "snmp-exporter"
    }
    annotations = {
      "reloader.stakater.com/search" = "true"
@ -51,10 +47,6 @@ resource "kubernetes_deployment" "snmp-exporter" {
      metadata {
        labels = {
          app = "snmp-exporter"
          # ADR-0014: Goldmane/Felix stamps POD labels onto flows, so the
          # disambiguating identity must live on the pod template (not just
          # the Deployment metadata above). Not in selector → no replace.
          "service-identity" = "snmp-exporter"
        }
      }
      spec {
--- a/stacks/rbac/modules/rbac/apiserver-oidc.tf
+++ b/stacks/rbac/modules/rbac/apiserver-oidc.tf
@ -10,29 +10,16 @@
 # match the existing RBAC subjects (kind: User, name: <raw email>; group names
 # verbatim). Do NOT add a prefix or existing bindings break.
 #
-# DRIFT WARNING (and how it's now handled): apiserver auth lives in THREE places
+# DRIFT WARNING: this edits the kube-apiserver static-pod manifest on the single
-# that must stay in sync, because a `kubeadm upgrade` REGENERATES the static-pod
+# master. A `kubeadm upgrade` regenerates that manifest and DROPS this flag (this
-# manifest from kubeadm-config:
+# is exactly how OIDC silently broke before — the flag was wiped and the
-#   1. /etc/kubernetes/pki/auth-config.yaml         — the structured authn file
+# content-hash trigger never re-fired). After any k8s control-plane upgrade,
-#   2. the live kube-apiserver static-pod manifest  — references it via the flag
+# re-apply the rbac stack to restore apiserver OIDC. See
-#   3. the kubeadm-config ClusterConfiguration CM   — what kubeadm regenerates from
+# docs/plans/2026-06-04-k8s-dashboard-sso-design.md.
 # Originally only (1)+(2) were managed, so every kubeadm upgrade rewrote the
 # manifest from the STALE CM, reverting --authentication-config to single-issuer
 # --oidc-* flags. The consequence is SSO breakage AFTER the upgrade: kubectl +
 # dashboard lose multi-issuer auth (the apiserver does NOT crash on this — verified
 # by an isolated repro 2026-06-24; the 2026-06-24 v1.35 upgrade *stall* was a
 # separate etcd IO-starvation issue, see
 # docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md). The
 # remote script below now ALSO reconciles (3) via `kubeadm init phase
 # upload-config`, so a future kubeadm upgrade regenerates a CORRECT manifest. The
 # k8s-version-upgrade chain additionally ALERTS (does not block — SSO drift is
 # recoverable) via `kubeadm upgrade diff` in preflight if --authentication-config
 # would still be dropped.
 #
 # SAFETY: the remote script health-gates on /livez and AUTO-ROLLS-BACK the
 # manifest from a timestamped backup if the apiserver does not recover, so a
-# malformed config cannot leave the single master down. Reconciling kubeadm-config
+# malformed config cannot leave the single master down.
 # is zero-impact on the running cluster (the CM is only read during an upgrade).
 variable "k8s_master_host" {
  type    = string
@ -110,40 +97,6 @@ locals {
    print('flag-inserted' if done else 'ANCHOR-NOT-FOUND')
  PY
  # Reconciles the kubeadm-config ClusterConfiguration's apiServer.extraArgs:
  # drops the stale single-issuer --oidc-* args and ensures --authentication-config
  # is present (anchored after --authorization-mode). Stdlib-only (the master is
  # only guaranteed python3, not pyyaml/yq). Idempotent; preserves all other
  # fields (etcd args, audit args, extraVolumes) verbatim. Exits 3 if the
  # authorization-mode anchor is missing (fail loud, leave the CM untouched).
  kubeadm_oidc_reconcile_py = <<-PY
    import sys
    lines = sys.stdin.read().split('\n')
    out, i, n = [], 0, len(lines)
    have_authn = any('name: authentication-config' in l for l in lines)
    inserted = have_authn
    while i < n:
        ln = lines[i]; s = ln.strip()
        if s.startswith('- name: oidc-'):
            i += 2 if (i + 1 < n and lines[i + 1].strip().startswith('value:')) else 1
            continue
        out.append(ln)
        if (not inserted) and s == '- name: authorization-mode':
            indent = ln[:len(ln) - len(ln.lstrip())]
            if i + 1 < n and lines[i + 1].strip().startswith('value:'):
                out.append(lines[i + 1]); i += 2
            else:
                i += 1
            out.append(indent + '- name: authentication-config')
            out.append(indent + '  value: /etc/kubernetes/pki/auth-config.yaml')
            inserted = True
            continue
        i += 1
    if not inserted:
        sys.stderr.write('ANCHOR-NOT-FOUND: authorization-mode\n'); sys.exit(3)
    sys.stdout.write('\n'.join(out))
  PY
  # Whole remote operation, base64-embedded for byte-exact transfer (no
  # heredoc/escaping hazards across SSH).
  apiserver_auth_remote_script = <<-SH
@ -184,30 +137,6 @@ locals {
      echo "rolled back to previous manifest"; exit 1
    fi
    echo "kube-apiserver healthy with multi-issuer --authentication-config"
    # 5. Reconcile kubeadm-config so a FUTURE `kubeadm upgrade` regenerates the
    #    apiserver manifest WITH --authentication-config instead of reverting to
    #    the stale single-issuer --oidc-* flags. Without this, kubeadm rewrote the
    #    manifest from kubeadm-config on every control-plane upgrade and the
    #    regenerated apiserver crash-looped (the 2026-06-24 v1.35 upgrade stall).
    #    Zero live impact (the CM is only read at upgrade time); idempotent;
    #    best-effort (the chain's `kubeadm upgrade diff` preflight gate is the
    #    backstop if this cannot run).
    KC="sudo kubectl --kubeconfig /etc/kubernetes/admin.conf"
    CC=$($KC -n kube-system get cm kubeadm-config -o jsonpath='{.data.ClusterConfiguration}' 2>/dev/null || true)
    if [ -n "$CC" ] && { echo "$CC" | grep -q 'oidc-issuer-url' || ! echo "$CC" | grep -q 'authentication-config'; }; then
      echo "Reconciling kubeadm-config (oidc-* -> authentication-config) so kubeadm upgrade keeps structured auth"
      echo '${base64encode(local.kubeadm_oidc_reconcile_py)}' | base64 -d > /tmp/reconcile_kubeadm_oidc.py
      if printf '%s' "$CC" | python3 /tmp/reconcile_kubeadm_oidc.py > /tmp/kubeadm-cc-new.yaml \
         && sudo kubeadm init phase upload-config kubeadm --config /tmp/kubeadm-cc-new.yaml; then
        echo "kubeadm-config reconciled: future control-plane upgrades keep --authentication-config"
      else
        echo "WARN: kubeadm-config reconcile failed; the upgrade-chain preflight gate will block the next upgrade"
      fi
      rm -f /tmp/reconcile_kubeadm_oidc.py /tmp/kubeadm-cc-new.yaml
    else
      echo "kubeadm-config already uses --authentication-config (no oidc drift)"
    fi
  SH
 }
@ -226,14 +155,6 @@ resource "null_resource" "apiserver_oidc_config" {
  }
  triggers = {
    # Intentionally hash ONLY the issuer config, NOT the remote script. CI applies
    # the rbac stack with no ssh_private_key (var defaults to ""), so a re-run of
    # this SSH provisioner in CI would fail — hence the null_resource must stay a
    # no-op on a plain CI apply. Script changes (e.g. the 2026-06-24 kubeadm-config
    # reconciliation) reach the cluster via the apiserver-oidc-restore ConfigMap
    # below (a plain k8s resource, no ssh) which the upgrade chain re-runs. To force
    # this provisioner to re-run after a script change, apply locally with
    # `-replace` + TF_VAR_ssh_private_key (see docs/runbooks/k8s-version-upgrade.md).
    auth_config = sha256(local.apiserver_auth_config_yaml)
  }
 }
--- a/state/stacks/dbaas/terraform.tfstate.enc
+++ b/state/stacks/dbaas/terraform.tfstate.enc
--- a/state/stacks/vault/terraform.tfstate.enc
+++ b/state/stacks/vault/terraform.tfstate.enc