WIP: goldmane-edge-aggregator deploy stack + vault role + ghcr allowlist (infra #58 )

NOT APPLIED. Staged for a fresh-session finish (see memory runbook). Contains: - stacks/goldmane-edge-aggregator/{main.tf,terragrunt.hcl}: namespace, TF-minted mTLS client cert from tigera-ca-private, goldmane_edges PG DB-init Job, db + slack ExternalSecrets, aggregate Deployment + digest CronJob. - stacks/vault/main.tf: pg-goldmane-edges static rotation role (Tier-0). - stacks/kyverno/.../ghcr-credentials.tf: ns added to the private-image allowlist. KNOWN BLOCKER: the stack uses the hashicorp/tls provider (cert minting) but the root terragrunt.hcl generate "k8s_providers" block doesn't declare it, and a second required_providers (the removed versions.tf) is illegal. FIX = add tls to that global block (mirrors proxmox/kubectl). Then apply order: db_init (creates goldmane_edges role) -> kyverno -> vault (Tier-0, plan-review) -> stack ExternalSecrets (targeted, first-apply) -> stack full -> verify mTLS to goldmane:7443. Vault KV secret/goldmane-edge-aggregator already created. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 13:01:37 +00:00
9 changed files with 2392 additions and 2626 deletions
--- a/.claude/skills/home-assistant/SKILL.md
+++ b/.claude/skills/home-assistant/SKILL.md
@ -11,8 +11,8 @@ description: |
  There are TWO Home Assistant deployments: ha-london (default) and ha-sofia.
  Always use Home Assistant for smart home control.
 author: Claude Code
-version: 2.1.0
-date: 2026-06-24
+version: 2.0.0
+date: 2026-02-07
 ---

 # Home Assistant Control
@ -395,27 +395,14 @@ Advanced SSH, File Editor, Studio Code Server, InfluxDB, Mosquitto, Node-RED, Fr
 ## ha-london Knowledge Map

 ### Overview
- **HA Version**: 2026.5.2 on **Home Assistant OS** (HAOS — managed appliance, NOT a `docker run` container). Latest is 2026.6.4 (update available, deliberately not applied).
+- **HA Version**: 2025.9.1 (Docker container on Raspberry Pi)
 - **Location**: London, UK
- **Platform**: Raspberry Pi 4, HA OS
- **Access from the Sofia devvm**: london is **remote** — `homelab ha ssh --instance london` generally WON'T connect (ADR-0012). Drive it via the API: `homelab ha token --instance london` + `https://ha-london.viktorbarzin.me/api/...`, and the WebSocket API `wss://ha-london.viktorbarzin.me/api/websocket` for dashboards / config-entries / HACS installs.
- **SSH (only from the London LAN)**: `ssh hassio@192.168.8.103` (requires `sudo` for file access)
- **Config path**: `/config/`
+- **Platform**: Raspberry Pi 4, HA OS (not Docker standalone)
+- **SSH**: `ssh hassio@192.168.8.103` (requires `sudo` for file access)
+- **Config path**: `/config/` (requires `sudo` for file access)
 - **3 tracked people**: Viktor Barzin, Anca Milea, Gheorghe Milea
 - **Zone**: London (home)

-### Dashboards (redesigned 2026-06-24)
-**Glossary** (HA terms — keep distinct):
- **Dashboard** = a sidebar entry (Overview, Air Quality, Map). Sidebar *order* is a per-USER frontend preference, not in any dashboard config.
- **View** = a tab inside a dashboard. View order is global (stored in the dashboard config).
- **Card** = a widget inside a view.
-
- **Overview** (`lovelace`, the default): responsive **sections** views, styled with Mushroom + mini-graph-card.
-  - **Home** tab: *Who's home* · *Comfort & Air* (CO₂/temp/humidity/PM2.5/VOC chips + CO₂ and temp/humidity trend graphs + link to Air Quality) · *Cowboy* (battery/range/last-ride) · *Energy* (5 Kasa plugs + power trend) · *Quick actions* (Netflix/Stremio/Night).
-  - **More** tab: *Network* (GL-MT6000 router) · *System* (HA version/update, last backup, RPi power) · *Phones*.
- **Air Quality** (`air-quality`): deep-dive (views: Home, Detailed). (`detialed`→`detailed` path typo fixed 2026-06-24.)
- Built via the WS `lovelace/config/save` API (london is remote — no SSH path).
-
 ### Key Systems

 #### 1. Smart Plugs (TP-Link Kasa) — Energy Monitoring
@ -437,15 +424,10 @@ Named plugs with power/energy tracking:
 - PM1.0/2.5/4.0/10 particulate sensors
 - VOC, NOx, ammonia, CO, ethanol, hydrogen, methane, NO2 gas sensors

-#### 3. Cowboy E-Bike (`elsbrock/cowboy-ha`)
-Bike named **"Classic Performance"** → entities are `sensor.classic_performance_*` (26 total). The old `sensor.bike_*` names are GONE (they were the dead `jdejaegh` integration).
- `sensor.classic_performance_remaining_battery`: Battery % (was `sensor.bike_state_of_charge`)
- `sensor.classic_performance_remaining_range`: Range km
- `sensor.classic_performance_mileage`: Total km (was `sensor.bike_total_distance`)
- `sensor.classic_performance_saved_co2`: Lifetime CO2 saved (was `sensor.bike_total_co2_saved`)
- Plus `_distance_today`, `_last_trip_*`, `_battery_health`, `device_tracker.classic_performance`, etc.
- **GOTCHA**: live battery/range/mileage read `unknown` while the bike is parked/asleep — Cowboy only reports live SoC when awake (ridden/charging); trip-history + `distance_today` stay live regardless.
- Auth: account **email+password** (no AWS Cognito — that was the dead `jdejaegh`/`cowboybike` lineage). Setup via UI config flow / REST `config_entries/flow`. Creds in Vaultwarden item **"cowboy bike"** (`homelab vault get "cowboy bike"`).
+#### 3. Cowboy E-Bike
+- `sensor.bike_state_of_charge`: Battery %
+- `sensor.bike_total_distance`: Total km
+- `sensor.bike_total_co2_saved`: CO2 saved (grams)

 #### 4. Uptime Monitoring (UptimeRobot)
 - `sensor.blog`: blog uptime
@ -464,17 +446,12 @@ Bike named **"Classic Performance"** → entities are `sensor.classic_performanc
 - Scripts: `script.start_netflix`, `script.start_stremio`
 - Scene: `scene.night` (turns off Livia + Michelle plugs)

-### Custom Components (HACS integrations)
- **cowboy** (`elsbrock/cowboy-ha` v1.2.0): Cowboy e-bike — revived 2026-06-24. The old `jdejaegh/home-assistant-cowboy` repo is **dead (404)**; don't chase it.
- **hildebrandglow_dcc**: UK smart meter DCC energy — **DISABLED by user** (config entry `disabled_by: user`), not broken.
-
-### HACS frontend cards (plugins)
- **Mushroom** (`piitaya/lovelace-mushroom`), **mini-graph-card** (`kalkih/mini-graph-card`), **plotly-graph-card** (`dbuezas/lovelace-plotly-graph-card`) — used by the redesigned Overview. Install over WS `hacs/repository/download`; resources auto-register in storage mode.
+### Custom Components
+- **cowboy**: Cowboy e-bike integration (HACS)
+- **hildebrandglow_dcc**: UK smart meter DCC energy data (HACS)

 ### Integrations
-ESPHome, TP-Link Kasa, Tapo, UptimeRobot, **Cowboy** (elsbrock), Oral-B BLE, Ookla Speedtest (exposes only an `update` entity, no live speed sensors), HACS, OpenRouter (free LLMs), Piper (TTS), Whisper (STT), Android TV/ADB.
- **Disabled by user (NOT broken)**: `met` + `metoffice` (weather — so `weather.*` entities are ABSENT), `roomba` (Rumi vacuum), `hildebrandglow_dcc` (energy).
- **Failing**: `tplink` **Tapo P100** projector plug — `setup_retry`, 403 KLAP handshake from 192.168.8.108 (plug off / firmware). Left as-is.
+ESPHome, TP-Link Kasa, Tapo, UptimeRobot, Cowboy, Hildebrand Glow DCC, Oral-B BLE, Ookla Speedtest, HACS, OpenRouter (multiple free LLMs), Piper (local TTS), Whisper (local STT), Android TV/ADB

 ### AI / Voice Assistants
 - 5 free LLM conversation agents: Google Gemma 3 27B, Meta Llama 3.2 3B, Mistral Devstral 2, OpenAI GPT-OSS-20B, Z.AI GLM 4.5 Air
@ -489,8 +466,15 @@ ESPHome, TP-Link Kasa, Tapo, UptimeRobot, **Cowboy** (elsbrock), Oral-B BLE, Ook
 - Anca arrival/departure notifications
 - Night scene: turns off Livia + Michelle

-### Platform (HAOS — ignore any legacy `docker run` snippet)
-ha-london runs **Home Assistant OS** (managed appliance), NOT a hand-run Docker container. There is no `docker run homeassistant/home-assistant` to manage. Install HACS components over the WebSocket API (`hacs/repository/download` with the repo's HACS id), then restart via `POST /api/services/homeassistant/restart` — a HAOS restart drops automations for ~1–2 min and resets `sensor.uptime` (use that as the "back up" marker).
+### Docker Setup
+```bash
+docker run -d --name homeassistant --privileged \
+  -e TZ=Europe/London \
+  -v /home/pi/docker/homeAssistant:/config \
+  -v /run/dbus:/run/dbus:ro \
+  --network=host --restart=unless-stopped \
+  homeassistant/home-assistant:2025.9
+```

 ### SSH Access
 ```bash
--- a/docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md
+++ b/docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md
@ -1,97 +0,0 @@
-# Post-mortem: k8s 1.34→1.35 upgrade stalled — etcd IO starvation (2026-06-24)
-
-> Filename kept for inbound links. The originally-suspected cause (kubeadm-config
-> OIDC drift) turned out **not** to be the crash — see "Correction" below. The OIDC
-> drift was a real *separate* latent bug fixed in the same change.
-
-**Impact:** The autonomous k8s-version-upgrade chain (23:00 UTC nightly) reached
-the master control-plane phase for the first time — preflight passed, etcd
-snapshot taken, master cordoned + drained, etcd upgraded 3.6.5→3.6.6 — then the
-kube-apiserver upgrade to v1.35.6 **crash-looped**. kubeadm waited its 5-minute
-static-pod-hash window across all internal retries, then auto-rolled-back to
-v1.34.9. The cluster stayed healthy on 1.34.9 (apiserver, all 7 nodes Ready), but
-the run left **k8s-master cordoned** and the chain **wedged on `in_flight=1`**.
-No data loss; no user-facing outage (the master carries control-plane taints, so
-no workloads were displaced).
-
-**Trigger:** the first *minor* upgrade the chain ever attempted (1.34→1.35) — the
-first time kubeadm upgrades etcd (3.6.5→3.6.6) and regenerates the control-plane
-static pods, i.e. the first time the upgrade pushes real write-IO at etcd.
-
-## Root cause — etcd IO starvation on the shared HDD
-
-The new kube-apiserver could not establish/keep a working connection to etcd
-during the upgrade because **etcd was IO-starved**. etcd's surviving container log
-from the crash window (`/var/log/pods/.../etcd/0.log`, 23:04–23:20 UTC) shows:
-
- **1,180** `apply request took too long` warnings in 16 minutes;
- individual applies of **4.3s / 2.9s / 2.7s / 1.8s** (healthy is <100ms),
-  clustered at **23:18:51 UTC** — exactly when kubeadm's final attempt was trying
-  to bring the new apiserver up.
-
-A reproduced 1.35.6 apiserver with no etcd dies with
-`F instance.go:233 Error creating leases: error creating storage factory: context
-deadline exceeded` — the same failure mode a multi-second etcd produces. etcd
-lives on the contended `sdc` HDD (**beads code-oflt**: "etcd/critical VM disks on
-shared sdc HDD — recurring IO-storm root cause"). The upgrade itself piled IO onto
-that spindle:
-
-1. etcd's own upgrade-restart + WAL/db re-read (it restarted ~23:04, re-elected);
-2. kubeadm dumping a full **~400MB etcd DB backup** to
-   `/etc/kubernetes/tmp/kubeadm-backup-etcd-<ts>/` (on the same HDD) before the
-   etcd upgrade — and **145 of these had accumulated to 28GB** (kubeadm never
-   cleans them up), pushing master root fs to **73%**, above the 70% kubelet
-   image-GC threshold, so image GC churned during the drain too;
-3. master-drain pod evictions.
-
-### Correction — it was NOT the OIDC flag swap
-
-`kubeadm upgrade diff v1.35.6` showed the regenerated manifest also swaps
-`--authentication-config` (structured multi-issuer OIDC) back to legacy
-single-issuer `--oidc-*` flags (kubeadm-config drift, see secondary finding). That
-was the *first* hypothesis — but an isolated repro of the 1.35.6 apiserver with
-those exact `--oidc-*` flags **and authentik reachable** initialised OIDC cleanly
-(`oidc.go:313`, no error) and ran fine until it hit the (deliberately dead) test
-etcd. So the auth swap does **not** crash the apiserver; it was a red herring for
-the crash. Image pull (all v1.35.6 images pre-pulled), OOM (none), and disk-full
-were also ruled out.
-
-## Secondary finding (real, fixed separately) — kubeadm-config OIDC drift
-
-apiserver auth is configured in three places that must agree:
-(1) `/etc/kubernetes/pki/auth-config.yaml` (structured, two issuers: `kubernetes`
-+ `k8s-dashboard`, added 2026-06-19); (2) the live static-pod manifest
-(`--authentication-config`); (3) the kubeadm-config `ClusterConfiguration` CM —
-which still carried the legacy `--oidc-*` extraArgs. `kubeadm upgrade` regenerates
-the manifest from (3), so it would have reverted structured auth → **dashboard +
-kubectl SSO break after a successful upgrade** (recoverable: the chain's
-post-master `restore.sh` re-adds the flag). This is a real bug, just not the crash.
-
-## Resolution
-
-1. **Reclaimed the 28GB kubeadm scratch** on master (`/etc/kubernetes/tmp/kubeadm-backup-*`) — root fs 73% → 23%.
-2. **Reconciled kubeadm-config live** (zero cluster impact — CM only read at upgrade time): dropped `--oidc-*`, added `--authentication-config` via `kubeadm init phase upload-config kubeadm`. `kubeadm upgrade diff` then shows only the control-plane image bumps.
-3. **Recovered:** uncordoned k8s-master, cleared the stuck `in_flight` gauge + annotation, deleted last night's Complete/Failed `1-35-6` phase jobs (a Complete preflight would otherwise make the detector idempotent-skip the re-run).
-
-## Prevention (landed in this change)
-
-| Gap | Fix |
-|-----|-----|
-| kubeadm leaks ~400MB etcd-DB backups into `/etc/kubernetes/tmp` forever (→ disk fills, image-GC churn, write-IO on etcd's spindle) | **`upgrade-step.sh` preflight now prunes** `/etc/kubernetes/tmp/kubeadm-backup-*` + `kubeadm-upgraded-manifests*` older than 3 days on master, every run. Best-effort, never aborts. |
-| kubeadm-config drift would silently break SSO after an upgrade | `apiserver-oidc.tf`'s remote script now **also reconciles kubeadm-config** (`kubeadm init phase upload-config`), delivered via the `apiserver-oidc-restore` ConfigMap the chain re-runs (CI needs no ssh) or a local `-replace` apply. Preflight **alerts** (not blocks — SSO drift is recoverable) if `kubeadm upgrade diff` would still drop `--authentication-config`. |
-| etcd on the contended `sdc` HDD starves under upgrade IO | **Durable fix is beads code-oflt** (move etcd/critical VM disks off `sdc`). Not in this change. Mitigations above reduce the upgrade's own IO; reclaimed disk removes the image-GC variable. |
-
-## Lessons
-
- **Capture the failing component's own logs before concluding.** The `kubeadm
-  upgrade diff` made the OIDC swap look like the cause; only etcd's log (multi-second
-  applies) + an isolated apiserver repro showed the truth (etcd IO). A clean diff is
-  "what config changes," not "why it crashed."
- **etcd on shared HDD is the cluster's recurring fragility** (immich IO storm
-  2026-05-25, this stall). Upgrades concentrate IO (etcd restart + kubeadm's 400MB
-  backup copy + drain) onto that spindle. code-oflt is the real fix.
- **Tools that leave per-operation scratch must be reaped.** kubeadm's
-  `/etc/kubernetes/tmp` etcd backups are throwaway (real backups → NFS) but never
-  GC'd; 28GB had silently accumulated.
- **Out-of-band control-plane edits must be written back to kubeadm-config** — else
-  `kubeadm upgrade` silently reverts them (here: SSO; could be admission/audit/API flags).
--- a/docs/runbooks/k8s-version-upgrade.md
+++ b/docs/runbooks/k8s-version-upgrade.md
@ -41,8 +41,6 @@ Job 0 — preflight       (pinned: k8s-node1)
  ├── halt-on-alert (kured-style ignore-list)
  ├── 24h-quiet baseline (no Ready transitions <24h ago)
  ├── kubeadm upgrade plan matches target (skipped when master already at target — partial-resume)
-  ├── apiserver-OIDC drift check: kubeadm upgrade diff drops --authentication-config? → Slack WARN (recoverable; not a block)
-  ├── reclaim kubeadm scratch: prune /etc/kubernetes/tmp/kubeadm-backup-* >3d on master (kubeadm leaks ~400MB etcd-db backups)
  ├── Push k8s_upgrade_in_flight=1, k8s_upgrade_started_timestamp=$(date +%s)
  ├── Trigger backup-etcd Job, wait, verify snapshot byte count
  ├── SSH master: containerd skew fix (if master < workers)
@ -224,34 +222,22 @@ Exposed in K8s via ExternalSecret `k8s-upgrade-creds` in the `k8s-upgrade` names

 ## Common Operations

-### apiserver OIDC + kubeadm upgrades (kubeadm-config reconciliation since 2026-06-24)
+### Post-upgrade: apiserver OIDC restore (AUTOMATED by the chain since 2026-06-19)

 `kubeadm upgrade apply` **regenerates `/etc/kubernetes/manifests/kube-apiserver.yaml`
-from kubeadm-config**. apiserver auth uses a structured multi-issuer
-`--authentication-config` (kubectl + dashboard SSO), but kubeadm-config used to
-still carry the legacy single-issuer `--oidc-*` extraArgs — so every upgrade
-reverted the flag, **silently breaking SSO after the upgrade** (the apiserver does
-NOT crash on this — verified by isolated repro; it's recoverable via the restore
-script below). NB: the **1.34→1.35 stall on 2026-06-24 was a *separate* issue —
-etcd IO starvation**, not this drift; post-mortem:
-`docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md`.
+and drops the `--authentication-config` flag**, silently disabling apiserver
+OIDC (kubectl/kubelogin CLI **and** the web dashboard SSO break — tokens get
+401). This used to require a manual re-apply after **every** control-plane bump.

-**Primary fix (2026-06-24):** `stacks/rbac/modules/rbac/apiserver-oidc.tf` now
-**reconciles kubeadm-config** (`kubeadm init phase upload-config kubeadm`, rewriting
-`apiServer.extraArgs`: drop `--oidc-*`, add `--authentication-config`) as part of
-its remote script. So kubeadm regenerates a **correct** manifest and the apiserver
-upgrades with a pure image bump — `kubeadm upgrade diff <target>` shows only the
-image change. Zero live impact (the CM is read only during an upgrade).
-
-**Backstops:**
- **Preflight check 4b** runs `kubeadm upgrade diff` and **alerts** (Slack WARN, does
-  NOT block — the drift only breaks SSO, which is recoverable) if
-  `--authentication-config` would still be dropped.
- The `rbac` stack still publishes its restore script to the
-  `kube-system/apiserver-oidc-restore` ConfigMap, and `phase_master` re-runs it on
-  master right after `kubeadm upgrade apply` (idempotent, `/livez`-gated with
-  auto-rollback, non-fatal) — now redundant belt-and-suspenders that *also*
-  re-reconciles kubeadm-config. Self-skips when master is already at target.
+**Now automated:** the `rbac` stack publishes its OIDC restore script to the
+`kube-system/apiserver-oidc-restore` ConfigMap, and the version-upgrade chain's
+`phase_master` re-runs it on master immediately after `kubeadm upgrade apply`
+(while tigera-operator is still quiesced, so the flag-add apiserver restart can't
+crashloop the operator). It's idempotent, health-gates `/livez` with
+auto-rollback, and is **non-fatal** — a failure only lags SSO until the next rbac
+apply (the version upgrade itself already succeeded). So a chain-driven
+control-plane bump no longer breaks SSO. The master phase self-skips when master
+is already at target, so this only runs when master was actually upgraded.

 **Manual fallback** — only for an out-of-band/manual `kubeadm` upgrade, or if the
 chain logged `WARN: --authentication-config absent after re-apply`:
--- a/stacks/goldmane-edge-aggregator/main.tf
+++ b/stacks/goldmane-edge-aggregator/main.tf
@ -57,19 +57,16 @@ resource "kubernetes_namespace" "goldmane_edge_aggregator" {
 # -----------------------------------------------------------------------------
 # The aggregator dials goldmane:7443 over mutual TLS. We mint a client cert
 # signed by the Tigera CA (the same CA that issues Goldmane's serving cert), so
-# Goldmane requires mutual TLS on :7443 and verifies the client cert chains to
-# the Tigera CA — it does NOT authorize by client identity, so ANY Tigera-CA-
-# signed cert is accepted. Rather than copy the Tigera CA PRIVATE KEY into TF
-# state to mint our own (a needless CA-key exposure; the hashicorp/tls provider
-# is also incompatible with this repo's global generate-providers/lockfile
-# pattern), we REUSE the operator-minted, Tigera-CA-signed client cert
-# `whisker-backend-key-pair` (calico-system). We never touch the CA key.
-# Trade-off: if the operator rotates that cert, re-apply to re-sync (hardening
-# follow-up: mint an own-identity cert in-namespace if Whisker is ever removed).
-data "kubernetes_secret" "whisker_backend" {
+# Goldmane trusts the client and the client trusts Goldmane's server cert via
+# the published CA bundle.
+#
+# The Tigera CA private key lives in the `tigera-ca-private` Secret in
+# tigera-operator (Opaque; verified keys: tls.crt + tls.key). The stack's apply
+# identity needs RBAC get on that secret — see the Role/RoleBinding below.
+data "kubernetes_secret" "tigera_ca" {
  metadata {
-    name      = "whisker-backend-key-pair"
-    namespace = "calico-system"
+    name      = "tigera-ca-private"
+    namespace = "tigera-operator"
  }
 }

@ -96,11 +93,46 @@ resource "kubernetes_config_map" "tigera_ca_bundle" {
  data = data.kubernetes_config_map.tigera_ca_bundle.data
 }

-# Client cert + key for mTLS to goldmane:7443, mounted at TLS_CERT_PATH /
-# TLS_KEY_PATH defaults (/etc/goldmane-client-tls/tls.crt and .../tls.key).
-# Sourced verbatim from the operator's whisker-backend client key-pair (read
-# above) — already Tigera-CA-signed, which is all Goldmane verifies. No CA key
-# is touched and no cross-namespace CA RBAC is needed.
+# Client private key.
+resource "tls_private_key" "goldmane_client" {
+  algorithm = "RSA"
+  rsa_bits  = 2048
+}
+
+# CSR for the client cert. CN identifies the client; the service-DNS SAN mirrors
+# how Felix/whisker-backend present a client identity to Goldmane.
+resource "tls_cert_request" "goldmane_client" {
+  private_key_pem = tls_private_key.goldmane_client.private_key_pem
+  subject {
+    common_name  = "goldmane-edge-aggregator"
+    organization = "goldmane-edge-aggregator"
+  }
+  dns_names = [
+    "goldmane-edge-aggregator",
+    "goldmane-edge-aggregator.goldmane-edge-aggregator.svc.cluster.local",
+  ]
+}
+
+# Sign the CSR with the Tigera CA. 10-year validity (87600h): re-apply rotates
+# it well before expiry; a long horizon avoids surprise mTLS outages from an
+# unattended stack. The Tigera CA itself outlives this (operator-managed).
+resource "tls_locally_signed_cert" "goldmane_client" {
+  cert_request_pem   = tls_cert_request.goldmane_client.cert_request_pem
+  ca_private_key_pem = data.kubernetes_secret.tigera_ca.data["tls.key"]
+  ca_cert_pem        = data.kubernetes_secret.tigera_ca.data["tls.crt"]
+
+  validity_period_hours = 87600 # 10y
+  early_renewal_hours   = 720   # re-sign on apply when <30d remain
+
+  allowed_uses = [
+    "client_auth",
+    "digital_signature",
+    "key_encipherment",
+  ]
+}
+
+# The minted client cert + key, mounted at TLS_CERT_PATH / TLS_KEY_PATH defaults
+# (/etc/goldmane-client-tls/tls.crt and .../tls.key).
 resource "kubernetes_secret" "goldmane_client_tls" {
  metadata {
    name      = "goldmane-client-tls"
@ -108,8 +140,47 @@ resource "kubernetes_secret" "goldmane_client_tls" {
  }
  type = "Opaque"
  data = {
-    "tls.crt" = data.kubernetes_secret.whisker_backend.data["tls.crt"]
-    "tls.key" = data.kubernetes_secret.whisker_backend.data["tls.key"]
+    "tls.crt" = tls_locally_signed_cert.goldmane_client.cert_pem
+    "tls.key" = tls_private_key.goldmane_client.private_key_pem
+  }
+}
+
+# Narrow RBAC so this stack's apply identity (and ESO/Reloader are unaffected)
+# can `get` the Tigera CA private key in tigera-operator. The data source above
+# reads it at apply time; this Role/RoleBinding documents + grants that access
+# rather than relying on cluster-admin. The subject is the same SA the other
+# Tier-1 stacks apply as (claude-agent/terraform-state for headless, the human
+# OIDC identity interactively) — both are cluster-admin today, so this is
+# belt-and-braces / least-privilege intent for when apply identities tighten.
+resource "kubernetes_role" "read_tigera_ca" {
+  metadata {
+    name      = "goldmane-edge-aggregator-read-tigera-ca"
+    namespace = "tigera-operator"
+  }
+  rule {
+    api_groups     = [""]
+    resources      = ["secrets"]
+    resource_names = ["tigera-ca-private"]
+    verbs          = ["get"]
+  }
+}
+
+resource "kubernetes_role_binding" "read_tigera_ca" {
+  metadata {
+    name      = "goldmane-edge-aggregator-read-tigera-ca"
+    namespace = "tigera-operator"
+  }
+  role_ref {
+    api_group = "rbac.authorization.k8s.io"
+    kind      = "Role"
+    name      = kubernetes_role.read_tigera_ca.metadata[0].name
+  }
+  # The headless apply identity (claude-agent-service runs Tier-1 applies as the
+  # `terraform-state` Vault K8s role in the claude-agent namespace).
+  subject {
+    kind      = "ServiceAccount"
+    name      = "default"
+    namespace = "claude-agent"
  }
 }

@ -156,11 +227,6 @@ resource "kubernetes_job" "db_init" {
  timeouts {
    create = "2m"
  }
-  lifecycle {
-    # KYVERNO_LIFECYCLE_V1: Kyverno injects dns_config (ndots=2); ignore it so
-    # this idempotent Job isn't replaced (Jobs are immutable) on every apply.
-    ignore_changes = [spec[0].template[0].spec[0].dns_config]
-  }
 }

 # ExternalSecret projecting the Vault-rotated (7-day) credential into a K8s
@ -229,7 +295,7 @@ resource "kubernetes_manifest" "slack_external_secret" {
      data = [{
        secretKey = "SLACK_WEBHOOK_URL"
        remoteRef = {
-          key      = "viktor"
+          key      = "monitoring"
          property = "alertmanager_slack_api_url"
        }
      }]
--- a/stacks/instagram-poster/modules/instagram-poster/main.tf
+++ b/stacks/instagram-poster/modules/instagram-poster/main.tf
@ -35,14 +35,6 @@ resource "kubernetes_namespace" "instagram_poster" {
 #     - immich_tag_instagram      (optional — auto-resolved if missing)
 #     - immich_tag_posted         (optional — auto-resolved if missing)
 resource "kubernetes_manifest" "external_secret" {
-  # The external-secrets controller takes server-side-apply ownership of
-  # .spec.refreshInterval, so a plain TF apply conflicts. force_conflicts lets
-  # TF win (values match, so it's stable) — same pattern as grafana/woodpecker/
-  # traefik/k8s-version-upgrade. Surfaced 2026-06-24 by the first IG apply since
-  # the ESO v1 migration (the scale-to-0 push).
-  field_manager {
-    force_conflicts = true
-  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
@ -147,11 +139,6 @@ resource "kubernetes_manifest" "external_secret" {
 # ESO refreshes the K8s Secret every 15m. `reloader.stakater.com/match`
 # bounces the pod when the password changes.
 resource "kubernetes_manifest" "benchmark_db_external_secret" {
-  # See external_secret above — ESO owns .spec.refreshInterval; force_conflicts
-  # lets the TF apply win instead of erroring on the field-manager conflict.
-  field_manager {
-    force_conflicts = true
-  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
@ -240,11 +227,7 @@ resource "kubernetes_deployment" "instagram_poster" {
  }

  spec {
-    # Scaled to 0 (2026-06-24): Instagram Graph integration is unused and its
-    # ExternalSecret is dead (missing ig_graph_long_lived_token /
-    # ig_business_account_id in Vault secret/instagram-poster). Set back to 1
-    # after minting a Meta long-lived token and populating those keys.
-    replicas = 0
+    replicas = 1
    # RWO PVC — cannot rolling-update.
    strategy {
      type = "Recreate"
--- a/stacks/k8s-version-upgrade/scripts/upgrade-step.sh
+++ b/stacks/k8s-version-upgrade/scripts/upgrade-step.sh
@ -416,39 +416,6 @@ phase_preflight() {
    fi
  fi

-  # 4b. apiserver-OIDC drift check (backstop for the rbac stack's kubeadm-config
-  # reconciliation). A `kubeadm upgrade` REGENERATES the apiserver manifest from
-  # kubeadm-config; if kubeadm-config still carries the legacy single-issuer
-  # --oidc-* args instead of --authentication-config, the regenerated apiserver
-  # loses structured multi-issuer auth → kubectl + dashboard SSO break AFTER the
-  # upgrade. This is RECOVERABLE (the apiserver does NOT crash — verified by an
-  # isolated repro 2026-06-24; the chain's post-master restore.sh re-adds the flag,
-  # and the rbac stack reconciles kubeadm-config so it won't recur) — so this is an
-  # ALERT, not a block. (NB the 2026-06-24 stall was NOT this — it was etcd IO
-  # starvation; see docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md.)
-  # Skip on an at-target master (resume — no apiserver regen).
-  if [ "$master_kubelet_v" != "$TARGET_VERSION" ]; then
-    local apiserver_diff
-    apiserver_diff=$(ssh "${SSH_OPTS[@]}" "$(ssh_target k8s-master)" "sudo kubeadm upgrade diff v$TARGET_VERSION 2>/dev/null" || true)
-    if echo "$apiserver_diff" | grep -qE '^-[[:space:]].*--authentication-config'; then
-      slack "WARN preflight — kubeadm upgrade will DROP --authentication-config (kubeadm-config OIDC drift). SSO breaks post-upgrade until restore.sh re-adds it; re-apply the rbac stack to reconcile kubeadm-config. Proceeding (recoverable, not a crash)."
-    fi
-  fi
-
-  # 4c. Reclaim kubeadm scratch on master. `kubeadm upgrade apply` dumps a full
-  # ~400MB etcd DB backup into /etc/kubernetes/tmp/kubeadm-backup-etcd-<ts>/ before
-  # every etcd upgrade and NEVER cleans it up — 145 dirs / 28GB had accumulated by
-  # 2026-06-24, pushing master root fs to 73% (image-GC churn + extra write IO on
-  # the shared HDD where etcd lives — a contributor to the etcd IO starvation that
-  # stalled that run, see post-mortem). Real etcd backups go to NFS, so these are
-  # throwaway. Prune ones >3 days old (keeps a short rollback window). Best-effort;
-  # never aborts the chain.
-  if [ "$master_kubelet_v" != "$TARGET_VERSION" ]; then
-    ssh "${SSH_OPTS[@]}" "$(ssh_target k8s-master)" \
-      "sudo find /etc/kubernetes/tmp -maxdepth 1 -type d \( -name 'kubeadm-backup-*' -o -name 'kubeadm-upgraded-manifests*' \) -mtime +3 -exec rm -rf {} + 2>/dev/null; echo -n 'master root after prune: '; df -h / | awk 'NR==2{print \$5\" used, \"\$4\" free\"}'" \
-      || echo "kubeadm-scratch prune skipped (ssh/df failed) — non-fatal"
-  fi
-
  # 5. Push in-flight + started_timestamp metrics + ns annotations
  $KUBECTL annotate ns "$NS" \
    "viktorbarzin.me/k8s-upgrade-in-flight=$(date -u +%FT%TZ)" \
--- a/stacks/monitoring/modules/monitoring/grafana.tf
+++ b/stacks/monitoring/modules/monitoring/grafana.tf
@ -71,15 +71,6 @@ resource "kubernetes_persistent_volume" "alertmanager_pv" {
 # DB credentials from Vault database engine (rotated automatically)
 # Provides GF_DATABASE_PASSWORD that auto-updates when password rotates
 resource "kubernetes_manifest" "grafana_db_creds" {
-  # The external-secrets controller takes server-side-apply ownership of
-  # .spec.refreshInterval, so a plain TF apply conflicts ("conflict with
-  # external-secrets ... .spec.refreshInterval"). force_conflicts lets TF win
-  # (values match, so it's stable) — same pattern as the woodpecker/traefik/
-  # k8s-version-upgrade stacks. Surfaced 2026-06-24: the first monitoring apply
-  # in a while exposed this latent conflict (prior pushes were docs-only).
-  field_manager {
-    force_conflicts = true
-  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
--- a/stacks/rbac/modules/rbac/apiserver-oidc.tf
+++ b/stacks/rbac/modules/rbac/apiserver-oidc.tf
@ -10,29 +10,16 @@
 # match the existing RBAC subjects (kind: User, name: <raw email>; group names
 # verbatim). Do NOT add a prefix or existing bindings break.
 #
-# DRIFT WARNING (and how it's now handled): apiserver auth lives in THREE places
-# that must stay in sync, because a `kubeadm upgrade` REGENERATES the static-pod
-# manifest from kubeadm-config:
-#   1. /etc/kubernetes/pki/auth-config.yaml         — the structured authn file
-#   2. the live kube-apiserver static-pod manifest  — references it via the flag
-#   3. the kubeadm-config ClusterConfiguration CM   — what kubeadm regenerates from
-# Originally only (1)+(2) were managed, so every kubeadm upgrade rewrote the
-# manifest from the STALE CM, reverting --authentication-config to single-issuer
-# --oidc-* flags. The consequence is SSO breakage AFTER the upgrade: kubectl +
-# dashboard lose multi-issuer auth (the apiserver does NOT crash on this — verified
-# by an isolated repro 2026-06-24; the 2026-06-24 v1.35 upgrade *stall* was a
-# separate etcd IO-starvation issue, see
-# docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md). The
-# remote script below now ALSO reconciles (3) via `kubeadm init phase
-# upload-config`, so a future kubeadm upgrade regenerates a CORRECT manifest. The
-# k8s-version-upgrade chain additionally ALERTS (does not block — SSO drift is
-# recoverable) via `kubeadm upgrade diff` in preflight if --authentication-config
-# would still be dropped.
+# DRIFT WARNING: this edits the kube-apiserver static-pod manifest on the single
+# master. A `kubeadm upgrade` regenerates that manifest and DROPS this flag (this
+# is exactly how OIDC silently broke before — the flag was wiped and the
+# content-hash trigger never re-fired). After any k8s control-plane upgrade,
+# re-apply the rbac stack to restore apiserver OIDC. See
+# docs/plans/2026-06-04-k8s-dashboard-sso-design.md.
 #
 # SAFETY: the remote script health-gates on /livez and AUTO-ROLLS-BACK the
 # manifest from a timestamped backup if the apiserver does not recover, so a
-# malformed config cannot leave the single master down. Reconciling kubeadm-config
-# is zero-impact on the running cluster (the CM is only read during an upgrade).
+# malformed config cannot leave the single master down.

 variable "k8s_master_host" {
  type    = string
@ -110,40 +97,6 @@ locals {
    print('flag-inserted' if done else 'ANCHOR-NOT-FOUND')
  PY

-  # Reconciles the kubeadm-config ClusterConfiguration's apiServer.extraArgs:
-  # drops the stale single-issuer --oidc-* args and ensures --authentication-config
-  # is present (anchored after --authorization-mode). Stdlib-only (the master is
-  # only guaranteed python3, not pyyaml/yq). Idempotent; preserves all other
-  # fields (etcd args, audit args, extraVolumes) verbatim. Exits 3 if the
-  # authorization-mode anchor is missing (fail loud, leave the CM untouched).
-  kubeadm_oidc_reconcile_py = <<-PY
-    import sys
-    lines = sys.stdin.read().split('\n')
-    out, i, n = [], 0, len(lines)
-    have_authn = any('name: authentication-config' in l for l in lines)
-    inserted = have_authn
-    while i < n:
-        ln = lines[i]; s = ln.strip()
-        if s.startswith('- name: oidc-'):
-            i += 2 if (i + 1 < n and lines[i + 1].strip().startswith('value:')) else 1
-            continue
-        out.append(ln)
-        if (not inserted) and s == '- name: authorization-mode':
-            indent = ln[:len(ln) - len(ln.lstrip())]
-            if i + 1 < n and lines[i + 1].strip().startswith('value:'):
-                out.append(lines[i + 1]); i += 2
-            else:
-                i += 1
-            out.append(indent + '- name: authentication-config')
-            out.append(indent + '  value: /etc/kubernetes/pki/auth-config.yaml')
-            inserted = True
-            continue
-        i += 1
-    if not inserted:
-        sys.stderr.write('ANCHOR-NOT-FOUND: authorization-mode\n'); sys.exit(3)
-    sys.stdout.write('\n'.join(out))
-  PY
-
  # Whole remote operation, base64-embedded for byte-exact transfer (no
  # heredoc/escaping hazards across SSH).
  apiserver_auth_remote_script = <<-SH
@ -184,30 +137,6 @@ locals {
      echo "rolled back to previous manifest"; exit 1
    fi
    echo "kube-apiserver healthy with multi-issuer --authentication-config"
-
-    # 5. Reconcile kubeadm-config so a FUTURE `kubeadm upgrade` regenerates the
-    #    apiserver manifest WITH --authentication-config instead of reverting to
-    #    the stale single-issuer --oidc-* flags. Without this, kubeadm rewrote the
-    #    manifest from kubeadm-config on every control-plane upgrade and the
-    #    regenerated apiserver crash-looped (the 2026-06-24 v1.35 upgrade stall).
-    #    Zero live impact (the CM is only read at upgrade time); idempotent;
-    #    best-effort (the chain's `kubeadm upgrade diff` preflight gate is the
-    #    backstop if this cannot run).
-    KC="sudo kubectl --kubeconfig /etc/kubernetes/admin.conf"
-    CC=$($KC -n kube-system get cm kubeadm-config -o jsonpath='{.data.ClusterConfiguration}' 2>/dev/null || true)
-    if [ -n "$CC" ] && { echo "$CC" | grep -q 'oidc-issuer-url' || ! echo "$CC" | grep -q 'authentication-config'; }; then
-      echo "Reconciling kubeadm-config (oidc-* -> authentication-config) so kubeadm upgrade keeps structured auth"
-      echo '${base64encode(local.kubeadm_oidc_reconcile_py)}' | base64 -d > /tmp/reconcile_kubeadm_oidc.py
-      if printf '%s' "$CC" | python3 /tmp/reconcile_kubeadm_oidc.py > /tmp/kubeadm-cc-new.yaml \
-         && sudo kubeadm init phase upload-config kubeadm --config /tmp/kubeadm-cc-new.yaml; then
-        echo "kubeadm-config reconciled: future control-plane upgrades keep --authentication-config"
-      else
-        echo "WARN: kubeadm-config reconcile failed; the upgrade-chain preflight gate will block the next upgrade"
-      fi
-      rm -f /tmp/reconcile_kubeadm_oidc.py /tmp/kubeadm-cc-new.yaml
-    else
-      echo "kubeadm-config already uses --authentication-config (no oidc drift)"
-    fi
  SH
 }

@ -226,14 +155,6 @@ resource "null_resource" "apiserver_oidc_config" {
  }

  triggers = {
-    # Intentionally hash ONLY the issuer config, NOT the remote script. CI applies
-    # the rbac stack with no ssh_private_key (var defaults to ""), so a re-run of
-    # this SSH provisioner in CI would fail — hence the null_resource must stay a
-    # no-op on a plain CI apply. Script changes (e.g. the 2026-06-24 kubeadm-config
-    # reconciliation) reach the cluster via the apiserver-oidc-restore ConfigMap
-    # below (a plain k8s resource, no ssh) which the upgrade chain re-runs. To force
-    # this provisioner to re-run after a script change, apply locally with
-    # `-replace` + TF_VAR_ssh_private_key (see docs/runbooks/k8s-version-upgrade.md).
    auth_config = sha256(local.apiserver_auth_config_yaml)
  }
 }
--- a/state/stacks/vault/terraform.tfstate.enc
+++ b/state/stacks/vault/terraform.tfstate.enc