k8s-upgrade: reclaim+auto-prune kubeadm /etc/kubernetes/tmp leak; correct crash root cause to etcd IO (not OIDC)

Digging into "why did the apiserver crash" disproved the earlier OIDC explanation. An isolated v1.35.6 apiserver repro with authentik reachable initialises OIDC cleanly (oidc.go:313, no error) and runs fine — so the --authentication-config -> --oidc-* revert is NOT what crashed it. etcd's surviving crash-window log is the real cause: 1180 "apply request took too long" warnings in 16 min, individual applies up to 4.3s (healthy <100ms) right as kubeadm tried to bring up the new apiserver. That's etcd IO starvation on the shared sdc HDD (beads code-oflt). A big contributor + the reason master root fs sat at 73%: kubeadm dumps a full ~400MB etcd DB backup into /etc/kubernetes/tmp/kubeadm-backup-etcd-<ts>/ before every etcd upgrade and never cleans it up — 145 dirs / 28GB had accumulated, driving image-GC churn and extra write-IO onto etcd's spindle. Reclaimed live (73% -> 23%) and added a preflight prune (>3 days) so it can't re-accumulate. Also corrected the OIDC handling: the kubeadm-config drift is real but only breaks dashboard/kubectl SSO AFTER a successful upgrade (recoverable via the chain's restore.sh + the kubeadm-config reconciliation) — it does not crash the apiserver. So the preflight check is now an ALERT, not a block (was added on the wrong hypothesis). Post-mortem, runbook, and apiserver-oidc.tf header corrected. Per Viktor: reclaim the disk and automate so the manual cleanup never recurs; the durable IO fix remains code-oflt (etcd off the shared HDD). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
k8s-upgrade: reconcile kubeadm-config OIDC drift that crash-looped the v1.35 apiserver upgrade
2026-06-25 15:23:15 +00:00 · 2026-06-25 14:16:04 +00:00 · 2026-06-24 22:03:15 +00:00 · 2026-06-24 20:59:39 +00:00 · 2026-06-24 20:59:39 +00:00 · 2026-06-24 20:49:53 +00:00
12 changed files with 3128 additions and 2301 deletions
--- a/.claude/skills/home-assistant/SKILL.md
+++ b/.claude/skills/home-assistant/SKILL.md
@ -11,8 +11,8 @@ description: |
  There are TWO Home Assistant deployments: ha-london (default) and ha-sofia.
  Always use Home Assistant for smart home control.
 author: Claude Code
-version: 2.0.0
+version: 2.1.0
-date: 2026-02-07
+date: 2026-06-24
 ---
 # Home Assistant Control
@ -395,14 +395,27 @@ Advanced SSH, File Editor, Studio Code Server, InfluxDB, Mosquitto, Node-RED, Fr
 ## ha-london Knowledge Map
 ### Overview
- **HA Version**: 2025.9.1 (Docker container on Raspberry Pi)
+- **HA Version**: 2026.5.2 on **Home Assistant OS** (HAOS — managed appliance, NOT a `docker run` container). Latest is 2026.6.4 (update available, deliberately not applied).
 - **Location**: London, UK
- **Platform**: Raspberry Pi 4, HA OS (not Docker standalone)
+- **Platform**: Raspberry Pi 4, HA OS
- **SSH**: `ssh hassio@192.168.8.103` (requires `sudo` for file access)
+- **Access from the Sofia devvm**: london is **remote** — `homelab ha ssh --instance london` generally WON'T connect (ADR-0012). Drive it via the API: `homelab ha token --instance london` + `https://ha-london.viktorbarzin.me/api/...`, and the WebSocket API `wss://ha-london.viktorbarzin.me/api/websocket` for dashboards / config-entries / HACS installs.
- **Config path**: `/config/` (requires `sudo` for file access)
+- **SSH (only from the London LAN)**: `ssh hassio@192.168.8.103` (requires `sudo` for file access)
 - **Config path**: `/config/`
 - **3 tracked people**: Viktor Barzin, Anca Milea, Gheorghe Milea
 - **Zone**: London (home)
 ### Dashboards (redesigned 2026-06-24)
 **Glossary** (HA terms — keep distinct):
 - **Dashboard** = a sidebar entry (Overview, Air Quality, Map). Sidebar *order* is a per-USER frontend preference, not in any dashboard config.
 - **View** = a tab inside a dashboard. View order is global (stored in the dashboard config).
 - **Card** = a widget inside a view.
 - **Overview** (`lovelace`, the default): responsive **sections** views, styled with Mushroom + mini-graph-card.
  - **Home** tab: *Who's home* · *Comfort & Air* (CO₂/temp/humidity/PM2.5/VOC chips + CO₂ and temp/humidity trend graphs + link to Air Quality) · *Cowboy* (battery/range/last-ride) · *Energy* (5 Kasa plugs + power trend) · *Quick actions* (Netflix/Stremio/Night).
  - **More** tab: *Network* (GL-MT6000 router) · *System* (HA version/update, last backup, RPi power) · *Phones*.
 - **Air Quality** (`air-quality`): deep-dive (views: Home, Detailed). (`detialed`→`detailed` path typo fixed 2026-06-24.)
 - Built via the WS `lovelace/config/save` API (london is remote — no SSH path).
 ### Key Systems
 #### 1. Smart Plugs (TP-Link Kasa) — Energy Monitoring
@ -424,10 +437,15 @@ Named plugs with power/energy tracking:
 - PM1.0/2.5/4.0/10 particulate sensors
 - VOC, NOx, ammonia, CO, ethanol, hydrogen, methane, NO2 gas sensors
-#### 3. Cowboy E-Bike
+#### 3. Cowboy E-Bike (`elsbrock/cowboy-ha`)
- `sensor.bike_state_of_charge`: Battery %
+Bike named **"Classic Performance"** → entities are `sensor.classic_performance_*` (26 total). The old `sensor.bike_*` names are GONE (they were the dead `jdejaegh` integration).
- `sensor.bike_total_distance`: Total km
+- `sensor.classic_performance_remaining_battery`: Battery % (was `sensor.bike_state_of_charge`)
- `sensor.bike_total_co2_saved`: CO2 saved (grams)
+- `sensor.classic_performance_remaining_range`: Range km
 - `sensor.classic_performance_mileage`: Total km (was `sensor.bike_total_distance`)
 - `sensor.classic_performance_saved_co2`: Lifetime CO2 saved (was `sensor.bike_total_co2_saved`)
 - Plus `_distance_today`, `_last_trip_*`, `_battery_health`, `device_tracker.classic_performance`, etc.
 - **GOTCHA**: live battery/range/mileage read `unknown` while the bike is parked/asleep — Cowboy only reports live SoC when awake (ridden/charging); trip-history + `distance_today` stay live regardless.
 - Auth: account **email+password** (no AWS Cognito — that was the dead `jdejaegh`/`cowboybike` lineage). Setup via UI config flow / REST `config_entries/flow`. Creds in Vaultwarden item **"cowboy bike"** (`homelab vault get "cowboy bike"`).
 #### 4. Uptime Monitoring (UptimeRobot)
 - `sensor.blog`: blog uptime
@ -446,12 +464,17 @@ Named plugs with power/energy tracking:
 - Scripts: `script.start_netflix`, `script.start_stremio`
 - Scene: `scene.night` (turns off Livia + Michelle plugs)
-### Custom Components
+### Custom Components (HACS integrations)
- **cowboy**: Cowboy e-bike integration (HACS)
+- **cowboy** (`elsbrock/cowboy-ha` v1.2.0): Cowboy e-bike — revived 2026-06-24. The old `jdejaegh/home-assistant-cowboy` repo is **dead (404)**; don't chase it.
- **hildebrandglow_dcc**: UK smart meter DCC energy data (HACS)
+- **hildebrandglow_dcc**: UK smart meter DCC energy — **DISABLED by user** (config entry `disabled_by: user`), not broken.
 ### HACS frontend cards (plugins)
 - **Mushroom** (`piitaya/lovelace-mushroom`), **mini-graph-card** (`kalkih/mini-graph-card`), **plotly-graph-card** (`dbuezas/lovelace-plotly-graph-card`) — used by the redesigned Overview. Install over WS `hacs/repository/download`; resources auto-register in storage mode.
 ### Integrations
-ESPHome, TP-Link Kasa, Tapo, UptimeRobot, Cowboy, Hildebrand Glow DCC, Oral-B BLE, Ookla Speedtest, HACS, OpenRouter (multiple free LLMs), Piper (local TTS), Whisper (local STT), Android TV/ADB
+ESPHome, TP-Link Kasa, Tapo, UptimeRobot, **Cowboy** (elsbrock), Oral-B BLE, Ookla Speedtest (exposes only an `update` entity, no live speed sensors), HACS, OpenRouter (free LLMs), Piper (TTS), Whisper (STT), Android TV/ADB.
 - **Disabled by user (NOT broken)**: `met` + `metoffice` (weather — so `weather.*` entities are ABSENT), `roomba` (Rumi vacuum), `hildebrandglow_dcc` (energy).
 - **Failing**: `tplink` **Tapo P100** projector plug — `setup_retry`, 403 KLAP handshake from 192.168.8.108 (plug off / firmware). Left as-is.
 ### AI / Voice Assistants
 - 5 free LLM conversation agents: Google Gemma 3 27B, Meta Llama 3.2 3B, Mistral Devstral 2, OpenAI GPT-OSS-20B, Z.AI GLM 4.5 Air
@ -466,15 +489,8 @@ ESPHome, TP-Link Kasa, Tapo, UptimeRobot, Cowboy, Hildebrand Glow DCC, Oral-B BL
 - Anca arrival/departure notifications
 - Night scene: turns off Livia + Michelle
-### Docker Setup
+### Platform (HAOS — ignore any legacy `docker run` snippet)
-```bash
+ha-london runs **Home Assistant OS** (managed appliance), NOT a hand-run Docker container. There is no `docker run homeassistant/home-assistant` to manage. Install HACS components over the WebSocket API (`hacs/repository/download` with the repo's HACS id), then restart via `POST /api/services/homeassistant/restart` — a HAOS restart drops automations for ~1–2 min and resets `sensor.uptime` (use that as the "back up" marker).
 docker run -d --name homeassistant --privileged \
  -e TZ=Europe/London \
  -v /home/pi/docker/homeAssistant:/config \
  -v /run/dbus:/run/dbus:ro \
  --network=host --restart=unless-stopped \
  homeassistant/home-assistant:2025.9
 ```
 ### SSH Access
 ```bash
--- a/docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md
+++ b/docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md
@ -0,0 +1,97 @@
 # Post-mortem: k8s 1.34→1.35 upgrade stalled — etcd IO starvation (2026-06-24)
 > Filename kept for inbound links. The originally-suspected cause (kubeadm-config
 > OIDC drift) turned out **not** to be the crash — see "Correction" below. The OIDC
 > drift was a real *separate* latent bug fixed in the same change.
 **Impact:** The autonomous k8s-version-upgrade chain (23:00 UTC nightly) reached
 the master control-plane phase for the first time — preflight passed, etcd
 snapshot taken, master cordoned + drained, etcd upgraded 3.6.5→3.6.6 — then the
 kube-apiserver upgrade to v1.35.6 **crash-looped**. kubeadm waited its 5-minute
 static-pod-hash window across all internal retries, then auto-rolled-back to
 v1.34.9. The cluster stayed healthy on 1.34.9 (apiserver, all 7 nodes Ready), but
 the run left **k8s-master cordoned** and the chain **wedged on `in_flight=1`**.
 No data loss; no user-facing outage (the master carries control-plane taints, so
 no workloads were displaced).
 **Trigger:** the first *minor* upgrade the chain ever attempted (1.34→1.35) — the
 first time kubeadm upgrades etcd (3.6.5→3.6.6) and regenerates the control-plane
 static pods, i.e. the first time the upgrade pushes real write-IO at etcd.
 ## Root cause — etcd IO starvation on the shared HDD
 The new kube-apiserver could not establish/keep a working connection to etcd
 during the upgrade because **etcd was IO-starved**. etcd's surviving container log
 from the crash window (`/var/log/pods/.../etcd/0.log`, 23:04–23:20 UTC) shows:
 - **1,180** `apply request took too long` warnings in 16 minutes;
 - individual applies of **4.3s / 2.9s / 2.7s / 1.8s** (healthy is <100ms),
  clustered at **23:18:51 UTC** — exactly when kubeadm's final attempt was trying
  to bring the new apiserver up.
 A reproduced 1.35.6 apiserver with no etcd dies with
 `F instance.go:233 Error creating leases: error creating storage factory: context
 deadline exceeded` — the same failure mode a multi-second etcd produces. etcd
 lives on the contended `sdc` HDD (**beads code-oflt**: "etcd/critical VM disks on
 shared sdc HDD — recurring IO-storm root cause"). The upgrade itself piled IO onto
 that spindle:
 1. etcd's own upgrade-restart + WAL/db re-read (it restarted ~23:04, re-elected);
 2. kubeadm dumping a full **~400MB etcd DB backup** to
   `/etc/kubernetes/tmp/kubeadm-backup-etcd-<ts>/` (on the same HDD) before the
   etcd upgrade — and **145 of these had accumulated to 28GB** (kubeadm never
   cleans them up), pushing master root fs to **73%**, above the 70% kubelet
   image-GC threshold, so image GC churned during the drain too;
 3. master-drain pod evictions.
 ### Correction — it was NOT the OIDC flag swap
 `kubeadm upgrade diff v1.35.6` showed the regenerated manifest also swaps
 `--authentication-config` (structured multi-issuer OIDC) back to legacy
 single-issuer `--oidc-*` flags (kubeadm-config drift, see secondary finding). That
 was the *first* hypothesis — but an isolated repro of the 1.35.6 apiserver with
 those exact `--oidc-*` flags **and authentik reachable** initialised OIDC cleanly
 (`oidc.go:313`, no error) and ran fine until it hit the (deliberately dead) test
 etcd. So the auth swap does **not** crash the apiserver; it was a red herring for
 the crash. Image pull (all v1.35.6 images pre-pulled), OOM (none), and disk-full
 were also ruled out.
 ## Secondary finding (real, fixed separately) — kubeadm-config OIDC drift
 apiserver auth is configured in three places that must agree:
 (1) `/etc/kubernetes/pki/auth-config.yaml` (structured, two issuers: `kubernetes`
 + `k8s-dashboard`, added 2026-06-19); (2) the live static-pod manifest
 (`--authentication-config`); (3) the kubeadm-config `ClusterConfiguration` CM —
 which still carried the legacy `--oidc-*` extraArgs. `kubeadm upgrade` regenerates
 the manifest from (3), so it would have reverted structured auth → **dashboard +
 kubectl SSO break after a successful upgrade** (recoverable: the chain's
 post-master `restore.sh` re-adds the flag). This is a real bug, just not the crash.
 ## Resolution
 1. **Reclaimed the 28GB kubeadm scratch** on master (`/etc/kubernetes/tmp/kubeadm-backup-*`) — root fs 73% → 23%.
 2. **Reconciled kubeadm-config live** (zero cluster impact — CM only read at upgrade time): dropped `--oidc-*`, added `--authentication-config` via `kubeadm init phase upload-config kubeadm`. `kubeadm upgrade diff` then shows only the control-plane image bumps.
 3. **Recovered:** uncordoned k8s-master, cleared the stuck `in_flight` gauge + annotation, deleted last night's Complete/Failed `1-35-6` phase jobs (a Complete preflight would otherwise make the detector idempotent-skip the re-run).
 ## Prevention (landed in this change)
 | Gap | Fix |
 |-----|-----|
 | kubeadm leaks ~400MB etcd-DB backups into `/etc/kubernetes/tmp` forever (→ disk fills, image-GC churn, write-IO on etcd's spindle) | **`upgrade-step.sh` preflight now prunes** `/etc/kubernetes/tmp/kubeadm-backup-*` + `kubeadm-upgraded-manifests*` older than 3 days on master, every run. Best-effort, never aborts. |
 | kubeadm-config drift would silently break SSO after an upgrade | `apiserver-oidc.tf`'s remote script now **also reconciles kubeadm-config** (`kubeadm init phase upload-config`), delivered via the `apiserver-oidc-restore` ConfigMap the chain re-runs (CI needs no ssh) or a local `-replace` apply. Preflight **alerts** (not blocks — SSO drift is recoverable) if `kubeadm upgrade diff` would still drop `--authentication-config`. |
 | etcd on the contended `sdc` HDD starves under upgrade IO | **Durable fix is beads code-oflt** (move etcd/critical VM disks off `sdc`). Not in this change. Mitigations above reduce the upgrade's own IO; reclaimed disk removes the image-GC variable. |
 ## Lessons
 - **Capture the failing component's own logs before concluding.** The `kubeadm
  upgrade diff` made the OIDC swap look like the cause; only etcd's log (multi-second
  applies) + an isolated apiserver repro showed the truth (etcd IO). A clean diff is
  "what config changes," not "why it crashed."
 - **etcd on shared HDD is the cluster's recurring fragility** (immich IO storm
  2026-05-25, this stall). Upgrades concentrate IO (etcd restart + kubeadm's 400MB
  backup copy + drain) onto that spindle. code-oflt is the real fix.
 - **Tools that leave per-operation scratch must be reaped.** kubeadm's
  `/etc/kubernetes/tmp` etcd backups are throwaway (real backups → NFS) but never
  GC'd; 28GB had silently accumulated.
 - **Out-of-band control-plane edits must be written back to kubeadm-config** — else
  `kubeadm upgrade` silently reverts them (here: SSO; could be admission/audit/API flags).
--- a/docs/runbooks/k8s-version-upgrade.md
+++ b/docs/runbooks/k8s-version-upgrade.md
@ -41,6 +41,8 @@ Job 0 — preflight       (pinned: k8s-node1)
  ├── halt-on-alert (kured-style ignore-list)
  ├── 24h-quiet baseline (no Ready transitions <24h ago)
  ├── kubeadm upgrade plan matches target (skipped when master already at target — partial-resume)
  ├── apiserver-OIDC drift check: kubeadm upgrade diff drops --authentication-config? → Slack WARN (recoverable; not a block)
  ├── reclaim kubeadm scratch: prune /etc/kubernetes/tmp/kubeadm-backup-* >3d on master (kubeadm leaks ~400MB etcd-db backups)
  ├── Push k8s_upgrade_in_flight=1, k8s_upgrade_started_timestamp=$(date +%s)
  ├── Trigger backup-etcd Job, wait, verify snapshot byte count
  ├── SSH master: containerd skew fix (if master < workers)
@ -222,22 +224,34 @@ Exposed in K8s via ExternalSecret `k8s-upgrade-creds` in the `k8s-upgrade` names
 ## Common Operations
-### Post-upgrade: apiserver OIDC restore (AUTOMATED by the chain since 2026-06-19)
+### apiserver OIDC + kubeadm upgrades (kubeadm-config reconciliation since 2026-06-24)
 `kubeadm upgrade apply` **regenerates `/etc/kubernetes/manifests/kube-apiserver.yaml`
-and drops the `--authentication-config` flag**, silently disabling apiserver
+from kubeadm-config**. apiserver auth uses a structured multi-issuer
-OIDC (kubectl/kubelogin CLI **and** the web dashboard SSO break — tokens get
+`--authentication-config` (kubectl + dashboard SSO), but kubeadm-config used to
-401). This used to require a manual re-apply after **every** control-plane bump.
+still carry the legacy single-issuer `--oidc-*` extraArgs — so every upgrade
 reverted the flag, **silently breaking SSO after the upgrade** (the apiserver does
 NOT crash on this — verified by isolated repro; it's recoverable via the restore
 script below). NB: the **1.34→1.35 stall on 2026-06-24 was a *separate* issue —
 etcd IO starvation**, not this drift; post-mortem:
 `docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md`.
-**Now automated:** the `rbac` stack publishes its OIDC restore script to the
+**Primary fix (2026-06-24):** `stacks/rbac/modules/rbac/apiserver-oidc.tf` now
-`kube-system/apiserver-oidc-restore` ConfigMap, and the version-upgrade chain's
+**reconciles kubeadm-config** (`kubeadm init phase upload-config kubeadm`, rewriting
-`phase_master` re-runs it on master immediately after `kubeadm upgrade apply`
+`apiServer.extraArgs`: drop `--oidc-*`, add `--authentication-config`) as part of
-(while tigera-operator is still quiesced, so the flag-add apiserver restart can't
+its remote script. So kubeadm regenerates a **correct** manifest and the apiserver
-crashloop the operator). It's idempotent, health-gates `/livez` with
+upgrades with a pure image bump — `kubeadm upgrade diff <target>` shows only the
-auto-rollback, and is **non-fatal** — a failure only lags SSO until the next rbac
+image change. Zero live impact (the CM is read only during an upgrade).
-apply (the version upgrade itself already succeeded). So a chain-driven
+
-control-plane bump no longer breaks SSO. The master phase self-skips when master
+**Backstops:**
-is already at target, so this only runs when master was actually upgraded.
+- **Preflight check 4b** runs `kubeadm upgrade diff` and **alerts** (Slack WARN, does
  NOT block — the drift only breaks SSO, which is recoverable) if
  `--authentication-config` would still be dropped.
 - The `rbac` stack still publishes its restore script to the
  `kube-system/apiserver-oidc-restore` ConfigMap, and `phase_master` re-runs it on
  master right after `kubeadm upgrade apply` (idempotent, `/livez`-gated with
  auto-rollback, non-fatal) — now redundant belt-and-suspenders that *also*
  re-reconciles kubeadm-config. Self-skips when master is already at target.
 **Manual fallback** — only for an out-of-band/manual `kubeadm` upgrade, or if the
 chain logged `WARN: --authentication-config absent after re-apply`:
--- a/stacks/goldmane-edge-aggregator/main.tf
+++ b/stacks/goldmane-edge-aggregator/main.tf
@ -0,0 +1,488 @@
 # =============================================================================
 # goldmane-edge-aggregator — durable who-talks-to-whom audit trail (ADR-0014 / #58)
 # =============================================================================
 # A small Go service that streams Calico Goldmane's gRPC Flows API (mTLS) and
 # upserts the unique service-to-service edge set into Postgres, plus a daily
 # Slack digest CronJob of first-seen edges. Code lives in the standalone
 # `goldmane-edge-aggregator` repo; the authoritative deploy spec is its
 # DEPLOY.md. This stack is the infra side of that spec.
 #
 # Goldmane runs as `Service goldmane:7443` (gRPC/mTLS) in calico-system, enabled
 # via the operator CR in stacks/calico/main.tf. The durable Loki path is NOT
 # the operator CRs — this service IS the durable trail.
 #
 # Structure mirrors stacks/claude-memory (the canonical Tier-1 pattern): a
 # per-service namespace, a CNPG Postgres DB + role + Vault 7-day rotation +
 # ExternalSecret -> DATABASE_URL, the Reloader annotation, and the
 # Terragrunt-generated backend.tf/providers.tf/tiers.tf layout. The novel bit is
 # minting an mTLS client cert from the Tigera CA (hashicorp/tls; see versions.tf).
 #
 # IMAGE: ghcr.io/viktorbarzin/goldmane-edge-aggregator is PRIVATE. Onboarding
 # MUST add the "goldmane-edge-aggregator" namespace to the ghcr-credentials
 # Kyverno allowlist (stacks/kyverno/modules/kyverno/ghcr-credentials.tf,
 # local.ghcr_private_namespaces) so the Kyverno-synced `ghcr-credentials` secret
 # is cloned into this namespace — otherwise the pulls 401. The imagePullSecrets
 # reference below assumes that entry exists.
 # =============================================================================
 variable "postgresql_host" { type = string }
 # Plan-time root creds for the idempotent DB-init Job (mirrors claude-memory).
 data "vault_kv_secret_v2" "secrets" {
  mount = "secret"
  name  = "goldmane-edge-aggregator"
 }
 # -----------------------------------------------------------------------------
 # 1. Namespace
 # -----------------------------------------------------------------------------
 resource "kubernetes_namespace" "goldmane_edge_aggregator" {
  metadata {
    name = "goldmane-edge-aggregator"
    labels = {
      name = "goldmane-edge-aggregator"
      # Tier 4-aux: a small off-path consumer service, like claude-memory.
      tier               = local.tiers.aux
      "keel.sh/enrolled" = "true"
    }
  }
  lifecycle {
    # KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
    ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
  }
 }
 # -----------------------------------------------------------------------------
 # 2. Goldmane mTLS client certificate (minted from the Tigera CA)
 # -----------------------------------------------------------------------------
 # The aggregator dials goldmane:7443 over mutual TLS. We mint a client cert
 # signed by the Tigera CA (the same CA that issues Goldmane's serving cert), so
 # Goldmane requires mutual TLS on :7443 and verifies the client cert chains to
 # the Tigera CA — it does NOT authorize by client identity, so ANY Tigera-CA-
 # signed cert is accepted. Rather than copy the Tigera CA PRIVATE KEY into TF
 # state to mint our own (a needless CA-key exposure; the hashicorp/tls provider
 # is also incompatible with this repo's global generate-providers/lockfile
 # pattern), we REUSE the operator-minted, Tigera-CA-signed client cert
 # `whisker-backend-key-pair` (calico-system). We never touch the CA key.
 # Trade-off: if the operator rotates that cert, re-apply to re-sync (hardening
 # follow-up: mint an own-identity cert in-namespace if Whisker is ever removed).
 data "kubernetes_secret" "whisker_backend" {
  metadata {
    name      = "whisker-backend-key-pair"
    namespace = "calico-system"
  }
 }
 # The CA bundle that verifies Goldmane's serving cert. It lives ONLY in
 # calico-system (verified: ConfigMap `tigera-ca-bundle`, 2 keys present —
 # `ca-bundle.crt` AND `tigera-ca-bundle.crt`, both the trusted bundle). We read
 # it and recreate it as a ConfigMap in this namespace so the pod can mount it
 # (a ConfigMap cannot be cross-namespace-mounted).
 data "kubernetes_config_map" "tigera_ca_bundle" {
  metadata {
    name      = "tigera-ca-bundle"
    namespace = "calico-system"
  }
 }
 resource "kubernetes_config_map" "tigera_ca_bundle" {
  metadata {
    name      = "tigera-ca-bundle"
    namespace = kubernetes_namespace.goldmane_edge_aggregator.metadata[0].name
  }
  # Copy the upstream bundle verbatim. We mount the `tigera-ca-bundle.crt` key
  # at /etc/tigera-ca/tigera-ca-bundle.crt so the service's default
  # CA_CERT_PATH (/etc/tigera-ca/tigera-ca-bundle.crt) resolves with no override.
  data = data.kubernetes_config_map.tigera_ca_bundle.data
 }
 # Client cert + key for mTLS to goldmane:7443, mounted at TLS_CERT_PATH /
 # TLS_KEY_PATH defaults (/etc/goldmane-client-tls/tls.crt and .../tls.key).
 # Sourced verbatim from the operator's whisker-backend client key-pair (read
 # above) — already Tigera-CA-signed, which is all Goldmane verifies. No CA key
 # is touched and no cross-namespace CA RBAC is needed.
 resource "kubernetes_secret" "goldmane_client_tls" {
  metadata {
    name      = "goldmane-client-tls"
    namespace = kubernetes_namespace.goldmane_edge_aggregator.metadata[0].name
  }
  type = "Opaque"
  data = {
    "tls.crt" = data.kubernetes_secret.whisker_backend.data["tls.crt"]
    "tls.key" = data.kubernetes_secret.whisker_backend.data["tls.key"]
  }
 }
 # -----------------------------------------------------------------------------
 # 3. Postgres: DB + role `goldmane_edges`, Vault 7-day rotation, DATABASE_URL
 # -----------------------------------------------------------------------------
 # Idempotent create of the role + DB using the CNPG root creds from Vault
 # (dbaas_root_password), exactly mirroring claude-memory's db_init Job. The
 # service creates the `edge` table itself at startup (migrations/0001_edge.sql),
 # so no migration Job is needed.
 resource "kubernetes_job" "db_init" {
  metadata {
    name      = "goldmane-edges-db-init"
    namespace = kubernetes_namespace.goldmane_edge_aggregator.metadata[0].name
  }
  spec {
    template {
      metadata {}
      spec {
        container {
          name  = "db-init"
          image = "postgres:16-alpine"
          command = [
            "sh", "-c",
            <<-EOT
              set -e
              # -d postgres: psql defaults the database name to the username;
              # the root user has no root-named database, so be explicit.
              PGPASSWORD='${data.vault_kv_secret_v2.secrets.data["dbaas_root_password"]}' psql -h ${var.postgresql_host} -U root -d postgres -tc "SELECT 1 FROM pg_roles WHERE rolname='goldmane_edges'" | grep -q 1 || \
                PGPASSWORD='${data.vault_kv_secret_v2.secrets.data["dbaas_root_password"]}' psql -h ${var.postgresql_host} -U root -d postgres -c "CREATE ROLE goldmane_edges WITH LOGIN PASSWORD '${data.vault_kv_secret_v2.secrets.data["db_password"]}'"
              PGPASSWORD='${data.vault_kv_secret_v2.secrets.data["dbaas_root_password"]}' psql -h ${var.postgresql_host} -U root -d postgres -tc "SELECT 1 FROM pg_database WHERE datname='goldmane_edges'" | grep -q 1 || \
                PGPASSWORD='${data.vault_kv_secret_v2.secrets.data["dbaas_root_password"]}' psql -h ${var.postgresql_host} -U root -d postgres -c "CREATE DATABASE goldmane_edges OWNER goldmane_edges"
              PGPASSWORD='${data.vault_kv_secret_v2.secrets.data["dbaas_root_password"]}' psql -h ${var.postgresql_host} -U root -d postgres -c "GRANT ALL PRIVILEGES ON DATABASE goldmane_edges TO goldmane_edges"
              echo "Database init complete"
            EOT
          ]
        }
        restart_policy = "Never"
      }
    }
    backoff_limit = 3
  }
  wait_for_completion = true
  timeouts {
    create = "2m"
  }
  lifecycle {
    # KYVERNO_LIFECYCLE_V1: Kyverno injects dns_config (ndots=2); ignore it so
    # this idempotent Job isn't replaced (Jobs are immutable) on every apply.
    ignore_changes = [spec[0].template[0].spec[0].dns_config]
  }
 }
 # ExternalSecret projecting the Vault-rotated (7-day) credential into a K8s
 # Secret as DATABASE_URL. The Vault DB static role `pg-goldmane-edges` and its
 # place in the CNPG connection allowlist are added in stacks/vault/main.tf
 # (see this stack's terragrunt.hcl note). remoteRef key: static-creds/pg-goldmane-edges.
 resource "kubernetes_manifest" "db_external_secret" {
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
    metadata = {
      name      = "goldmane-edges-db-creds"
      namespace = kubernetes_namespace.goldmane_edge_aggregator.metadata[0].name
    }
    spec = {
      refreshInterval = "15m"
      secretStoreRef = {
        name = "vault-database"
        kind = "ClusterSecretStore"
      }
      target = {
        name = "goldmane-edges-db-creds"
        template = {
          data = {
            DATABASE_URL = "postgresql://goldmane_edges:{{ .password }}@${var.postgresql_host}:5432/goldmane_edges"
          }
        }
      }
      data = [{
        secretKey = "password"
        remoteRef = {
          key      = "static-creds/pg-goldmane-edges"
          property = "password"
        }
      }]
    }
  }
  depends_on = [kubernetes_namespace.goldmane_edge_aggregator]
 }
 # -----------------------------------------------------------------------------
 # 4. Slack webhook (reuse the alert-digest incoming webhook)
 # -----------------------------------------------------------------------------
 # The monitoring alert-digest CronJob posts with the Slack incoming webhook at
 # Vault secret/monitoring -> key `alertmanager_slack_api_url`
 # (stacks/monitoring/modules/monitoring/alert_digest.tf). Project that same URL
 # into this namespace as SLACK_WEBHOOK_URL via an ExternalSecret (no new
 # webhook). The digest CronJob defaults to #security.
 resource "kubernetes_manifest" "slack_external_secret" {
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
    metadata = {
      name      = "goldmane-edges-slack"
      namespace = kubernetes_namespace.goldmane_edge_aggregator.metadata[0].name
    }
    spec = {
      refreshInterval = "1h"
      secretStoreRef = {
        name = "vault-kv"
        kind = "ClusterSecretStore"
      }
      target = {
        name = "goldmane-edges-slack"
      }
      data = [{
        secretKey = "SLACK_WEBHOOK_URL"
        remoteRef = {
          key      = "viktor"
          property = "alertmanager_slack_api_url"
        }
      }]
    }
  }
  depends_on = [kubernetes_namespace.goldmane_edge_aggregator]
 }
 # -----------------------------------------------------------------------------
 # 5. aggregate — Deployment (long-running gRPC stream -> Postgres upserts)
 # -----------------------------------------------------------------------------
 resource "kubernetes_deployment" "aggregate" {
  depends_on = [
    kubernetes_job.db_init,
    kubernetes_manifest.db_external_secret,
  ]
  metadata {
    name      = "goldmane-edge-aggregator"
    namespace = kubernetes_namespace.goldmane_edge_aggregator.metadata[0].name
    labels = {
      app  = "goldmane-edge-aggregator"
      tier = local.tiers.aux
    }
    annotations = {
      # Credential is env-injected and read only at startup; the 7-day rotation
      # must bounce the pod or it keeps the stale password and silently fails
      # DB auth (infra CLAUDE.md Reloader rule).
      "secret.reloader.stakater.com/reload" = "goldmane-edges-db-creds"
    }
  }
  spec {
    # 1 replica: the edge set is a global upsert keyed on (src_ns, dst_ns,
    # action); a second replica only doubles writes for no benefit (Goldmane
    # streams per-flow). Stateless (no PVC) so RollingUpdate is fine.
    replicas = 1
    selector {
      match_labels = {
        app = "goldmane-edge-aggregator"
      }
    }
    template {
      metadata {
        labels = {
          app = "goldmane-edge-aggregator"
        }
      }
      spec {
        # PRIVATE ghcr image — cloned into this namespace by the Kyverno
        # sync-ghcr-credentials allowlist policy (add this ns to that list).
        image_pull_secrets {
          name = "ghcr-credentials"
        }
        container {
          name = "aggregate"
          # CI (GHA -> ghcr) overwrites this to :<sha8> via `kubectl set image`;
          # the image tag is in ignore_changes below so the SHA sticks across
          # `terragrunt apply` (fleet image-pin convention). Placeholder :latest
          # until the deploy pipeline runs.
          image = "ghcr.io/viktorbarzin/goldmane-edge-aggregator:latest"
          args  = ["aggregate"]
          # Goldmane mTLS. GOLDMANE_HOST default host sans port =>
          # ServerName "goldmane.calico-system.svc.cluster.local", which is a SAN
          # on the live Goldmane serving cert (verified 2026-06-24:
          # DNS:goldmane{,.calico-system{,.svc{,.cluster.local}}}). So no
          # GOLDMANE_SERVER_NAME override and no GOLDMANE_TLS_INSECURE needed.
          env {
            name  = "GOLDMANE_HOST"
            value = "goldmane.calico-system.svc.cluster.local:7443"
          }
          # TLS_CERT_PATH / TLS_KEY_PATH / CA_CERT_PATH are left at their image
          # defaults (/etc/goldmane-client-tls/tls.{crt,key} and
          # /etc/tigera-ca/tigera-ca-bundle.crt) — the mounts below match them.
          env {
            name = "DATABASE_URL"
            value_from {
              secret_key_ref {
                name = "goldmane-edges-db-creds"
                key  = "DATABASE_URL"
              }
            }
          }
          volume_mount {
            name       = "goldmane-client-tls"
            mount_path = "/etc/goldmane-client-tls"
            read_only  = true
          }
          volume_mount {
            name       = "tigera-ca"
            mount_path = "/etc/tigera-ca"
            read_only  = true
          }
          resources {
            # Idles low: a single gRPC stream + periodic upserts. requests=limits
            # per the repo memory rule; no CPU limit (CFS throttling). Right-size
            # later with krr.
            requests = {
              cpu    = "10m"
              memory = "64Mi"
            }
            limits = {
              memory = "64Mi"
            }
          }
        }
        volume {
          name = "goldmane-client-tls"
          secret {
            secret_name = kubernetes_secret.goldmane_client_tls.metadata[0].name
          }
        }
        volume {
          name = "tigera-ca"
          config_map {
            name = kubernetes_config_map.tigera_ca_bundle.metadata[0].name
          }
        }
      }
    }
  }
  lifecycle {
    ignore_changes = [
      # CI pipeline owns the image tag (kubectl set image from GHA/Woodpecker).
      spec[0].template[0].spec[0].container[0].image,
      spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
      metadata[0].annotations["keel.sh/policy"],
      metadata[0].annotations["keel.sh/trigger"],
      metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
      metadata[0].annotations["keel.sh/match-tag"],
      metadata[0].annotations["kubernetes.io/change-cause"],
      metadata[0].annotations["deployment.kubernetes.io/revision"],
      spec[0].template[0].metadata[0].annotations["keel.sh/update-time"], # KEEL_LIFECYCLE_V1
    ]
  }
 }
 # -----------------------------------------------------------------------------
 # 6. digest — daily CronJob (first-seen edges -> Slack)
 # -----------------------------------------------------------------------------
 resource "kubernetes_cron_job_v1" "digest" {
  depends_on = [
    kubernetes_job.db_init,
    kubernetes_manifest.db_external_secret,
    kubernetes_manifest.slack_external_secret,
  ]
  metadata {
    name      = "goldmane-edges-digest"
    namespace = kubernetes_namespace.goldmane_edge_aggregator.metadata[0].name
    labels = {
      app  = "goldmane-edge-aggregator"
      tier = local.tiers.aux
    }
  }
  spec {
    # Daily 08:00 Europe/London — aligns with the alert-digest cadence.
    schedule                      = "0 8 * * *"
    timezone                      = "Europe/London"
    concurrency_policy            = "Forbid"
    successful_jobs_history_limit = 3
    failed_jobs_history_limit     = 3
    starting_deadline_seconds     = 600
    job_template {
      metadata {
        labels = {
          app = "goldmane-edge-aggregator"
        }
        annotations = {
          # 7-day DB rotation: bounce the Job pod's stale env (Reloader rule).
          "secret.reloader.stakater.com/reload" = "goldmane-edges-db-creds"
        }
      }
      spec {
        backoff_limit              = 2
        active_deadline_seconds    = 300
        ttl_seconds_after_finished = 86400
        template {
          metadata {
            labels = {
              app = "goldmane-edge-aggregator"
            }
          }
          spec {
            restart_policy = "OnFailure"
            image_pull_secrets {
              name = "ghcr-credentials"
            }
            container {
              name = "digest"
              # CronJobs track :latest + imagePullPolicy: Always (fleet
              # convention) so the daily run picks up the current image.
              image             = "ghcr.io/viktorbarzin/goldmane-edge-aggregator:latest"
              image_pull_policy = "Always"
              args              = ["digest"]
              env {
                name = "DATABASE_URL"
                value_from {
                  secret_key_ref {
                    name = "goldmane-edges-db-creds"
                    key  = "DATABASE_URL"
                  }
                }
              }
              env {
                name = "SLACK_WEBHOOK_URL"
                value_from {
                  secret_key_ref {
                    name = "goldmane-edges-slack"
                    key  = "SLACK_WEBHOOK_URL"
                  }
                }
              }
              env {
                name  = "SLACK_CHANNEL"
                value = "#security"
              }
              resources {
                requests = {
                  cpu    = "10m"
                  memory = "64Mi"
                }
                limits = {
                  memory = "64Mi"
                }
              }
            }
          }
        }
      }
    }
  }
  lifecycle {
    # KYVERNO_LIFECYCLE_V1 (CronJob path): Kyverno mutates dns_config with ndots=2.
    ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
  }
 }
 # -----------------------------------------------------------------------------
 # 7. Egress (default-deny consideration)
 # -----------------------------------------------------------------------------
 # Goldmane's own NetworkPolicy already allows INGRESS on 7443 from anywhere, so
 # nothing is needed on the Goldmane side. No egress policy is declared here:
 # this namespace is default-allow egress today. IF/WHEN it is brought under the
 # wave-1 default-deny egress enforcement (per-namespace allowlists), add
 # (Global)NetworkPolicy egress rules permitting:
 #   - goldmane.calico-system.svc.cluster.local:7443 (the flow stream)
 #   - pg-cluster-rw.dbaas.svc.cluster.local:5432    (Postgres)
 #   - hooks.slack.com:443                            (digest -> Slack, internet)
 #   - kube-dns / CoreDNS :53                         (DNS, every namespace)
--- a/stacks/goldmane-edge-aggregator/terragrunt.hcl
+++ b/stacks/goldmane-edge-aggregator/terragrunt.hcl
@ -0,0 +1,24 @@
 include "root" {
  path = find_in_parent_folders()
 }
 # Tier-1 stack (PG state backend). The root terragrunt.hcl generates backend.tf
 # (pg backend, schema_name = "goldmane-edge-aggregator"), providers.tf,
 # cloudflare_provider.tf and tiers.tf automatically — do NOT hand-write those.
 # This stack adds the hashicorp/tls provider via a local versions.tf (merged
 # into the generated required_providers).
 dependency "platform" {
  config_path  = "../platform"
  skip_outputs = true
 }
 dependency "vault" {
  config_path  = "../vault"
  skip_outputs = true
 }
 # The Vault DB static role pg-goldmane-edges (7-day rotation) and the CNPG
 # connection allowlist entry live in the vault stack (stacks/vault/main.tf).
 # The vault dependency above orders this stack after it so the ExternalSecret
 # can materialize the rotated credential on first apply.
--- a/stacks/instagram-poster/modules/instagram-poster/main.tf
+++ b/stacks/instagram-poster/modules/instagram-poster/main.tf
@ -35,6 +35,14 @@ resource "kubernetes_namespace" "instagram_poster" {
 #     - immich_tag_instagram      (optional — auto-resolved if missing)
 #     - immich_tag_posted         (optional — auto-resolved if missing)
 resource "kubernetes_manifest" "external_secret" {
  # The external-secrets controller takes server-side-apply ownership of
  # .spec.refreshInterval, so a plain TF apply conflicts. force_conflicts lets
  # TF win (values match, so it's stable) — same pattern as grafana/woodpecker/
  # traefik/k8s-version-upgrade. Surfaced 2026-06-24 by the first IG apply since
  # the ESO v1 migration (the scale-to-0 push).
  field_manager {
    force_conflicts = true
  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
@ -139,6 +147,11 @@ resource "kubernetes_manifest" "external_secret" {
 # ESO refreshes the K8s Secret every 15m. `reloader.stakater.com/match`
 # bounces the pod when the password changes.
 resource "kubernetes_manifest" "benchmark_db_external_secret" {
  # See external_secret above — ESO owns .spec.refreshInterval; force_conflicts
  # lets the TF apply win instead of erroring on the field-manager conflict.
  field_manager {
    force_conflicts = true
  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
@ -227,7 +240,11 @@ resource "kubernetes_deployment" "instagram_poster" {
  }
  spec {
-    replicas = 1
+    # Scaled to 0 (2026-06-24): Instagram Graph integration is unused and its
    # ExternalSecret is dead (missing ig_graph_long_lived_token /
    # ig_business_account_id in Vault secret/instagram-poster). Set back to 1
    # after minting a Meta long-lived token and populating those keys.
    replicas = 0
    # RWO PVC — cannot rolling-update.
    strategy {
      type = "Recreate"
--- a/stacks/k8s-version-upgrade/scripts/upgrade-step.sh
+++ b/stacks/k8s-version-upgrade/scripts/upgrade-step.sh
@ -416,6 +416,39 @@ phase_preflight() {
    fi
  fi
  # 4b. apiserver-OIDC drift check (backstop for the rbac stack's kubeadm-config
  # reconciliation). A `kubeadm upgrade` REGENERATES the apiserver manifest from
  # kubeadm-config; if kubeadm-config still carries the legacy single-issuer
  # --oidc-* args instead of --authentication-config, the regenerated apiserver
  # loses structured multi-issuer auth → kubectl + dashboard SSO break AFTER the
  # upgrade. This is RECOVERABLE (the apiserver does NOT crash — verified by an
  # isolated repro 2026-06-24; the chain's post-master restore.sh re-adds the flag,
  # and the rbac stack reconciles kubeadm-config so it won't recur) — so this is an
  # ALERT, not a block. (NB the 2026-06-24 stall was NOT this — it was etcd IO
  # starvation; see docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md.)
  # Skip on an at-target master (resume — no apiserver regen).
  if [ "$master_kubelet_v" != "$TARGET_VERSION" ]; then
    local apiserver_diff
    apiserver_diff=$(ssh "${SSH_OPTS[@]}" "$(ssh_target k8s-master)" "sudo kubeadm upgrade diff v$TARGET_VERSION 2>/dev/null" || true)
    if echo "$apiserver_diff" | grep -qE '^-[[:space:]].*--authentication-config'; then
      slack "WARN preflight — kubeadm upgrade will DROP --authentication-config (kubeadm-config OIDC drift). SSO breaks post-upgrade until restore.sh re-adds it; re-apply the rbac stack to reconcile kubeadm-config. Proceeding (recoverable, not a crash)."
    fi
  fi
  # 4c. Reclaim kubeadm scratch on master. `kubeadm upgrade apply` dumps a full
  # ~400MB etcd DB backup into /etc/kubernetes/tmp/kubeadm-backup-etcd-<ts>/ before
  # every etcd upgrade and NEVER cleans it up — 145 dirs / 28GB had accumulated by
  # 2026-06-24, pushing master root fs to 73% (image-GC churn + extra write IO on
  # the shared HDD where etcd lives — a contributor to the etcd IO starvation that
  # stalled that run, see post-mortem). Real etcd backups go to NFS, so these are
  # throwaway. Prune ones >3 days old (keeps a short rollback window). Best-effort;
  # never aborts the chain.
  if [ "$master_kubelet_v" != "$TARGET_VERSION" ]; then
    ssh "${SSH_OPTS[@]}" "$(ssh_target k8s-master)" \
      "sudo find /etc/kubernetes/tmp -maxdepth 1 -type d \( -name 'kubeadm-backup-*' -o -name 'kubeadm-upgraded-manifests*' \) -mtime +3 -exec rm -rf {} + 2>/dev/null; echo -n 'master root after prune: '; df -h / | awk 'NR==2{print \$5\" used, \"\$4\" free\"}'" \
      || echo "kubeadm-scratch prune skipped (ssh/df failed) — non-fatal"
  fi
  # 5. Push in-flight + started_timestamp metrics + ns annotations
  $KUBECTL annotate ns "$NS" \
    "viktorbarzin.me/k8s-upgrade-in-flight=$(date -u +%FT%TZ)" \
--- a/stacks/kyverno/modules/kyverno/ghcr-credentials.tf
+++ b/stacks/kyverno/modules/kyverno/ghcr-credentials.tf
@ -31,6 +31,9 @@ locals {
    # "no local builds"). ghcr.io/viktorbarzin/k8s-portal:latest is PRIVATE
    # (infra repo default); the deployment references the cloned secret.
    "k8s-portal",
    # goldmane-edge-aggregator: PRIVATE ghcr image pulled by the aggregate
    # Deployment + digest CronJob (ADR-0014, infra#58).
    "goldmane-edge-aggregator",
  ]
 }
--- a/stacks/monitoring/modules/monitoring/grafana.tf
+++ b/stacks/monitoring/modules/monitoring/grafana.tf
@ -71,6 +71,15 @@ resource "kubernetes_persistent_volume" "alertmanager_pv" {
 # DB credentials from Vault database engine (rotated automatically)
 # Provides GF_DATABASE_PASSWORD that auto-updates when password rotates
 resource "kubernetes_manifest" "grafana_db_creds" {
  # The external-secrets controller takes server-side-apply ownership of
  # .spec.refreshInterval, so a plain TF apply conflicts ("conflict with
  # external-secrets ... .spec.refreshInterval"). force_conflicts lets TF win
  # (values match, so it's stable) — same pattern as the woodpecker/traefik/
  # k8s-version-upgrade stacks. Surfaced 2026-06-24: the first monitoring apply
  # in a while exposed this latent conflict (prior pushes were docs-only).
  field_manager {
    force_conflicts = true
  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
--- a/stacks/rbac/modules/rbac/apiserver-oidc.tf
+++ b/stacks/rbac/modules/rbac/apiserver-oidc.tf
@ -10,16 +10,29 @@
 # match the existing RBAC subjects (kind: User, name: <raw email>; group names
 # verbatim). Do NOT add a prefix or existing bindings break.
 #
-# DRIFT WARNING: this edits the kube-apiserver static-pod manifest on the single
+# DRIFT WARNING (and how it's now handled): apiserver auth lives in THREE places
-# master. A `kubeadm upgrade` regenerates that manifest and DROPS this flag (this
+# that must stay in sync, because a `kubeadm upgrade` REGENERATES the static-pod
-# is exactly how OIDC silently broke before — the flag was wiped and the
+# manifest from kubeadm-config:
-# content-hash trigger never re-fired). After any k8s control-plane upgrade,
+#   1. /etc/kubernetes/pki/auth-config.yaml         — the structured authn file
-# re-apply the rbac stack to restore apiserver OIDC. See
+#   2. the live kube-apiserver static-pod manifest  — references it via the flag
-# docs/plans/2026-06-04-k8s-dashboard-sso-design.md.
+#   3. the kubeadm-config ClusterConfiguration CM   — what kubeadm regenerates from
 # Originally only (1)+(2) were managed, so every kubeadm upgrade rewrote the
 # manifest from the STALE CM, reverting --authentication-config to single-issuer
 # --oidc-* flags. The consequence is SSO breakage AFTER the upgrade: kubectl +
 # dashboard lose multi-issuer auth (the apiserver does NOT crash on this — verified
 # by an isolated repro 2026-06-24; the 2026-06-24 v1.35 upgrade *stall* was a
 # separate etcd IO-starvation issue, see
 # docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md). The
 # remote script below now ALSO reconciles (3) via `kubeadm init phase
 # upload-config`, so a future kubeadm upgrade regenerates a CORRECT manifest. The
 # k8s-version-upgrade chain additionally ALERTS (does not block — SSO drift is
 # recoverable) via `kubeadm upgrade diff` in preflight if --authentication-config
 # would still be dropped.
 #
 # SAFETY: the remote script health-gates on /livez and AUTO-ROLLS-BACK the
 # manifest from a timestamped backup if the apiserver does not recover, so a
-# malformed config cannot leave the single master down.
+# malformed config cannot leave the single master down. Reconciling kubeadm-config
 # is zero-impact on the running cluster (the CM is only read during an upgrade).
 variable "k8s_master_host" {
  type    = string
@ -97,6 +110,40 @@ locals {
    print('flag-inserted' if done else 'ANCHOR-NOT-FOUND')
  PY
  # Reconciles the kubeadm-config ClusterConfiguration's apiServer.extraArgs:
  # drops the stale single-issuer --oidc-* args and ensures --authentication-config
  # is present (anchored after --authorization-mode). Stdlib-only (the master is
  # only guaranteed python3, not pyyaml/yq). Idempotent; preserves all other
  # fields (etcd args, audit args, extraVolumes) verbatim. Exits 3 if the
  # authorization-mode anchor is missing (fail loud, leave the CM untouched).
  kubeadm_oidc_reconcile_py = <<-PY
    import sys
    lines = sys.stdin.read().split('\n')
    out, i, n = [], 0, len(lines)
    have_authn = any('name: authentication-config' in l for l in lines)
    inserted = have_authn
    while i < n:
        ln = lines[i]; s = ln.strip()
        if s.startswith('- name: oidc-'):
            i += 2 if (i + 1 < n and lines[i + 1].strip().startswith('value:')) else 1
            continue
        out.append(ln)
        if (not inserted) and s == '- name: authorization-mode':
            indent = ln[:len(ln) - len(ln.lstrip())]
            if i + 1 < n and lines[i + 1].strip().startswith('value:'):
                out.append(lines[i + 1]); i += 2
            else:
                i += 1
            out.append(indent + '- name: authentication-config')
            out.append(indent + '  value: /etc/kubernetes/pki/auth-config.yaml')
            inserted = True
            continue
        i += 1
    if not inserted:
        sys.stderr.write('ANCHOR-NOT-FOUND: authorization-mode\n'); sys.exit(3)
    sys.stdout.write('\n'.join(out))
  PY
  # Whole remote operation, base64-embedded for byte-exact transfer (no
  # heredoc/escaping hazards across SSH).
  apiserver_auth_remote_script = <<-SH
@ -137,6 +184,30 @@ locals {
      echo "rolled back to previous manifest"; exit 1
    fi
    echo "kube-apiserver healthy with multi-issuer --authentication-config"
    # 5. Reconcile kubeadm-config so a FUTURE `kubeadm upgrade` regenerates the
    #    apiserver manifest WITH --authentication-config instead of reverting to
    #    the stale single-issuer --oidc-* flags. Without this, kubeadm rewrote the
    #    manifest from kubeadm-config on every control-plane upgrade and the
    #    regenerated apiserver crash-looped (the 2026-06-24 v1.35 upgrade stall).
    #    Zero live impact (the CM is only read at upgrade time); idempotent;
    #    best-effort (the chain's `kubeadm upgrade diff` preflight gate is the
    #    backstop if this cannot run).
    KC="sudo kubectl --kubeconfig /etc/kubernetes/admin.conf"
    CC=$($KC -n kube-system get cm kubeadm-config -o jsonpath='{.data.ClusterConfiguration}' 2>/dev/null || true)
    if [ -n "$CC" ] && { echo "$CC" | grep -q 'oidc-issuer-url' || ! echo "$CC" | grep -q 'authentication-config'; }; then
      echo "Reconciling kubeadm-config (oidc-* -> authentication-config) so kubeadm upgrade keeps structured auth"
      echo '${base64encode(local.kubeadm_oidc_reconcile_py)}' | base64 -d > /tmp/reconcile_kubeadm_oidc.py
      if printf '%s' "$CC" | python3 /tmp/reconcile_kubeadm_oidc.py > /tmp/kubeadm-cc-new.yaml \
         && sudo kubeadm init phase upload-config kubeadm --config /tmp/kubeadm-cc-new.yaml; then
        echo "kubeadm-config reconciled: future control-plane upgrades keep --authentication-config"
      else
        echo "WARN: kubeadm-config reconcile failed; the upgrade-chain preflight gate will block the next upgrade"
      fi
      rm -f /tmp/reconcile_kubeadm_oidc.py /tmp/kubeadm-cc-new.yaml
    else
      echo "kubeadm-config already uses --authentication-config (no oidc drift)"
    fi
  SH
 }
@ -155,6 +226,14 @@ resource "null_resource" "apiserver_oidc_config" {
  }
  triggers = {
    # Intentionally hash ONLY the issuer config, NOT the remote script. CI applies
    # the rbac stack with no ssh_private_key (var defaults to ""), so a re-run of
    # this SSH provisioner in CI would fail — hence the null_resource must stay a
    # no-op on a plain CI apply. Script changes (e.g. the 2026-06-24 kubeadm-config
    # reconciliation) reach the cluster via the apiserver-oidc-restore ConfigMap
    # below (a plain k8s resource, no ssh) which the upgrade chain re-runs. To force
    # this provisioner to re-run after a script change, apply locally with
    # `-replace` + TF_VAR_ssh_private_key (see docs/runbooks/k8s-version-upgrade.md).
    auth_config = sha256(local.apiserver_auth_config_yaml)
  }
 }
--- a/stacks/vault/main.tf
+++ b/stacks/vault/main.tf
@ -674,6 +674,7 @@ resource "vault_database_secret_backend_connection" "postgresql" {
    "pg-recruiter-responder", "pg-tripit",
    "pg-nextcloud-todos",
    "pg-technitium",
    "pg-goldmane-edges",
  ]
  postgresql {
@ -891,6 +892,17 @@ resource "vault_database_secret_backend_static_role" "pg_technitium" {
  rotation_period = 604800
 }
 # goldmane-edge-aggregator (ADR-0014 / infra #58) — 7-day rotation for the
 # goldmane_edges CNPG role. Consumed by stacks/goldmane-edge-aggregator via a
 # vault-database ExternalSecret -> DATABASE_URL (remoteRef static-creds/pg-goldmane-edges).
 resource "vault_database_secret_backend_static_role" "pg_goldmane_edges" {
  backend         = vault_mount.database.path
  db_name         = vault_database_secret_backend_connection.postgresql.name
  name            = "pg-goldmane-edges"
  username        = "goldmane_edges"
  rotation_period = 604800
 }
 # =============================================================================
 # Kubernetes Secrets Engine — Dynamic K8s Credentials
 # =============================================================================
--- a/state/stacks/vault/terraform.tfstate.enc
+++ b/state/stacks/vault/terraform.tfstate.enc
Author	SHA1	Message	Date
Viktor Barzin	9c68d147e0	k8s-upgrade: reclaim+auto-prune kubeadm /etc/kubernetes/tmp leak; correct crash root cause to etcd IO (not OIDC) Some checks failed ci/woodpecker/push/postmortem-todos Pipeline was successful Details ci/woodpecker/push/default Pipeline failed Details Digging into "why did the apiserver crash" disproved the earlier OIDC explanation. An isolated v1.35.6 apiserver repro with authentik reachable initialises OIDC cleanly (oidc.go:313, no error) and runs fine — so the --authentication-config -> --oidc-* revert is NOT what crashed it. etcd's surviving crash-window log is the real cause: 1180 "apply request took too long" warnings in 16 min, individual applies up to 4.3s (healthy <100ms) right as kubeadm tried to bring up the new apiserver. That's etcd IO starvation on the shared sdc HDD (beads code-oflt). A big contributor + the reason master root fs sat at 73%: kubeadm dumps a full ~400MB etcd DB backup into /etc/kubernetes/tmp/kubeadm-backup-etcd-<ts>/ before every etcd upgrade and never cleans it up — 145 dirs / 28GB had accumulated, driving image-GC churn and extra write-IO onto etcd's spindle. Reclaimed live (73% -> 23%) and added a preflight prune (>3 days) so it can't re-accumulate. Also corrected the OIDC handling: the kubeadm-config drift is real but only breaks dashboard/kubectl SSO AFTER a successful upgrade (recoverable via the chain's restore.sh + the kubeadm-config reconciliation) — it does not crash the apiserver. So the preflight check is now an ALERT, not a block (was added on the wrong hypothesis). Post-mortem, runbook, and apiserver-oidc.tf header corrected. Per Viktor: reclaim the disk and automate so the manual cleanup never recurs; the durable IO fix remains code-oflt (etcd off the shared HDD). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-25 15:23:15 +00:00
Viktor Barzin	60a1cb9a25	k8s-upgrade: reconcile kubeadm-config OIDC drift that crash-looped the v1.35 apiserver upgrade All checks were successful ci/woodpecker/push/postmortem-todos Pipeline was successful Details ci/woodpecker/push/default Pipeline was successful Details Last night's autonomous 1.34->1.35 run reached the master control-plane phase for the first time (preflight passed, etcd snapshot taken, etcd upgraded), then the kube-apiserver upgrade to v1.35.6 crash-looped and kubeadm auto-rolled-back to 1.34.9. The cluster stayed healthy but the master was left cordoned and the chain wedged on in_flight. Root cause: kubeadm upgrade regenerates the apiserver static-pod manifest from the kubeadm-config ConfigMap. apiserver auth was switched on 2026-06-19 to a structured multi-issuer --authentication-config (kubectl + dashboard SSO), but kubeadm-config still carried the legacy single-issuer --oidc-* extraArgs, so the regenerated manifest reverted structured auth and the new apiserver crash-looped. Proven via `kubeadm upgrade diff`. The existing post-upgrade OIDC restore step never ran because the upgrade itself never succeeded. Fix: - rbac/apiserver-oidc.tf: the remote script now also reconciles kubeadm-config (kubeadm init phase upload-config: drop --oidc-*, add --authentication-config) so a future kubeadm upgrade regenerates a correct manifest. Delivered to the cluster via the apiserver-oidc-restore ConfigMap the chain re-runs (CI needs no ssh key); trigger deliberately not script-hashed since CI cannot ssh. - k8s-version-upgrade/upgrade-step.sh: new preflight gate runs `kubeadm upgrade diff` and BLOCKS+alerts (never drains the master) if --authentication-config would still be dropped. - Post-mortem + runbook updated. The live kubeadm-config was reconciled directly on the master and verified (`kubeadm upgrade diff` now shows only the control-plane image bump), so tonight's run can complete the 1.34->1.35 upgrade. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-25 14:16:04 +00:00
Viktor Barzin	c6bba1da6e	home-assistant skill: refresh ha-london map (HAOS 2026.5.2, Cowboy revived, Overview redesign) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor asked to redesign the ha-london dashboards and fix the broken integrations (the Cowboy one). The skill's ha-london knowledge map had drifted badly from reality, so this brings it current: it claimed HA 2025.9.1 on a docker-run container (it's HAOS 2026.5.2, managed); listed the now-dead jdejaegh Cowboy integration with sensor.bike_* entities (revived via elsbrock/cowboy-ha v1.2.0 -> entities are sensor.classic_performance_*); and didn't flag that met/metoffice/roomba/hildebrandglow are user-disabled (not broken) or that Tapo P100 is failing. Also documents the redesigned Overview (Home+More sections, Mushroom+mini-graph-card), the dashboard/view/card glossary, the parked-bike 'unknown battery' gotcha, and that london is API-only from the Sofia devvm (no SSH). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-24 22:03:15 +00:00
Viktor Barzin	b858561bd0	Merge remote-tracking branch 'origin/master' Some checks failed ci/woodpecker/push/default Pipeline failed Details	2026-06-24 20:59:39 +00:00
Viktor Barzin	a7704f46a6	deploy goldmane-edge-aggregator: durable who-talks-to-whom edge trail (#58 , ADR-0014) Infra side of ADR-0014: an mTLS gRPC consumer of Calico Goldmane's Flows API that records the namespace-pair edge-set in CNPG and posts a daily new-edge digest to #security. Adds the goldmane-edge-aggregator stack, the pg-goldmane-edges Vault rotation role (Tier-0 vault state updated here), and the namespace in the ghcr-credentials allowlist. Cert: REUSES the operator-minted, Tigera-CA-signed whisker-backend client cert (Goldmane verifies only the CA chain, not identity) instead of minting from the Tigera CA private key. This avoids putting the CA key in TF state AND the hashicorp/tls provider, which is incompatible with this repo's global generate-providers/lockfile pattern (it broke every stack's lockfile). Verified live: aggregator streaming flows, 174 edges in Postgres across 50x54 namespaces, db+slack ExternalSecrets synced, digest dry-run formats correctly, private image pulls via the Kyverno-synced ghcr-credentials. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-24 20:59:39 +00:00
Viktor Barzin	aa510e3600	instagram-poster: force_conflicts on ESO manifests (fix apply) All checks were successful ci/woodpecker/push/default Pipeline was successful Details The ESO v1 migration (2026-06-22) made the external-secrets controller own .spec.refreshInterval via server-side apply, so terraform apply of the two ExternalSecret manifests fails with a field-manager conflict (Woodpecker #348), which blocked the replicas=0 scale-down from landing. Add force_conflicts=true to both, matching the grafana/woodpecker/traefik fix applied to other stacks the same day. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-24 20:49:53 +00:00
Viktor Barzin	53834deb24	instagram-poster: scale to 0 (unused, dead ExternalSecret) Some checks failed ci/woodpecker/push/default Pipeline failed Details Viktor confirmed the Instagram Graph poster isn't used. Its ExternalSecret has been dead on missing Vault keys (ig_graph_long_lived_token, ig_business_account_id), so the deployment sat at 0/1 firing DeploymentReplicasMismatch. Setting replicas=0 stops the alert and makes the scale-down durable (a bare kubectl scale reverts on the next stack apply). Re-set to 1 after minting a Meta long-lived token + populating the Vault keys. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-24 20:45:30 +00:00
Viktor Barzin	8dd9a3978d	Merge remote-tracking branch 'forgejo/master' into wizard/homelab-vault All checks were successful ci/woodpecker/push/default Pipeline was successful Details	2026-06-24 12:25:52 +00:00
Viktor Barzin	65b2df1222	fix(monitoring): force_conflicts on grafana_db_creds ExternalSecret The external-secrets controller owns .spec.refreshInterval via SSA, so a plain terraform apply of the monitoring stack conflicts. Latent until 2026-06-24 (the homelab-vault loki-rules change was the first monitoring apply in a while and surfaced it). force_conflicts lets TF win — same pattern as woodpecker/traefik/ k8s-version-upgrade stacks.	2026-06-24 12:25:36 +00:00