k8s-upgrade: reclaim+auto-prune kubeadm /etc/kubernetes/tmp leak; correct crash root cause to etcd IO (not OIDC)

Digging into "why did the apiserver crash" disproved the earlier OIDC explanation. An isolated v1.35.6 apiserver repro with authentik reachable initialises OIDC cleanly (oidc.go:313, no error) and runs fine — so the --authentication-config -> --oidc-* revert is NOT what crashed it. etcd's surviving crash-window log is the real cause: 1180 "apply request took too long" warnings in 16 min, individual applies up to 4.3s (healthy <100ms) right as kubeadm tried to bring up the new apiserver. That's etcd IO starvation on the shared sdc HDD (beads code-oflt). A big contributor + the reason master root fs sat at 73%: kubeadm dumps a full ~400MB etcd DB backup into /etc/kubernetes/tmp/kubeadm-backup-etcd-<ts>/ before every etcd upgrade and never cleans it up — 145 dirs / 28GB had accumulated, driving image-GC churn and extra write-IO onto etcd's spindle. Reclaimed live (73% -> 23%) and added a preflight prune (>3 days) so it can't re-accumulate. Also corrected the OIDC handling: the kubeadm-config drift is real but only breaks dashboard/kubectl SSO AFTER a successful upgrade (recoverable via the chain's restore.sh + the kubeadm-config reconciliation) — it does not crash the apiserver. So the preflight check is now an ALERT, not a block (was added on the wrong hypothesis). Post-mortem, runbook, and apiserver-oidc.tf header corrected. Per Viktor: reclaim the disk and automate so the manual cleanup never recurs; the durable IO fix remains code-oflt (etcd off the shared HDD). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
k8s-upgrade: reconcile kubeadm-config OIDC drift that crash-looped the v1.35 apiserver upgrade
2026-06-25 15:23:15 +00:00 · 2026-06-25 14:16:04 +00:00 · 2026-06-24 22:03:15 +00:00 · 2026-06-24 20:59:39 +00:00 · 2026-06-24 20:59:39 +00:00 · 2026-06-24 20:49:53 +00:00
9 changed files with 2626 additions and 2392 deletions
--- a/.claude/skills/home-assistant/SKILL.md
+++ b/.claude/skills/home-assistant/SKILL.md
@ -11,8 +11,8 @@ description: |
  There are TWO Home Assistant deployments: ha-london (default) and ha-sofia.
  Always use Home Assistant for smart home control.
 author: Claude Code
-version: 2.0.0
-date: 2026-02-07
+version: 2.1.0
+date: 2026-06-24
 ---

 # Home Assistant Control
@ -395,14 +395,27 @@ Advanced SSH, File Editor, Studio Code Server, InfluxDB, Mosquitto, Node-RED, Fr
 ## ha-london Knowledge Map

 ### Overview
- **HA Version**: 2025.9.1 (Docker container on Raspberry Pi)
+- **HA Version**: 2026.5.2 on **Home Assistant OS** (HAOS — managed appliance, NOT a `docker run` container). Latest is 2026.6.4 (update available, deliberately not applied).
 - **Location**: London, UK
- **Platform**: Raspberry Pi 4, HA OS (not Docker standalone)
- **SSH**: `ssh hassio@192.168.8.103` (requires `sudo` for file access)
- **Config path**: `/config/` (requires `sudo` for file access)
+- **Platform**: Raspberry Pi 4, HA OS
+- **Access from the Sofia devvm**: london is **remote** — `homelab ha ssh --instance london` generally WON'T connect (ADR-0012). Drive it via the API: `homelab ha token --instance london` + `https://ha-london.viktorbarzin.me/api/...`, and the WebSocket API `wss://ha-london.viktorbarzin.me/api/websocket` for dashboards / config-entries / HACS installs.
+- **SSH (only from the London LAN)**: `ssh hassio@192.168.8.103` (requires `sudo` for file access)
+- **Config path**: `/config/`
 - **3 tracked people**: Viktor Barzin, Anca Milea, Gheorghe Milea
 - **Zone**: London (home)

+### Dashboards (redesigned 2026-06-24)
+**Glossary** (HA terms — keep distinct):
+- **Dashboard** = a sidebar entry (Overview, Air Quality, Map). Sidebar *order* is a per-USER frontend preference, not in any dashboard config.
+- **View** = a tab inside a dashboard. View order is global (stored in the dashboard config).
+- **Card** = a widget inside a view.
+
+- **Overview** (`lovelace`, the default): responsive **sections** views, styled with Mushroom + mini-graph-card.
+  - **Home** tab: *Who's home* · *Comfort & Air* (CO₂/temp/humidity/PM2.5/VOC chips + CO₂ and temp/humidity trend graphs + link to Air Quality) · *Cowboy* (battery/range/last-ride) · *Energy* (5 Kasa plugs + power trend) · *Quick actions* (Netflix/Stremio/Night).
+  - **More** tab: *Network* (GL-MT6000 router) · *System* (HA version/update, last backup, RPi power) · *Phones*.
+- **Air Quality** (`air-quality`): deep-dive (views: Home, Detailed). (`detialed`→`detailed` path typo fixed 2026-06-24.)
+- Built via the WS `lovelace/config/save` API (london is remote — no SSH path).
+
 ### Key Systems

 #### 1. Smart Plugs (TP-Link Kasa) — Energy Monitoring
@ -424,10 +437,15 @@ Named plugs with power/energy tracking:
 - PM1.0/2.5/4.0/10 particulate sensors
 - VOC, NOx, ammonia, CO, ethanol, hydrogen, methane, NO2 gas sensors

-#### 3. Cowboy E-Bike
- `sensor.bike_state_of_charge`: Battery %
- `sensor.bike_total_distance`: Total km
- `sensor.bike_total_co2_saved`: CO2 saved (grams)
+#### 3. Cowboy E-Bike (`elsbrock/cowboy-ha`)
+Bike named **"Classic Performance"** → entities are `sensor.classic_performance_*` (26 total). The old `sensor.bike_*` names are GONE (they were the dead `jdejaegh` integration).
+- `sensor.classic_performance_remaining_battery`: Battery % (was `sensor.bike_state_of_charge`)
+- `sensor.classic_performance_remaining_range`: Range km
+- `sensor.classic_performance_mileage`: Total km (was `sensor.bike_total_distance`)
+- `sensor.classic_performance_saved_co2`: Lifetime CO2 saved (was `sensor.bike_total_co2_saved`)
+- Plus `_distance_today`, `_last_trip_*`, `_battery_health`, `device_tracker.classic_performance`, etc.
+- **GOTCHA**: live battery/range/mileage read `unknown` while the bike is parked/asleep — Cowboy only reports live SoC when awake (ridden/charging); trip-history + `distance_today` stay live regardless.
+- Auth: account **email+password** (no AWS Cognito — that was the dead `jdejaegh`/`cowboybike` lineage). Setup via UI config flow / REST `config_entries/flow`. Creds in Vaultwarden item **"cowboy bike"** (`homelab vault get "cowboy bike"`).

 #### 4. Uptime Monitoring (UptimeRobot)
 - `sensor.blog`: blog uptime
@ -446,12 +464,17 @@ Named plugs with power/energy tracking:
 - Scripts: `script.start_netflix`, `script.start_stremio`
 - Scene: `scene.night` (turns off Livia + Michelle plugs)

-### Custom Components
- **cowboy**: Cowboy e-bike integration (HACS)
- **hildebrandglow_dcc**: UK smart meter DCC energy data (HACS)
+### Custom Components (HACS integrations)
+- **cowboy** (`elsbrock/cowboy-ha` v1.2.0): Cowboy e-bike — revived 2026-06-24. The old `jdejaegh/home-assistant-cowboy` repo is **dead (404)**; don't chase it.
+- **hildebrandglow_dcc**: UK smart meter DCC energy — **DISABLED by user** (config entry `disabled_by: user`), not broken.
+
+### HACS frontend cards (plugins)
+- **Mushroom** (`piitaya/lovelace-mushroom`), **mini-graph-card** (`kalkih/mini-graph-card`), **plotly-graph-card** (`dbuezas/lovelace-plotly-graph-card`) — used by the redesigned Overview. Install over WS `hacs/repository/download`; resources auto-register in storage mode.

 ### Integrations
-ESPHome, TP-Link Kasa, Tapo, UptimeRobot, Cowboy, Hildebrand Glow DCC, Oral-B BLE, Ookla Speedtest, HACS, OpenRouter (multiple free LLMs), Piper (local TTS), Whisper (local STT), Android TV/ADB
+ESPHome, TP-Link Kasa, Tapo, UptimeRobot, **Cowboy** (elsbrock), Oral-B BLE, Ookla Speedtest (exposes only an `update` entity, no live speed sensors), HACS, OpenRouter (free LLMs), Piper (TTS), Whisper (STT), Android TV/ADB.
+- **Disabled by user (NOT broken)**: `met` + `metoffice` (weather — so `weather.*` entities are ABSENT), `roomba` (Rumi vacuum), `hildebrandglow_dcc` (energy).
+- **Failing**: `tplink` **Tapo P100** projector plug — `setup_retry`, 403 KLAP handshake from 192.168.8.108 (plug off / firmware). Left as-is.

 ### AI / Voice Assistants
 - 5 free LLM conversation agents: Google Gemma 3 27B, Meta Llama 3.2 3B, Mistral Devstral 2, OpenAI GPT-OSS-20B, Z.AI GLM 4.5 Air
@ -466,15 +489,8 @@ ESPHome, TP-Link Kasa, Tapo, UptimeRobot, Cowboy, Hildebrand Glow DCC, Oral-B BL
 - Anca arrival/departure notifications
 - Night scene: turns off Livia + Michelle

-### Docker Setup
-```bash
-docker run -d --name homeassistant --privileged \
-  -e TZ=Europe/London \
-  -v /home/pi/docker/homeAssistant:/config \
-  -v /run/dbus:/run/dbus:ro \
-  --network=host --restart=unless-stopped \
-  homeassistant/home-assistant:2025.9
-```
+### Platform (HAOS — ignore any legacy `docker run` snippet)
+ha-london runs **Home Assistant OS** (managed appliance), NOT a hand-run Docker container. There is no `docker run homeassistant/home-assistant` to manage. Install HACS components over the WebSocket API (`hacs/repository/download` with the repo's HACS id), then restart via `POST /api/services/homeassistant/restart` — a HAOS restart drops automations for ~1–2 min and resets `sensor.uptime` (use that as the "back up" marker).

 ### SSH Access
 ```bash
--- a/docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md
+++ b/docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md
@ -0,0 +1,97 @@
+# Post-mortem: k8s 1.34→1.35 upgrade stalled — etcd IO starvation (2026-06-24)
+
+> Filename kept for inbound links. The originally-suspected cause (kubeadm-config
+> OIDC drift) turned out **not** to be the crash — see "Correction" below. The OIDC
+> drift was a real *separate* latent bug fixed in the same change.
+
+**Impact:** The autonomous k8s-version-upgrade chain (23:00 UTC nightly) reached
+the master control-plane phase for the first time — preflight passed, etcd
+snapshot taken, master cordoned + drained, etcd upgraded 3.6.5→3.6.6 — then the
+kube-apiserver upgrade to v1.35.6 **crash-looped**. kubeadm waited its 5-minute
+static-pod-hash window across all internal retries, then auto-rolled-back to
+v1.34.9. The cluster stayed healthy on 1.34.9 (apiserver, all 7 nodes Ready), but
+the run left **k8s-master cordoned** and the chain **wedged on `in_flight=1`**.
+No data loss; no user-facing outage (the master carries control-plane taints, so
+no workloads were displaced).
+
+**Trigger:** the first *minor* upgrade the chain ever attempted (1.34→1.35) — the
+first time kubeadm upgrades etcd (3.6.5→3.6.6) and regenerates the control-plane
+static pods, i.e. the first time the upgrade pushes real write-IO at etcd.
+
+## Root cause — etcd IO starvation on the shared HDD
+
+The new kube-apiserver could not establish/keep a working connection to etcd
+during the upgrade because **etcd was IO-starved**. etcd's surviving container log
+from the crash window (`/var/log/pods/.../etcd/0.log`, 23:04–23:20 UTC) shows:
+
+- **1,180** `apply request took too long` warnings in 16 minutes;
+- individual applies of **4.3s / 2.9s / 2.7s / 1.8s** (healthy is <100ms),
+  clustered at **23:18:51 UTC** — exactly when kubeadm's final attempt was trying
+  to bring the new apiserver up.
+
+A reproduced 1.35.6 apiserver with no etcd dies with
+`F instance.go:233 Error creating leases: error creating storage factory: context
+deadline exceeded` — the same failure mode a multi-second etcd produces. etcd
+lives on the contended `sdc` HDD (**beads code-oflt**: "etcd/critical VM disks on
+shared sdc HDD — recurring IO-storm root cause"). The upgrade itself piled IO onto
+that spindle:
+
+1. etcd's own upgrade-restart + WAL/db re-read (it restarted ~23:04, re-elected);
+2. kubeadm dumping a full **~400MB etcd DB backup** to
+   `/etc/kubernetes/tmp/kubeadm-backup-etcd-<ts>/` (on the same HDD) before the
+   etcd upgrade — and **145 of these had accumulated to 28GB** (kubeadm never
+   cleans them up), pushing master root fs to **73%**, above the 70% kubelet
+   image-GC threshold, so image GC churned during the drain too;
+3. master-drain pod evictions.
+
+### Correction — it was NOT the OIDC flag swap
+
+`kubeadm upgrade diff v1.35.6` showed the regenerated manifest also swaps
+`--authentication-config` (structured multi-issuer OIDC) back to legacy
+single-issuer `--oidc-*` flags (kubeadm-config drift, see secondary finding). That
+was the *first* hypothesis — but an isolated repro of the 1.35.6 apiserver with
+those exact `--oidc-*` flags **and authentik reachable** initialised OIDC cleanly
+(`oidc.go:313`, no error) and ran fine until it hit the (deliberately dead) test
+etcd. So the auth swap does **not** crash the apiserver; it was a red herring for
+the crash. Image pull (all v1.35.6 images pre-pulled), OOM (none), and disk-full
+were also ruled out.
+
+## Secondary finding (real, fixed separately) — kubeadm-config OIDC drift
+
+apiserver auth is configured in three places that must agree:
+(1) `/etc/kubernetes/pki/auth-config.yaml` (structured, two issuers: `kubernetes`
+ `k8s-dashboard`, added 2026-06-19); (2) the live static-pod manifest
+(`--authentication-config`); (3) the kubeadm-config `ClusterConfiguration` CM —
+which still carried the legacy `--oidc-*` extraArgs. `kubeadm upgrade` regenerates
+the manifest from (3), so it would have reverted structured auth → **dashboard +
+kubectl SSO break after a successful upgrade** (recoverable: the chain's
+post-master `restore.sh` re-adds the flag). This is a real bug, just not the crash.
+
+## Resolution
+
+1. **Reclaimed the 28GB kubeadm scratch** on master (`/etc/kubernetes/tmp/kubeadm-backup-*`) — root fs 73% → 23%.
+2. **Reconciled kubeadm-config live** (zero cluster impact — CM only read at upgrade time): dropped `--oidc-*`, added `--authentication-config` via `kubeadm init phase upload-config kubeadm`. `kubeadm upgrade diff` then shows only the control-plane image bumps.
+3. **Recovered:** uncordoned k8s-master, cleared the stuck `in_flight` gauge + annotation, deleted last night's Complete/Failed `1-35-6` phase jobs (a Complete preflight would otherwise make the detector idempotent-skip the re-run).
+
+## Prevention (landed in this change)
+
+| Gap | Fix |
+|-----|-----|
+| kubeadm leaks ~400MB etcd-DB backups into `/etc/kubernetes/tmp` forever (→ disk fills, image-GC churn, write-IO on etcd's spindle) | **`upgrade-step.sh` preflight now prunes** `/etc/kubernetes/tmp/kubeadm-backup-*` + `kubeadm-upgraded-manifests*` older than 3 days on master, every run. Best-effort, never aborts. |
+| kubeadm-config drift would silently break SSO after an upgrade | `apiserver-oidc.tf`'s remote script now **also reconciles kubeadm-config** (`kubeadm init phase upload-config`), delivered via the `apiserver-oidc-restore` ConfigMap the chain re-runs (CI needs no ssh) or a local `-replace` apply. Preflight **alerts** (not blocks — SSO drift is recoverable) if `kubeadm upgrade diff` would still drop `--authentication-config`. |
+| etcd on the contended `sdc` HDD starves under upgrade IO | **Durable fix is beads code-oflt** (move etcd/critical VM disks off `sdc`). Not in this change. Mitigations above reduce the upgrade's own IO; reclaimed disk removes the image-GC variable. |
+
+## Lessons
+
+- **Capture the failing component's own logs before concluding.** The `kubeadm
+  upgrade diff` made the OIDC swap look like the cause; only etcd's log (multi-second
+  applies) + an isolated apiserver repro showed the truth (etcd IO). A clean diff is
+  "what config changes," not "why it crashed."
+- **etcd on shared HDD is the cluster's recurring fragility** (immich IO storm
+  2026-05-25, this stall). Upgrades concentrate IO (etcd restart + kubeadm's 400MB
+  backup copy + drain) onto that spindle. code-oflt is the real fix.
+- **Tools that leave per-operation scratch must be reaped.** kubeadm's
+  `/etc/kubernetes/tmp` etcd backups are throwaway (real backups → NFS) but never
+  GC'd; 28GB had silently accumulated.
+- **Out-of-band control-plane edits must be written back to kubeadm-config** — else
+  `kubeadm upgrade` silently reverts them (here: SSO; could be admission/audit/API flags).
--- a/docs/runbooks/k8s-version-upgrade.md
+++ b/docs/runbooks/k8s-version-upgrade.md
@ -41,6 +41,8 @@ Job 0 — preflight       (pinned: k8s-node1)
  ├── halt-on-alert (kured-style ignore-list)
  ├── 24h-quiet baseline (no Ready transitions <24h ago)
  ├── kubeadm upgrade plan matches target (skipped when master already at target — partial-resume)
+  ├── apiserver-OIDC drift check: kubeadm upgrade diff drops --authentication-config? → Slack WARN (recoverable; not a block)
+  ├── reclaim kubeadm scratch: prune /etc/kubernetes/tmp/kubeadm-backup-* >3d on master (kubeadm leaks ~400MB etcd-db backups)
  ├── Push k8s_upgrade_in_flight=1, k8s_upgrade_started_timestamp=$(date +%s)
  ├── Trigger backup-etcd Job, wait, verify snapshot byte count
  ├── SSH master: containerd skew fix (if master < workers)
@ -222,22 +224,34 @@ Exposed in K8s via ExternalSecret `k8s-upgrade-creds` in the `k8s-upgrade` names

 ## Common Operations

-### Post-upgrade: apiserver OIDC restore (AUTOMATED by the chain since 2026-06-19)
+### apiserver OIDC + kubeadm upgrades (kubeadm-config reconciliation since 2026-06-24)

 `kubeadm upgrade apply` **regenerates `/etc/kubernetes/manifests/kube-apiserver.yaml`
-and drops the `--authentication-config` flag**, silently disabling apiserver
-OIDC (kubectl/kubelogin CLI **and** the web dashboard SSO break — tokens get
-401). This used to require a manual re-apply after **every** control-plane bump.
+from kubeadm-config**. apiserver auth uses a structured multi-issuer
+`--authentication-config` (kubectl + dashboard SSO), but kubeadm-config used to
+still carry the legacy single-issuer `--oidc-*` extraArgs — so every upgrade
+reverted the flag, **silently breaking SSO after the upgrade** (the apiserver does
+NOT crash on this — verified by isolated repro; it's recoverable via the restore
+script below). NB: the **1.34→1.35 stall on 2026-06-24 was a *separate* issue —
+etcd IO starvation**, not this drift; post-mortem:
+`docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md`.

-**Now automated:** the `rbac` stack publishes its OIDC restore script to the
-`kube-system/apiserver-oidc-restore` ConfigMap, and the version-upgrade chain's
-`phase_master` re-runs it on master immediately after `kubeadm upgrade apply`
-(while tigera-operator is still quiesced, so the flag-add apiserver restart can't
-crashloop the operator). It's idempotent, health-gates `/livez` with
-auto-rollback, and is **non-fatal** — a failure only lags SSO until the next rbac
-apply (the version upgrade itself already succeeded). So a chain-driven
-control-plane bump no longer breaks SSO. The master phase self-skips when master
-is already at target, so this only runs when master was actually upgraded.
+**Primary fix (2026-06-24):** `stacks/rbac/modules/rbac/apiserver-oidc.tf` now
+**reconciles kubeadm-config** (`kubeadm init phase upload-config kubeadm`, rewriting
+`apiServer.extraArgs`: drop `--oidc-*`, add `--authentication-config`) as part of
+its remote script. So kubeadm regenerates a **correct** manifest and the apiserver
+upgrades with a pure image bump — `kubeadm upgrade diff <target>` shows only the
+image change. Zero live impact (the CM is read only during an upgrade).
+
+**Backstops:**
+- **Preflight check 4b** runs `kubeadm upgrade diff` and **alerts** (Slack WARN, does
+  NOT block — the drift only breaks SSO, which is recoverable) if
+  `--authentication-config` would still be dropped.
+- The `rbac` stack still publishes its restore script to the
+  `kube-system/apiserver-oidc-restore` ConfigMap, and `phase_master` re-runs it on
+  master right after `kubeadm upgrade apply` (idempotent, `/livez`-gated with
+  auto-rollback, non-fatal) — now redundant belt-and-suspenders that *also*
+  re-reconciles kubeadm-config. Self-skips when master is already at target.

 **Manual fallback** — only for an out-of-band/manual `kubeadm` upgrade, or if the
 chain logged `WARN: --authentication-config absent after re-apply`:
--- a/stacks/goldmane-edge-aggregator/main.tf
+++ b/stacks/goldmane-edge-aggregator/main.tf
@ -57,16 +57,19 @@ resource "kubernetes_namespace" "goldmane_edge_aggregator" {
 # -----------------------------------------------------------------------------
 # The aggregator dials goldmane:7443 over mutual TLS. We mint a client cert
 # signed by the Tigera CA (the same CA that issues Goldmane's serving cert), so
-# Goldmane trusts the client and the client trusts Goldmane's server cert via
-# the published CA bundle.
-#
-# The Tigera CA private key lives in the `tigera-ca-private` Secret in
-# tigera-operator (Opaque; verified keys: tls.crt + tls.key). The stack's apply
-# identity needs RBAC get on that secret — see the Role/RoleBinding below.
-data "kubernetes_secret" "tigera_ca" {
+# Goldmane requires mutual TLS on :7443 and verifies the client cert chains to
+# the Tigera CA — it does NOT authorize by client identity, so ANY Tigera-CA-
+# signed cert is accepted. Rather than copy the Tigera CA PRIVATE KEY into TF
+# state to mint our own (a needless CA-key exposure; the hashicorp/tls provider
+# is also incompatible with this repo's global generate-providers/lockfile
+# pattern), we REUSE the operator-minted, Tigera-CA-signed client cert
+# `whisker-backend-key-pair` (calico-system). We never touch the CA key.
+# Trade-off: if the operator rotates that cert, re-apply to re-sync (hardening
+# follow-up: mint an own-identity cert in-namespace if Whisker is ever removed).
+data "kubernetes_secret" "whisker_backend" {
  metadata {
-    name      = "tigera-ca-private"
-    namespace = "tigera-operator"
+    name      = "whisker-backend-key-pair"
+    namespace = "calico-system"
  }
 }

@ -93,46 +96,11 @@ resource "kubernetes_config_map" "tigera_ca_bundle" {
  data = data.kubernetes_config_map.tigera_ca_bundle.data
 }

-# Client private key.
-resource "tls_private_key" "goldmane_client" {
-  algorithm = "RSA"
-  rsa_bits  = 2048
-}
-
-# CSR for the client cert. CN identifies the client; the service-DNS SAN mirrors
-# how Felix/whisker-backend present a client identity to Goldmane.
-resource "tls_cert_request" "goldmane_client" {
-  private_key_pem = tls_private_key.goldmane_client.private_key_pem
-  subject {
-    common_name  = "goldmane-edge-aggregator"
-    organization = "goldmane-edge-aggregator"
-  }
-  dns_names = [
-    "goldmane-edge-aggregator",
-    "goldmane-edge-aggregator.goldmane-edge-aggregator.svc.cluster.local",
-  ]
-}
-
-# Sign the CSR with the Tigera CA. 10-year validity (87600h): re-apply rotates
-# it well before expiry; a long horizon avoids surprise mTLS outages from an
-# unattended stack. The Tigera CA itself outlives this (operator-managed).
-resource "tls_locally_signed_cert" "goldmane_client" {
-  cert_request_pem   = tls_cert_request.goldmane_client.cert_request_pem
-  ca_private_key_pem = data.kubernetes_secret.tigera_ca.data["tls.key"]
-  ca_cert_pem        = data.kubernetes_secret.tigera_ca.data["tls.crt"]
-
-  validity_period_hours = 87600 # 10y
-  early_renewal_hours   = 720   # re-sign on apply when <30d remain
-
-  allowed_uses = [
-    "client_auth",
-    "digital_signature",
-    "key_encipherment",
-  ]
-}
-
-# The minted client cert + key, mounted at TLS_CERT_PATH / TLS_KEY_PATH defaults
-# (/etc/goldmane-client-tls/tls.crt and .../tls.key).
+# Client cert + key for mTLS to goldmane:7443, mounted at TLS_CERT_PATH /
+# TLS_KEY_PATH defaults (/etc/goldmane-client-tls/tls.crt and .../tls.key).
+# Sourced verbatim from the operator's whisker-backend client key-pair (read
+# above) — already Tigera-CA-signed, which is all Goldmane verifies. No CA key
+# is touched and no cross-namespace CA RBAC is needed.
 resource "kubernetes_secret" "goldmane_client_tls" {
  metadata {
    name      = "goldmane-client-tls"
@ -140,47 +108,8 @@ resource "kubernetes_secret" "goldmane_client_tls" {
  }
  type = "Opaque"
  data = {
-    "tls.crt" = tls_locally_signed_cert.goldmane_client.cert_pem
-    "tls.key" = tls_private_key.goldmane_client.private_key_pem
-  }
-}
-
-# Narrow RBAC so this stack's apply identity (and ESO/Reloader are unaffected)
-# can `get` the Tigera CA private key in tigera-operator. The data source above
-# reads it at apply time; this Role/RoleBinding documents + grants that access
-# rather than relying on cluster-admin. The subject is the same SA the other
-# Tier-1 stacks apply as (claude-agent/terraform-state for headless, the human
-# OIDC identity interactively) — both are cluster-admin today, so this is
-# belt-and-braces / least-privilege intent for when apply identities tighten.
-resource "kubernetes_role" "read_tigera_ca" {
-  metadata {
-    name      = "goldmane-edge-aggregator-read-tigera-ca"
-    namespace = "tigera-operator"
-  }
-  rule {
-    api_groups     = [""]
-    resources      = ["secrets"]
-    resource_names = ["tigera-ca-private"]
-    verbs          = ["get"]
-  }
-}
-
-resource "kubernetes_role_binding" "read_tigera_ca" {
-  metadata {
-    name      = "goldmane-edge-aggregator-read-tigera-ca"
-    namespace = "tigera-operator"
-  }
-  role_ref {
-    api_group = "rbac.authorization.k8s.io"
-    kind      = "Role"
-    name      = kubernetes_role.read_tigera_ca.metadata[0].name
-  }
-  # The headless apply identity (claude-agent-service runs Tier-1 applies as the
-  # `terraform-state` Vault K8s role in the claude-agent namespace).
-  subject {
-    kind      = "ServiceAccount"
-    name      = "default"
-    namespace = "claude-agent"
+    "tls.crt" = data.kubernetes_secret.whisker_backend.data["tls.crt"]
+    "tls.key" = data.kubernetes_secret.whisker_backend.data["tls.key"]
  }
 }

@ -227,6 +156,11 @@ resource "kubernetes_job" "db_init" {
  timeouts {
    create = "2m"
  }
+  lifecycle {
+    # KYVERNO_LIFECYCLE_V1: Kyverno injects dns_config (ndots=2); ignore it so
+    # this idempotent Job isn't replaced (Jobs are immutable) on every apply.
+    ignore_changes = [spec[0].template[0].spec[0].dns_config]
+  }
 }

 # ExternalSecret projecting the Vault-rotated (7-day) credential into a K8s
@ -295,7 +229,7 @@ resource "kubernetes_manifest" "slack_external_secret" {
      data = [{
        secretKey = "SLACK_WEBHOOK_URL"
        remoteRef = {
-          key      = "monitoring"
+          key      = "viktor"
          property = "alertmanager_slack_api_url"
        }
      }]
--- a/stacks/instagram-poster/modules/instagram-poster/main.tf
+++ b/stacks/instagram-poster/modules/instagram-poster/main.tf
@ -35,6 +35,14 @@ resource "kubernetes_namespace" "instagram_poster" {
 #     - immich_tag_instagram      (optional — auto-resolved if missing)
 #     - immich_tag_posted         (optional — auto-resolved if missing)
 resource "kubernetes_manifest" "external_secret" {
+  # The external-secrets controller takes server-side-apply ownership of
+  # .spec.refreshInterval, so a plain TF apply conflicts. force_conflicts lets
+  # TF win (values match, so it's stable) — same pattern as grafana/woodpecker/
+  # traefik/k8s-version-upgrade. Surfaced 2026-06-24 by the first IG apply since
+  # the ESO v1 migration (the scale-to-0 push).
+  field_manager {
+    force_conflicts = true
+  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
@ -139,6 +147,11 @@ resource "kubernetes_manifest" "external_secret" {
 # ESO refreshes the K8s Secret every 15m. `reloader.stakater.com/match`
 # bounces the pod when the password changes.
 resource "kubernetes_manifest" "benchmark_db_external_secret" {
+  # See external_secret above — ESO owns .spec.refreshInterval; force_conflicts
+  # lets the TF apply win instead of erroring on the field-manager conflict.
+  field_manager {
+    force_conflicts = true
+  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
@ -227,7 +240,11 @@ resource "kubernetes_deployment" "instagram_poster" {
  }

  spec {
-    replicas = 1
+    # Scaled to 0 (2026-06-24): Instagram Graph integration is unused and its
+    # ExternalSecret is dead (missing ig_graph_long_lived_token /
+    # ig_business_account_id in Vault secret/instagram-poster). Set back to 1
+    # after minting a Meta long-lived token and populating those keys.
+    replicas = 0
    # RWO PVC — cannot rolling-update.
    strategy {
      type = "Recreate"
--- a/stacks/k8s-version-upgrade/scripts/upgrade-step.sh
+++ b/stacks/k8s-version-upgrade/scripts/upgrade-step.sh
@ -416,6 +416,39 @@ phase_preflight() {
    fi
  fi

+  # 4b. apiserver-OIDC drift check (backstop for the rbac stack's kubeadm-config
+  # reconciliation). A `kubeadm upgrade` REGENERATES the apiserver manifest from
+  # kubeadm-config; if kubeadm-config still carries the legacy single-issuer
+  # --oidc-* args instead of --authentication-config, the regenerated apiserver
+  # loses structured multi-issuer auth → kubectl + dashboard SSO break AFTER the
+  # upgrade. This is RECOVERABLE (the apiserver does NOT crash — verified by an
+  # isolated repro 2026-06-24; the chain's post-master restore.sh re-adds the flag,
+  # and the rbac stack reconciles kubeadm-config so it won't recur) — so this is an
+  # ALERT, not a block. (NB the 2026-06-24 stall was NOT this — it was etcd IO
+  # starvation; see docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md.)
+  # Skip on an at-target master (resume — no apiserver regen).
+  if [ "$master_kubelet_v" != "$TARGET_VERSION" ]; then
+    local apiserver_diff
+    apiserver_diff=$(ssh "${SSH_OPTS[@]}" "$(ssh_target k8s-master)" "sudo kubeadm upgrade diff v$TARGET_VERSION 2>/dev/null" || true)
+    if echo "$apiserver_diff" | grep -qE '^-[[:space:]].*--authentication-config'; then
+      slack "WARN preflight — kubeadm upgrade will DROP --authentication-config (kubeadm-config OIDC drift). SSO breaks post-upgrade until restore.sh re-adds it; re-apply the rbac stack to reconcile kubeadm-config. Proceeding (recoverable, not a crash)."
+    fi
+  fi
+
+  # 4c. Reclaim kubeadm scratch on master. `kubeadm upgrade apply` dumps a full
+  # ~400MB etcd DB backup into /etc/kubernetes/tmp/kubeadm-backup-etcd-<ts>/ before
+  # every etcd upgrade and NEVER cleans it up — 145 dirs / 28GB had accumulated by
+  # 2026-06-24, pushing master root fs to 73% (image-GC churn + extra write IO on
+  # the shared HDD where etcd lives — a contributor to the etcd IO starvation that
+  # stalled that run, see post-mortem). Real etcd backups go to NFS, so these are
+  # throwaway. Prune ones >3 days old (keeps a short rollback window). Best-effort;
+  # never aborts the chain.
+  if [ "$master_kubelet_v" != "$TARGET_VERSION" ]; then
+    ssh "${SSH_OPTS[@]}" "$(ssh_target k8s-master)" \
+      "sudo find /etc/kubernetes/tmp -maxdepth 1 -type d \( -name 'kubeadm-backup-*' -o -name 'kubeadm-upgraded-manifests*' \) -mtime +3 -exec rm -rf {} + 2>/dev/null; echo -n 'master root after prune: '; df -h / | awk 'NR==2{print \$5\" used, \"\$4\" free\"}'" \
+      || echo "kubeadm-scratch prune skipped (ssh/df failed) — non-fatal"
+  fi
+
  # 5. Push in-flight + started_timestamp metrics + ns annotations
  $KUBECTL annotate ns "$NS" \
    "viktorbarzin.me/k8s-upgrade-in-flight=$(date -u +%FT%TZ)" \
--- a/stacks/monitoring/modules/monitoring/grafana.tf
+++ b/stacks/monitoring/modules/monitoring/grafana.tf
@ -71,6 +71,15 @@ resource "kubernetes_persistent_volume" "alertmanager_pv" {
 # DB credentials from Vault database engine (rotated automatically)
 # Provides GF_DATABASE_PASSWORD that auto-updates when password rotates
 resource "kubernetes_manifest" "grafana_db_creds" {
+  # The external-secrets controller takes server-side-apply ownership of
+  # .spec.refreshInterval, so a plain TF apply conflicts ("conflict with
+  # external-secrets ... .spec.refreshInterval"). force_conflicts lets TF win
+  # (values match, so it's stable) — same pattern as the woodpecker/traefik/
+  # k8s-version-upgrade stacks. Surfaced 2026-06-24: the first monitoring apply
+  # in a while exposed this latent conflict (prior pushes were docs-only).
+  field_manager {
+    force_conflicts = true
+  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
--- a/stacks/rbac/modules/rbac/apiserver-oidc.tf
+++ b/stacks/rbac/modules/rbac/apiserver-oidc.tf
@ -10,16 +10,29 @@
 # match the existing RBAC subjects (kind: User, name: <raw email>; group names
 # verbatim). Do NOT add a prefix or existing bindings break.
 #
-# DRIFT WARNING: this edits the kube-apiserver static-pod manifest on the single
-# master. A `kubeadm upgrade` regenerates that manifest and DROPS this flag (this
-# is exactly how OIDC silently broke before — the flag was wiped and the
-# content-hash trigger never re-fired). After any k8s control-plane upgrade,
-# re-apply the rbac stack to restore apiserver OIDC. See
-# docs/plans/2026-06-04-k8s-dashboard-sso-design.md.
+# DRIFT WARNING (and how it's now handled): apiserver auth lives in THREE places
+# that must stay in sync, because a `kubeadm upgrade` REGENERATES the static-pod
+# manifest from kubeadm-config:
+#   1. /etc/kubernetes/pki/auth-config.yaml         — the structured authn file
+#   2. the live kube-apiserver static-pod manifest  — references it via the flag
+#   3. the kubeadm-config ClusterConfiguration CM   — what kubeadm regenerates from
+# Originally only (1)+(2) were managed, so every kubeadm upgrade rewrote the
+# manifest from the STALE CM, reverting --authentication-config to single-issuer
+# --oidc-* flags. The consequence is SSO breakage AFTER the upgrade: kubectl +
+# dashboard lose multi-issuer auth (the apiserver does NOT crash on this — verified
+# by an isolated repro 2026-06-24; the 2026-06-24 v1.35 upgrade *stall* was a
+# separate etcd IO-starvation issue, see
+# docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md). The
+# remote script below now ALSO reconciles (3) via `kubeadm init phase
+# upload-config`, so a future kubeadm upgrade regenerates a CORRECT manifest. The
+# k8s-version-upgrade chain additionally ALERTS (does not block — SSO drift is
+# recoverable) via `kubeadm upgrade diff` in preflight if --authentication-config
+# would still be dropped.
 #
 # SAFETY: the remote script health-gates on /livez and AUTO-ROLLS-BACK the
 # manifest from a timestamped backup if the apiserver does not recover, so a
-# malformed config cannot leave the single master down.
+# malformed config cannot leave the single master down. Reconciling kubeadm-config
+# is zero-impact on the running cluster (the CM is only read during an upgrade).

 variable "k8s_master_host" {
  type    = string
@ -97,6 +110,40 @@ locals {
    print('flag-inserted' if done else 'ANCHOR-NOT-FOUND')
  PY

+  # Reconciles the kubeadm-config ClusterConfiguration's apiServer.extraArgs:
+  # drops the stale single-issuer --oidc-* args and ensures --authentication-config
+  # is present (anchored after --authorization-mode). Stdlib-only (the master is
+  # only guaranteed python3, not pyyaml/yq). Idempotent; preserves all other
+  # fields (etcd args, audit args, extraVolumes) verbatim. Exits 3 if the
+  # authorization-mode anchor is missing (fail loud, leave the CM untouched).
+  kubeadm_oidc_reconcile_py = <<-PY
+    import sys
+    lines = sys.stdin.read().split('\n')
+    out, i, n = [], 0, len(lines)
+    have_authn = any('name: authentication-config' in l for l in lines)
+    inserted = have_authn
+    while i < n:
+        ln = lines[i]; s = ln.strip()
+        if s.startswith('- name: oidc-'):
+            i += 2 if (i + 1 < n and lines[i + 1].strip().startswith('value:')) else 1
+            continue
+        out.append(ln)
+        if (not inserted) and s == '- name: authorization-mode':
+            indent = ln[:len(ln) - len(ln.lstrip())]
+            if i + 1 < n and lines[i + 1].strip().startswith('value:'):
+                out.append(lines[i + 1]); i += 2
+            else:
+                i += 1
+            out.append(indent + '- name: authentication-config')
+            out.append(indent + '  value: /etc/kubernetes/pki/auth-config.yaml')
+            inserted = True
+            continue
+        i += 1
+    if not inserted:
+        sys.stderr.write('ANCHOR-NOT-FOUND: authorization-mode\n'); sys.exit(3)
+    sys.stdout.write('\n'.join(out))
+  PY
+
  # Whole remote operation, base64-embedded for byte-exact transfer (no
  # heredoc/escaping hazards across SSH).
  apiserver_auth_remote_script = <<-SH
@ -137,6 +184,30 @@ locals {
      echo "rolled back to previous manifest"; exit 1
    fi
    echo "kube-apiserver healthy with multi-issuer --authentication-config"
+
+    # 5. Reconcile kubeadm-config so a FUTURE `kubeadm upgrade` regenerates the
+    #    apiserver manifest WITH --authentication-config instead of reverting to
+    #    the stale single-issuer --oidc-* flags. Without this, kubeadm rewrote the
+    #    manifest from kubeadm-config on every control-plane upgrade and the
+    #    regenerated apiserver crash-looped (the 2026-06-24 v1.35 upgrade stall).
+    #    Zero live impact (the CM is only read at upgrade time); idempotent;
+    #    best-effort (the chain's `kubeadm upgrade diff` preflight gate is the
+    #    backstop if this cannot run).
+    KC="sudo kubectl --kubeconfig /etc/kubernetes/admin.conf"
+    CC=$($KC -n kube-system get cm kubeadm-config -o jsonpath='{.data.ClusterConfiguration}' 2>/dev/null || true)
+    if [ -n "$CC" ] && { echo "$CC" | grep -q 'oidc-issuer-url' || ! echo "$CC" | grep -q 'authentication-config'; }; then
+      echo "Reconciling kubeadm-config (oidc-* -> authentication-config) so kubeadm upgrade keeps structured auth"
+      echo '${base64encode(local.kubeadm_oidc_reconcile_py)}' | base64 -d > /tmp/reconcile_kubeadm_oidc.py
+      if printf '%s' "$CC" | python3 /tmp/reconcile_kubeadm_oidc.py > /tmp/kubeadm-cc-new.yaml \
+         && sudo kubeadm init phase upload-config kubeadm --config /tmp/kubeadm-cc-new.yaml; then
+        echo "kubeadm-config reconciled: future control-plane upgrades keep --authentication-config"
+      else
+        echo "WARN: kubeadm-config reconcile failed; the upgrade-chain preflight gate will block the next upgrade"
+      fi
+      rm -f /tmp/reconcile_kubeadm_oidc.py /tmp/kubeadm-cc-new.yaml
+    else
+      echo "kubeadm-config already uses --authentication-config (no oidc drift)"
+    fi
  SH
 }

@ -155,6 +226,14 @@ resource "null_resource" "apiserver_oidc_config" {
  }

  triggers = {
+    # Intentionally hash ONLY the issuer config, NOT the remote script. CI applies
+    # the rbac stack with no ssh_private_key (var defaults to ""), so a re-run of
+    # this SSH provisioner in CI would fail — hence the null_resource must stay a
+    # no-op on a plain CI apply. Script changes (e.g. the 2026-06-24 kubeadm-config
+    # reconciliation) reach the cluster via the apiserver-oidc-restore ConfigMap
+    # below (a plain k8s resource, no ssh) which the upgrade chain re-runs. To force
+    # this provisioner to re-run after a script change, apply locally with
+    # `-replace` + TF_VAR_ssh_private_key (see docs/runbooks/k8s-version-upgrade.md).
    auth_config = sha256(local.apiserver_auth_config_yaml)
  }
 }
--- a/state/stacks/vault/terraform.tfstate.enc
+++ b/state/stacks/vault/terraform.tfstate.enc
Author	SHA1	Message	Date
Viktor Barzin	9c68d147e0	k8s-upgrade: reclaim+auto-prune kubeadm /etc/kubernetes/tmp leak; correct crash root cause to etcd IO (not OIDC) Some checks failed ci/woodpecker/push/postmortem-todos Pipeline was successful Details ci/woodpecker/push/default Pipeline failed Details Digging into "why did the apiserver crash" disproved the earlier OIDC explanation. An isolated v1.35.6 apiserver repro with authentik reachable initialises OIDC cleanly (oidc.go:313, no error) and runs fine — so the --authentication-config -> --oidc-* revert is NOT what crashed it. etcd's surviving crash-window log is the real cause: 1180 "apply request took too long" warnings in 16 min, individual applies up to 4.3s (healthy <100ms) right as kubeadm tried to bring up the new apiserver. That's etcd IO starvation on the shared sdc HDD (beads code-oflt). A big contributor + the reason master root fs sat at 73%: kubeadm dumps a full ~400MB etcd DB backup into /etc/kubernetes/tmp/kubeadm-backup-etcd-<ts>/ before every etcd upgrade and never cleans it up — 145 dirs / 28GB had accumulated, driving image-GC churn and extra write-IO onto etcd's spindle. Reclaimed live (73% -> 23%) and added a preflight prune (>3 days) so it can't re-accumulate. Also corrected the OIDC handling: the kubeadm-config drift is real but only breaks dashboard/kubectl SSO AFTER a successful upgrade (recoverable via the chain's restore.sh + the kubeadm-config reconciliation) — it does not crash the apiserver. So the preflight check is now an ALERT, not a block (was added on the wrong hypothesis). Post-mortem, runbook, and apiserver-oidc.tf header corrected. Per Viktor: reclaim the disk and automate so the manual cleanup never recurs; the durable IO fix remains code-oflt (etcd off the shared HDD). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-25 15:23:15 +00:00
Viktor Barzin	60a1cb9a25	k8s-upgrade: reconcile kubeadm-config OIDC drift that crash-looped the v1.35 apiserver upgrade All checks were successful ci/woodpecker/push/postmortem-todos Pipeline was successful Details ci/woodpecker/push/default Pipeline was successful Details Last night's autonomous 1.34->1.35 run reached the master control-plane phase for the first time (preflight passed, etcd snapshot taken, etcd upgraded), then the kube-apiserver upgrade to v1.35.6 crash-looped and kubeadm auto-rolled-back to 1.34.9. The cluster stayed healthy but the master was left cordoned and the chain wedged on in_flight. Root cause: kubeadm upgrade regenerates the apiserver static-pod manifest from the kubeadm-config ConfigMap. apiserver auth was switched on 2026-06-19 to a structured multi-issuer --authentication-config (kubectl + dashboard SSO), but kubeadm-config still carried the legacy single-issuer --oidc-* extraArgs, so the regenerated manifest reverted structured auth and the new apiserver crash-looped. Proven via `kubeadm upgrade diff`. The existing post-upgrade OIDC restore step never ran because the upgrade itself never succeeded. Fix: - rbac/apiserver-oidc.tf: the remote script now also reconciles kubeadm-config (kubeadm init phase upload-config: drop --oidc-*, add --authentication-config) so a future kubeadm upgrade regenerates a correct manifest. Delivered to the cluster via the apiserver-oidc-restore ConfigMap the chain re-runs (CI needs no ssh key); trigger deliberately not script-hashed since CI cannot ssh. - k8s-version-upgrade/upgrade-step.sh: new preflight gate runs `kubeadm upgrade diff` and BLOCKS+alerts (never drains the master) if --authentication-config would still be dropped. - Post-mortem + runbook updated. The live kubeadm-config was reconciled directly on the master and verified (`kubeadm upgrade diff` now shows only the control-plane image bump), so tonight's run can complete the 1.34->1.35 upgrade. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-25 14:16:04 +00:00
Viktor Barzin	c6bba1da6e	home-assistant skill: refresh ha-london map (HAOS 2026.5.2, Cowboy revived, Overview redesign) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor asked to redesign the ha-london dashboards and fix the broken integrations (the Cowboy one). The skill's ha-london knowledge map had drifted badly from reality, so this brings it current: it claimed HA 2025.9.1 on a docker-run container (it's HAOS 2026.5.2, managed); listed the now-dead jdejaegh Cowboy integration with sensor.bike_* entities (revived via elsbrock/cowboy-ha v1.2.0 -> entities are sensor.classic_performance_*); and didn't flag that met/metoffice/roomba/hildebrandglow are user-disabled (not broken) or that Tapo P100 is failing. Also documents the redesigned Overview (Home+More sections, Mushroom+mini-graph-card), the dashboard/view/card glossary, the parked-bike 'unknown battery' gotcha, and that london is API-only from the Sofia devvm (no SSH). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-24 22:03:15 +00:00
Viktor Barzin	b858561bd0	Merge remote-tracking branch 'origin/master' Some checks failed ci/woodpecker/push/default Pipeline failed Details	2026-06-24 20:59:39 +00:00
Viktor Barzin	a7704f46a6	deploy goldmane-edge-aggregator: durable who-talks-to-whom edge trail (#58 , ADR-0014) Infra side of ADR-0014: an mTLS gRPC consumer of Calico Goldmane's Flows API that records the namespace-pair edge-set in CNPG and posts a daily new-edge digest to #security. Adds the goldmane-edge-aggregator stack, the pg-goldmane-edges Vault rotation role (Tier-0 vault state updated here), and the namespace in the ghcr-credentials allowlist. Cert: REUSES the operator-minted, Tigera-CA-signed whisker-backend client cert (Goldmane verifies only the CA chain, not identity) instead of minting from the Tigera CA private key. This avoids putting the CA key in TF state AND the hashicorp/tls provider, which is incompatible with this repo's global generate-providers/lockfile pattern (it broke every stack's lockfile). Verified live: aggregator streaming flows, 174 edges in Postgres across 50x54 namespaces, db+slack ExternalSecrets synced, digest dry-run formats correctly, private image pulls via the Kyverno-synced ghcr-credentials. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-24 20:59:39 +00:00
Viktor Barzin	aa510e3600	instagram-poster: force_conflicts on ESO manifests (fix apply) All checks were successful ci/woodpecker/push/default Pipeline was successful Details The ESO v1 migration (2026-06-22) made the external-secrets controller own .spec.refreshInterval via server-side apply, so terraform apply of the two ExternalSecret manifests fails with a field-manager conflict (Woodpecker #348), which blocked the replicas=0 scale-down from landing. Add force_conflicts=true to both, matching the grafana/woodpecker/traefik fix applied to other stacks the same day. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-24 20:49:53 +00:00
Viktor Barzin	53834deb24	instagram-poster: scale to 0 (unused, dead ExternalSecret) Some checks failed ci/woodpecker/push/default Pipeline failed Details Viktor confirmed the Instagram Graph poster isn't used. Its ExternalSecret has been dead on missing Vault keys (ig_graph_long_lived_token, ig_business_account_id), so the deployment sat at 0/1 firing DeploymentReplicasMismatch. Setting replicas=0 stops the alert and makes the scale-down durable (a bare kubectl scale reverts on the next stack apply). Re-set to 1 after minting a Meta long-lived token + populating the Vault keys. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-24 20:45:30 +00:00
Viktor Barzin	8dd9a3978d	Merge remote-tracking branch 'forgejo/master' into wizard/homelab-vault All checks were successful ci/woodpecker/push/default Pipeline was successful Details	2026-06-24 12:25:52 +00:00
Viktor Barzin	65b2df1222	fix(monitoring): force_conflicts on grafana_db_creds ExternalSecret The external-secrets controller owns .spec.refreshInterval via SSA, so a plain terraform apply of the monitoring stack conflicts. Latent until 2026-06-24 (the homelab-vault loki-rules change was the first monitoring apply in a while and surfaced it). force_conflicts lets TF win — same pattern as woodpecker/traefik/ k8s-version-upgrade stacks.	2026-06-24 12:25:36 +00:00