goldmane-trail: polish follow-ups #57/#59/#61/#62/#63 + digest→#alerts

Completes the Goldmane who-talks-to-whom trail (ADR-0014), implemented by a subagent workflow (distinct stacks in parallel, docs last): - #57 Whisker gated ingress: ingress_factory (whisker.viktorbarzin.me, auth=required, Authentik-gated) + a NetworkPolicy allowing traefik->whisker:8081 (the operator's whisker NP default-denies ingress). calico stack. - #61 pipeline health: AggregatorDown + DigestFailing Prometheus alerts (prometheus_chart_values.tpl) + cluster-health check #48. - #59 service-identity labels on the multi-Service namespaces (monitoring's 5 TF-managed deployments + dbaas), with the KYVERNO_LIFECYCLE_V1 marker so they update in-place. - #62/#63 docs: docs/runbooks/goldmane-flow-trail.md (new), service-catalog, security.md + monitoring.md east-west sections, ADR-0014 as-built, CONTEXT.md. #62 = the SQL to derive the Wave-1 per-namespace egress allowlist from the edge table (feeds code-8ywc; enforce-flips out of scope). Also fixes the digest's Slack target: #security override 404s channel_not_found because the shared alertmanager_slack_api_url webhook's app isn't a member of #security (this likely also breaks alertmanager's slack-security receiver — flagged in the runbook). Routed to #alerts (the webhook's working channel) until the app is invited; verified a real digest run posts cleanly (360 edges). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
state(dbaas): update encrypted state
2026-06-25 17:49:25 +00:00 · 2026-06-25 17:31:03 +00:00 · 2026-06-25 15:23:15 +00:00 · 2026-06-25 14:16:04 +00:00 · 2026-06-24 22:03:15 +00:00 · 2026-06-24 20:59:39 +00:00
29 changed files with 5416 additions and 3960 deletions
--- a/.claude/CLAUDE.md
+++ b/.claude/CLAUDE.md
@ -243,7 +243,8 @@ Plan in `docs/architecture/security.md` + response playbook in `docs/runbooks/se
 - **Source-IP allowlist (K2, K9, V7, S1)**: `10.0.20.0/22`, `192.168.1.0/24` (Proxmox + Sofia LAN), K8s pod CIDR, K8s service CIDR, Headscale tailnet. **Policy: no public-IP access** — Vault, kube-apiserver, PVE sshd must transit LAN or Headscale. **One documented exception (2026-06-11): break-glass SSH** — PVE sshd on a WAN-exposed `:52222`, key-only, dedicated break-glass key only (`Match LocalPort`), rate-limited + fail2ban; intentionally cluster-independent so it survives an outage. As-built `docs/runbooks/breakglass-ssh.md`. (Replaced the 2026-05-30 port-knock design — circular Vault dep caused a lockout.)
 - **Response model**: (I) Slack-only daily skim. All security alerts via Loki ruler → Alertmanager → `#security` Slack receiver. Single channel with severity labels inside (critical/warning/info). No paging.
 - **Kyverno policies (wave 1)**: `deny-privileged-containers`, `deny-host-namespaces`, `restrict-sys-admin`, `require-trusted-registries` flip Audit→Enforce with the 31-namespace exclude list (memory id=1970). `failurePolicy: Ignore` preserved. Cosign `verify-images` deferred.
- **NetworkPolicy default-deny egress (wave 1)**: observe-then-enforce (γ approach) — Calico flow logs cluster-wide + GlobalNetworkPolicy log-only on tier 3+4, build empirical allowlist after 1 week, phased per-namespace enforce starting `recruiter-responder`. Tier 0/1/2 deferred.
+- **NetworkPolicy default-deny egress (wave 1)**: observe-then-enforce (γ approach) — Calico flow logs cluster-wide + GlobalNetworkPolicy log-only on tier 3+4, build empirical allowlist after 1 week, phased per-namespace enforce starting `recruiter-responder`. Tier 0/1/2 deferred. **The internal (ns-to-ns) half of each allowlist now derives faster from the east-west flow trail** (below): `SELECT DISTINCT dst_ns FROM edge WHERE src_ns='<ns>' AND action='allow'`. External egress is NOT in that table (empty-ns flows dropped) — those still come from the Calico flow-log W1.6 snapshot. Enforce-flips remain out of scope of the trail (observe-and-derive only; beads `code-8ywc`).
 - **East-west flow trail (who-talks-to-whom, ADR-0014)**: Calico **Goldmane** (`goldmane.calico-system:7443`, gRPC/mTLS, ~60-min in-memory ring buffer — no etcd writes) + **Whisker** live UI (`whisker.viktorbarzin.me`, Authentik-gated) → **`goldmane-edge-aggregator`** streams Goldmane's `Flows.Stream` over mTLS and upserts the namespace-pair **edge set** (`edge(src_ns,dst_ns,action,first_seen,last_seen,flow_count)`, self-edges + public-internet flows dropped) into **CNPG DB `goldmane_edges`** → daily **`goldmane-edges-digest`** CronJob posts first-seen edges to `#alerts` (the shared webhook's Slack app isn't in `#security` → 404 channel_not_found; flip `SLACK_CHANNEL` back once invited — see runbook). **CERT-REUSE GOTCHA**: the aggregator's mTLS client cert reuses the operator's Tigera-CA-signed `whisker-backend-key-pair` Secret (Goldmane verifies CA-chain only) — **re-apply `stacks/goldmane-edge-aggregator` if the operator rotates it** (symptom: no `last_seen` updates, `AggregatorDown`). Service identity = namespace, + `service-identity` label only in `monitoring`/`kube-system`/`dbaas`. Health: `AggregatorDown` + `DigestFailing` alerts + cluster-health #48. Runbook: `docs/runbooks/goldmane-flow-trail.md`. (Goldmane is OSS tech-preview — reversible operator-CR toggle in `stacks/calico/main.tf`.)
 - **What's NOT in scope**: canary tokens (rejected — self-trigger risk with Viktor's normal `vault kv list secret/viktor` and `kubectl get secret -A` workflows), Falco/Tetragon (too noisy for Slack-only daily check), Cloudflare/GitHub audit polling (deferred to wave 2).
 ## Storage & Backup Architecture
--- a/.claude/reference/service-catalog.md
+++ b/.claude/reference/service-catalog.md
@ -13,6 +13,8 @@
 | authentik | Identity provider (SSO) | authentik |
 | cloudflared | Cloudflare tunnel | cloudflared |
 | authelia | Auth middleware (may be merged into ebooks or removed) | platform |
 | goldmane | Calico 3.30 OSS flow aggregator (`goldmane.calico-system.svc:7443`, gRPC/mTLS). Stamps identity (ns/pod/workload/labels + allow-deny) on every flow from Felix into a ~60-min in-memory ring buffer — no etcd/API writes. East-west "who-talks-to-whom" source (ADR-0014). Enabled via operator CR (`kubectl_manifest.goldmane`). | calico |
 | whisker | Calico 3.30 OSS live flow-observability UI (`whisker.calico-system.svc:8081`) at `whisker.viktorbarzin.me` (Authentik-gated, `auth=required` — no own login; additive NP ORs Traefik past the operator default-deny). ~60-min live view of Goldmane flows, NOT history. Enabled via operator CR (`kubectl_manifest.whisker`). | calico |
 | monitoring | Prometheus/Grafana/Loki stack | monitoring |
 ## Storage & Security (Tier: cluster)
@ -37,6 +39,7 @@
 ## Active Use
 | Service | Description | Stack |
 |---------|-------------|-------|
 | goldmane-edge-aggregator | Durable who-talks-to-whom audit trail (ADR-0014 / #58). Go service: `aggregate` Deployment streams Goldmane's gRPC `Flows.Stream` (mTLS) and upserts the low-cardinality namespace-pair edge set (`edge(src_ns,dst_ns,action,first_seen,last_seen,flow_count)`) into CNPG DB `goldmane_edges`; `goldmane-edges-digest` CronJob posts first-seen edges daily to `#security`. mTLS client cert REUSES the operator's `whisker-backend-key-pair` (re-apply if rotated). Tier-4-aux. Image `ghcr.io/viktorbarzin/goldmane-edge-aggregator` (private). Runbook: [goldmane-flow-trail.md](../../docs/runbooks/goldmane-flow-trail.md). | goldmane-edge-aggregator |
 | mailserver | Email (docker-mailserver) | mailserver |
 | shadowsocks | Proxy | shadowsocks |
 | webhook_handler | Webhook processing | webhook_handler |
@ -161,3 +164,4 @@ procedures) are documented in `infra/docs/runbooks/`:
 | pfSense + Unbound DNS | [pfsense-unbound.md](../../docs/runbooks/pfsense-unbound.md) |
 | Mailserver PROXY-protocol / HAProxy | [mailserver-pfsense-haproxy.md](../../docs/runbooks/mailserver-pfsense-haproxy.md) |
 | Technitium apply flow | [technitium-apply.md](../../docs/runbooks/technitium-apply.md) |
 | Goldmane flow trail (east-west who-talks-to-whom) | [goldmane-flow-trail.md](../../docs/runbooks/goldmane-flow-trail.md) |
--- a/.claude/skills/home-assistant/SKILL.md
+++ b/.claude/skills/home-assistant/SKILL.md
@ -11,8 +11,8 @@ description: |
  There are TWO Home Assistant deployments: ha-london (default) and ha-sofia.
  Always use Home Assistant for smart home control.
 author: Claude Code
-version: 2.0.0
+version: 2.1.0
-date: 2026-02-07
+date: 2026-06-24
 ---
 # Home Assistant Control
@ -395,14 +395,27 @@ Advanced SSH, File Editor, Studio Code Server, InfluxDB, Mosquitto, Node-RED, Fr
 ## ha-london Knowledge Map
 ### Overview
- **HA Version**: 2025.9.1 (Docker container on Raspberry Pi)
+- **HA Version**: 2026.5.2 on **Home Assistant OS** (HAOS — managed appliance, NOT a `docker run` container). Latest is 2026.6.4 (update available, deliberately not applied).
 - **Location**: London, UK
- **Platform**: Raspberry Pi 4, HA OS (not Docker standalone)
+- **Platform**: Raspberry Pi 4, HA OS
- **SSH**: `ssh hassio@192.168.8.103` (requires `sudo` for file access)
+- **Access from the Sofia devvm**: london is **remote** — `homelab ha ssh --instance london` generally WON'T connect (ADR-0012). Drive it via the API: `homelab ha token --instance london` + `https://ha-london.viktorbarzin.me/api/...`, and the WebSocket API `wss://ha-london.viktorbarzin.me/api/websocket` for dashboards / config-entries / HACS installs.
- **Config path**: `/config/` (requires `sudo` for file access)
+- **SSH (only from the London LAN)**: `ssh hassio@192.168.8.103` (requires `sudo` for file access)
 - **Config path**: `/config/`
 - **3 tracked people**: Viktor Barzin, Anca Milea, Gheorghe Milea
 - **Zone**: London (home)
 ### Dashboards (redesigned 2026-06-24)
 **Glossary** (HA terms — keep distinct):
 - **Dashboard** = a sidebar entry (Overview, Air Quality, Map). Sidebar *order* is a per-USER frontend preference, not in any dashboard config.
 - **View** = a tab inside a dashboard. View order is global (stored in the dashboard config).
 - **Card** = a widget inside a view.
 - **Overview** (`lovelace`, the default): responsive **sections** views, styled with Mushroom + mini-graph-card.
  - **Home** tab: *Who's home* · *Comfort & Air* (CO₂/temp/humidity/PM2.5/VOC chips + CO₂ and temp/humidity trend graphs + link to Air Quality) · *Cowboy* (battery/range/last-ride) · *Energy* (5 Kasa plugs + power trend) · *Quick actions* (Netflix/Stremio/Night).
  - **More** tab: *Network* (GL-MT6000 router) · *System* (HA version/update, last backup, RPi power) · *Phones*.
 - **Air Quality** (`air-quality`): deep-dive (views: Home, Detailed). (`detialed`→`detailed` path typo fixed 2026-06-24.)
 - Built via the WS `lovelace/config/save` API (london is remote — no SSH path).
 ### Key Systems
 #### 1. Smart Plugs (TP-Link Kasa) — Energy Monitoring
@ -424,10 +437,15 @@ Named plugs with power/energy tracking:
 - PM1.0/2.5/4.0/10 particulate sensors
 - VOC, NOx, ammonia, CO, ethanol, hydrogen, methane, NO2 gas sensors
-#### 3. Cowboy E-Bike
+#### 3. Cowboy E-Bike (`elsbrock/cowboy-ha`)
- `sensor.bike_state_of_charge`: Battery %
+Bike named **"Classic Performance"** → entities are `sensor.classic_performance_*` (26 total). The old `sensor.bike_*` names are GONE (they were the dead `jdejaegh` integration).
- `sensor.bike_total_distance`: Total km
+- `sensor.classic_performance_remaining_battery`: Battery % (was `sensor.bike_state_of_charge`)
- `sensor.bike_total_co2_saved`: CO2 saved (grams)
+- `sensor.classic_performance_remaining_range`: Range km
 - `sensor.classic_performance_mileage`: Total km (was `sensor.bike_total_distance`)
 - `sensor.classic_performance_saved_co2`: Lifetime CO2 saved (was `sensor.bike_total_co2_saved`)
 - Plus `_distance_today`, `_last_trip_*`, `_battery_health`, `device_tracker.classic_performance`, etc.
 - **GOTCHA**: live battery/range/mileage read `unknown` while the bike is parked/asleep — Cowboy only reports live SoC when awake (ridden/charging); trip-history + `distance_today` stay live regardless.
 - Auth: account **email+password** (no AWS Cognito — that was the dead `jdejaegh`/`cowboybike` lineage). Setup via UI config flow / REST `config_entries/flow`. Creds in Vaultwarden item **"cowboy bike"** (`homelab vault get "cowboy bike"`).
 #### 4. Uptime Monitoring (UptimeRobot)
 - `sensor.blog`: blog uptime
@ -446,12 +464,17 @@ Named plugs with power/energy tracking:
 - Scripts: `script.start_netflix`, `script.start_stremio`
 - Scene: `scene.night` (turns off Livia + Michelle plugs)
-### Custom Components
+### Custom Components (HACS integrations)
- **cowboy**: Cowboy e-bike integration (HACS)
+- **cowboy** (`elsbrock/cowboy-ha` v1.2.0): Cowboy e-bike — revived 2026-06-24. The old `jdejaegh/home-assistant-cowboy` repo is **dead (404)**; don't chase it.
- **hildebrandglow_dcc**: UK smart meter DCC energy data (HACS)
+- **hildebrandglow_dcc**: UK smart meter DCC energy — **DISABLED by user** (config entry `disabled_by: user`), not broken.
 ### HACS frontend cards (plugins)
 - **Mushroom** (`piitaya/lovelace-mushroom`), **mini-graph-card** (`kalkih/mini-graph-card`), **plotly-graph-card** (`dbuezas/lovelace-plotly-graph-card`) — used by the redesigned Overview. Install over WS `hacs/repository/download`; resources auto-register in storage mode.
 ### Integrations
-ESPHome, TP-Link Kasa, Tapo, UptimeRobot, Cowboy, Hildebrand Glow DCC, Oral-B BLE, Ookla Speedtest, HACS, OpenRouter (multiple free LLMs), Piper (local TTS), Whisper (local STT), Android TV/ADB
+ESPHome, TP-Link Kasa, Tapo, UptimeRobot, **Cowboy** (elsbrock), Oral-B BLE, Ookla Speedtest (exposes only an `update` entity, no live speed sensors), HACS, OpenRouter (free LLMs), Piper (TTS), Whisper (STT), Android TV/ADB.
 - **Disabled by user (NOT broken)**: `met` + `metoffice` (weather — so `weather.*` entities are ABSENT), `roomba` (Rumi vacuum), `hildebrandglow_dcc` (energy).
 - **Failing**: `tplink` **Tapo P100** projector plug — `setup_retry`, 403 KLAP handshake from 192.168.8.108 (plug off / firmware). Left as-is.
 ### AI / Voice Assistants
 - 5 free LLM conversation agents: Google Gemma 3 27B, Meta Llama 3.2 3B, Mistral Devstral 2, OpenAI GPT-OSS-20B, Z.AI GLM 4.5 Air
@ -466,15 +489,8 @@ ESPHome, TP-Link Kasa, Tapo, UptimeRobot, Cowboy, Hildebrand Glow DCC, Oral-B BL
 - Anca arrival/departure notifications
 - Night scene: turns off Livia + Michelle
-### Docker Setup
+### Platform (HAOS — ignore any legacy `docker run` snippet)
-```bash
+ha-london runs **Home Assistant OS** (managed appliance), NOT a hand-run Docker container. There is no `docker run homeassistant/home-assistant` to manage. Install HACS components over the WebSocket API (`hacs/repository/download` with the repo's HACS id), then restart via `POST /api/services/homeassistant/restart` — a HAOS restart drops automations for ~1–2 min and resets `sensor.uptime` (use that as the "back up" marker).
 docker run -d --name homeassistant --privileged \
  -e TZ=Europe/London \
  -v /home/pi/docker/homeAssistant:/config \
  -v /run/dbus:/run/dbus:ro \
  --network=host --restart=unless-stopped \
  homeassistant/home-assistant:2025.9
 ```
 ### SSH Access
 ```bash
--- a/CONTEXT.md
+++ b/CONTEXT.md
@ -125,7 +125,7 @@ How a **Service** is named in flow/audit data — its **namespace** is the prima
 _Avoid_: equating "service identity" with a workload's **ServiceAccount** (that's the deferred enforcement principal, not the attribution key) or with cryptographic/SPIFFE identity; "Service" here is the domain **Service**, not the K8s `Service` object.
 **Goldmane / Whisker**:
-Calico 3.30's OSS flow-observability pair — **Goldmane** aggregates identity-stamped flows (namespace/pod/workload/labels + allow-deny + policy trace) streamed from Felix over gRPC into an in-memory ~60-min ring buffer (no etcd/API writes); **Whisker** is its live web UI. The east-west "who-talks-to-whom" data plane, succeeding raw iptables-`LOG`→journald lines (which carry no identity). Durable history requires emitting Goldmane flows to **Loki**; the in-memory buffer alone is not an audit trail.
+Calico 3.30's OSS flow-observability pair — **Goldmane** aggregates identity-stamped flows (namespace/pod/workload/labels + allow-deny + policy trace) streamed from Felix over gRPC into an in-memory ~60-min ring buffer (no etcd/API writes); **Whisker** is its live web UI. The east-west "who-talks-to-whom" data plane, succeeding raw iptables-`LOG`→journald lines (which carry no identity). The in-memory buffer alone is not an audit trail — durable history is the **`goldmane-edge-aggregator`** (the implemented trail; ADR-0014 originally framed this as a Loki emitter), which streams Goldmane's gRPC `Flows.Stream` over mTLS and upserts the namespace-pair **edge set** into CNPG DB `goldmane_edges` + a daily `#security` digest. As-built: `docs/runbooks/goldmane-flow-trail.md`.
 _Avoid_: assuming Goldmane persists (it's a ring buffer — lost on restart); expecting a ServiceAccount field in its schema (it carries labels, not SA); confusing it with Cilium **Hubble** (needs the Cilium datapath, unusable on Calico) or **Kiali** (needs an Istio mesh).
 ### Storage
--- a/docs/adr/0014-service-identity-and-east-west-observability.md
+++ b/docs/adr/0014-service-identity-and-east-west-observability.md
@ -27,3 +27,9 @@ As the Service count grows we want an audit-grade record of which Service talks
 - **Enforcement gains a better data source.** Goldmane's allow/deny + policy-trace flows build the Wave 1 empirical egress allowlist faster than the current iptables-`LOG`→journald→Loki path, and policies select on namespace/label with no SA dependency.
 - **New ubiquitous language** recorded in `CONTEXT.md`: **Service identity** and **Goldmane / Whisker**.
 - **Revisit triggers:** adopt dedicated per-Service SAs if identity-aware NetworkPolicy needs a principal finer than namespace/label, or if mTLS is ever required; reconsider Retina if DNS/drop-level flow detail becomes necessary.
 ## As-built (2026-06-25)
 Implemented across infra issues #57–#63. **One material deviation from the decision above:** the durable trail is NOT a Goldmane→Loki emitter (no such emitter exists in OSS Calico 3.30) — it is the **`goldmane-edge-aggregator`** service, which streams Goldmane's gRPC `Flows.Stream` API over mTLS and upserts the unique namespace-pair **edge set** (`edge(src_ns,dst_ns,action,first_seen,last_seen,flow_count)`, self-edges + empty-namespace flows dropped) into **CNPG DB `goldmane_edges`**, plus a daily `goldmane-edges-digest` CronJob → `#alerts` (the shared webhook can't reach `#security` — see runbook). The mTLS client cert **reuses the operator's Tigera-CA-signed `whisker-backend-key-pair`** rather than copying the CA private key into TF state (Goldmane verifies CA-chain only, not identity) — re-apply `stacks/goldmane-edge-aggregator` if the operator rotates it. `service-identity` labels are live on the multi-Service namespaces (`monitoring`, `dbaas`). Whisker UI is Authentik-gated at `whisker.viktorbarzin.me`. Health: Prometheus alerts `AggregatorDown` + `DigestFailing` and cluster-health check #48.
 Full as-built, query recipes (incl. the Wave-1 egress-allowlist derivation), and troubleshooting: [`docs/runbooks/goldmane-flow-trail.md`](../runbooks/goldmane-flow-trail.md). Stacks: `stacks/calico` (Goldmane/Whisker + Whisker ingress), `stacks/goldmane-edge-aggregator` (the trail). Code: `~/code/goldmane-edge-aggregator`.
--- a/docs/architecture/monitoring.md
+++ b/docs/architecture/monitoring.md
@ -321,6 +321,17 @@ Detects the inverse of the K-series alerts: a service that **must work WITHOUT A
 - **Alert**: `probe_failed_due_to_regex{job="blackbox-authentik-walloff"} == 1` for 10m → `severity=warning`, `lane=security` → **`#security` Slack** (Slack-only, no paging). `probe_failed_due_to_regex` (not bare `probe_success==0`) is the signal: it isolates the Authentik-redirect from unrelated 5xx/DNS/TLS failures already covered by reachability alerts. Inhibited by `TraefikDown` and `AuthentikDown` (symptom, not regression, during those outages).
 - **Target list + how to add one**: `local.authentik_walloff_targets` in `stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf` — a map of `service → URL`. To guard a NEW carve-out, add ONE line. Verify it does NOT already 302 to Authentik first: `curl -s -o /dev/null -w '%{http_code} %{redirect_url}\n' '<url>'`. The map key becomes the `service` label on the metric + alert. (Note: openclaw `task-webhook` is intentionally NOT probed — no public DNS record.)
 #### East-west flow observability (Goldmane edge-aggregator) — `AggregatorDown` / `DigestFailing` (ADR-0014)
 Health for the durable "who-talks-to-whom" trail (Calico Goldmane → `goldmane-edge-aggregator` → CNPG `goldmane_edges` → daily `#alerts` digest; full trail in security.md + [runbooks/goldmane-flow-trail.md](../runbooks/goldmane-flow-trail.md)). The aggregator pod exposes **no `/metrics`**, so health is inferred from kube-state-metrics. Alert group `Network Observability (Goldmane)` in `prometheus_chart_values.tpl`; both route the default `slack-warning` receiver → **`#alerts`**.
 | Alert | Expr (abridged) | For | Severity |
 |---|---|---|---|
 | `AggregatorDown` | `kube_deployment_status_replicas_available{namespace="goldmane-edge-aggregator",deployment="goldmane-edge-aggregator"} < 1` (+ Prometheus-restart guard) | 15m | warning |
 | `DigestFailing` | `kube_job_status_failed{namespace="goldmane-edge-aggregator",job_name=~"goldmane-edges-digest.*"} > 0` within 24h | 30m | warning |
 The two layers are **complementary**: `AggregatorDown` ⇒ no new edges land in the DB; `DigestFailing` ⇒ edges still land but nobody is told. (`< 1` requires the metric series to exist — a fully-deleted Deployment is instead caught by cluster-health check #48 below as "deployment missing".) A freshness probe (#61b) was deliberately skipped — `AggregatorDown` is the agreed floor. **Cluster-health check #48** (`check_goldmane_aggregator` in `scripts/cluster_healthcheck.sh`) reads the Deployment's `Available` condition independently (human / `--quiet` / `--json`; JSON key `goldmane_aggregator`).
 #### Backup Alerts
 - **PostgreSQLBackupStale**: >36h since last backup
 - **MySQLBackupStale**: >36h since last backup
--- a/docs/architecture/security.md
+++ b/docs/architecture/security.md
@ -364,6 +364,67 @@ Beads: `code-8ywc` W1.6 + W1.7. **Status: planned.**
 - Rare-event misses: a Sunday-only CronJob's egress won't appear in 7 days of flow logs. Mitigation: extend observation to 2 weeks for namespaces with weekly CronJobs.
 - Mass-rollout cascade: the 26h March 2026 outage (memory id=390) was a mass-change cascade. Mitigation: phased per-namespace with health-check pauses, similar to the 2026-05-17 Keel phased rollout (memory id=1972).
 #### Deriving the per-namespace egress allowlist from the edge trail (Wave 1 W1.7)
 The durable **east-west flow trail** (below) is now the preferred data source for
 the *internal* (namespace-to-namespace) half of each Wave-1 egress allowlist —
 faster and identity-stamped vs the original iptables-`LOG`→journald→Loki path
 (ADR-0014: "Enforcement gains a better data source"). The unique observed
 namespace pairs live in CNPG DB `goldmane_edges`, table `edge`. To derive the
 namespaces a source is observed talking to (the `allow` set that seeds its
 NetworkPolicy):
 ```sql
 SELECT DISTINCT dst_ns FROM edge WHERE src_ns='<ns>' AND action='allow' ORDER BY dst_ns;
 ```
 The full SQL recipe (whole-cluster matrix, deny sanity-checks, the ≥7-day
 observation caveat) is in
 [runbooks/goldmane-flow-trail.md → Deriving the Wave-1 egress allowlist](../runbooks/goldmane-flow-trail.md#deriving-the-wave-1-egress-allowlist-from-the-edge-table-infra-62).
 **External / public-internet egress is NOT in this table** (empty-namespace flows
 are dropped) — for those destinations keep using the Calico flow-log observation
 (the W1.6 snapshot, `wave1-egress-observation-2026-05-22.md`). This feeds the
 existing observe-then-enforce effort (beads `code-8ywc`); **enforce-flips remain
 out of scope** of the trail — it is observe-and-derive only.
 ### East-west flow observability (Goldmane / Whisker + edge trail) (ADR-0014)
 The "who-talks-to-whom" data plane that succeeds raw iptables-`LOG` lines (which
 carried no identity). **Service identity = the workload's namespace** (primary),
 refined by a `service-identity` label in the few multi-Service namespaces
 (`monitoring`, `kube-system`, `dbaas`). End-to-end trail, three layers:
 1. **Calico Goldmane + Whisker** (`calico-system`) — Goldmane aggregates
   identity-stamped flows (ns/pod/workload/labels + allow-deny + policy-trace)
   streamed from Felix over gRPC into a **~60-min in-memory ring buffer** (no
   etcd/API writes — the etcd-cost constraint that drove the design). **Whisker**
   is its live web UI at `whisker.viktorbarzin.me` (Authentik-gated,
   `auth = "required"` — Whisker has no own login; an additive NetworkPolicy ORs
   Traefik past the operator's default-deny `whisker` NP). The ring buffer is
   **not** a trail (lost on Goldmane restart). Enabled via operator CRs in
   `stacks/calico/main.tf`; reversible toggle (Goldmane is OSS tech-preview).
 2. **`goldmane-edge-aggregator`** (`stacks/goldmane-edge-aggregator`) — streams
   Goldmane's gRPC `Flows.Stream` over **mTLS** and upserts the low-cardinality
   namespace-pair edge set (`edge(src_ns,dst_ns,action,first_seen,last_seen,
   flow_count)`) into CNPG DB `goldmane_edges`. Self-edges and empty-namespace
   (public-internet) flows are dropped — in-cluster relationships only. The mTLS
   client cert **reuses the operator's Tigera-CA-signed `whisker-backend-key-pair`**
   (Goldmane verifies CA-chain only, not identity) rather than copying the CA
   private key into TF state — **re-apply the stack if the operator rotates that
   Secret**.
 3. **`goldmane-edges-digest`** CronJob — posts first-seen edges daily to
   **`#alerts`** (reuses the alert-digest webhook; a `#security` override 404s —
   that webhook's Slack app isn't a member of `#security`; see runbook).
 The trail is **attribution-grade, not cryptographic** (reconstructs events in a
 trusted cluster; cannot prove identity against a spoofing pod — accepted trust-model
 limit; east-west stays plaintext, no mTLS between app pods). Health is covered by
 the **`AggregatorDown`** + **`DigestFailing`** alerts and cluster-health check #48
 (see monitoring.md). Full as-built, query recipes, and troubleshooting:
 [runbooks/goldmane-flow-trail.md](../runbooks/goldmane-flow-trail.md). Decision:
 [ADR-0014](../adr/0014-service-identity-and-east-west-observability.md); glossary
 `CONTEXT.md` → **Service identity**, **Goldmane / Whisker**.
 ### TLS & HTTP/3
 **Traefik** handles TLS termination:
--- a/docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md
+++ b/docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md
@ -0,0 +1,97 @@
 # Post-mortem: k8s 1.34→1.35 upgrade stalled — etcd IO starvation (2026-06-24)
 > Filename kept for inbound links. The originally-suspected cause (kubeadm-config
 > OIDC drift) turned out **not** to be the crash — see "Correction" below. The OIDC
 > drift was a real *separate* latent bug fixed in the same change.
 **Impact:** The autonomous k8s-version-upgrade chain (23:00 UTC nightly) reached
 the master control-plane phase for the first time — preflight passed, etcd
 snapshot taken, master cordoned + drained, etcd upgraded 3.6.5→3.6.6 — then the
 kube-apiserver upgrade to v1.35.6 **crash-looped**. kubeadm waited its 5-minute
 static-pod-hash window across all internal retries, then auto-rolled-back to
 v1.34.9. The cluster stayed healthy on 1.34.9 (apiserver, all 7 nodes Ready), but
 the run left **k8s-master cordoned** and the chain **wedged on `in_flight=1`**.
 No data loss; no user-facing outage (the master carries control-plane taints, so
 no workloads were displaced).
 **Trigger:** the first *minor* upgrade the chain ever attempted (1.34→1.35) — the
 first time kubeadm upgrades etcd (3.6.5→3.6.6) and regenerates the control-plane
 static pods, i.e. the first time the upgrade pushes real write-IO at etcd.
 ## Root cause — etcd IO starvation on the shared HDD
 The new kube-apiserver could not establish/keep a working connection to etcd
 during the upgrade because **etcd was IO-starved**. etcd's surviving container log
 from the crash window (`/var/log/pods/.../etcd/0.log`, 23:04–23:20 UTC) shows:
 - **1,180** `apply request took too long` warnings in 16 minutes;
 - individual applies of **4.3s / 2.9s / 2.7s / 1.8s** (healthy is <100ms),
  clustered at **23:18:51 UTC** — exactly when kubeadm's final attempt was trying
  to bring the new apiserver up.
 A reproduced 1.35.6 apiserver with no etcd dies with
 `F instance.go:233 Error creating leases: error creating storage factory: context
 deadline exceeded` — the same failure mode a multi-second etcd produces. etcd
 lives on the contended `sdc` HDD (**beads code-oflt**: "etcd/critical VM disks on
 shared sdc HDD — recurring IO-storm root cause"). The upgrade itself piled IO onto
 that spindle:
 1. etcd's own upgrade-restart + WAL/db re-read (it restarted ~23:04, re-elected);
 2. kubeadm dumping a full **~400MB etcd DB backup** to
   `/etc/kubernetes/tmp/kubeadm-backup-etcd-<ts>/` (on the same HDD) before the
   etcd upgrade — and **145 of these had accumulated to 28GB** (kubeadm never
   cleans them up), pushing master root fs to **73%**, above the 70% kubelet
   image-GC threshold, so image GC churned during the drain too;
 3. master-drain pod evictions.
 ### Correction — it was NOT the OIDC flag swap
 `kubeadm upgrade diff v1.35.6` showed the regenerated manifest also swaps
 `--authentication-config` (structured multi-issuer OIDC) back to legacy
 single-issuer `--oidc-*` flags (kubeadm-config drift, see secondary finding). That
 was the *first* hypothesis — but an isolated repro of the 1.35.6 apiserver with
 those exact `--oidc-*` flags **and authentik reachable** initialised OIDC cleanly
 (`oidc.go:313`, no error) and ran fine until it hit the (deliberately dead) test
 etcd. So the auth swap does **not** crash the apiserver; it was a red herring for
 the crash. Image pull (all v1.35.6 images pre-pulled), OOM (none), and disk-full
 were also ruled out.
 ## Secondary finding (real, fixed separately) — kubeadm-config OIDC drift
 apiserver auth is configured in three places that must agree:
 (1) `/etc/kubernetes/pki/auth-config.yaml` (structured, two issuers: `kubernetes`
 + `k8s-dashboard`, added 2026-06-19); (2) the live static-pod manifest
 (`--authentication-config`); (3) the kubeadm-config `ClusterConfiguration` CM —
 which still carried the legacy `--oidc-*` extraArgs. `kubeadm upgrade` regenerates
 the manifest from (3), so it would have reverted structured auth → **dashboard +
 kubectl SSO break after a successful upgrade** (recoverable: the chain's
 post-master `restore.sh` re-adds the flag). This is a real bug, just not the crash.
 ## Resolution
 1. **Reclaimed the 28GB kubeadm scratch** on master (`/etc/kubernetes/tmp/kubeadm-backup-*`) — root fs 73% → 23%.
 2. **Reconciled kubeadm-config live** (zero cluster impact — CM only read at upgrade time): dropped `--oidc-*`, added `--authentication-config` via `kubeadm init phase upload-config kubeadm`. `kubeadm upgrade diff` then shows only the control-plane image bumps.
 3. **Recovered:** uncordoned k8s-master, cleared the stuck `in_flight` gauge + annotation, deleted last night's Complete/Failed `1-35-6` phase jobs (a Complete preflight would otherwise make the detector idempotent-skip the re-run).
 ## Prevention (landed in this change)
 | Gap | Fix |
 |-----|-----|
 | kubeadm leaks ~400MB etcd-DB backups into `/etc/kubernetes/tmp` forever (→ disk fills, image-GC churn, write-IO on etcd's spindle) | **`upgrade-step.sh` preflight now prunes** `/etc/kubernetes/tmp/kubeadm-backup-*` + `kubeadm-upgraded-manifests*` older than 3 days on master, every run. Best-effort, never aborts. |
 | kubeadm-config drift would silently break SSO after an upgrade | `apiserver-oidc.tf`'s remote script now **also reconciles kubeadm-config** (`kubeadm init phase upload-config`), delivered via the `apiserver-oidc-restore` ConfigMap the chain re-runs (CI needs no ssh) or a local `-replace` apply. Preflight **alerts** (not blocks — SSO drift is recoverable) if `kubeadm upgrade diff` would still drop `--authentication-config`. |
 | etcd on the contended `sdc` HDD starves under upgrade IO | **Durable fix is beads code-oflt** (move etcd/critical VM disks off `sdc`). Not in this change. Mitigations above reduce the upgrade's own IO; reclaimed disk removes the image-GC variable. |
 ## Lessons
 - **Capture the failing component's own logs before concluding.** The `kubeadm
  upgrade diff` made the OIDC swap look like the cause; only etcd's log (multi-second
  applies) + an isolated apiserver repro showed the truth (etcd IO). A clean diff is
  "what config changes," not "why it crashed."
 - **etcd on shared HDD is the cluster's recurring fragility** (immich IO storm
  2026-05-25, this stall). Upgrades concentrate IO (etcd restart + kubeadm's 400MB
  backup copy + drain) onto that spindle. code-oflt is the real fix.
 - **Tools that leave per-operation scratch must be reaped.** kubeadm's
  `/etc/kubernetes/tmp` etcd backups are throwaway (real backups → NFS) but never
  GC'd; 28GB had silently accumulated.
 - **Out-of-band control-plane edits must be written back to kubeadm-config** — else
  `kubeadm upgrade` silently reverts them (here: SSO; could be admission/audit/API flags).
--- a/docs/runbooks/goldmane-flow-trail.md
+++ b/docs/runbooks/goldmane-flow-trail.md
@ -0,0 +1,301 @@
 # Goldmane Flow Trail — east-west "who-talks-to-whom" observability
 > As-built runbook for the Calico Goldmane + Whisker flow plane and the
 > `goldmane-edge-aggregator` durable audit trail. Design + rationale:
 > [ADR-0014](../adr/0014-service-identity-and-east-west-observability.md).
 > Glossary: `CONTEXT.md` → **Service identity**, **Goldmane / Whisker**.
 > Implements infra issues #57 (Whisker ingress), #58 (aggregator), #61
 > (monitoring), #62 (egress allowlist queries), #63 (these docs).
 ## What the trail is
 Three layers turn raw east-west traffic into a queryable, durable record of
 which Service talks to which. **Service identity = the workload's namespace**
 (primary), refined by a `service-identity` label in the few multi-Service
 namespaces (`monitoring`, `kube-system`, `dbaas`) — see ADR-0014.
 | Layer | Component | Lifetime | Where it lives |
 |---|---|---|---|
 | **Live map** | Calico **Goldmane** + **Whisker** | ~60-min in-memory ring buffer (lost on Goldmane restart) | `calico-system`; Whisker UI at `whisker.viktorbarzin.me` |
 | **Durable trail** | `goldmane-edge-aggregator` (`aggregate` mode) | persistent | CNPG Postgres DB `goldmane_edges`, table `edge` |
 | **Notification** | `goldmane-edges-digest` CronJob (`digest` mode) | daily | Slack `#alerts` |
 **Goldmane** aggregates identity-stamped flows (namespace / pod / workload /
 labels + allow-deny + policy-trace) streamed from Felix (the existing
 `calico-node` DaemonSet) over gRPC into a ~60-minute in-memory ring buffer —
 **nothing is written to etcd or the K8s API** (the etcd-cost constraint that
 drove the whole design). **Whisker** is its live web UI. Because the ring
 buffer is *not* a trail (a Goldmane restart loses the window), the
 `goldmane-edge-aggregator` consumes Goldmane's gRPC `Flows.Stream` API over
 mTLS and upserts the unique **namespace-pair edge set** into Postgres; a daily
 CronJob posts first-seen edges to Slack.
 The edge set is deliberately **low-cardinality** — one row per
 `(src_ns, dst_ns, action)`, *not* per-pod or per-port — so the table stays
 small no matter how much traffic flows.
 ## Where the data lives
 ### Whisker UI — live, ~60 min
 - `https://whisker.viktorbarzin.me` (Authentik-gated — Whisker ships no own
  login; `auth = "required"`). Shows the live flow stream + a service graph for
  roughly the last hour. Use it for "what is talking right now"; it is **not**
  history.
 - In-cluster: `Service goldmane:7443` (gRPC/mTLS), `Service whisker:8081`
  (HTTP), both in `calico-system`.
 ### CNPG `goldmane_edges` — durable
 - Postgres DB `goldmane_edges` on the CNPG cluster
  (`pg-cluster-rw.dbaas.svc.cluster.local:5432`). One table:
  ```
  edge(src_ns text, dst_ns text, action text,
       first_seen timestamptz, last_seen timestamptz, flow_count bigint,
       PRIMARY KEY (src_ns, dst_ns, action))
  ```
  - `action` ∈ `allow` / `deny` / `pass` / `unspecified` (normalised Goldmane
    action).
  - **Self-edges (`src_ns == dst_ns`) and empty-namespace flows** (host-endpoint
    / public-internet) are **dropped** — the trail is about in-cluster service
    relationships only. (Egress to the public internet is therefore NOT in this
    table; it lives in the Wave-1 Calico flow-log path — see security.md.)
  - A **"new edge"** = a row whose `first_seen` falls inside the digest window.
  - Role `goldmane_edges` (Vault-rotated, 7-day) owns the DB. The `edge` table
    is created idempotently by the aggregator at startup (canonical DDL also in
    the repo at `migrations/0001_edge.sql`).
 ### Slack `#alerts` — daily digest
 > **Channel note (2026-06-25):** posts to **`#alerts`**, not `#security`. The shared `alertmanager_slack_api_url` incoming webhook's Slack app is not a member of `#security`, so a channel override there returns HTTP `404 channel_not_found` (this almost certainly also breaks alertmanager's `slack-security` receiver — verify separately). To route the digest (and security alerts) to `#security`: invite that webhook's Slack app to `#security`, then set `SLACK_CHANNEL=#security` in `stacks/goldmane-edge-aggregator` and re-apply.
 - CronJob `goldmane-edges-digest` (08:00 Europe/London) posts edges first seen
  in the last 24h. Quiet when there are none. Reuses the existing alert-digest
  Slack incoming webhook (Vault `secret/viktor` → `alertmanager_slack_api_url`)
  — no new webhook was created.
 ## How to enable / disable
 ### Goldmane + Whisker (the flow plane)
 Operator CRs in **`stacks/calico/main.tf`** — NOT the Helm `goldmane`/`whisker`
 flags (those stay `false`; the operator's own `installation`/`apiServer` are
 operator-managed via the `goldmanes`/`whiskers.operator.tigera.io` CRDs):
 - `kubectl_manifest.goldmane` (kind `Goldmane`) — creating it makes the operator
  re-render `calico-node` with the `FELIX_FLOWLOGSGOLDMANESERVER` env (the
  operator auto-wires Felix — **do NOT patch FelixConfiguration**), triggering a
  supervised `calico-node` DaemonSet roll. Yields `Deployment` + `Service
  goldmane:7443`.
 - `kubectl_manifest.whisker` (kind `Whisker`, `depends_on` goldmane;
  `notifications = Disabled`). Yields `Deployment` + `Service whisker:8081`.
 **To disable:** delete those two CRs and re-apply `stacks/calico`. Reversible
 toggle (Goldmane is tech-preview in OSS Calico 3.30 — the main standing risk per
 ADR-0014).
 ### Whisker public ingress (infra #57)
 Also in `stacks/calico/main.tf`:
 - `module "ingress_whisker"` (`ingress_factory`, `auth = "required"`,
  `dns_type = "proxied"`) → `whisker.viktorbarzin.me`.
 - `kubernetes_network_policy_v1.whisker_allow_traefik` — **required alongside the
  ingress**: the operator's own `whisker` NetworkPolicy (owned by the Whisker CR)
  is `policyTypes: [Ingress]` with no rules = default-deny ingress to the pod.
  This additive NP ORs in an allow for `namespaceSelector
  kubernetes.io/metadata.name=traefik` on TCP 8081. Without it Traefik 502s.
 ### The aggregator + digest (the durable trail) — `stacks/goldmane-edge-aggregator`
 A Tier-1 stack (PG state) mirroring the claude-memory pattern. `scripts/tg
 apply` from `stacks/goldmane-edge-aggregator/`. It provisions: the namespace,
 the mTLS client material, the Postgres DB-init Job, the `DATABASE_URL`
 ExternalSecret (Vault static role `pg-goldmane-edges`), the Slack ExternalSecret,
 the `aggregate` Deployment, and the `digest` CronJob. **To disable the trail
 without touching the flow plane:** scale `deployment/goldmane-edge-aggregator` to
 0 (transient) or remove the stack (permanent) — Goldmane/Whisker keep running.
 Image: `ghcr.io/viktorbarzin/goldmane-edge-aggregator` (PRIVATE) — the
 `goldmane-edge-aggregator` namespace must be in the `ghcr-credentials` Kyverno
 allowlist (`stacks/kyverno/modules/kyverno/ghcr-credentials.tf`,
 `local.ghcr_private_namespaces`) or pulls 401. Code repo:
 `~/code/goldmane-edge-aggregator` (see its `README.md` + `DEPLOY.md`).
 ## mTLS cert — the REUSE decision (cert-reuse gotcha)
 The aggregator dials `goldmane:7443` over **mutual TLS**. Goldmane requires the
 client cert to chain to the **Tigera CA**, but it does **NOT authorize by client
 identity** — any Tigera-CA-signed cert is accepted.
 Rather than copy the Tigera CA **private key** into Terraform state to mint our
 own cert (a needless CA-key exposure; the `hashicorp/tls` provider also clashes
 with this repo's global generate-providers/lockfile pattern), the stack
 **REUSES the operator-minted, Tigera-CA-signed `whisker-backend-key-pair`
 Secret** (`calico-system`), copying its `tls.crt`/`tls.key` into the
 `goldmane-client-tls` Secret in the aggregator namespace. The CA *bundle* that
 verifies Goldmane's serving cert (`tigera-ca-bundle` ConfigMap, key
 `tigera-ca-bundle.crt`) is likewise copied verbatim (a ConfigMap can't be
 cross-namespace-mounted).
 > **GOTCHA — if the operator rotates `whisker-backend-key-pair`, re-apply
 > `stacks/goldmane-edge-aggregator`** to re-sync the copied cert. Symptom of a
 > stale copy: the `aggregate` pod logs TLS handshake / `Flows.Stream` failures
 > and no `last_seen` updates land in the `edge` table. Hardening follow-up
 > (noted in the stack): mint an own-identity cert in-namespace if Whisker is ever
 > removed (which would delete the reused source Secret).
 The Deployment leaves `GOLDMANE_HOST=goldmane.calico-system.svc.cluster.local:7443`
 and the default cert/CA paths; the default ServerName (host sans port) is a SAN
 on Goldmane's live serving cert, so no `GOLDMANE_SERVER_NAME` /
 `GOLDMANE_TLS_INSECURE` override is needed.
 ## How to query who-talks-to-whom
 `psql` into the DB (creds: Vault static role `static-creds/pg-goldmane-edges`, or
 exec a CNPG pod). All queries are against the single `edge` table.
 ```sql
 -- Everything talking to a namespace (inbound), most-active first
 SELECT src_ns, action, flow_count, first_seen, last_seen
 FROM edge WHERE dst_ns = '<ns>' ORDER BY flow_count DESC;
 -- Everything a namespace talks TO (outbound)
 SELECT dst_ns, action, flow_count, first_seen, last_seen
 FROM edge WHERE src_ns = '<ns>' ORDER BY last_seen DESC;
 -- New edges in the last 24h (what the digest reports)
 SELECT src_ns, dst_ns, action, flow_count, first_seen
 FROM edge WHERE first_seen > now() - interval '24 hours'
 ORDER BY first_seen DESC;
 -- Any DENIED edges (policy is dropping this pair)
 SELECT src_ns, dst_ns, flow_count, last_seen
 FROM edge WHERE action = 'deny' ORDER BY last_seen DESC;
 -- Full edge set as a graph adjacency list
 SELECT src_ns, dst_ns, action, flow_count FROM edge ORDER BY src_ns, dst_ns;
 ```
 For the **live** (sub-hour) view including pod/port detail, use the Whisker UI —
 the `edge` table intentionally aggregates that away.
 ## Deriving the Wave-1 egress allowlist from the edge table (infra #62)
 The durable edge set is a faster, identity-stamped data source for the existing
 **observe-then-enforce** egress effort (beads `code-8ywc`; snapshot
 `docs/architecture/wave1-egress-observation-2026-05-22.md`) than the original
 iptables-`LOG` → journald → Loki path (ADR-0014 consequence: "Enforcement gains
 a better data source"). It replaces the *internal* (namespace-to-namespace) leg
 of the allowlist; **external/public-internet egress is NOT in this table** (empty
 dst namespace, dropped) — for those destinations keep using the Calico flow-log
 path described in security.md.
 **Per-namespace internal egress allowlist** — the set of in-cluster namespaces a
 given source is *observed* talking to with `action='allow'`:
 ```sql
 -- Internal egress allowlist for one namespace (feeds its NetworkPolicy)
 SELECT DISTINCT dst_ns
 FROM edge
 WHERE src_ns = '<ns>' AND action = 'allow'
 ORDER BY dst_ns;
 ```
 ```sql
 -- Full internal egress matrix for all namespaces at once
 SELECT src_ns, array_agg(DISTINCT dst_ns ORDER BY dst_ns) AS allowed_dst_ns
 FROM edge
 WHERE action = 'allow'
 GROUP BY src_ns
 ORDER BY src_ns;
 ```
 ```sql
 -- Sanity: namespaces with a DENY edge already (policy is biting; investigate
 -- before tightening further)
 SELECT DISTINCT src_ns, dst_ns FROM edge WHERE action = 'deny';
 ```
 **How this feeds enforcement (scope):** the derived `dst_ns` set is the
 *internal* half of a namespace's egress allowlist — it tells you which
 in-cluster namespaces to permit before flipping that namespace to default-deny.
 The universal baseline (kube-dns :53, often dbaas :3306/:5432, redis :6379) and
 the external destinations still come from the Wave-1 observation snapshot.
 **Enforce-flips remain OUT OF SCOPE** here — this is observe-and-derive only;
 the phased per-namespace default-deny rollout (starting `recruiter-responder`)
 is tracked under `code-8ywc`. Cross-links:
 [security.md → NetworkPolicy Default-Deny Egress](../architecture/security.md#networkpolicy-default-deny-egress-wave-1--observe-then-enforce-tier-34),
 [wave1-egress-observation-2026-05-22.md](../architecture/wave1-egress-observation-2026-05-22.md),
 [ADR-0014](../adr/0014-service-identity-and-east-west-observability.md).
 > **Caveat (same as the Wave-1 snapshot):** an edge only exists if it was
 > *observed*. A weekly CronJob or a 7-day Vault rotation may not have fired yet —
 > collect ≥7 days of edges before treating a namespace's `allow` set as
 > complete. The `first_seen` column tells you how long an edge has been known;
 > the digest surfaces brand-new ones daily.
 ## Monitoring & health (infra #61)
 The aggregator pod has **no `/metrics` endpoint** — health is inferred from
 kube-state-metrics. Three complementary signals (memory ids 6598, 6599;
 see also [monitoring.md → Security Alerts](../architecture/monitoring.md#security-alerts-wave-1--planned-beads-code-8ywc)):
 | Signal | What | Where |
 |---|---|---|
 | **`AggregatorDown`** | `kube_deployment_status_replicas_available{namespace="goldmane-edge-aggregator",deployment="goldmane-edge-aggregator"} < 1` for 15m → warning | Prometheus alert group `Network Observability (Goldmane)` in `stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl`; routes `slack-warning` → `#alerts` |
 | **`DigestFailing`** | `kube_job_status_failed{...job_name=~"goldmane-edges-digest.*"} > 0` within 24h, for 30m → warning | same alert group → `#alerts` |
 | **cluster-health #48** | `check_goldmane_aggregator` reads the Deployment's `Available` condition (missing or not-Available → FAIL) | `scripts/cluster_healthcheck.sh` (human / `--quiet` / `--json` modes; emits `goldmane_aggregator`) |
 The two alert layers are deliberately complementary: `AggregatorDown` →
 **no new edges land** in the DB; `DigestFailing` → **edges still land but nobody
 is told**. A freshness probe (#61b) was intentionally skipped — `AggregatorDown`
 is the agreed floor.
 ## Troubleshooting
 **Whisker UI 502 / unreachable.** The additive
 `kubernetes_network_policy_v1.whisker_allow_traefik` is missing or the
 operator's default-deny `whisker` NP regenerated — re-apply `stacks/calico`. A
 brand-new ingress host is also invisible to LAN split-horizon until the hourly
 `technitium-ingress-dns-sync` runs (memory #5349); test meanwhile with
 `curl -sSI --resolve whisker.viktorbarzin.me:443:10.0.20.203 https://whisker.viktorbarzin.me`
 (expect a 302 to Authentik — the gate working).
 **No new `last_seen` updates / `AggregatorDown` firing.** Check the `aggregate`
 pod logs (`kubectl logs -n goldmane-edge-aggregator deploy/goldmane-edge-aggregator`).
 Common causes, in order:
 1. **Stale mTLS cert** — the operator rotated `whisker-backend-key-pair`; re-apply
   `stacks/goldmane-edge-aggregator` (see cert-reuse gotcha above). Symptom: TLS
   handshake / `Flows.Stream` errors.
 2. **Stale DB password** — the 7-day Vault rotation bounced the credential but
   the pod kept the old one. The Deployment carries
   `secret.reloader.stakater.com/reload: goldmane-edges-db-creds`; if it's not
   restarting on rotation, verify the Reloader annotation and the ExternalSecret.
 3. **Goldmane restarted** — the in-memory window was lost (expected); the stream
   reconnects automatically and resumes upserting. No data loss in the DB
   (only the sub-hour live window in Whisker is gone).
 **Digest never posts / `DigestFailing` firing.** Inspect the most recent
 `goldmane-edges-digest-*` Job (`kubectl get jobs -n goldmane-edge-aggregator`;
 `kubectl logs job/<name>`). The CronJob's `ttl_seconds_after_finished=86400` GCs
 pods after a day, so check soon after a failed run. With `SLACK_WEBHOOK_URL`
 empty the binary forces a dry-run (no post) — verify the `goldmane-edges-slack`
 ExternalSecret resolved. A dry run / smoke test: run the image with `args:
 ["digest"]` + `DRY_RUN=1` to print the message instead of POSTing.
 > Known state (2026-06-25): the digest CronJob's first Job **failed** and it has
 > never successfully posted (`lastSuccessfulTime` empty) — the digest leg is the
 > live gap; `DigestFailing` is catching it. Edges still land in the DB via the
 > `aggregate` Deployment; only the `#security` notification is affected.
 > Investigation/fix belongs to the aggregator slice (#58/#60), not monitoring.
 **No edges at all in the table.** Confirm Goldmane is enabled
 (`kubectl get goldmane,whisker -A`) and `calico-node` rolled with the
 `FELIX_FLOWLOGSGOLDMANESERVER` env; confirm the `goldmane-edges-db-init` Job
 completed; confirm the aggregator pod is `Running` and not `ImagePullBackOff`
 (ghcr allowlist).
 ## Related
 - [ADR-0014 — Service identity & east-west observability](../adr/0014-service-identity-and-east-west-observability.md)
 - [security.md — NetworkPolicy Default-Deny Egress + east-west flow observability](../architecture/security.md)
 - [monitoring.md — east-west flow observability + alerts](../architecture/monitoring.md)
 - [wave1-egress-observation-2026-05-22.md](../architecture/wave1-egress-observation-2026-05-22.md)
 - `CONTEXT.md` glossary — **Service identity**, **Goldmane / Whisker**
 - Code: `~/code/goldmane-edge-aggregator` (`README.md`, `DEPLOY.md`); stacks
  `stacks/goldmane-edge-aggregator`, `stacks/calico`
--- a/docs/runbooks/k8s-version-upgrade.md
+++ b/docs/runbooks/k8s-version-upgrade.md
@ -41,6 +41,8 @@ Job 0 — preflight       (pinned: k8s-node1)
  ├── halt-on-alert (kured-style ignore-list)
  ├── 24h-quiet baseline (no Ready transitions <24h ago)
  ├── kubeadm upgrade plan matches target (skipped when master already at target — partial-resume)
  ├── apiserver-OIDC drift check: kubeadm upgrade diff drops --authentication-config? → Slack WARN (recoverable; not a block)
  ├── reclaim kubeadm scratch: prune /etc/kubernetes/tmp/kubeadm-backup-* >3d on master (kubeadm leaks ~400MB etcd-db backups)
  ├── Push k8s_upgrade_in_flight=1, k8s_upgrade_started_timestamp=$(date +%s)
  ├── Trigger backup-etcd Job, wait, verify snapshot byte count
  ├── SSH master: containerd skew fix (if master < workers)
@ -222,22 +224,34 @@ Exposed in K8s via ExternalSecret `k8s-upgrade-creds` in the `k8s-upgrade` names
 ## Common Operations
-### Post-upgrade: apiserver OIDC restore (AUTOMATED by the chain since 2026-06-19)
+### apiserver OIDC + kubeadm upgrades (kubeadm-config reconciliation since 2026-06-24)
 `kubeadm upgrade apply` **regenerates `/etc/kubernetes/manifests/kube-apiserver.yaml`
-and drops the `--authentication-config` flag**, silently disabling apiserver
+from kubeadm-config**. apiserver auth uses a structured multi-issuer
-OIDC (kubectl/kubelogin CLI **and** the web dashboard SSO break — tokens get
+`--authentication-config` (kubectl + dashboard SSO), but kubeadm-config used to
-401). This used to require a manual re-apply after **every** control-plane bump.
+still carry the legacy single-issuer `--oidc-*` extraArgs — so every upgrade
 reverted the flag, **silently breaking SSO after the upgrade** (the apiserver does
 NOT crash on this — verified by isolated repro; it's recoverable via the restore
 script below). NB: the **1.34→1.35 stall on 2026-06-24 was a *separate* issue —
 etcd IO starvation**, not this drift; post-mortem:
 `docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md`.
-**Now automated:** the `rbac` stack publishes its OIDC restore script to the
+**Primary fix (2026-06-24):** `stacks/rbac/modules/rbac/apiserver-oidc.tf` now
-`kube-system/apiserver-oidc-restore` ConfigMap, and the version-upgrade chain's
+**reconciles kubeadm-config** (`kubeadm init phase upload-config kubeadm`, rewriting
-`phase_master` re-runs it on master immediately after `kubeadm upgrade apply`
+`apiServer.extraArgs`: drop `--oidc-*`, add `--authentication-config`) as part of
-(while tigera-operator is still quiesced, so the flag-add apiserver restart can't
+its remote script. So kubeadm regenerates a **correct** manifest and the apiserver
-crashloop the operator). It's idempotent, health-gates `/livez` with
+upgrades with a pure image bump — `kubeadm upgrade diff <target>` shows only the
-auto-rollback, and is **non-fatal** — a failure only lags SSO until the next rbac
+image change. Zero live impact (the CM is read only during an upgrade).
-apply (the version upgrade itself already succeeded). So a chain-driven
+
-control-plane bump no longer breaks SSO. The master phase self-skips when master
+**Backstops:**
-is already at target, so this only runs when master was actually upgraded.
+- **Preflight check 4b** runs `kubeadm upgrade diff` and **alerts** (Slack WARN, does
  NOT block — the drift only breaks SSO, which is recoverable) if
  `--authentication-config` would still be dropped.
 - The `rbac` stack still publishes its restore script to the
  `kube-system/apiserver-oidc-restore` ConfigMap, and `phase_master` re-runs it on
  master right after `kubeadm upgrade apply` (idempotent, `/livez`-gated with
  auto-rollback, non-fatal) — now redundant belt-and-suspenders that *also*
  re-reconciles kubeadm-config. Self-skips when master is already at target.
 **Manual fallback** — only for an out-of-band/manual `kubeadm` upgrade, or if the
 chain logged `WARN: --authentication-config absent after re-apply`:
--- a/scripts/cluster_healthcheck.sh
+++ b/scripts/cluster_healthcheck.sh
@ -27,7 +27,7 @@ KUBECONFIG_PATH="${KUBECONFIG:-${HOME}/.kube/config}"
 [[ -f "$KUBECONFIG_PATH" ]] || KUBECONFIG_PATH="$(pwd)/config"
 KUBECTL=""
 JSON_RESULTS=()
-TOTAL_CHECKS=47
+TOTAL_CHECKS=48
 # Parallel execution settings. Each check function is self-contained — it
 # only reads cluster state and mutates the in-memory counters / JSON_RESULTS
@ -3156,6 +3156,44 @@ PYEOF
    esac
 }
 # --- 48. Goldmane edge-aggregator availability ---
 #
 # The goldmane-edge-aggregator Deployment (ADR-0014 / infra #58) streams Calico
 # Goldmane flows into the goldmane_edges CNPG DB — the durable who-talks-to-whom
 # trail. The pod has NO /metrics endpoint, so its liveness can't be scraped;
 # this check reads the Deployment's Available condition directly so the trail
 # silently dying surfaces in the health board (mirrors the AggregatorDown
 # Prometheus alert). Missing Deployment / not-Available -> FAIL.
 check_goldmane_aggregator() {
    section 48 "Goldmane Edge-Aggregator"
    local ns="goldmane-edge-aggregator" dep="goldmane-edge-aggregator"
    local avail desired ready
    # One get; absent Deployment is a hard fail (the trail isn't deployed).
    if ! $KUBECTL get deploy "$dep" -n "$ns" >/dev/null 2>&1; then
        [[ "$QUIET" == true ]] && section_always 48 "Goldmane Edge-Aggregator"
        fail "Deployment $ns/$dep not found — who-talks-to-whom edge trail is not running"
        json_add "goldmane_aggregator" "FAIL" "deployment missing"
        return 0
    fi
    avail=$($KUBECTL get deploy "$dep" -n "$ns" \
        -o jsonpath='{.status.conditions[?(@.type=="Available")].status}' 2>/dev/null)
    ready=$($KUBECTL get deploy "$dep" -n "$ns" -o jsonpath='{.status.readyReplicas}' 2>/dev/null)
    desired=$($KUBECTL get deploy "$dep" -n "$ns" -o jsonpath='{.spec.replicas}' 2>/dev/null)
    ready=${ready:-0}
    desired=${desired:-0}
    if [[ "$avail" == "True" ]]; then
        pass "Edge-aggregator Available ($ready/$desired ready)"
        json_add "goldmane_aggregator" "PASS" "${ready}/${desired} ready"
    else
        [[ "$QUIET" == true ]] && section_always 48 "Goldmane Edge-Aggregator"
        fail "Edge-aggregator NOT Available ($ready/$desired ready) — edge trail has stopped recording"
        json_add "goldmane_aggregator" "FAIL" "${ready}/${desired} ready; Available=${avail:-unknown}"
    fi
 }
 # --- Summary ---
 print_summary() {
    if [[ "$JSON" == true ]]; then
@ -3224,7 +3262,7 @@ main() {
        check_monitoring_prom_am check_monitoring_vault check_monitoring_css
        check_external_replicas check_external_divergence check_pve_thermals
        check_pve_load check_external_traefik_5xx check_ha_status_dashboard
-        check_immich_search check_csi_ghost_drift
+        check_immich_search check_csi_ghost_drift check_goldmane_aggregator
    )
    # Auto-fix mutates cluster state inside individual checks — keep that
--- a/stacks/calico/main.tf
+++ b/stacks/calico/main.tf
@ -212,3 +212,65 @@ resource "kubectl_manifest" "whisker" {
    spec       = { notifications = "Disabled" }
  })
 }
 # ---------------------------------------------------------------------------
 # Gated public ingress for the Whisker UI (infra #57 / ADR-0014).
 #
 # whisker.viktorbarzin.me -> whisker:8081, Authentik-gated (auth="required":
 # Whisker ships NO own login — it's an admin observability UI, so Authentik
 # forward-auth is the only gate between strangers and the flow view). The
 # operator replicated `tls-secret` into calico-system already.
 #
 # TWO coupled pieces are required because the operator's own `whisker`
 # NetworkPolicy (owned by the Whisker CR above) sets policyTypes:[Ingress]
 # with NO ingress rules => default-deny on ingress to the whisker pod. The
 # additive NP below ORs in a Traefik allow (k8s NetworkPolicies are additive
 # across policies selecting the same pod), so we never edit the operator NP.
 module "ingress_whisker" {
  source          = "../../modules/kubernetes/ingress_factory"
  dns_type        = "proxied"
  namespace       = "calico-system"
  name            = "whisker"
  service_name    = "whisker"
  port            = 8081
  auth            = "required"
  tls_secret_name = "tls-secret"
  extra_annotations = {
    "gethomepage.dev/enabled"     = "true"
    "gethomepage.dev/name"        = "Whisker"
    "gethomepage.dev/description" = "Calico flow observability (who-talks-to-whom)"
    "gethomepage.dev/icon"        = "calico.png"
    "gethomepage.dev/group"       = "Infrastructure"
  }
 }
 # Additive NetworkPolicy: permit Traefik -> whisker:8081. ORs with the
 # operator's default-deny `whisker` NP (selecting the same pod) so Traefik
 # can reach the UI without touching the operator-owned policy.
 resource "kubernetes_network_policy_v1" "whisker_allow_traefik" {
  metadata {
    name      = "whisker-allow-traefik"
    namespace = "calico-system"
  }
  spec {
    pod_selector {
      match_labels = {
        "app.kubernetes.io/name" = "whisker"
      }
    }
    policy_types = ["Ingress"]
    ingress {
      from {
        namespace_selector {
          match_labels = {
            "kubernetes.io/metadata.name" = "traefik"
          }
        }
      }
      ports {
        port     = "8081"
        protocol = "TCP"
      }
    }
  }
 }
--- a/stacks/dbaas/modules/dbaas/main.tf
+++ b/stacks/dbaas/modules/dbaas/main.tf
@ -745,7 +745,10 @@ resource "kubernetes_deployment" "phpmyadmin" {
    labels = {
      "app" = "phpmyadmin"
      tier  = var.tier
-
+      # ADR-0014 service identity: dbaas is a multi-Service namespace, so the
      # namespace alone can't attribute Goldmane flows. Value = the fronting
      # Service name (kubernetes_service.phpmyadmin is named "pma").
      "service-identity" = "pma"
    }
    annotations = {
      "reloader.stakater.com/search" = "true"
@ -762,6 +765,10 @@ resource "kubernetes_deployment" "phpmyadmin" {
      metadata {
        labels = {
          "app" = "phpmyadmin"
          # ADR-0014: Goldmane/Felix stamps POD labels onto flows, so the
          # disambiguating identity must live on the pod template (not just
          # the Deployment metadata above). Not in selector → no replace.
          "service-identity" = "pma"
        }
      }
      spec {
@ -812,8 +819,19 @@ resource "kubernetes_deployment" "phpmyadmin" {
    }
  }
  lifecycle {
-    # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
+    ignore_changes = [
-    ignore_changes = [spec[0].template[0].spec[0].dns_config]
+      spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
      # This Deployment is Keel-enrolled (keel.sh/policy=patch). Ignore the
      # attributes Keel/Kyverno mutate at runtime so `terragrunt apply` (incl.
      # the daily drift plan) doesn't fight them or revert the live image —
      # canonical KEEL/KYVERNO lifecycle guard, matches linkwarden/chrome-service.
      metadata[0].annotations["keel.sh/policy"],
      metadata[0].annotations["keel.sh/trigger"],
      metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
      metadata[0].annotations["keel.sh/match-tag"],
      spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE — Keel manages tag updates
      spec[0].template[0].metadata[0].annotations["keel.sh/update-time"], # KEEL_LIFECYCLE_V1
    ]
  }
 }
@ -1499,6 +1517,10 @@ resource "kubernetes_deployment" "pgadmin" {
    }
    labels = {
      tier = var.tier
      # ADR-0014 service identity: dbaas is a multi-Service namespace, so the
      # namespace alone can't attribute Goldmane flows. Value = the fronting
      # Service name (kubernetes_service.pgadmin is named "pgadmin").
      "service-identity" = "pgadmin"
    }
  }
  spec {
@ -1514,6 +1536,10 @@ resource "kubernetes_deployment" "pgadmin" {
      metadata {
        labels = {
          app = "pgadmin"
          # ADR-0014: Goldmane/Felix stamps POD labels onto flows, so the
          # disambiguating identity must live on the pod template (not just
          # the Deployment metadata above). Not in selector → no replace.
          "service-identity" = "pgadmin"
        }
      }
      spec {
@ -1568,8 +1594,20 @@ resource "kubernetes_deployment" "pgadmin" {
    }
  }
  lifecycle {
-    # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
+    ignore_changes = [
-    ignore_changes = [spec[0].template[0].spec[0].dns_config]
+      spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
      # This Deployment is Keel-enrolled (keel.sh/policy=patch) and Keel has
      # bumped the live image (dpage/pgadmin4:9.16). Ignore the Keel/Kyverno
      # runtime-mutated attributes so `terragrunt apply` (incl. the daily drift
      # plan) doesn't revert the image to bare `dpage/pgadmin4` or strip Keel's
      # annotations — canonical guard, matches linkwarden/chrome-service.
      metadata[0].annotations["keel.sh/policy"],
      metadata[0].annotations["keel.sh/trigger"],
      metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
      metadata[0].annotations["keel.sh/match-tag"],
      spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE — Keel manages tag updates
      spec[0].template[0].metadata[0].annotations["keel.sh/update-time"], # KEEL_LIFECYCLE_V1
    ]
  }
 }
 resource "kubernetes_service" "pgadmin" {
--- a/stacks/goldmane-edge-aggregator/main.tf
+++ b/stacks/goldmane-edge-aggregator/main.tf
@ -0,0 +1,496 @@
 # =============================================================================
 # goldmane-edge-aggregator — durable who-talks-to-whom audit trail (ADR-0014 / #58)
 # =============================================================================
 # A small Go service that streams Calico Goldmane's gRPC Flows API (mTLS) and
 # upserts the unique service-to-service edge set into Postgres, plus a daily
 # Slack digest CronJob of first-seen edges. Code lives in the standalone
 # `goldmane-edge-aggregator` repo; the authoritative deploy spec is its
 # DEPLOY.md. This stack is the infra side of that spec.
 #
 # Goldmane runs as `Service goldmane:7443` (gRPC/mTLS) in calico-system, enabled
 # via the operator CR in stacks/calico/main.tf. The durable Loki path is NOT
 # the operator CRs — this service IS the durable trail.
 #
 # Structure mirrors stacks/claude-memory (the canonical Tier-1 pattern): a
 # per-service namespace, a CNPG Postgres DB + role + Vault 7-day rotation +
 # ExternalSecret -> DATABASE_URL, the Reloader annotation, and the
 # Terragrunt-generated backend.tf/providers.tf/tiers.tf layout. The novel bit is
 # minting an mTLS client cert from the Tigera CA (hashicorp/tls; see versions.tf).
 #
 # IMAGE: ghcr.io/viktorbarzin/goldmane-edge-aggregator is PRIVATE. Onboarding
 # MUST add the "goldmane-edge-aggregator" namespace to the ghcr-credentials
 # Kyverno allowlist (stacks/kyverno/modules/kyverno/ghcr-credentials.tf,
 # local.ghcr_private_namespaces) so the Kyverno-synced `ghcr-credentials` secret
 # is cloned into this namespace — otherwise the pulls 401. The imagePullSecrets
 # reference below assumes that entry exists.
 # =============================================================================
 variable "postgresql_host" { type = string }
 # Plan-time root creds for the idempotent DB-init Job (mirrors claude-memory).
 data "vault_kv_secret_v2" "secrets" {
  mount = "secret"
  name  = "goldmane-edge-aggregator"
 }
 # -----------------------------------------------------------------------------
 # 1. Namespace
 # -----------------------------------------------------------------------------
 resource "kubernetes_namespace" "goldmane_edge_aggregator" {
  metadata {
    name = "goldmane-edge-aggregator"
    labels = {
      name = "goldmane-edge-aggregator"
      # Tier 4-aux: a small off-path consumer service, like claude-memory.
      tier               = local.tiers.aux
      "keel.sh/enrolled" = "true"
    }
  }
  lifecycle {
    # KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
    ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
  }
 }
 # -----------------------------------------------------------------------------
 # 2. Goldmane mTLS client certificate (minted from the Tigera CA)
 # -----------------------------------------------------------------------------
 # The aggregator dials goldmane:7443 over mutual TLS. We mint a client cert
 # signed by the Tigera CA (the same CA that issues Goldmane's serving cert), so
 # Goldmane requires mutual TLS on :7443 and verifies the client cert chains to
 # the Tigera CA — it does NOT authorize by client identity, so ANY Tigera-CA-
 # signed cert is accepted. Rather than copy the Tigera CA PRIVATE KEY into TF
 # state to mint our own (a needless CA-key exposure; the hashicorp/tls provider
 # is also incompatible with this repo's global generate-providers/lockfile
 # pattern), we REUSE the operator-minted, Tigera-CA-signed client cert
 # `whisker-backend-key-pair` (calico-system). We never touch the CA key.
 # Trade-off: if the operator rotates that cert, re-apply to re-sync (hardening
 # follow-up: mint an own-identity cert in-namespace if Whisker is ever removed).
 data "kubernetes_secret" "whisker_backend" {
  metadata {
    name      = "whisker-backend-key-pair"
    namespace = "calico-system"
  }
 }
 # The CA bundle that verifies Goldmane's serving cert. It lives ONLY in
 # calico-system (verified: ConfigMap `tigera-ca-bundle`, 2 keys present —
 # `ca-bundle.crt` AND `tigera-ca-bundle.crt`, both the trusted bundle). We read
 # it and recreate it as a ConfigMap in this namespace so the pod can mount it
 # (a ConfigMap cannot be cross-namespace-mounted).
 data "kubernetes_config_map" "tigera_ca_bundle" {
  metadata {
    name      = "tigera-ca-bundle"
    namespace = "calico-system"
  }
 }
 resource "kubernetes_config_map" "tigera_ca_bundle" {
  metadata {
    name      = "tigera-ca-bundle"
    namespace = kubernetes_namespace.goldmane_edge_aggregator.metadata[0].name
  }
  # Copy the upstream bundle verbatim. We mount the `tigera-ca-bundle.crt` key
  # at /etc/tigera-ca/tigera-ca-bundle.crt so the service's default
  # CA_CERT_PATH (/etc/tigera-ca/tigera-ca-bundle.crt) resolves with no override.
  data = data.kubernetes_config_map.tigera_ca_bundle.data
 }
 # Client cert + key for mTLS to goldmane:7443, mounted at TLS_CERT_PATH /
 # TLS_KEY_PATH defaults (/etc/goldmane-client-tls/tls.crt and .../tls.key).
 # Sourced verbatim from the operator's whisker-backend client key-pair (read
 # above) — already Tigera-CA-signed, which is all Goldmane verifies. No CA key
 # is touched and no cross-namespace CA RBAC is needed.
 resource "kubernetes_secret" "goldmane_client_tls" {
  metadata {
    name      = "goldmane-client-tls"
    namespace = kubernetes_namespace.goldmane_edge_aggregator.metadata[0].name
  }
  type = "Opaque"
  data = {
    "tls.crt" = data.kubernetes_secret.whisker_backend.data["tls.crt"]
    "tls.key" = data.kubernetes_secret.whisker_backend.data["tls.key"]
  }
 }
 # -----------------------------------------------------------------------------
 # 3. Postgres: DB + role `goldmane_edges`, Vault 7-day rotation, DATABASE_URL
 # -----------------------------------------------------------------------------
 # Idempotent create of the role + DB using the CNPG root creds from Vault
 # (dbaas_root_password), exactly mirroring claude-memory's db_init Job. The
 # service creates the `edge` table itself at startup (migrations/0001_edge.sql),
 # so no migration Job is needed.
 resource "kubernetes_job" "db_init" {
  metadata {
    name      = "goldmane-edges-db-init"
    namespace = kubernetes_namespace.goldmane_edge_aggregator.metadata[0].name
  }
  spec {
    template {
      metadata {}
      spec {
        container {
          name  = "db-init"
          image = "postgres:16-alpine"
          command = [
            "sh", "-c",
            <<-EOT
              set -e
              # -d postgres: psql defaults the database name to the username;
              # the root user has no root-named database, so be explicit.
              PGPASSWORD='${data.vault_kv_secret_v2.secrets.data["dbaas_root_password"]}' psql -h ${var.postgresql_host} -U root -d postgres -tc "SELECT 1 FROM pg_roles WHERE rolname='goldmane_edges'" | grep -q 1 || \
                PGPASSWORD='${data.vault_kv_secret_v2.secrets.data["dbaas_root_password"]}' psql -h ${var.postgresql_host} -U root -d postgres -c "CREATE ROLE goldmane_edges WITH LOGIN PASSWORD '${data.vault_kv_secret_v2.secrets.data["db_password"]}'"
              PGPASSWORD='${data.vault_kv_secret_v2.secrets.data["dbaas_root_password"]}' psql -h ${var.postgresql_host} -U root -d postgres -tc "SELECT 1 FROM pg_database WHERE datname='goldmane_edges'" | grep -q 1 || \
                PGPASSWORD='${data.vault_kv_secret_v2.secrets.data["dbaas_root_password"]}' psql -h ${var.postgresql_host} -U root -d postgres -c "CREATE DATABASE goldmane_edges OWNER goldmane_edges"
              PGPASSWORD='${data.vault_kv_secret_v2.secrets.data["dbaas_root_password"]}' psql -h ${var.postgresql_host} -U root -d postgres -c "GRANT ALL PRIVILEGES ON DATABASE goldmane_edges TO goldmane_edges"
              echo "Database init complete"
            EOT
          ]
        }
        restart_policy = "Never"
      }
    }
    backoff_limit = 3
  }
  wait_for_completion = true
  timeouts {
    create = "2m"
  }
  lifecycle {
    # KYVERNO_LIFECYCLE_V1: Kyverno injects dns_config (ndots=2); ignore it so
    # this idempotent Job isn't replaced (Jobs are immutable) on every apply.
    ignore_changes = [spec[0].template[0].spec[0].dns_config]
  }
 }
 # ExternalSecret projecting the Vault-rotated (7-day) credential into a K8s
 # Secret as DATABASE_URL. The Vault DB static role `pg-goldmane-edges` and its
 # place in the CNPG connection allowlist are added in stacks/vault/main.tf
 # (see this stack's terragrunt.hcl note). remoteRef key: static-creds/pg-goldmane-edges.
 resource "kubernetes_manifest" "db_external_secret" {
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
    metadata = {
      name      = "goldmane-edges-db-creds"
      namespace = kubernetes_namespace.goldmane_edge_aggregator.metadata[0].name
    }
    spec = {
      refreshInterval = "15m"
      secretStoreRef = {
        name = "vault-database"
        kind = "ClusterSecretStore"
      }
      target = {
        name = "goldmane-edges-db-creds"
        template = {
          data = {
            DATABASE_URL = "postgresql://goldmane_edges:{{ .password }}@${var.postgresql_host}:5432/goldmane_edges"
          }
        }
      }
      data = [{
        secretKey = "password"
        remoteRef = {
          key      = "static-creds/pg-goldmane-edges"
          property = "password"
        }
      }]
    }
  }
  depends_on = [kubernetes_namespace.goldmane_edge_aggregator]
 }
 # -----------------------------------------------------------------------------
 # 4. Slack webhook (reuse the alert-digest incoming webhook)
 # -----------------------------------------------------------------------------
 # The monitoring alert-digest CronJob posts with the Slack incoming webhook at
 # Vault secret/monitoring -> key `alertmanager_slack_api_url`
 # (stacks/monitoring/modules/monitoring/alert_digest.tf). Project that same URL
 # into this namespace as SLACK_WEBHOOK_URL via an ExternalSecret (no new
 # webhook). The digest CronJob defaults to #security.
 resource "kubernetes_manifest" "slack_external_secret" {
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
    metadata = {
      name      = "goldmane-edges-slack"
      namespace = kubernetes_namespace.goldmane_edge_aggregator.metadata[0].name
    }
    spec = {
      refreshInterval = "1h"
      secretStoreRef = {
        name = "vault-kv"
        kind = "ClusterSecretStore"
      }
      target = {
        name = "goldmane-edges-slack"
      }
      data = [{
        secretKey = "SLACK_WEBHOOK_URL"
        remoteRef = {
          key      = "viktor"
          property = "alertmanager_slack_api_url"
        }
      }]
    }
  }
  depends_on = [kubernetes_namespace.goldmane_edge_aggregator]
 }
 # -----------------------------------------------------------------------------
 # 5. aggregate — Deployment (long-running gRPC stream -> Postgres upserts)
 # -----------------------------------------------------------------------------
 resource "kubernetes_deployment" "aggregate" {
  depends_on = [
    kubernetes_job.db_init,
    kubernetes_manifest.db_external_secret,
  ]
  metadata {
    name      = "goldmane-edge-aggregator"
    namespace = kubernetes_namespace.goldmane_edge_aggregator.metadata[0].name
    labels = {
      app  = "goldmane-edge-aggregator"
      tier = local.tiers.aux
    }
    annotations = {
      # Credential is env-injected and read only at startup; the 7-day rotation
      # must bounce the pod or it keeps the stale password and silently fails
      # DB auth (infra CLAUDE.md Reloader rule).
      "secret.reloader.stakater.com/reload" = "goldmane-edges-db-creds"
    }
  }
  spec {
    # 1 replica: the edge set is a global upsert keyed on (src_ns, dst_ns,
    # action); a second replica only doubles writes for no benefit (Goldmane
    # streams per-flow). Stateless (no PVC) so RollingUpdate is fine.
    replicas = 1
    selector {
      match_labels = {
        app = "goldmane-edge-aggregator"
      }
    }
    template {
      metadata {
        labels = {
          app = "goldmane-edge-aggregator"
        }
      }
      spec {
        # PRIVATE ghcr image — cloned into this namespace by the Kyverno
        # sync-ghcr-credentials allowlist policy (add this ns to that list).
        image_pull_secrets {
          name = "ghcr-credentials"
        }
        container {
          name = "aggregate"
          # CI (GHA -> ghcr) overwrites this to :<sha8> via `kubectl set image`;
          # the image tag is in ignore_changes below so the SHA sticks across
          # `terragrunt apply` (fleet image-pin convention). Placeholder :latest
          # until the deploy pipeline runs.
          image = "ghcr.io/viktorbarzin/goldmane-edge-aggregator:latest"
          args  = ["aggregate"]
          # Goldmane mTLS. GOLDMANE_HOST default host sans port =>
          # ServerName "goldmane.calico-system.svc.cluster.local", which is a SAN
          # on the live Goldmane serving cert (verified 2026-06-24:
          # DNS:goldmane{,.calico-system{,.svc{,.cluster.local}}}). So no
          # GOLDMANE_SERVER_NAME override and no GOLDMANE_TLS_INSECURE needed.
          env {
            name  = "GOLDMANE_HOST"
            value = "goldmane.calico-system.svc.cluster.local:7443"
          }
          # TLS_CERT_PATH / TLS_KEY_PATH / CA_CERT_PATH are left at their image
          # defaults (/etc/goldmane-client-tls/tls.{crt,key} and
          # /etc/tigera-ca/tigera-ca-bundle.crt) — the mounts below match them.
          env {
            name = "DATABASE_URL"
            value_from {
              secret_key_ref {
                name = "goldmane-edges-db-creds"
                key  = "DATABASE_URL"
              }
            }
          }
          volume_mount {
            name       = "goldmane-client-tls"
            mount_path = "/etc/goldmane-client-tls"
            read_only  = true
          }
          volume_mount {
            name       = "tigera-ca"
            mount_path = "/etc/tigera-ca"
            read_only  = true
          }
          resources {
            # Idles low: a single gRPC stream + periodic upserts. requests=limits
            # per the repo memory rule; no CPU limit (CFS throttling). Right-size
            # later with krr.
            requests = {
              cpu    = "10m"
              memory = "64Mi"
            }
            limits = {
              memory = "64Mi"
            }
          }
        }
        volume {
          name = "goldmane-client-tls"
          secret {
            secret_name = kubernetes_secret.goldmane_client_tls.metadata[0].name
          }
        }
        volume {
          name = "tigera-ca"
          config_map {
            name = kubernetes_config_map.tigera_ca_bundle.metadata[0].name
          }
        }
      }
    }
  }
  lifecycle {
    ignore_changes = [
      # CI pipeline owns the image tag (kubectl set image from GHA/Woodpecker).
      spec[0].template[0].spec[0].container[0].image,
      spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
      metadata[0].annotations["keel.sh/policy"],
      metadata[0].annotations["keel.sh/trigger"],
      metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
      metadata[0].annotations["keel.sh/match-tag"],
      metadata[0].annotations["kubernetes.io/change-cause"],
      metadata[0].annotations["deployment.kubernetes.io/revision"],
      spec[0].template[0].metadata[0].annotations["keel.sh/update-time"], # KEEL_LIFECYCLE_V1
    ]
  }
 }
 # -----------------------------------------------------------------------------
 # 6. digest — daily CronJob (first-seen edges -> Slack)
 # -----------------------------------------------------------------------------
 resource "kubernetes_cron_job_v1" "digest" {
  depends_on = [
    kubernetes_job.db_init,
    kubernetes_manifest.db_external_secret,
    kubernetes_manifest.slack_external_secret,
  ]
  metadata {
    name      = "goldmane-edges-digest"
    namespace = kubernetes_namespace.goldmane_edge_aggregator.metadata[0].name
    labels = {
      app  = "goldmane-edge-aggregator"
      tier = local.tiers.aux
    }
  }
  spec {
    # Daily 08:00 Europe/London — aligns with the alert-digest cadence.
    schedule                      = "0 8 * * *"
    timezone                      = "Europe/London"
    concurrency_policy            = "Forbid"
    successful_jobs_history_limit = 3
    failed_jobs_history_limit     = 3
    starting_deadline_seconds     = 600
    job_template {
      metadata {
        labels = {
          app = "goldmane-edge-aggregator"
        }
        annotations = {
          # 7-day DB rotation: bounce the Job pod's stale env (Reloader rule).
          "secret.reloader.stakater.com/reload" = "goldmane-edges-db-creds"
        }
      }
      spec {
        backoff_limit              = 2
        active_deadline_seconds    = 300
        ttl_seconds_after_finished = 86400
        template {
          metadata {
            labels = {
              app = "goldmane-edge-aggregator"
            }
          }
          spec {
            restart_policy = "OnFailure"
            image_pull_secrets {
              name = "ghcr-credentials"
            }
            container {
              name = "digest"
              # CronJobs track :latest + imagePullPolicy: Always (fleet
              # convention) so the daily run picks up the current image.
              image             = "ghcr.io/viktorbarzin/goldmane-edge-aggregator:latest"
              image_pull_policy = "Always"
              args              = ["digest"]
              env {
                name = "DATABASE_URL"
                value_from {
                  secret_key_ref {
                    name = "goldmane-edges-db-creds"
                    key  = "DATABASE_URL"
                  }
                }
              }
              env {
                name = "SLACK_WEBHOOK_URL"
                value_from {
                  secret_key_ref {
                    name = "goldmane-edges-slack"
                    key  = "SLACK_WEBHOOK_URL"
                  }
                }
              }
              env {
                name = "SLACK_CHANNEL"
                # The shared alertmanager_slack_api_url incoming webhook's Slack
                # app is NOT a member of #security, so overriding the channel to
                # it returns HTTP 404 channel_not_found (verified 2026-06-25).
                # alertmanager's own slack-security receiver shares this webhook
                # and almost certainly hits the same wall. Post to #alerts (the
                # webhook's working channel, same as alert-digest) until the app
                # is invited to #security, then flip this back. See
                # docs/runbooks/goldmane-flow-trail.md.
                value = "#alerts"
              }
              resources {
                requests = {
                  cpu    = "10m"
                  memory = "64Mi"
                }
                limits = {
                  memory = "64Mi"
                }
              }
            }
          }
        }
      }
    }
  }
  lifecycle {
    # KYVERNO_LIFECYCLE_V1 (CronJob path): Kyverno mutates dns_config with ndots=2.
    ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
  }
 }
 # -----------------------------------------------------------------------------
 # 7. Egress (default-deny consideration)
 # -----------------------------------------------------------------------------
 # Goldmane's own NetworkPolicy already allows INGRESS on 7443 from anywhere, so
 # nothing is needed on the Goldmane side. No egress policy is declared here:
 # this namespace is default-allow egress today. IF/WHEN it is brought under the
 # wave-1 default-deny egress enforcement (per-namespace allowlists), add
 # (Global)NetworkPolicy egress rules permitting:
 #   - goldmane.calico-system.svc.cluster.local:7443 (the flow stream)
 #   - pg-cluster-rw.dbaas.svc.cluster.local:5432    (Postgres)
 #   - hooks.slack.com:443                            (digest -> Slack, internet)
 #   - kube-dns / CoreDNS :53                         (DNS, every namespace)
--- a/stacks/goldmane-edge-aggregator/terragrunt.hcl
+++ b/stacks/goldmane-edge-aggregator/terragrunt.hcl
@ -0,0 +1,24 @@
 include "root" {
  path = find_in_parent_folders()
 }
 # Tier-1 stack (PG state backend). The root terragrunt.hcl generates backend.tf
 # (pg backend, schema_name = "goldmane-edge-aggregator"), providers.tf,
 # cloudflare_provider.tf and tiers.tf automatically — do NOT hand-write those.
 # This stack adds the hashicorp/tls provider via a local versions.tf (merged
 # into the generated required_providers).
 dependency "platform" {
  config_path  = "../platform"
  skip_outputs = true
 }
 dependency "vault" {
  config_path  = "../vault"
  skip_outputs = true
 }
 # The Vault DB static role pg-goldmane-edges (7-day rotation) and the CNPG
 # connection allowlist entry live in the vault stack (stacks/vault/main.tf).
 # The vault dependency above orders this stack after it so the ExternalSecret
 # can materialize the rotated credential on first apply.
--- a/stacks/instagram-poster/modules/instagram-poster/main.tf
+++ b/stacks/instagram-poster/modules/instagram-poster/main.tf
@ -35,6 +35,14 @@ resource "kubernetes_namespace" "instagram_poster" {
 #     - immich_tag_instagram      (optional — auto-resolved if missing)
 #     - immich_tag_posted         (optional — auto-resolved if missing)
 resource "kubernetes_manifest" "external_secret" {
  # The external-secrets controller takes server-side-apply ownership of
  # .spec.refreshInterval, so a plain TF apply conflicts. force_conflicts lets
  # TF win (values match, so it's stable) — same pattern as grafana/woodpecker/
  # traefik/k8s-version-upgrade. Surfaced 2026-06-24 by the first IG apply since
  # the ESO v1 migration (the scale-to-0 push).
  field_manager {
    force_conflicts = true
  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
@ -139,6 +147,11 @@ resource "kubernetes_manifest" "external_secret" {
 # ESO refreshes the K8s Secret every 15m. `reloader.stakater.com/match`
 # bounces the pod when the password changes.
 resource "kubernetes_manifest" "benchmark_db_external_secret" {
  # See external_secret above — ESO owns .spec.refreshInterval; force_conflicts
  # lets the TF apply win instead of erroring on the field-manager conflict.
  field_manager {
    force_conflicts = true
  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
@ -227,7 +240,11 @@ resource "kubernetes_deployment" "instagram_poster" {
  }
  spec {
-    replicas = 1
+    # Scaled to 0 (2026-06-24): Instagram Graph integration is unused and its
    # ExternalSecret is dead (missing ig_graph_long_lived_token /
    # ig_business_account_id in Vault secret/instagram-poster). Set back to 1
    # after minting a Meta long-lived token and populating those keys.
    replicas = 0
    # RWO PVC — cannot rolling-update.
    strategy {
      type = "Recreate"
--- a/stacks/k8s-version-upgrade/scripts/upgrade-step.sh
+++ b/stacks/k8s-version-upgrade/scripts/upgrade-step.sh
@ -416,6 +416,39 @@ phase_preflight() {
    fi
  fi
  # 4b. apiserver-OIDC drift check (backstop for the rbac stack's kubeadm-config
  # reconciliation). A `kubeadm upgrade` REGENERATES the apiserver manifest from
  # kubeadm-config; if kubeadm-config still carries the legacy single-issuer
  # --oidc-* args instead of --authentication-config, the regenerated apiserver
  # loses structured multi-issuer auth → kubectl + dashboard SSO break AFTER the
  # upgrade. This is RECOVERABLE (the apiserver does NOT crash — verified by an
  # isolated repro 2026-06-24; the chain's post-master restore.sh re-adds the flag,
  # and the rbac stack reconciles kubeadm-config so it won't recur) — so this is an
  # ALERT, not a block. (NB the 2026-06-24 stall was NOT this — it was etcd IO
  # starvation; see docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md.)
  # Skip on an at-target master (resume — no apiserver regen).
  if [ "$master_kubelet_v" != "$TARGET_VERSION" ]; then
    local apiserver_diff
    apiserver_diff=$(ssh "${SSH_OPTS[@]}" "$(ssh_target k8s-master)" "sudo kubeadm upgrade diff v$TARGET_VERSION 2>/dev/null" || true)
    if echo "$apiserver_diff" | grep -qE '^-[[:space:]].*--authentication-config'; then
      slack "WARN preflight — kubeadm upgrade will DROP --authentication-config (kubeadm-config OIDC drift). SSO breaks post-upgrade until restore.sh re-adds it; re-apply the rbac stack to reconcile kubeadm-config. Proceeding (recoverable, not a crash)."
    fi
  fi
  # 4c. Reclaim kubeadm scratch on master. `kubeadm upgrade apply` dumps a full
  # ~400MB etcd DB backup into /etc/kubernetes/tmp/kubeadm-backup-etcd-<ts>/ before
  # every etcd upgrade and NEVER cleans it up — 145 dirs / 28GB had accumulated by
  # 2026-06-24, pushing master root fs to 73% (image-GC churn + extra write IO on
  # the shared HDD where etcd lives — a contributor to the etcd IO starvation that
  # stalled that run, see post-mortem). Real etcd backups go to NFS, so these are
  # throwaway. Prune ones >3 days old (keeps a short rollback window). Best-effort;
  # never aborts the chain.
  if [ "$master_kubelet_v" != "$TARGET_VERSION" ]; then
    ssh "${SSH_OPTS[@]}" "$(ssh_target k8s-master)" \
      "sudo find /etc/kubernetes/tmp -maxdepth 1 -type d \( -name 'kubeadm-backup-*' -o -name 'kubeadm-upgraded-manifests*' \) -mtime +3 -exec rm -rf {} + 2>/dev/null; echo -n 'master root after prune: '; df -h / | awk 'NR==2{print \$5\" used, \"\$4\" free\"}'" \
      || echo "kubeadm-scratch prune skipped (ssh/df failed) — non-fatal"
  fi
  # 5. Push in-flight + started_timestamp metrics + ns annotations
  $KUBECTL annotate ns "$NS" \
    "viktorbarzin.me/k8s-upgrade-in-flight=$(date -u +%FT%TZ)" \
--- a/stacks/kyverno/modules/kyverno/ghcr-credentials.tf
+++ b/stacks/kyverno/modules/kyverno/ghcr-credentials.tf
@ -31,6 +31,9 @@ locals {
    # "no local builds"). ghcr.io/viktorbarzin/k8s-portal:latest is PRIVATE
    # (infra repo default); the deployment references the cloned secret.
    "k8s-portal",
    # goldmane-edge-aggregator: PRIVATE ghcr image pulled by the aggregate
    # Deployment + digest CronJob (ADR-0014, infra#58).
    "goldmane-edge-aggregator",
  ]
 }
--- a/stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf
+++ b/stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf
@ -130,6 +130,11 @@ resource "kubernetes_deployment" "blackbox_exporter" {
    labels = {
      app  = "blackbox-exporter"
      tier = var.tier
      # ADR-0014 service identity: monitoring is a multi-Service namespace, so
      # the namespace alone can't attribute Goldmane flows. Value = the
      # fronting Service name (kubernetes_service.blackbox_exporter is named
      # "blackbox-exporter").
      "service-identity" = "blackbox-exporter"
    }
    annotations = {
      "reloader.stakater.com/search" = "true"
@ -146,6 +151,10 @@ resource "kubernetes_deployment" "blackbox_exporter" {
      metadata {
        labels = {
          app = "blackbox-exporter"
          # ADR-0014: Goldmane/Felix stamps POD labels onto flows, so the
          # disambiguating identity must live on the pod template (not just
          # the Deployment metadata above). Not in selector → no replace.
          "service-identity" = "blackbox-exporter"
        }
      }
      spec {
--- a/stacks/monitoring/modules/monitoring/goflow2.tf
+++ b/stacks/monitoring/modules/monitoring/goflow2.tf
@ -5,6 +5,11 @@ resource "kubernetes_deployment" "goflow2" {
    labels = {
      app  = "goflow2"
      tier = var.tier
      # ADR-0014 service identity: monitoring is a multi-Service namespace, so
      # the namespace alone can't attribute Goldmane flows. Value = the
      # fronting Service name (kubernetes_service.goflow2 — the metrics svc; the
      # goflow2-netflow NodePort is the same pod by another name).
      "service-identity" = "goflow2"
    }
  }
  spec {
@ -18,6 +23,10 @@ resource "kubernetes_deployment" "goflow2" {
      metadata {
        labels = {
          app = "goflow2"
          # ADR-0014: Goldmane/Felix stamps POD labels onto flows, so the
          # disambiguating identity must live on the pod template (not just
          # the Deployment metadata above). Not in selector → no replace.
          "service-identity" = "goflow2"
        }
      }
      spec {
--- a/stacks/monitoring/modules/monitoring/grafana.tf
+++ b/stacks/monitoring/modules/monitoring/grafana.tf
@ -71,6 +71,15 @@ resource "kubernetes_persistent_volume" "alertmanager_pv" {
 # DB credentials from Vault database engine (rotated automatically)
 # Provides GF_DATABASE_PASSWORD that auto-updates when password rotates
 resource "kubernetes_manifest" "grafana_db_creds" {
  # The external-secrets controller takes server-side-apply ownership of
  # .spec.refreshInterval, so a plain TF apply conflicts ("conflict with
  # external-secrets ... .spec.refreshInterval"). force_conflicts lets TF win
  # (values match, so it's stable) — same pattern as the woodpecker/traefik/
  # k8s-version-upgrade stacks. Surfaced 2026-06-24: the first monitoring apply
  # in a while exposed this latent conflict (prior pushes were docs-only).
  field_manager {
    force_conflicts = true
  }
  manifest = {
    apiVersion = "external-secrets.io/v1"
    kind       = "ExternalSecret"
--- a/stacks/monitoring/modules/monitoring/idrac.tf
+++ b/stacks/monitoring/modules/monitoring/idrac.tf
@ -47,6 +47,10 @@ resource "kubernetes_deployment" "idrac-redfish" {
    labels = {
      app  = "idrac-redfish-exporter"
      tier = var.tier
      # ADR-0014 service identity: monitoring is a multi-Service namespace, so
      # the namespace alone can't attribute Goldmane flows. Value = the
      # fronting Service name (kubernetes_service.idrac-redfish-exporter).
      "service-identity" = "idrac-redfish-exporter"
    }
    annotations = {
      "reloader.stakater.com/search" = "true"
@ -63,6 +67,10 @@ resource "kubernetes_deployment" "idrac-redfish" {
      metadata {
        labels = {
          app = "idrac-redfish-exporter"
          # ADR-0014: Goldmane/Felix stamps POD labels onto flows, so the
          # disambiguating identity must live on the pod template (not just
          # the Deployment metadata above). Not in selector → no replace.
          "service-identity" = "idrac-redfish-exporter"
        }
      }
      spec {
--- a/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl
+++ b/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl
@ -1450,6 +1450,49 @@ serverFiles:
                Remediation: right-size top reservers via Goldilocks (immich-server,
                frigate, prometheus, pg-cluster, paperless) or bump VM RAM on
                k8s-node2/k8s-node3 from 32GB → 48GB to match node1.
      # Goldmane edge-aggregator (ADR-0014 / infra #58, #61): the durable
      # who-talks-to-whom trail. The aggregator pod has NO /metrics endpoint,
      # so its health is inferred from kube-state-metrics signals — the trail
      # must not silently die. Two failure modes are covered:
      #   - the aggregate Deployment stops consuming Goldmane's flow stream
      #     (AggregatorDown) → no new edges ever land in the goldmane_edges DB
      #   - the daily digest CronJob can't post new edges to Slack
      #     (DigestFailing) → edges still land but nobody is told.
      # A freshness probe (max(last_seen) staleness) is intentionally NOT here:
      # AggregatorDown is the agreed floor and needs no extra moving parts.
      - name: Network Observability (Goldmane)
        rules:
          # Deployment has <1 available replica for 15m. kube-state-metrics
          # keeps `kube_deployment_status_replicas_available` (metric-keep list
          # in serverFiles below). The 15m window rides out a normal rollout /
          # node drain without paging; a genuinely-dead aggregator means the
          # edge trail has stopped recording and stays down.
          - alert: AggregatorDown
            expr: |
              kube_deployment_status_replicas_available{namespace="goldmane-edge-aggregator",deployment="goldmane-edge-aggregator"} < 1
              and on() (time() - process_start_time_seconds{job="prometheus"}) > 900
            for: 15m
            labels:
              severity: warning
            annotations:
              summary: "goldmane-edge-aggregator has no available replica — the who-talks-to-whom edge trail has stopped recording"
              description: "The aggregate Deployment streams Calico Goldmane flows into the goldmane_edges CNPG DB. With 0 replicas, no new namespace-pair edges are captured. `kubectl -n goldmane-edge-aggregator describe deploy goldmane-edge-aggregator` + check the goldmane svc (calico-system) is reachable."
          # The goldmane-edges-digest CronJob has a failed Job that started in
          # the last 24h. Mirrors the generic JobFailed shape but scoped to the
          # digest so it routes here. `for: 30m` rides out the apply/scrape
          # transient; the digest runs daily so a real failure won't self-heal
          # until the next run — surface it same-day rather than waiting 24h.
          - alert: DigestFailing
            expr: |
              kube_job_status_failed{namespace="goldmane-edge-aggregator", job_name=~"goldmane-edges-digest.*"} > 0
              and on(namespace, job_name)
              (time() - kube_job_status_start_time{namespace="goldmane-edge-aggregator", job_name=~"goldmane-edges-digest.*"}) < 86400
            for: 30m
            labels:
              severity: warning
            annotations:
              summary: "goldmane-edges-digest CronJob failing — new edges captured but not posted to #security"
              description: "The daily edge digest Job {{ $labels.job_name }} failed. Edges may still be landing in the goldmane_edges DB but no one is being notified of new namespace-pairs. `kubectl -n goldmane-edge-aggregator logs job/{{ $labels.job_name }}`."
      - name: Infrastructure Health
        rules:
          - alert: HomeAssistantDown
--- a/stacks/monitoring/modules/monitoring/pve_exporter.tf
+++ b/stacks/monitoring/modules/monitoring/pve_exporter.tf
@ -22,6 +22,10 @@ resource "kubernetes_deployment" "pve_exporter" {
    namespace = kubernetes_namespace.monitoring.metadata[0].name
    labels = {
      tier = var.tier
      # ADR-0014 service identity: monitoring is a multi-Service namespace, so
      # the namespace alone can't attribute Goldmane flows. Value = the
      # fronting Service name (kubernetes_service.proxmox-exporter).
      "service-identity" = "proxmox-exporter"
    }
  }
@ -37,6 +41,10 @@ resource "kubernetes_deployment" "pve_exporter" {
      metadata {
        labels = {
          app = "proxmox-exporter"
          # ADR-0014: Goldmane/Felix stamps POD labels onto flows, so the
          # disambiguating identity must live on the pod template (not just
          # the Deployment metadata above). Not in selector → no replace.
          "service-identity" = "proxmox-exporter"
        }
      }
--- a/stacks/monitoring/modules/monitoring/snmp_exporter.tf
+++ b/stacks/monitoring/modules/monitoring/snmp_exporter.tf
@ -31,6 +31,10 @@ resource "kubernetes_deployment" "snmp-exporter" {
    labels = {
      app  = "snmp-exporter"
      tier = var.tier
      # ADR-0014 service identity: monitoring is a multi-Service namespace, so
      # the namespace alone can't attribute Goldmane flows. Value = the
      # fronting Service name (kubernetes_service.snmp-exporter).
      "service-identity" = "snmp-exporter"
    }
    annotations = {
      "reloader.stakater.com/search" = "true"
@ -47,6 +51,10 @@ resource "kubernetes_deployment" "snmp-exporter" {
      metadata {
        labels = {
          app = "snmp-exporter"
          # ADR-0014: Goldmane/Felix stamps POD labels onto flows, so the
          # disambiguating identity must live on the pod template (not just
          # the Deployment metadata above). Not in selector → no replace.
          "service-identity" = "snmp-exporter"
        }
      }
      spec {
--- a/stacks/rbac/modules/rbac/apiserver-oidc.tf
+++ b/stacks/rbac/modules/rbac/apiserver-oidc.tf
@ -10,16 +10,29 @@
 # match the existing RBAC subjects (kind: User, name: <raw email>; group names
 # verbatim). Do NOT add a prefix or existing bindings break.
 #
-# DRIFT WARNING: this edits the kube-apiserver static-pod manifest on the single
+# DRIFT WARNING (and how it's now handled): apiserver auth lives in THREE places
-# master. A `kubeadm upgrade` regenerates that manifest and DROPS this flag (this
+# that must stay in sync, because a `kubeadm upgrade` REGENERATES the static-pod
-# is exactly how OIDC silently broke before — the flag was wiped and the
+# manifest from kubeadm-config:
-# content-hash trigger never re-fired). After any k8s control-plane upgrade,
+#   1. /etc/kubernetes/pki/auth-config.yaml         — the structured authn file
-# re-apply the rbac stack to restore apiserver OIDC. See
+#   2. the live kube-apiserver static-pod manifest  — references it via the flag
-# docs/plans/2026-06-04-k8s-dashboard-sso-design.md.
+#   3. the kubeadm-config ClusterConfiguration CM   — what kubeadm regenerates from
 # Originally only (1)+(2) were managed, so every kubeadm upgrade rewrote the
 # manifest from the STALE CM, reverting --authentication-config to single-issuer
 # --oidc-* flags. The consequence is SSO breakage AFTER the upgrade: kubectl +
 # dashboard lose multi-issuer auth (the apiserver does NOT crash on this — verified
 # by an isolated repro 2026-06-24; the 2026-06-24 v1.35 upgrade *stall* was a
 # separate etcd IO-starvation issue, see
 # docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md). The
 # remote script below now ALSO reconciles (3) via `kubeadm init phase
 # upload-config`, so a future kubeadm upgrade regenerates a CORRECT manifest. The
 # k8s-version-upgrade chain additionally ALERTS (does not block — SSO drift is
 # recoverable) via `kubeadm upgrade diff` in preflight if --authentication-config
 # would still be dropped.
 #
 # SAFETY: the remote script health-gates on /livez and AUTO-ROLLS-BACK the
 # manifest from a timestamped backup if the apiserver does not recover, so a
-# malformed config cannot leave the single master down.
+# malformed config cannot leave the single master down. Reconciling kubeadm-config
 # is zero-impact on the running cluster (the CM is only read during an upgrade).
 variable "k8s_master_host" {
  type    = string
@ -97,6 +110,40 @@ locals {
    print('flag-inserted' if done else 'ANCHOR-NOT-FOUND')
  PY
  # Reconciles the kubeadm-config ClusterConfiguration's apiServer.extraArgs:
  # drops the stale single-issuer --oidc-* args and ensures --authentication-config
  # is present (anchored after --authorization-mode). Stdlib-only (the master is
  # only guaranteed python3, not pyyaml/yq). Idempotent; preserves all other
  # fields (etcd args, audit args, extraVolumes) verbatim. Exits 3 if the
  # authorization-mode anchor is missing (fail loud, leave the CM untouched).
  kubeadm_oidc_reconcile_py = <<-PY
    import sys
    lines = sys.stdin.read().split('\n')
    out, i, n = [], 0, len(lines)
    have_authn = any('name: authentication-config' in l for l in lines)
    inserted = have_authn
    while i < n:
        ln = lines[i]; s = ln.strip()
        if s.startswith('- name: oidc-'):
            i += 2 if (i + 1 < n and lines[i + 1].strip().startswith('value:')) else 1
            continue
        out.append(ln)
        if (not inserted) and s == '- name: authorization-mode':
            indent = ln[:len(ln) - len(ln.lstrip())]
            if i + 1 < n and lines[i + 1].strip().startswith('value:'):
                out.append(lines[i + 1]); i += 2
            else:
                i += 1
            out.append(indent + '- name: authentication-config')
            out.append(indent + '  value: /etc/kubernetes/pki/auth-config.yaml')
            inserted = True
            continue
        i += 1
    if not inserted:
        sys.stderr.write('ANCHOR-NOT-FOUND: authorization-mode\n'); sys.exit(3)
    sys.stdout.write('\n'.join(out))
  PY
  # Whole remote operation, base64-embedded for byte-exact transfer (no
  # heredoc/escaping hazards across SSH).
  apiserver_auth_remote_script = <<-SH
@ -137,6 +184,30 @@ locals {
      echo "rolled back to previous manifest"; exit 1
    fi
    echo "kube-apiserver healthy with multi-issuer --authentication-config"
    # 5. Reconcile kubeadm-config so a FUTURE `kubeadm upgrade` regenerates the
    #    apiserver manifest WITH --authentication-config instead of reverting to
    #    the stale single-issuer --oidc-* flags. Without this, kubeadm rewrote the
    #    manifest from kubeadm-config on every control-plane upgrade and the
    #    regenerated apiserver crash-looped (the 2026-06-24 v1.35 upgrade stall).
    #    Zero live impact (the CM is only read at upgrade time); idempotent;
    #    best-effort (the chain's `kubeadm upgrade diff` preflight gate is the
    #    backstop if this cannot run).
    KC="sudo kubectl --kubeconfig /etc/kubernetes/admin.conf"
    CC=$($KC -n kube-system get cm kubeadm-config -o jsonpath='{.data.ClusterConfiguration}' 2>/dev/null || true)
    if [ -n "$CC" ] && { echo "$CC" | grep -q 'oidc-issuer-url' || ! echo "$CC" | grep -q 'authentication-config'; }; then
      echo "Reconciling kubeadm-config (oidc-* -> authentication-config) so kubeadm upgrade keeps structured auth"
      echo '${base64encode(local.kubeadm_oidc_reconcile_py)}' | base64 -d > /tmp/reconcile_kubeadm_oidc.py
      if printf '%s' "$CC" | python3 /tmp/reconcile_kubeadm_oidc.py > /tmp/kubeadm-cc-new.yaml \
         && sudo kubeadm init phase upload-config kubeadm --config /tmp/kubeadm-cc-new.yaml; then
        echo "kubeadm-config reconciled: future control-plane upgrades keep --authentication-config"
      else
        echo "WARN: kubeadm-config reconcile failed; the upgrade-chain preflight gate will block the next upgrade"
      fi
      rm -f /tmp/reconcile_kubeadm_oidc.py /tmp/kubeadm-cc-new.yaml
    else
      echo "kubeadm-config already uses --authentication-config (no oidc drift)"
    fi
  SH
 }
@ -155,6 +226,14 @@ resource "null_resource" "apiserver_oidc_config" {
  }
  triggers = {
    # Intentionally hash ONLY the issuer config, NOT the remote script. CI applies
    # the rbac stack with no ssh_private_key (var defaults to ""), so a re-run of
    # this SSH provisioner in CI would fail — hence the null_resource must stay a
    # no-op on a plain CI apply. Script changes (e.g. the 2026-06-24 kubeadm-config
    # reconciliation) reach the cluster via the apiserver-oidc-restore ConfigMap
    # below (a plain k8s resource, no ssh) which the upgrade chain re-runs. To force
    # this provisioner to re-run after a script change, apply locally with
    # `-replace` + TF_VAR_ssh_private_key (see docs/runbooks/k8s-version-upgrade.md).
    auth_config = sha256(local.apiserver_auth_config_yaml)
  }
 }
--- a/stacks/vault/main.tf
+++ b/stacks/vault/main.tf
@ -674,6 +674,7 @@ resource "vault_database_secret_backend_connection" "postgresql" {
    "pg-recruiter-responder", "pg-tripit",
    "pg-nextcloud-todos",
    "pg-technitium",
    "pg-goldmane-edges",
  ]
  postgresql {
@ -891,6 +892,17 @@ resource "vault_database_secret_backend_static_role" "pg_technitium" {
  rotation_period = 604800
 }
 # goldmane-edge-aggregator (ADR-0014 / infra #58) — 7-day rotation for the
 # goldmane_edges CNPG role. Consumed by stacks/goldmane-edge-aggregator via a
 # vault-database ExternalSecret -> DATABASE_URL (remoteRef static-creds/pg-goldmane-edges).
 resource "vault_database_secret_backend_static_role" "pg_goldmane_edges" {
  backend         = vault_mount.database.path
  db_name         = vault_database_secret_backend_connection.postgresql.name
  name            = "pg-goldmane-edges"
  username        = "goldmane_edges"
  rotation_period = 604800
 }
 # =============================================================================
 # Kubernetes Secrets Engine — Dynamic K8s Credentials
 # =============================================================================
--- a/state/stacks/dbaas/terraform.tfstate.enc
+++ b/state/stacks/dbaas/terraform.tfstate.enc
--- a/state/stacks/vault/terraform.tfstate.enc
+++ b/state/stacks/vault/terraform.tfstate.enc
Author	SHA1	Message	Date
Viktor Barzin	6c5288998f	goldmane-trail: polish follow-ups #57/#59/#61/#62/#63 + digest→#alerts All checks were successful ci/woodpecker/push/default Pipeline was successful Details Completes the Goldmane who-talks-to-whom trail (ADR-0014), implemented by a subagent workflow (distinct stacks in parallel, docs last): - #57 Whisker gated ingress: ingress_factory (whisker.viktorbarzin.me, auth=required, Authentik-gated) + a NetworkPolicy allowing traefik->whisker:8081 (the operator's whisker NP default-denies ingress). calico stack. - #61 pipeline health: AggregatorDown + DigestFailing Prometheus alerts (prometheus_chart_values.tpl) + cluster-health check #48. - #59 service-identity labels on the multi-Service namespaces (monitoring's 5 TF-managed deployments + dbaas), with the KYVERNO_LIFECYCLE_V1 marker so they update in-place. - #62/#63 docs: docs/runbooks/goldmane-flow-trail.md (new), service-catalog, security.md + monitoring.md east-west sections, ADR-0014 as-built, CONTEXT.md. #62 = the SQL to derive the Wave-1 per-namespace egress allowlist from the edge table (feeds code-8ywc; enforce-flips out of scope). Also fixes the digest's Slack target: #security override 404s channel_not_found because the shared alertmanager_slack_api_url webhook's app isn't a member of #security (this likely also breaks alertmanager's slack-security receiver — flagged in the runbook). Routed to #alerts (the webhook's working channel) until the app is invited; verified a real digest run posts cleanly (360 edges). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-25 17:49:25 +00:00
Viktor Barzin	306cdd4cb3	state(dbaas): update encrypted state	2026-06-25 17:31:03 +00:00
Viktor Barzin	9c68d147e0	k8s-upgrade: reclaim+auto-prune kubeadm /etc/kubernetes/tmp leak; correct crash root cause to etcd IO (not OIDC) Some checks failed ci/woodpecker/push/postmortem-todos Pipeline was successful Details ci/woodpecker/push/default Pipeline failed Details Digging into "why did the apiserver crash" disproved the earlier OIDC explanation. An isolated v1.35.6 apiserver repro with authentik reachable initialises OIDC cleanly (oidc.go:313, no error) and runs fine — so the --authentication-config -> --oidc-* revert is NOT what crashed it. etcd's surviving crash-window log is the real cause: 1180 "apply request took too long" warnings in 16 min, individual applies up to 4.3s (healthy <100ms) right as kubeadm tried to bring up the new apiserver. That's etcd IO starvation on the shared sdc HDD (beads code-oflt). A big contributor + the reason master root fs sat at 73%: kubeadm dumps a full ~400MB etcd DB backup into /etc/kubernetes/tmp/kubeadm-backup-etcd-<ts>/ before every etcd upgrade and never cleans it up — 145 dirs / 28GB had accumulated, driving image-GC churn and extra write-IO onto etcd's spindle. Reclaimed live (73% -> 23%) and added a preflight prune (>3 days) so it can't re-accumulate. Also corrected the OIDC handling: the kubeadm-config drift is real but only breaks dashboard/kubectl SSO AFTER a successful upgrade (recoverable via the chain's restore.sh + the kubeadm-config reconciliation) — it does not crash the apiserver. So the preflight check is now an ALERT, not a block (was added on the wrong hypothesis). Post-mortem, runbook, and apiserver-oidc.tf header corrected. Per Viktor: reclaim the disk and automate so the manual cleanup never recurs; the durable IO fix remains code-oflt (etcd off the shared HDD). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-25 15:23:15 +00:00
Viktor Barzin	60a1cb9a25	k8s-upgrade: reconcile kubeadm-config OIDC drift that crash-looped the v1.35 apiserver upgrade All checks were successful ci/woodpecker/push/postmortem-todos Pipeline was successful Details ci/woodpecker/push/default Pipeline was successful Details Last night's autonomous 1.34->1.35 run reached the master control-plane phase for the first time (preflight passed, etcd snapshot taken, etcd upgraded), then the kube-apiserver upgrade to v1.35.6 crash-looped and kubeadm auto-rolled-back to 1.34.9. The cluster stayed healthy but the master was left cordoned and the chain wedged on in_flight. Root cause: kubeadm upgrade regenerates the apiserver static-pod manifest from the kubeadm-config ConfigMap. apiserver auth was switched on 2026-06-19 to a structured multi-issuer --authentication-config (kubectl + dashboard SSO), but kubeadm-config still carried the legacy single-issuer --oidc-* extraArgs, so the regenerated manifest reverted structured auth and the new apiserver crash-looped. Proven via `kubeadm upgrade diff`. The existing post-upgrade OIDC restore step never ran because the upgrade itself never succeeded. Fix: - rbac/apiserver-oidc.tf: the remote script now also reconciles kubeadm-config (kubeadm init phase upload-config: drop --oidc-*, add --authentication-config) so a future kubeadm upgrade regenerates a correct manifest. Delivered to the cluster via the apiserver-oidc-restore ConfigMap the chain re-runs (CI needs no ssh key); trigger deliberately not script-hashed since CI cannot ssh. - k8s-version-upgrade/upgrade-step.sh: new preflight gate runs `kubeadm upgrade diff` and BLOCKS+alerts (never drains the master) if --authentication-config would still be dropped. - Post-mortem + runbook updated. The live kubeadm-config was reconciled directly on the master and verified (`kubeadm upgrade diff` now shows only the control-plane image bump), so tonight's run can complete the 1.34->1.35 upgrade. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-25 14:16:04 +00:00
Viktor Barzin	c6bba1da6e	home-assistant skill: refresh ha-london map (HAOS 2026.5.2, Cowboy revived, Overview redesign) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor asked to redesign the ha-london dashboards and fix the broken integrations (the Cowboy one). The skill's ha-london knowledge map had drifted badly from reality, so this brings it current: it claimed HA 2025.9.1 on a docker-run container (it's HAOS 2026.5.2, managed); listed the now-dead jdejaegh Cowboy integration with sensor.bike_* entities (revived via elsbrock/cowboy-ha v1.2.0 -> entities are sensor.classic_performance_*); and didn't flag that met/metoffice/roomba/hildebrandglow are user-disabled (not broken) or that Tapo P100 is failing. Also documents the redesigned Overview (Home+More sections, Mushroom+mini-graph-card), the dashboard/view/card glossary, the parked-bike 'unknown battery' gotcha, and that london is API-only from the Sofia devvm (no SSH). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-24 22:03:15 +00:00
Viktor Barzin	b858561bd0	Merge remote-tracking branch 'origin/master' Some checks failed ci/woodpecker/push/default Pipeline failed Details	2026-06-24 20:59:39 +00:00
Viktor Barzin	a7704f46a6	deploy goldmane-edge-aggregator: durable who-talks-to-whom edge trail (#58 , ADR-0014) Infra side of ADR-0014: an mTLS gRPC consumer of Calico Goldmane's Flows API that records the namespace-pair edge-set in CNPG and posts a daily new-edge digest to #security. Adds the goldmane-edge-aggregator stack, the pg-goldmane-edges Vault rotation role (Tier-0 vault state updated here), and the namespace in the ghcr-credentials allowlist. Cert: REUSES the operator-minted, Tigera-CA-signed whisker-backend client cert (Goldmane verifies only the CA chain, not identity) instead of minting from the Tigera CA private key. This avoids putting the CA key in TF state AND the hashicorp/tls provider, which is incompatible with this repo's global generate-providers/lockfile pattern (it broke every stack's lockfile). Verified live: aggregator streaming flows, 174 edges in Postgres across 50x54 namespaces, db+slack ExternalSecrets synced, digest dry-run formats correctly, private image pulls via the Kyverno-synced ghcr-credentials. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-24 20:59:39 +00:00
Viktor Barzin	aa510e3600	instagram-poster: force_conflicts on ESO manifests (fix apply) All checks were successful ci/woodpecker/push/default Pipeline was successful Details The ESO v1 migration (2026-06-22) made the external-secrets controller own .spec.refreshInterval via server-side apply, so terraform apply of the two ExternalSecret manifests fails with a field-manager conflict (Woodpecker #348), which blocked the replicas=0 scale-down from landing. Add force_conflicts=true to both, matching the grafana/woodpecker/traefik fix applied to other stacks the same day. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-24 20:49:53 +00:00
Viktor Barzin	53834deb24	instagram-poster: scale to 0 (unused, dead ExternalSecret) Some checks failed ci/woodpecker/push/default Pipeline failed Details Viktor confirmed the Instagram Graph poster isn't used. Its ExternalSecret has been dead on missing Vault keys (ig_graph_long_lived_token, ig_business_account_id), so the deployment sat at 0/1 firing DeploymentReplicasMismatch. Setting replicas=0 stops the alert and makes the scale-down durable (a bare kubectl scale reverts on the next stack apply). Re-set to 1 after minting a Meta long-lived token + populating the Vault keys. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-24 20:45:30 +00:00
Viktor Barzin	8dd9a3978d	Merge remote-tracking branch 'forgejo/master' into wizard/homelab-vault All checks were successful ci/woodpecker/push/default Pipeline was successful Details	2026-06-24 12:25:52 +00:00
Viktor Barzin	65b2df1222	fix(monitoring): force_conflicts on grafana_db_creds ExternalSecret The external-secrets controller owns .spec.refreshInterval via SSA, so a plain terraform apply of the monitoring stack conflicts. Latent until 2026-06-24 (the homelab-vault loki-rules change was the first monitoring apply in a while and surfaced it). force_conflicts lets TF win — same pattern as woodpecker/traefik/ k8s-version-upgrade stacks.	2026-06-24 12:25:36 +00:00