Compare commits

...
Sign in to create a new pull request.

11 commits

Author SHA1 Message Date
Viktor Barzin
6c5288998f goldmane-trail: polish follow-ups #57/#59/#61/#62/#63 + digest→#alerts
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Completes the Goldmane who-talks-to-whom trail (ADR-0014), implemented by a
subagent workflow (distinct stacks in parallel, docs last):

- #57 Whisker gated ingress: ingress_factory (whisker.viktorbarzin.me,
  auth=required, Authentik-gated) + a NetworkPolicy allowing traefik->whisker:8081
  (the operator's whisker NP default-denies ingress). calico stack.
- #61 pipeline health: AggregatorDown + DigestFailing Prometheus alerts
  (prometheus_chart_values.tpl) + cluster-health check #48.
- #59 service-identity labels on the multi-Service namespaces (monitoring's 5
  TF-managed deployments + dbaas), with the KYVERNO_LIFECYCLE_V1 marker so they
  update in-place.
- #62/#63 docs: docs/runbooks/goldmane-flow-trail.md (new), service-catalog,
  security.md + monitoring.md east-west sections, ADR-0014 as-built, CONTEXT.md.
  #62 = the SQL to derive the Wave-1 per-namespace egress allowlist from the
  edge table (feeds code-8ywc; enforce-flips out of scope).

Also fixes the digest's Slack target: #security override 404s channel_not_found
because the shared alertmanager_slack_api_url webhook's app isn't a member of
#security (this likely also breaks alertmanager's slack-security receiver — flagged
in the runbook). Routed to #alerts (the webhook's working channel) until the app
is invited; verified a real digest run posts cleanly (360 edges).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-25 17:49:25 +00:00
Viktor Barzin
306cdd4cb3 state(dbaas): update encrypted state 2026-06-25 17:31:03 +00:00
Viktor Barzin
9c68d147e0 k8s-upgrade: reclaim+auto-prune kubeadm /etc/kubernetes/tmp leak; correct crash root cause to etcd IO (not OIDC)
Some checks failed
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/default Pipeline failed
Digging into "why did the apiserver crash" disproved the earlier OIDC
explanation. An isolated v1.35.6 apiserver repro with authentik reachable
initialises OIDC cleanly (oidc.go:313, no error) and runs fine — so the
--authentication-config -> --oidc-* revert is NOT what crashed it. etcd's
surviving crash-window log is the real cause: 1180 "apply request took too long"
warnings in 16 min, individual applies up to 4.3s (healthy <100ms) right as
kubeadm tried to bring up the new apiserver. That's etcd IO starvation on the
shared sdc HDD (beads code-oflt).

A big contributor + the reason master root fs sat at 73%: kubeadm dumps a full
~400MB etcd DB backup into /etc/kubernetes/tmp/kubeadm-backup-etcd-<ts>/ before
every etcd upgrade and never cleans it up — 145 dirs / 28GB had accumulated,
driving image-GC churn and extra write-IO onto etcd's spindle. Reclaimed live
(73% -> 23%) and added a preflight prune (>3 days) so it can't re-accumulate.

Also corrected the OIDC handling: the kubeadm-config drift is real but only
breaks dashboard/kubectl SSO AFTER a successful upgrade (recoverable via the
chain's restore.sh + the kubeadm-config reconciliation) — it does not crash the
apiserver. So the preflight check is now an ALERT, not a block (was added on the
wrong hypothesis). Post-mortem, runbook, and apiserver-oidc.tf header corrected.

Per Viktor: reclaim the disk and automate so the manual cleanup never recurs;
the durable IO fix remains code-oflt (etcd off the shared HDD).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-25 15:23:15 +00:00
Viktor Barzin
60a1cb9a25 k8s-upgrade: reconcile kubeadm-config OIDC drift that crash-looped the v1.35 apiserver upgrade
All checks were successful
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
Last night's autonomous 1.34->1.35 run reached the master control-plane phase
for the first time (preflight passed, etcd snapshot taken, etcd upgraded), then
the kube-apiserver upgrade to v1.35.6 crash-looped and kubeadm auto-rolled-back
to 1.34.9. The cluster stayed healthy but the master was left cordoned and the
chain wedged on in_flight.

Root cause: kubeadm upgrade regenerates the apiserver static-pod manifest from
the kubeadm-config ConfigMap. apiserver auth was switched on 2026-06-19 to a
structured multi-issuer --authentication-config (kubectl + dashboard SSO), but
kubeadm-config still carried the legacy single-issuer --oidc-* extraArgs, so the
regenerated manifest reverted structured auth and the new apiserver crash-looped.
Proven via `kubeadm upgrade diff`. The existing post-upgrade OIDC restore step
never ran because the upgrade itself never succeeded.

Fix:
- rbac/apiserver-oidc.tf: the remote script now also reconciles kubeadm-config
  (kubeadm init phase upload-config: drop --oidc-*, add --authentication-config)
  so a future kubeadm upgrade regenerates a correct manifest. Delivered to the
  cluster via the apiserver-oidc-restore ConfigMap the chain re-runs (CI needs no
  ssh key); trigger deliberately not script-hashed since CI cannot ssh.
- k8s-version-upgrade/upgrade-step.sh: new preflight gate runs `kubeadm upgrade
  diff` and BLOCKS+alerts (never drains the master) if --authentication-config
  would still be dropped.
- Post-mortem + runbook updated.

The live kubeadm-config was reconciled directly on the master and verified
(`kubeadm upgrade diff` now shows only the control-plane image bump), so tonight's
run can complete the 1.34->1.35 upgrade.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-25 14:16:04 +00:00
Viktor Barzin
c6bba1da6e home-assistant skill: refresh ha-london map (HAOS 2026.5.2, Cowboy revived, Overview redesign)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked to redesign the ha-london dashboards and fix the broken integrations (the Cowboy one). The skill's ha-london knowledge map had drifted badly from reality, so this brings it current: it claimed HA 2025.9.1 on a docker-run container (it's HAOS 2026.5.2, managed); listed the now-dead jdejaegh Cowboy integration with sensor.bike_* entities (revived via elsbrock/cowboy-ha v1.2.0 -> entities are sensor.classic_performance_*); and didn't flag that met/metoffice/roomba/hildebrandglow are user-disabled (not broken) or that Tapo P100 is failing. Also documents the redesigned Overview (Home+More sections, Mushroom+mini-graph-card), the dashboard/view/card glossary, the parked-bike 'unknown battery' gotcha, and that london is API-only from the Sofia devvm (no SSH).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 22:03:15 +00:00
Viktor Barzin
b858561bd0 Merge remote-tracking branch 'origin/master'
Some checks failed
ci/woodpecker/push/default Pipeline failed
2026-06-24 20:59:39 +00:00
Viktor Barzin
a7704f46a6 deploy goldmane-edge-aggregator: durable who-talks-to-whom edge trail (#58, ADR-0014)
Infra side of ADR-0014: an mTLS gRPC consumer of Calico Goldmane's Flows API
that records the namespace-pair edge-set in CNPG and posts a daily new-edge
digest to #security. Adds the goldmane-edge-aggregator stack, the
pg-goldmane-edges Vault rotation role (Tier-0 vault state updated here), and the
namespace in the ghcr-credentials allowlist.

Cert: REUSES the operator-minted, Tigera-CA-signed whisker-backend client cert
(Goldmane verifies only the CA chain, not identity) instead of minting from the
Tigera CA private key. This avoids putting the CA key in TF state AND the
hashicorp/tls provider, which is incompatible with this repo's global
generate-providers/lockfile pattern (it broke every stack's lockfile).

Verified live: aggregator streaming flows, 174 edges in Postgres across 50x54
namespaces, db+slack ExternalSecrets synced, digest dry-run formats correctly,
private image pulls via the Kyverno-synced ghcr-credentials.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 20:59:39 +00:00
Viktor Barzin
aa510e3600 instagram-poster: force_conflicts on ESO manifests (fix apply)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The ESO v1 migration (2026-06-22) made the external-secrets controller own
.spec.refreshInterval via server-side apply, so terraform apply of the two
ExternalSecret manifests fails with a field-manager conflict (Woodpecker #348),
which blocked the replicas=0 scale-down from landing. Add force_conflicts=true
to both, matching the grafana/woodpecker/traefik fix applied to other stacks
the same day.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 20:49:53 +00:00
Viktor Barzin
53834deb24 instagram-poster: scale to 0 (unused, dead ExternalSecret)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Viktor confirmed the Instagram Graph poster isn't used. Its ExternalSecret
has been dead on missing Vault keys (ig_graph_long_lived_token,
ig_business_account_id), so the deployment sat at 0/1 firing
DeploymentReplicasMismatch. Setting replicas=0 stops the alert and makes the
scale-down durable (a bare kubectl scale reverts on the next stack apply).
Re-set to 1 after minting a Meta long-lived token + populating the Vault keys.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 20:45:30 +00:00
Viktor Barzin
8dd9a3978d Merge remote-tracking branch 'forgejo/master' into wizard/homelab-vault
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-24 12:25:52 +00:00
Viktor Barzin
65b2df1222 fix(monitoring): force_conflicts on grafana_db_creds ExternalSecret
The external-secrets controller owns .spec.refreshInterval via SSA, so a plain
terraform apply of the monitoring stack conflicts. Latent until 2026-06-24 (the
homelab-vault loki-rules change was the first monitoring apply in a while and
surfaced it). force_conflicts lets TF win — same pattern as woodpecker/traefik/
k8s-version-upgrade stacks.
2026-06-24 12:25:36 +00:00
29 changed files with 5416 additions and 3960 deletions

View file

@ -243,7 +243,8 @@ Plan in `docs/architecture/security.md` + response playbook in `docs/runbooks/se
- **Source-IP allowlist (K2, K9, V7, S1)**: `10.0.20.0/22`, `192.168.1.0/24` (Proxmox + Sofia LAN), K8s pod CIDR, K8s service CIDR, Headscale tailnet. **Policy: no public-IP access** — Vault, kube-apiserver, PVE sshd must transit LAN or Headscale. **One documented exception (2026-06-11): break-glass SSH** — PVE sshd on a WAN-exposed `:52222`, key-only, dedicated break-glass key only (`Match LocalPort`), rate-limited + fail2ban; intentionally cluster-independent so it survives an outage. As-built `docs/runbooks/breakglass-ssh.md`. (Replaced the 2026-05-30 port-knock design — circular Vault dep caused a lockout.) - **Source-IP allowlist (K2, K9, V7, S1)**: `10.0.20.0/22`, `192.168.1.0/24` (Proxmox + Sofia LAN), K8s pod CIDR, K8s service CIDR, Headscale tailnet. **Policy: no public-IP access** — Vault, kube-apiserver, PVE sshd must transit LAN or Headscale. **One documented exception (2026-06-11): break-glass SSH** — PVE sshd on a WAN-exposed `:52222`, key-only, dedicated break-glass key only (`Match LocalPort`), rate-limited + fail2ban; intentionally cluster-independent so it survives an outage. As-built `docs/runbooks/breakglass-ssh.md`. (Replaced the 2026-05-30 port-knock design — circular Vault dep caused a lockout.)
- **Response model**: (I) Slack-only daily skim. All security alerts via Loki ruler → Alertmanager → `#security` Slack receiver. Single channel with severity labels inside (critical/warning/info). No paging. - **Response model**: (I) Slack-only daily skim. All security alerts via Loki ruler → Alertmanager → `#security` Slack receiver. Single channel with severity labels inside (critical/warning/info). No paging.
- **Kyverno policies (wave 1)**: `deny-privileged-containers`, `deny-host-namespaces`, `restrict-sys-admin`, `require-trusted-registries` flip Audit→Enforce with the 31-namespace exclude list (memory id=1970). `failurePolicy: Ignore` preserved. Cosign `verify-images` deferred. - **Kyverno policies (wave 1)**: `deny-privileged-containers`, `deny-host-namespaces`, `restrict-sys-admin`, `require-trusted-registries` flip Audit→Enforce with the 31-namespace exclude list (memory id=1970). `failurePolicy: Ignore` preserved. Cosign `verify-images` deferred.
- **NetworkPolicy default-deny egress (wave 1)**: observe-then-enforce (γ approach) — Calico flow logs cluster-wide + GlobalNetworkPolicy log-only on tier 3+4, build empirical allowlist after 1 week, phased per-namespace enforce starting `recruiter-responder`. Tier 0/1/2 deferred. - **NetworkPolicy default-deny egress (wave 1)**: observe-then-enforce (γ approach) — Calico flow logs cluster-wide + GlobalNetworkPolicy log-only on tier 3+4, build empirical allowlist after 1 week, phased per-namespace enforce starting `recruiter-responder`. Tier 0/1/2 deferred. **The internal (ns-to-ns) half of each allowlist now derives faster from the east-west flow trail** (below): `SELECT DISTINCT dst_ns FROM edge WHERE src_ns='<ns>' AND action='allow'`. External egress is NOT in that table (empty-ns flows dropped) — those still come from the Calico flow-log W1.6 snapshot. Enforce-flips remain out of scope of the trail (observe-and-derive only; beads `code-8ywc`).
- **East-west flow trail (who-talks-to-whom, ADR-0014)**: Calico **Goldmane** (`goldmane.calico-system:7443`, gRPC/mTLS, ~60-min in-memory ring buffer — no etcd writes) + **Whisker** live UI (`whisker.viktorbarzin.me`, Authentik-gated) → **`goldmane-edge-aggregator`** streams Goldmane's `Flows.Stream` over mTLS and upserts the namespace-pair **edge set** (`edge(src_ns,dst_ns,action,first_seen,last_seen,flow_count)`, self-edges + public-internet flows dropped) into **CNPG DB `goldmane_edges`** → daily **`goldmane-edges-digest`** CronJob posts first-seen edges to `#alerts` (the shared webhook's Slack app isn't in `#security` → 404 channel_not_found; flip `SLACK_CHANNEL` back once invited — see runbook). **CERT-REUSE GOTCHA**: the aggregator's mTLS client cert reuses the operator's Tigera-CA-signed `whisker-backend-key-pair` Secret (Goldmane verifies CA-chain only) — **re-apply `stacks/goldmane-edge-aggregator` if the operator rotates it** (symptom: no `last_seen` updates, `AggregatorDown`). Service identity = namespace, + `service-identity` label only in `monitoring`/`kube-system`/`dbaas`. Health: `AggregatorDown` + `DigestFailing` alerts + cluster-health #48. Runbook: `docs/runbooks/goldmane-flow-trail.md`. (Goldmane is OSS tech-preview — reversible operator-CR toggle in `stacks/calico/main.tf`.)
- **What's NOT in scope**: canary tokens (rejected — self-trigger risk with Viktor's normal `vault kv list secret/viktor` and `kubectl get secret -A` workflows), Falco/Tetragon (too noisy for Slack-only daily check), Cloudflare/GitHub audit polling (deferred to wave 2). - **What's NOT in scope**: canary tokens (rejected — self-trigger risk with Viktor's normal `vault kv list secret/viktor` and `kubectl get secret -A` workflows), Falco/Tetragon (too noisy for Slack-only daily check), Cloudflare/GitHub audit polling (deferred to wave 2).
## Storage & Backup Architecture ## Storage & Backup Architecture

View file

@ -13,6 +13,8 @@
| authentik | Identity provider (SSO) | authentik | | authentik | Identity provider (SSO) | authentik |
| cloudflared | Cloudflare tunnel | cloudflared | | cloudflared | Cloudflare tunnel | cloudflared |
| authelia | Auth middleware (may be merged into ebooks or removed) | platform | | authelia | Auth middleware (may be merged into ebooks or removed) | platform |
| goldmane | Calico 3.30 OSS flow aggregator (`goldmane.calico-system.svc:7443`, gRPC/mTLS). Stamps identity (ns/pod/workload/labels + allow-deny) on every flow from Felix into a ~60-min in-memory ring buffer — no etcd/API writes. East-west "who-talks-to-whom" source (ADR-0014). Enabled via operator CR (`kubectl_manifest.goldmane`). | calico |
| whisker | Calico 3.30 OSS live flow-observability UI (`whisker.calico-system.svc:8081`) at `whisker.viktorbarzin.me` (Authentik-gated, `auth=required` — no own login; additive NP ORs Traefik past the operator default-deny). ~60-min live view of Goldmane flows, NOT history. Enabled via operator CR (`kubectl_manifest.whisker`). | calico |
| monitoring | Prometheus/Grafana/Loki stack | monitoring | | monitoring | Prometheus/Grafana/Loki stack | monitoring |
## Storage & Security (Tier: cluster) ## Storage & Security (Tier: cluster)
@ -37,6 +39,7 @@
## Active Use ## Active Use
| Service | Description | Stack | | Service | Description | Stack |
|---------|-------------|-------| |---------|-------------|-------|
| goldmane-edge-aggregator | Durable who-talks-to-whom audit trail (ADR-0014 / #58). Go service: `aggregate` Deployment streams Goldmane's gRPC `Flows.Stream` (mTLS) and upserts the low-cardinality namespace-pair edge set (`edge(src_ns,dst_ns,action,first_seen,last_seen,flow_count)`) into CNPG DB `goldmane_edges`; `goldmane-edges-digest` CronJob posts first-seen edges daily to `#security`. mTLS client cert REUSES the operator's `whisker-backend-key-pair` (re-apply if rotated). Tier-4-aux. Image `ghcr.io/viktorbarzin/goldmane-edge-aggregator` (private). Runbook: [goldmane-flow-trail.md](../../docs/runbooks/goldmane-flow-trail.md). | goldmane-edge-aggregator |
| mailserver | Email (docker-mailserver) | mailserver | | mailserver | Email (docker-mailserver) | mailserver |
| shadowsocks | Proxy | shadowsocks | | shadowsocks | Proxy | shadowsocks |
| webhook_handler | Webhook processing | webhook_handler | | webhook_handler | Webhook processing | webhook_handler |
@ -161,3 +164,4 @@ procedures) are documented in `infra/docs/runbooks/`:
| pfSense + Unbound DNS | [pfsense-unbound.md](../../docs/runbooks/pfsense-unbound.md) | | pfSense + Unbound DNS | [pfsense-unbound.md](../../docs/runbooks/pfsense-unbound.md) |
| Mailserver PROXY-protocol / HAProxy | [mailserver-pfsense-haproxy.md](../../docs/runbooks/mailserver-pfsense-haproxy.md) | | Mailserver PROXY-protocol / HAProxy | [mailserver-pfsense-haproxy.md](../../docs/runbooks/mailserver-pfsense-haproxy.md) |
| Technitium apply flow | [technitium-apply.md](../../docs/runbooks/technitium-apply.md) | | Technitium apply flow | [technitium-apply.md](../../docs/runbooks/technitium-apply.md) |
| Goldmane flow trail (east-west who-talks-to-whom) | [goldmane-flow-trail.md](../../docs/runbooks/goldmane-flow-trail.md) |

View file

@ -11,8 +11,8 @@ description: |
There are TWO Home Assistant deployments: ha-london (default) and ha-sofia. There are TWO Home Assistant deployments: ha-london (default) and ha-sofia.
Always use Home Assistant for smart home control. Always use Home Assistant for smart home control.
author: Claude Code author: Claude Code
version: 2.0.0 version: 2.1.0
date: 2026-02-07 date: 2026-06-24
--- ---
# Home Assistant Control # Home Assistant Control
@ -395,14 +395,27 @@ Advanced SSH, File Editor, Studio Code Server, InfluxDB, Mosquitto, Node-RED, Fr
## ha-london Knowledge Map ## ha-london Knowledge Map
### Overview ### Overview
- **HA Version**: 2025.9.1 (Docker container on Raspberry Pi) - **HA Version**: 2026.5.2 on **Home Assistant OS** (HAOS — managed appliance, NOT a `docker run` container). Latest is 2026.6.4 (update available, deliberately not applied).
- **Location**: London, UK - **Location**: London, UK
- **Platform**: Raspberry Pi 4, HA OS (not Docker standalone) - **Platform**: Raspberry Pi 4, HA OS
- **SSH**: `ssh hassio@192.168.8.103` (requires `sudo` for file access) - **Access from the Sofia devvm**: london is **remote**`homelab ha ssh --instance london` generally WON'T connect (ADR-0012). Drive it via the API: `homelab ha token --instance london` + `https://ha-london.viktorbarzin.me/api/...`, and the WebSocket API `wss://ha-london.viktorbarzin.me/api/websocket` for dashboards / config-entries / HACS installs.
- **Config path**: `/config/` (requires `sudo` for file access) - **SSH (only from the London LAN)**: `ssh hassio@192.168.8.103` (requires `sudo` for file access)
- **Config path**: `/config/`
- **3 tracked people**: Viktor Barzin, Anca Milea, Gheorghe Milea - **3 tracked people**: Viktor Barzin, Anca Milea, Gheorghe Milea
- **Zone**: London (home) - **Zone**: London (home)
### Dashboards (redesigned 2026-06-24)
**Glossary** (HA terms — keep distinct):
- **Dashboard** = a sidebar entry (Overview, Air Quality, Map). Sidebar *order* is a per-USER frontend preference, not in any dashboard config.
- **View** = a tab inside a dashboard. View order is global (stored in the dashboard config).
- **Card** = a widget inside a view.
- **Overview** (`lovelace`, the default): responsive **sections** views, styled with Mushroom + mini-graph-card.
- **Home** tab: *Who's home* · *Comfort & Air* (CO₂/temp/humidity/PM2.5/VOC chips + CO₂ and temp/humidity trend graphs + link to Air Quality) · *Cowboy* (battery/range/last-ride) · *Energy* (5 Kasa plugs + power trend) · *Quick actions* (Netflix/Stremio/Night).
- **More** tab: *Network* (GL-MT6000 router) · *System* (HA version/update, last backup, RPi power) · *Phones*.
- **Air Quality** (`air-quality`): deep-dive (views: Home, Detailed). (`detialed``detailed` path typo fixed 2026-06-24.)
- Built via the WS `lovelace/config/save` API (london is remote — no SSH path).
### Key Systems ### Key Systems
#### 1. Smart Plugs (TP-Link Kasa) — Energy Monitoring #### 1. Smart Plugs (TP-Link Kasa) — Energy Monitoring
@ -424,10 +437,15 @@ Named plugs with power/energy tracking:
- PM1.0/2.5/4.0/10 particulate sensors - PM1.0/2.5/4.0/10 particulate sensors
- VOC, NOx, ammonia, CO, ethanol, hydrogen, methane, NO2 gas sensors - VOC, NOx, ammonia, CO, ethanol, hydrogen, methane, NO2 gas sensors
#### 3. Cowboy E-Bike #### 3. Cowboy E-Bike (`elsbrock/cowboy-ha`)
- `sensor.bike_state_of_charge`: Battery % Bike named **"Classic Performance"** → entities are `sensor.classic_performance_*` (26 total). The old `sensor.bike_*` names are GONE (they were the dead `jdejaegh` integration).
- `sensor.bike_total_distance`: Total km - `sensor.classic_performance_remaining_battery`: Battery % (was `sensor.bike_state_of_charge`)
- `sensor.bike_total_co2_saved`: CO2 saved (grams) - `sensor.classic_performance_remaining_range`: Range km
- `sensor.classic_performance_mileage`: Total km (was `sensor.bike_total_distance`)
- `sensor.classic_performance_saved_co2`: Lifetime CO2 saved (was `sensor.bike_total_co2_saved`)
- Plus `_distance_today`, `_last_trip_*`, `_battery_health`, `device_tracker.classic_performance`, etc.
- **GOTCHA**: live battery/range/mileage read `unknown` while the bike is parked/asleep — Cowboy only reports live SoC when awake (ridden/charging); trip-history + `distance_today` stay live regardless.
- Auth: account **email+password** (no AWS Cognito — that was the dead `jdejaegh`/`cowboybike` lineage). Setup via UI config flow / REST `config_entries/flow`. Creds in Vaultwarden item **"cowboy bike"** (`homelab vault get "cowboy bike"`).
#### 4. Uptime Monitoring (UptimeRobot) #### 4. Uptime Monitoring (UptimeRobot)
- `sensor.blog`: blog uptime - `sensor.blog`: blog uptime
@ -446,12 +464,17 @@ Named plugs with power/energy tracking:
- Scripts: `script.start_netflix`, `script.start_stremio` - Scripts: `script.start_netflix`, `script.start_stremio`
- Scene: `scene.night` (turns off Livia + Michelle plugs) - Scene: `scene.night` (turns off Livia + Michelle plugs)
### Custom Components ### Custom Components (HACS integrations)
- **cowboy**: Cowboy e-bike integration (HACS) - **cowboy** (`elsbrock/cowboy-ha` v1.2.0): Cowboy e-bike — revived 2026-06-24. The old `jdejaegh/home-assistant-cowboy` repo is **dead (404)**; don't chase it.
- **hildebrandglow_dcc**: UK smart meter DCC energy data (HACS) - **hildebrandglow_dcc**: UK smart meter DCC energy — **DISABLED by user** (config entry `disabled_by: user`), not broken.
### HACS frontend cards (plugins)
- **Mushroom** (`piitaya/lovelace-mushroom`), **mini-graph-card** (`kalkih/mini-graph-card`), **plotly-graph-card** (`dbuezas/lovelace-plotly-graph-card`) — used by the redesigned Overview. Install over WS `hacs/repository/download`; resources auto-register in storage mode.
### Integrations ### Integrations
ESPHome, TP-Link Kasa, Tapo, UptimeRobot, Cowboy, Hildebrand Glow DCC, Oral-B BLE, Ookla Speedtest, HACS, OpenRouter (multiple free LLMs), Piper (local TTS), Whisper (local STT), Android TV/ADB ESPHome, TP-Link Kasa, Tapo, UptimeRobot, **Cowboy** (elsbrock), Oral-B BLE, Ookla Speedtest (exposes only an `update` entity, no live speed sensors), HACS, OpenRouter (free LLMs), Piper (TTS), Whisper (STT), Android TV/ADB.
- **Disabled by user (NOT broken)**: `met` + `metoffice` (weather — so `weather.*` entities are ABSENT), `roomba` (Rumi vacuum), `hildebrandglow_dcc` (energy).
- **Failing**: `tplink` **Tapo P100** projector plug — `setup_retry`, 403 KLAP handshake from 192.168.8.108 (plug off / firmware). Left as-is.
### AI / Voice Assistants ### AI / Voice Assistants
- 5 free LLM conversation agents: Google Gemma 3 27B, Meta Llama 3.2 3B, Mistral Devstral 2, OpenAI GPT-OSS-20B, Z.AI GLM 4.5 Air - 5 free LLM conversation agents: Google Gemma 3 27B, Meta Llama 3.2 3B, Mistral Devstral 2, OpenAI GPT-OSS-20B, Z.AI GLM 4.5 Air
@ -466,15 +489,8 @@ ESPHome, TP-Link Kasa, Tapo, UptimeRobot, Cowboy, Hildebrand Glow DCC, Oral-B BL
- Anca arrival/departure notifications - Anca arrival/departure notifications
- Night scene: turns off Livia + Michelle - Night scene: turns off Livia + Michelle
### Docker Setup ### Platform (HAOS — ignore any legacy `docker run` snippet)
```bash ha-london runs **Home Assistant OS** (managed appliance), NOT a hand-run Docker container. There is no `docker run homeassistant/home-assistant` to manage. Install HACS components over the WebSocket API (`hacs/repository/download` with the repo's HACS id), then restart via `POST /api/services/homeassistant/restart` — a HAOS restart drops automations for ~12 min and resets `sensor.uptime` (use that as the "back up" marker).
docker run -d --name homeassistant --privileged \
-e TZ=Europe/London \
-v /home/pi/docker/homeAssistant:/config \
-v /run/dbus:/run/dbus:ro \
--network=host --restart=unless-stopped \
homeassistant/home-assistant:2025.9
```
### SSH Access ### SSH Access
```bash ```bash

View file

@ -125,7 +125,7 @@ How a **Service** is named in flow/audit data — its **namespace** is the prima
_Avoid_: equating "service identity" with a workload's **ServiceAccount** (that's the deferred enforcement principal, not the attribution key) or with cryptographic/SPIFFE identity; "Service" here is the domain **Service**, not the K8s `Service` object. _Avoid_: equating "service identity" with a workload's **ServiceAccount** (that's the deferred enforcement principal, not the attribution key) or with cryptographic/SPIFFE identity; "Service" here is the domain **Service**, not the K8s `Service` object.
**Goldmane / Whisker**: **Goldmane / Whisker**:
Calico 3.30's OSS flow-observability pair — **Goldmane** aggregates identity-stamped flows (namespace/pod/workload/labels + allow-deny + policy trace) streamed from Felix over gRPC into an in-memory ~60-min ring buffer (no etcd/API writes); **Whisker** is its live web UI. The east-west "who-talks-to-whom" data plane, succeeding raw iptables-`LOG`→journald lines (which carry no identity). Durable history requires emitting Goldmane flows to **Loki**; the in-memory buffer alone is not an audit trail. Calico 3.30's OSS flow-observability pair — **Goldmane** aggregates identity-stamped flows (namespace/pod/workload/labels + allow-deny + policy trace) streamed from Felix over gRPC into an in-memory ~60-min ring buffer (no etcd/API writes); **Whisker** is its live web UI. The east-west "who-talks-to-whom" data plane, succeeding raw iptables-`LOG`→journald lines (which carry no identity). The in-memory buffer alone is not an audit trail — durable history is the **`goldmane-edge-aggregator`** (the implemented trail; ADR-0014 originally framed this as a Loki emitter), which streams Goldmane's gRPC `Flows.Stream` over mTLS and upserts the namespace-pair **edge set** into CNPG DB `goldmane_edges` + a daily `#security` digest. As-built: `docs/runbooks/goldmane-flow-trail.md`.
_Avoid_: assuming Goldmane persists (it's a ring buffer — lost on restart); expecting a ServiceAccount field in its schema (it carries labels, not SA); confusing it with Cilium **Hubble** (needs the Cilium datapath, unusable on Calico) or **Kiali** (needs an Istio mesh). _Avoid_: assuming Goldmane persists (it's a ring buffer — lost on restart); expecting a ServiceAccount field in its schema (it carries labels, not SA); confusing it with Cilium **Hubble** (needs the Cilium datapath, unusable on Calico) or **Kiali** (needs an Istio mesh).
### Storage ### Storage

View file

@ -27,3 +27,9 @@ As the Service count grows we want an audit-grade record of which Service talks
- **Enforcement gains a better data source.** Goldmane's allow/deny + policy-trace flows build the Wave 1 empirical egress allowlist faster than the current iptables-`LOG`→journald→Loki path, and policies select on namespace/label with no SA dependency. - **Enforcement gains a better data source.** Goldmane's allow/deny + policy-trace flows build the Wave 1 empirical egress allowlist faster than the current iptables-`LOG`→journald→Loki path, and policies select on namespace/label with no SA dependency.
- **New ubiquitous language** recorded in `CONTEXT.md`: **Service identity** and **Goldmane / Whisker**. - **New ubiquitous language** recorded in `CONTEXT.md`: **Service identity** and **Goldmane / Whisker**.
- **Revisit triggers:** adopt dedicated per-Service SAs if identity-aware NetworkPolicy needs a principal finer than namespace/label, or if mTLS is ever required; reconsider Retina if DNS/drop-level flow detail becomes necessary. - **Revisit triggers:** adopt dedicated per-Service SAs if identity-aware NetworkPolicy needs a principal finer than namespace/label, or if mTLS is ever required; reconsider Retina if DNS/drop-level flow detail becomes necessary.
## As-built (2026-06-25)
Implemented across infra issues #57#63. **One material deviation from the decision above:** the durable trail is NOT a Goldmane→Loki emitter (no such emitter exists in OSS Calico 3.30) — it is the **`goldmane-edge-aggregator`** service, which streams Goldmane's gRPC `Flows.Stream` API over mTLS and upserts the unique namespace-pair **edge set** (`edge(src_ns,dst_ns,action,first_seen,last_seen,flow_count)`, self-edges + empty-namespace flows dropped) into **CNPG DB `goldmane_edges`**, plus a daily `goldmane-edges-digest` CronJob → `#alerts` (the shared webhook can't reach `#security` — see runbook). The mTLS client cert **reuses the operator's Tigera-CA-signed `whisker-backend-key-pair`** rather than copying the CA private key into TF state (Goldmane verifies CA-chain only, not identity) — re-apply `stacks/goldmane-edge-aggregator` if the operator rotates it. `service-identity` labels are live on the multi-Service namespaces (`monitoring`, `dbaas`). Whisker UI is Authentik-gated at `whisker.viktorbarzin.me`. Health: Prometheus alerts `AggregatorDown` + `DigestFailing` and cluster-health check #48.
Full as-built, query recipes (incl. the Wave-1 egress-allowlist derivation), and troubleshooting: [`docs/runbooks/goldmane-flow-trail.md`](../runbooks/goldmane-flow-trail.md). Stacks: `stacks/calico` (Goldmane/Whisker + Whisker ingress), `stacks/goldmane-edge-aggregator` (the trail). Code: `~/code/goldmane-edge-aggregator`.

View file

@ -321,6 +321,17 @@ Detects the inverse of the K-series alerts: a service that **must work WITHOUT A
- **Alert**: `probe_failed_due_to_regex{job="blackbox-authentik-walloff"} == 1` for 10m → `severity=warning`, `lane=security`**`#security` Slack** (Slack-only, no paging). `probe_failed_due_to_regex` (not bare `probe_success==0`) is the signal: it isolates the Authentik-redirect from unrelated 5xx/DNS/TLS failures already covered by reachability alerts. Inhibited by `TraefikDown` and `AuthentikDown` (symptom, not regression, during those outages). - **Alert**: `probe_failed_due_to_regex{job="blackbox-authentik-walloff"} == 1` for 10m → `severity=warning`, `lane=security`**`#security` Slack** (Slack-only, no paging). `probe_failed_due_to_regex` (not bare `probe_success==0`) is the signal: it isolates the Authentik-redirect from unrelated 5xx/DNS/TLS failures already covered by reachability alerts. Inhibited by `TraefikDown` and `AuthentikDown` (symptom, not regression, during those outages).
- **Target list + how to add one**: `local.authentik_walloff_targets` in `stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf` — a map of `service → URL`. To guard a NEW carve-out, add ONE line. Verify it does NOT already 302 to Authentik first: `curl -s -o /dev/null -w '%{http_code} %{redirect_url}\n' '<url>'`. The map key becomes the `service` label on the metric + alert. (Note: openclaw `task-webhook` is intentionally NOT probed — no public DNS record.) - **Target list + how to add one**: `local.authentik_walloff_targets` in `stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf` — a map of `service → URL`. To guard a NEW carve-out, add ONE line. Verify it does NOT already 302 to Authentik first: `curl -s -o /dev/null -w '%{http_code} %{redirect_url}\n' '<url>'`. The map key becomes the `service` label on the metric + alert. (Note: openclaw `task-webhook` is intentionally NOT probed — no public DNS record.)
#### East-west flow observability (Goldmane edge-aggregator) — `AggregatorDown` / `DigestFailing` (ADR-0014)
Health for the durable "who-talks-to-whom" trail (Calico Goldmane → `goldmane-edge-aggregator` → CNPG `goldmane_edges` → daily `#alerts` digest; full trail in security.md + [runbooks/goldmane-flow-trail.md](../runbooks/goldmane-flow-trail.md)). The aggregator pod exposes **no `/metrics`**, so health is inferred from kube-state-metrics. Alert group `Network Observability (Goldmane)` in `prometheus_chart_values.tpl`; both route the default `slack-warning` receiver → **`#alerts`**.
| Alert | Expr (abridged) | For | Severity |
|---|---|---|---|
| `AggregatorDown` | `kube_deployment_status_replicas_available{namespace="goldmane-edge-aggregator",deployment="goldmane-edge-aggregator"} < 1` (+ Prometheus-restart guard) | 15m | warning |
| `DigestFailing` | `kube_job_status_failed{namespace="goldmane-edge-aggregator",job_name=~"goldmane-edges-digest.*"} > 0` within 24h | 30m | warning |
The two layers are **complementary**: `AggregatorDown` ⇒ no new edges land in the DB; `DigestFailing` ⇒ edges still land but nobody is told. (`< 1` requires the metric series to exist — a fully-deleted Deployment is instead caught by cluster-health check #48 below as "deployment missing".) A freshness probe (#61b) was deliberately skipped — `AggregatorDown` is the agreed floor. **Cluster-health check #48** (`check_goldmane_aggregator` in `scripts/cluster_healthcheck.sh`) reads the Deployment's `Available` condition independently (human / `--quiet` / `--json`; JSON key `goldmane_aggregator`).
#### Backup Alerts #### Backup Alerts
- **PostgreSQLBackupStale**: >36h since last backup - **PostgreSQLBackupStale**: >36h since last backup
- **MySQLBackupStale**: >36h since last backup - **MySQLBackupStale**: >36h since last backup

View file

@ -364,6 +364,67 @@ Beads: `code-8ywc` W1.6 + W1.7. **Status: planned.**
- Rare-event misses: a Sunday-only CronJob's egress won't appear in 7 days of flow logs. Mitigation: extend observation to 2 weeks for namespaces with weekly CronJobs. - Rare-event misses: a Sunday-only CronJob's egress won't appear in 7 days of flow logs. Mitigation: extend observation to 2 weeks for namespaces with weekly CronJobs.
- Mass-rollout cascade: the 26h March 2026 outage (memory id=390) was a mass-change cascade. Mitigation: phased per-namespace with health-check pauses, similar to the 2026-05-17 Keel phased rollout (memory id=1972). - Mass-rollout cascade: the 26h March 2026 outage (memory id=390) was a mass-change cascade. Mitigation: phased per-namespace with health-check pauses, similar to the 2026-05-17 Keel phased rollout (memory id=1972).
#### Deriving the per-namespace egress allowlist from the edge trail (Wave 1 W1.7)
The durable **east-west flow trail** (below) is now the preferred data source for
the *internal* (namespace-to-namespace) half of each Wave-1 egress allowlist —
faster and identity-stamped vs the original iptables-`LOG`→journald→Loki path
(ADR-0014: "Enforcement gains a better data source"). The unique observed
namespace pairs live in CNPG DB `goldmane_edges`, table `edge`. To derive the
namespaces a source is observed talking to (the `allow` set that seeds its
NetworkPolicy):
```sql
SELECT DISTINCT dst_ns FROM edge WHERE src_ns='<ns>' AND action='allow' ORDER BY dst_ns;
```
The full SQL recipe (whole-cluster matrix, deny sanity-checks, the ≥7-day
observation caveat) is in
[runbooks/goldmane-flow-trail.md → Deriving the Wave-1 egress allowlist](../runbooks/goldmane-flow-trail.md#deriving-the-wave-1-egress-allowlist-from-the-edge-table-infra-62).
**External / public-internet egress is NOT in this table** (empty-namespace flows
are dropped) — for those destinations keep using the Calico flow-log observation
(the W1.6 snapshot, `wave1-egress-observation-2026-05-22.md`). This feeds the
existing observe-then-enforce effort (beads `code-8ywc`); **enforce-flips remain
out of scope** of the trail — it is observe-and-derive only.
### East-west flow observability (Goldmane / Whisker + edge trail) (ADR-0014)
The "who-talks-to-whom" data plane that succeeds raw iptables-`LOG` lines (which
carried no identity). **Service identity = the workload's namespace** (primary),
refined by a `service-identity` label in the few multi-Service namespaces
(`monitoring`, `kube-system`, `dbaas`). End-to-end trail, three layers:
1. **Calico Goldmane + Whisker** (`calico-system`) — Goldmane aggregates
identity-stamped flows (ns/pod/workload/labels + allow-deny + policy-trace)
streamed from Felix over gRPC into a **~60-min in-memory ring buffer** (no
etcd/API writes — the etcd-cost constraint that drove the design). **Whisker**
is its live web UI at `whisker.viktorbarzin.me` (Authentik-gated,
`auth = "required"` — Whisker has no own login; an additive NetworkPolicy ORs
Traefik past the operator's default-deny `whisker` NP). The ring buffer is
**not** a trail (lost on Goldmane restart). Enabled via operator CRs in
`stacks/calico/main.tf`; reversible toggle (Goldmane is OSS tech-preview).
2. **`goldmane-edge-aggregator`** (`stacks/goldmane-edge-aggregator`) — streams
Goldmane's gRPC `Flows.Stream` over **mTLS** and upserts the low-cardinality
namespace-pair edge set (`edge(src_ns,dst_ns,action,first_seen,last_seen,
flow_count)`) into CNPG DB `goldmane_edges`. Self-edges and empty-namespace
(public-internet) flows are dropped — in-cluster relationships only. The mTLS
client cert **reuses the operator's Tigera-CA-signed `whisker-backend-key-pair`**
(Goldmane verifies CA-chain only, not identity) rather than copying the CA
private key into TF state — **re-apply the stack if the operator rotates that
Secret**.
3. **`goldmane-edges-digest`** CronJob — posts first-seen edges daily to
**`#alerts`** (reuses the alert-digest webhook; a `#security` override 404s —
that webhook's Slack app isn't a member of `#security`; see runbook).
The trail is **attribution-grade, not cryptographic** (reconstructs events in a
trusted cluster; cannot prove identity against a spoofing pod — accepted trust-model
limit; east-west stays plaintext, no mTLS between app pods). Health is covered by
the **`AggregatorDown`** + **`DigestFailing`** alerts and cluster-health check #48
(see monitoring.md). Full as-built, query recipes, and troubleshooting:
[runbooks/goldmane-flow-trail.md](../runbooks/goldmane-flow-trail.md). Decision:
[ADR-0014](../adr/0014-service-identity-and-east-west-observability.md); glossary
`CONTEXT.md`**Service identity**, **Goldmane / Whisker**.
### TLS & HTTP/3 ### TLS & HTTP/3
**Traefik** handles TLS termination: **Traefik** handles TLS termination:

View file

@ -0,0 +1,97 @@
# Post-mortem: k8s 1.34→1.35 upgrade stalled — etcd IO starvation (2026-06-24)
> Filename kept for inbound links. The originally-suspected cause (kubeadm-config
> OIDC drift) turned out **not** to be the crash — see "Correction" below. The OIDC
> drift was a real *separate* latent bug fixed in the same change.
**Impact:** The autonomous k8s-version-upgrade chain (23:00 UTC nightly) reached
the master control-plane phase for the first time — preflight passed, etcd
snapshot taken, master cordoned + drained, etcd upgraded 3.6.5→3.6.6 — then the
kube-apiserver upgrade to v1.35.6 **crash-looped**. kubeadm waited its 5-minute
static-pod-hash window across all internal retries, then auto-rolled-back to
v1.34.9. The cluster stayed healthy on 1.34.9 (apiserver, all 7 nodes Ready), but
the run left **k8s-master cordoned** and the chain **wedged on `in_flight=1`**.
No data loss; no user-facing outage (the master carries control-plane taints, so
no workloads were displaced).
**Trigger:** the first *minor* upgrade the chain ever attempted (1.34→1.35) — the
first time kubeadm upgrades etcd (3.6.5→3.6.6) and regenerates the control-plane
static pods, i.e. the first time the upgrade pushes real write-IO at etcd.
## Root cause — etcd IO starvation on the shared HDD
The new kube-apiserver could not establish/keep a working connection to etcd
during the upgrade because **etcd was IO-starved**. etcd's surviving container log
from the crash window (`/var/log/pods/.../etcd/0.log`, 23:0423:20 UTC) shows:
- **1,180** `apply request took too long` warnings in 16 minutes;
- individual applies of **4.3s / 2.9s / 2.7s / 1.8s** (healthy is <100ms),
clustered at **23:18:51 UTC** — exactly when kubeadm's final attempt was trying
to bring the new apiserver up.
A reproduced 1.35.6 apiserver with no etcd dies with
`F instance.go:233 Error creating leases: error creating storage factory: context
deadline exceeded` — the same failure mode a multi-second etcd produces. etcd
lives on the contended `sdc` HDD (**beads code-oflt**: "etcd/critical VM disks on
shared sdc HDD — recurring IO-storm root cause"). The upgrade itself piled IO onto
that spindle:
1. etcd's own upgrade-restart + WAL/db re-read (it restarted ~23:04, re-elected);
2. kubeadm dumping a full **~400MB etcd DB backup** to
`/etc/kubernetes/tmp/kubeadm-backup-etcd-<ts>/` (on the same HDD) before the
etcd upgrade — and **145 of these had accumulated to 28GB** (kubeadm never
cleans them up), pushing master root fs to **73%**, above the 70% kubelet
image-GC threshold, so image GC churned during the drain too;
3. master-drain pod evictions.
### Correction — it was NOT the OIDC flag swap
`kubeadm upgrade diff v1.35.6` showed the regenerated manifest also swaps
`--authentication-config` (structured multi-issuer OIDC) back to legacy
single-issuer `--oidc-*` flags (kubeadm-config drift, see secondary finding). That
was the *first* hypothesis — but an isolated repro of the 1.35.6 apiserver with
those exact `--oidc-*` flags **and authentik reachable** initialised OIDC cleanly
(`oidc.go:313`, no error) and ran fine until it hit the (deliberately dead) test
etcd. So the auth swap does **not** crash the apiserver; it was a red herring for
the crash. Image pull (all v1.35.6 images pre-pulled), OOM (none), and disk-full
were also ruled out.
## Secondary finding (real, fixed separately) — kubeadm-config OIDC drift
apiserver auth is configured in three places that must agree:
(1) `/etc/kubernetes/pki/auth-config.yaml` (structured, two issuers: `kubernetes`
+ `k8s-dashboard`, added 2026-06-19); (2) the live static-pod manifest
(`--authentication-config`); (3) the kubeadm-config `ClusterConfiguration` CM —
which still carried the legacy `--oidc-*` extraArgs. `kubeadm upgrade` regenerates
the manifest from (3), so it would have reverted structured auth → **dashboard +
kubectl SSO break after a successful upgrade** (recoverable: the chain's
post-master `restore.sh` re-adds the flag). This is a real bug, just not the crash.
## Resolution
1. **Reclaimed the 28GB kubeadm scratch** on master (`/etc/kubernetes/tmp/kubeadm-backup-*`) — root fs 73% → 23%.
2. **Reconciled kubeadm-config live** (zero cluster impact — CM only read at upgrade time): dropped `--oidc-*`, added `--authentication-config` via `kubeadm init phase upload-config kubeadm`. `kubeadm upgrade diff` then shows only the control-plane image bumps.
3. **Recovered:** uncordoned k8s-master, cleared the stuck `in_flight` gauge + annotation, deleted last night's Complete/Failed `1-35-6` phase jobs (a Complete preflight would otherwise make the detector idempotent-skip the re-run).
## Prevention (landed in this change)
| Gap | Fix |
|-----|-----|
| kubeadm leaks ~400MB etcd-DB backups into `/etc/kubernetes/tmp` forever (→ disk fills, image-GC churn, write-IO on etcd's spindle) | **`upgrade-step.sh` preflight now prunes** `/etc/kubernetes/tmp/kubeadm-backup-*` + `kubeadm-upgraded-manifests*` older than 3 days on master, every run. Best-effort, never aborts. |
| kubeadm-config drift would silently break SSO after an upgrade | `apiserver-oidc.tf`'s remote script now **also reconciles kubeadm-config** (`kubeadm init phase upload-config`), delivered via the `apiserver-oidc-restore` ConfigMap the chain re-runs (CI needs no ssh) or a local `-replace` apply. Preflight **alerts** (not blocks — SSO drift is recoverable) if `kubeadm upgrade diff` would still drop `--authentication-config`. |
| etcd on the contended `sdc` HDD starves under upgrade IO | **Durable fix is beads code-oflt** (move etcd/critical VM disks off `sdc`). Not in this change. Mitigations above reduce the upgrade's own IO; reclaimed disk removes the image-GC variable. |
## Lessons
- **Capture the failing component's own logs before concluding.** The `kubeadm
upgrade diff` made the OIDC swap look like the cause; only etcd's log (multi-second
applies) + an isolated apiserver repro showed the truth (etcd IO). A clean diff is
"what config changes," not "why it crashed."
- **etcd on shared HDD is the cluster's recurring fragility** (immich IO storm
2026-05-25, this stall). Upgrades concentrate IO (etcd restart + kubeadm's 400MB
backup copy + drain) onto that spindle. code-oflt is the real fix.
- **Tools that leave per-operation scratch must be reaped.** kubeadm's
`/etc/kubernetes/tmp` etcd backups are throwaway (real backups → NFS) but never
GC'd; 28GB had silently accumulated.
- **Out-of-band control-plane edits must be written back to kubeadm-config** — else
`kubeadm upgrade` silently reverts them (here: SSO; could be admission/audit/API flags).

View file

@ -0,0 +1,301 @@
# Goldmane Flow Trail — east-west "who-talks-to-whom" observability
> As-built runbook for the Calico Goldmane + Whisker flow plane and the
> `goldmane-edge-aggregator` durable audit trail. Design + rationale:
> [ADR-0014](../adr/0014-service-identity-and-east-west-observability.md).
> Glossary: `CONTEXT.md`**Service identity**, **Goldmane / Whisker**.
> Implements infra issues #57 (Whisker ingress), #58 (aggregator), #61
> (monitoring), #62 (egress allowlist queries), #63 (these docs).
## What the trail is
Three layers turn raw east-west traffic into a queryable, durable record of
which Service talks to which. **Service identity = the workload's namespace**
(primary), refined by a `service-identity` label in the few multi-Service
namespaces (`monitoring`, `kube-system`, `dbaas`) — see ADR-0014.
| Layer | Component | Lifetime | Where it lives |
|---|---|---|---|
| **Live map** | Calico **Goldmane** + **Whisker** | ~60-min in-memory ring buffer (lost on Goldmane restart) | `calico-system`; Whisker UI at `whisker.viktorbarzin.me` |
| **Durable trail** | `goldmane-edge-aggregator` (`aggregate` mode) | persistent | CNPG Postgres DB `goldmane_edges`, table `edge` |
| **Notification** | `goldmane-edges-digest` CronJob (`digest` mode) | daily | Slack `#alerts` |
**Goldmane** aggregates identity-stamped flows (namespace / pod / workload /
labels + allow-deny + policy-trace) streamed from Felix (the existing
`calico-node` DaemonSet) over gRPC into a ~60-minute in-memory ring buffer —
**nothing is written to etcd or the K8s API** (the etcd-cost constraint that
drove the whole design). **Whisker** is its live web UI. Because the ring
buffer is *not* a trail (a Goldmane restart loses the window), the
`goldmane-edge-aggregator` consumes Goldmane's gRPC `Flows.Stream` API over
mTLS and upserts the unique **namespace-pair edge set** into Postgres; a daily
CronJob posts first-seen edges to Slack.
The edge set is deliberately **low-cardinality** — one row per
`(src_ns, dst_ns, action)`, *not* per-pod or per-port — so the table stays
small no matter how much traffic flows.
## Where the data lives
### Whisker UI — live, ~60 min
- `https://whisker.viktorbarzin.me` (Authentik-gated — Whisker ships no own
login; `auth = "required"`). Shows the live flow stream + a service graph for
roughly the last hour. Use it for "what is talking right now"; it is **not**
history.
- In-cluster: `Service goldmane:7443` (gRPC/mTLS), `Service whisker:8081`
(HTTP), both in `calico-system`.
### CNPG `goldmane_edges` — durable
- Postgres DB `goldmane_edges` on the CNPG cluster
(`pg-cluster-rw.dbaas.svc.cluster.local:5432`). One table:
```
edge(src_ns text, dst_ns text, action text,
first_seen timestamptz, last_seen timestamptz, flow_count bigint,
PRIMARY KEY (src_ns, dst_ns, action))
```
- `action``allow` / `deny` / `pass` / `unspecified` (normalised Goldmane
action).
- **Self-edges (`src_ns == dst_ns`) and empty-namespace flows** (host-endpoint
/ public-internet) are **dropped** — the trail is about in-cluster service
relationships only. (Egress to the public internet is therefore NOT in this
table; it lives in the Wave-1 Calico flow-log path — see security.md.)
- A **"new edge"** = a row whose `first_seen` falls inside the digest window.
- Role `goldmane_edges` (Vault-rotated, 7-day) owns the DB. The `edge` table
is created idempotently by the aggregator at startup (canonical DDL also in
the repo at `migrations/0001_edge.sql`).
### Slack `#alerts` — daily digest
> **Channel note (2026-06-25):** posts to **`#alerts`**, not `#security`. The shared `alertmanager_slack_api_url` incoming webhook's Slack app is not a member of `#security`, so a channel override there returns HTTP `404 channel_not_found` (this almost certainly also breaks alertmanager's `slack-security` receiver — verify separately). To route the digest (and security alerts) to `#security`: invite that webhook's Slack app to `#security`, then set `SLACK_CHANNEL=#security` in `stacks/goldmane-edge-aggregator` and re-apply.
- CronJob `goldmane-edges-digest` (08:00 Europe/London) posts edges first seen
in the last 24h. Quiet when there are none. Reuses the existing alert-digest
Slack incoming webhook (Vault `secret/viktor``alertmanager_slack_api_url`)
— no new webhook was created.
## How to enable / disable
### Goldmane + Whisker (the flow plane)
Operator CRs in **`stacks/calico/main.tf`** — NOT the Helm `goldmane`/`whisker`
flags (those stay `false`; the operator's own `installation`/`apiServer` are
operator-managed via the `goldmanes`/`whiskers.operator.tigera.io` CRDs):
- `kubectl_manifest.goldmane` (kind `Goldmane`) — creating it makes the operator
re-render `calico-node` with the `FELIX_FLOWLOGSGOLDMANESERVER` env (the
operator auto-wires Felix — **do NOT patch FelixConfiguration**), triggering a
supervised `calico-node` DaemonSet roll. Yields `Deployment` + `Service
goldmane:7443`.
- `kubectl_manifest.whisker` (kind `Whisker`, `depends_on` goldmane;
`notifications = Disabled`). Yields `Deployment` + `Service whisker:8081`.
**To disable:** delete those two CRs and re-apply `stacks/calico`. Reversible
toggle (Goldmane is tech-preview in OSS Calico 3.30 — the main standing risk per
ADR-0014).
### Whisker public ingress (infra #57)
Also in `stacks/calico/main.tf`:
- `module "ingress_whisker"` (`ingress_factory`, `auth = "required"`,
`dns_type = "proxied"`) → `whisker.viktorbarzin.me`.
- `kubernetes_network_policy_v1.whisker_allow_traefik` — **required alongside the
ingress**: the operator's own `whisker` NetworkPolicy (owned by the Whisker CR)
is `policyTypes: [Ingress]` with no rules = default-deny ingress to the pod.
This additive NP ORs in an allow for `namespaceSelector
kubernetes.io/metadata.name=traefik` on TCP 8081. Without it Traefik 502s.
### The aggregator + digest (the durable trail) — `stacks/goldmane-edge-aggregator`
A Tier-1 stack (PG state) mirroring the claude-memory pattern. `scripts/tg
apply` from `stacks/goldmane-edge-aggregator/`. It provisions: the namespace,
the mTLS client material, the Postgres DB-init Job, the `DATABASE_URL`
ExternalSecret (Vault static role `pg-goldmane-edges`), the Slack ExternalSecret,
the `aggregate` Deployment, and the `digest` CronJob. **To disable the trail
without touching the flow plane:** scale `deployment/goldmane-edge-aggregator` to
0 (transient) or remove the stack (permanent) — Goldmane/Whisker keep running.
Image: `ghcr.io/viktorbarzin/goldmane-edge-aggregator` (PRIVATE) — the
`goldmane-edge-aggregator` namespace must be in the `ghcr-credentials` Kyverno
allowlist (`stacks/kyverno/modules/kyverno/ghcr-credentials.tf`,
`local.ghcr_private_namespaces`) or pulls 401. Code repo:
`~/code/goldmane-edge-aggregator` (see its `README.md` + `DEPLOY.md`).
## mTLS cert — the REUSE decision (cert-reuse gotcha)
The aggregator dials `goldmane:7443` over **mutual TLS**. Goldmane requires the
client cert to chain to the **Tigera CA**, but it does **NOT authorize by client
identity** — any Tigera-CA-signed cert is accepted.
Rather than copy the Tigera CA **private key** into Terraform state to mint our
own cert (a needless CA-key exposure; the `hashicorp/tls` provider also clashes
with this repo's global generate-providers/lockfile pattern), the stack
**REUSES the operator-minted, Tigera-CA-signed `whisker-backend-key-pair`
Secret** (`calico-system`), copying its `tls.crt`/`tls.key` into the
`goldmane-client-tls` Secret in the aggregator namespace. The CA *bundle* that
verifies Goldmane's serving cert (`tigera-ca-bundle` ConfigMap, key
`tigera-ca-bundle.crt`) is likewise copied verbatim (a ConfigMap can't be
cross-namespace-mounted).
> **GOTCHA — if the operator rotates `whisker-backend-key-pair`, re-apply
> `stacks/goldmane-edge-aggregator`** to re-sync the copied cert. Symptom of a
> stale copy: the `aggregate` pod logs TLS handshake / `Flows.Stream` failures
> and no `last_seen` updates land in the `edge` table. Hardening follow-up
> (noted in the stack): mint an own-identity cert in-namespace if Whisker is ever
> removed (which would delete the reused source Secret).
The Deployment leaves `GOLDMANE_HOST=goldmane.calico-system.svc.cluster.local:7443`
and the default cert/CA paths; the default ServerName (host sans port) is a SAN
on Goldmane's live serving cert, so no `GOLDMANE_SERVER_NAME` /
`GOLDMANE_TLS_INSECURE` override is needed.
## How to query who-talks-to-whom
`psql` into the DB (creds: Vault static role `static-creds/pg-goldmane-edges`, or
exec a CNPG pod). All queries are against the single `edge` table.
```sql
-- Everything talking to a namespace (inbound), most-active first
SELECT src_ns, action, flow_count, first_seen, last_seen
FROM edge WHERE dst_ns = '<ns>' ORDER BY flow_count DESC;
-- Everything a namespace talks TO (outbound)
SELECT dst_ns, action, flow_count, first_seen, last_seen
FROM edge WHERE src_ns = '<ns>' ORDER BY last_seen DESC;
-- New edges in the last 24h (what the digest reports)
SELECT src_ns, dst_ns, action, flow_count, first_seen
FROM edge WHERE first_seen > now() - interval '24 hours'
ORDER BY first_seen DESC;
-- Any DENIED edges (policy is dropping this pair)
SELECT src_ns, dst_ns, flow_count, last_seen
FROM edge WHERE action = 'deny' ORDER BY last_seen DESC;
-- Full edge set as a graph adjacency list
SELECT src_ns, dst_ns, action, flow_count FROM edge ORDER BY src_ns, dst_ns;
```
For the **live** (sub-hour) view including pod/port detail, use the Whisker UI —
the `edge` table intentionally aggregates that away.
## Deriving the Wave-1 egress allowlist from the edge table (infra #62)
The durable edge set is a faster, identity-stamped data source for the existing
**observe-then-enforce** egress effort (beads `code-8ywc`; snapshot
`docs/architecture/wave1-egress-observation-2026-05-22.md`) than the original
iptables-`LOG` → journald → Loki path (ADR-0014 consequence: "Enforcement gains
a better data source"). It replaces the *internal* (namespace-to-namespace) leg
of the allowlist; **external/public-internet egress is NOT in this table** (empty
dst namespace, dropped) — for those destinations keep using the Calico flow-log
path described in security.md.
**Per-namespace internal egress allowlist** — the set of in-cluster namespaces a
given source is *observed* talking to with `action='allow'`:
```sql
-- Internal egress allowlist for one namespace (feeds its NetworkPolicy)
SELECT DISTINCT dst_ns
FROM edge
WHERE src_ns = '<ns>' AND action = 'allow'
ORDER BY dst_ns;
```
```sql
-- Full internal egress matrix for all namespaces at once
SELECT src_ns, array_agg(DISTINCT dst_ns ORDER BY dst_ns) AS allowed_dst_ns
FROM edge
WHERE action = 'allow'
GROUP BY src_ns
ORDER BY src_ns;
```
```sql
-- Sanity: namespaces with a DENY edge already (policy is biting; investigate
-- before tightening further)
SELECT DISTINCT src_ns, dst_ns FROM edge WHERE action = 'deny';
```
**How this feeds enforcement (scope):** the derived `dst_ns` set is the
*internal* half of a namespace's egress allowlist — it tells you which
in-cluster namespaces to permit before flipping that namespace to default-deny.
The universal baseline (kube-dns :53, often dbaas :3306/:5432, redis :6379) and
the external destinations still come from the Wave-1 observation snapshot.
**Enforce-flips remain OUT OF SCOPE** here — this is observe-and-derive only;
the phased per-namespace default-deny rollout (starting `recruiter-responder`)
is tracked under `code-8ywc`. Cross-links:
[security.md → NetworkPolicy Default-Deny Egress](../architecture/security.md#networkpolicy-default-deny-egress-wave-1--observe-then-enforce-tier-34),
[wave1-egress-observation-2026-05-22.md](../architecture/wave1-egress-observation-2026-05-22.md),
[ADR-0014](../adr/0014-service-identity-and-east-west-observability.md).
> **Caveat (same as the Wave-1 snapshot):** an edge only exists if it was
> *observed*. A weekly CronJob or a 7-day Vault rotation may not have fired yet —
> collect ≥7 days of edges before treating a namespace's `allow` set as
> complete. The `first_seen` column tells you how long an edge has been known;
> the digest surfaces brand-new ones daily.
## Monitoring & health (infra #61)
The aggregator pod has **no `/metrics` endpoint** — health is inferred from
kube-state-metrics. Three complementary signals (memory ids 6598, 6599;
see also [monitoring.md → Security Alerts](../architecture/monitoring.md#security-alerts-wave-1--planned-beads-code-8ywc)):
| Signal | What | Where |
|---|---|---|
| **`AggregatorDown`** | `kube_deployment_status_replicas_available{namespace="goldmane-edge-aggregator",deployment="goldmane-edge-aggregator"} < 1` for 15m → warning | Prometheus alert group `Network Observability (Goldmane)` in `stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl`; routes `slack-warning``#alerts` |
| **`DigestFailing`** | `kube_job_status_failed{...job_name=~"goldmane-edges-digest.*"} > 0` within 24h, for 30m → warning | same alert group → `#alerts` |
| **cluster-health #48** | `check_goldmane_aggregator` reads the Deployment's `Available` condition (missing or not-Available → FAIL) | `scripts/cluster_healthcheck.sh` (human / `--quiet` / `--json` modes; emits `goldmane_aggregator`) |
The two alert layers are deliberately complementary: `AggregatorDown`
**no new edges land** in the DB; `DigestFailing` → **edges still land but nobody
is told**. A freshness probe (#61b) was intentionally skipped — `AggregatorDown`
is the agreed floor.
## Troubleshooting
**Whisker UI 502 / unreachable.** The additive
`kubernetes_network_policy_v1.whisker_allow_traefik` is missing or the
operator's default-deny `whisker` NP regenerated — re-apply `stacks/calico`. A
brand-new ingress host is also invisible to LAN split-horizon until the hourly
`technitium-ingress-dns-sync` runs (memory #5349); test meanwhile with
`curl -sSI --resolve whisker.viktorbarzin.me:443:10.0.20.203 https://whisker.viktorbarzin.me`
(expect a 302 to Authentik — the gate working).
**No new `last_seen` updates / `AggregatorDown` firing.** Check the `aggregate`
pod logs (`kubectl logs -n goldmane-edge-aggregator deploy/goldmane-edge-aggregator`).
Common causes, in order:
1. **Stale mTLS cert** — the operator rotated `whisker-backend-key-pair`; re-apply
`stacks/goldmane-edge-aggregator` (see cert-reuse gotcha above). Symptom: TLS
handshake / `Flows.Stream` errors.
2. **Stale DB password** — the 7-day Vault rotation bounced the credential but
the pod kept the old one. The Deployment carries
`secret.reloader.stakater.com/reload: goldmane-edges-db-creds`; if it's not
restarting on rotation, verify the Reloader annotation and the ExternalSecret.
3. **Goldmane restarted** — the in-memory window was lost (expected); the stream
reconnects automatically and resumes upserting. No data loss in the DB
(only the sub-hour live window in Whisker is gone).
**Digest never posts / `DigestFailing` firing.** Inspect the most recent
`goldmane-edges-digest-*` Job (`kubectl get jobs -n goldmane-edge-aggregator`;
`kubectl logs job/<name>`). The CronJob's `ttl_seconds_after_finished=86400` GCs
pods after a day, so check soon after a failed run. With `SLACK_WEBHOOK_URL`
empty the binary forces a dry-run (no post) — verify the `goldmane-edges-slack`
ExternalSecret resolved. A dry run / smoke test: run the image with `args:
["digest"]` + `DRY_RUN=1` to print the message instead of POSTing.
> Known state (2026-06-25): the digest CronJob's first Job **failed** and it has
> never successfully posted (`lastSuccessfulTime` empty) — the digest leg is the
> live gap; `DigestFailing` is catching it. Edges still land in the DB via the
> `aggregate` Deployment; only the `#security` notification is affected.
> Investigation/fix belongs to the aggregator slice (#58/#60), not monitoring.
**No edges at all in the table.** Confirm Goldmane is enabled
(`kubectl get goldmane,whisker -A`) and `calico-node` rolled with the
`FELIX_FLOWLOGSGOLDMANESERVER` env; confirm the `goldmane-edges-db-init` Job
completed; confirm the aggregator pod is `Running` and not `ImagePullBackOff`
(ghcr allowlist).
## Related
- [ADR-0014 — Service identity & east-west observability](../adr/0014-service-identity-and-east-west-observability.md)
- [security.md — NetworkPolicy Default-Deny Egress + east-west flow observability](../architecture/security.md)
- [monitoring.md — east-west flow observability + alerts](../architecture/monitoring.md)
- [wave1-egress-observation-2026-05-22.md](../architecture/wave1-egress-observation-2026-05-22.md)
- `CONTEXT.md` glossary — **Service identity**, **Goldmane / Whisker**
- Code: `~/code/goldmane-edge-aggregator` (`README.md`, `DEPLOY.md`); stacks
`stacks/goldmane-edge-aggregator`, `stacks/calico`

View file

@ -41,6 +41,8 @@ Job 0 — preflight (pinned: k8s-node1)
├── halt-on-alert (kured-style ignore-list) ├── halt-on-alert (kured-style ignore-list)
├── 24h-quiet baseline (no Ready transitions <24h ago) ├── 24h-quiet baseline (no Ready transitions <24h ago)
├── kubeadm upgrade plan matches target (skipped when master already at target — partial-resume) ├── kubeadm upgrade plan matches target (skipped when master already at target — partial-resume)
├── apiserver-OIDC drift check: kubeadm upgrade diff drops --authentication-config? → Slack WARN (recoverable; not a block)
├── reclaim kubeadm scratch: prune /etc/kubernetes/tmp/kubeadm-backup-* >3d on master (kubeadm leaks ~400MB etcd-db backups)
├── Push k8s_upgrade_in_flight=1, k8s_upgrade_started_timestamp=$(date +%s) ├── Push k8s_upgrade_in_flight=1, k8s_upgrade_started_timestamp=$(date +%s)
├── Trigger backup-etcd Job, wait, verify snapshot byte count ├── Trigger backup-etcd Job, wait, verify snapshot byte count
├── SSH master: containerd skew fix (if master < workers) ├── SSH master: containerd skew fix (if master < workers)
@ -222,22 +224,34 @@ Exposed in K8s via ExternalSecret `k8s-upgrade-creds` in the `k8s-upgrade` names
## Common Operations ## Common Operations
### Post-upgrade: apiserver OIDC restore (AUTOMATED by the chain since 2026-06-19) ### apiserver OIDC + kubeadm upgrades (kubeadm-config reconciliation since 2026-06-24)
`kubeadm upgrade apply` **regenerates `/etc/kubernetes/manifests/kube-apiserver.yaml` `kubeadm upgrade apply` **regenerates `/etc/kubernetes/manifests/kube-apiserver.yaml`
and drops the `--authentication-config` flag**, silently disabling apiserver from kubeadm-config**. apiserver auth uses a structured multi-issuer
OIDC (kubectl/kubelogin CLI **and** the web dashboard SSO break — tokens get `--authentication-config` (kubectl + dashboard SSO), but kubeadm-config used to
401). This used to require a manual re-apply after **every** control-plane bump. still carry the legacy single-issuer `--oidc-*` extraArgs — so every upgrade
reverted the flag, **silently breaking SSO after the upgrade** (the apiserver does
NOT crash on this — verified by isolated repro; it's recoverable via the restore
script below). NB: the **1.34→1.35 stall on 2026-06-24 was a *separate* issue —
etcd IO starvation**, not this drift; post-mortem:
`docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md`.
**Now automated:** the `rbac` stack publishes its OIDC restore script to the **Primary fix (2026-06-24):** `stacks/rbac/modules/rbac/apiserver-oidc.tf` now
`kube-system/apiserver-oidc-restore` ConfigMap, and the version-upgrade chain's **reconciles kubeadm-config** (`kubeadm init phase upload-config kubeadm`, rewriting
`phase_master` re-runs it on master immediately after `kubeadm upgrade apply` `apiServer.extraArgs`: drop `--oidc-*`, add `--authentication-config`) as part of
(while tigera-operator is still quiesced, so the flag-add apiserver restart can't its remote script. So kubeadm regenerates a **correct** manifest and the apiserver
crashloop the operator). It's idempotent, health-gates `/livez` with upgrades with a pure image bump — `kubeadm upgrade diff <target>` shows only the
auto-rollback, and is **non-fatal** — a failure only lags SSO until the next rbac image change. Zero live impact (the CM is read only during an upgrade).
apply (the version upgrade itself already succeeded). So a chain-driven
control-plane bump no longer breaks SSO. The master phase self-skips when master **Backstops:**
is already at target, so this only runs when master was actually upgraded. - **Preflight check 4b** runs `kubeadm upgrade diff` and **alerts** (Slack WARN, does
NOT block — the drift only breaks SSO, which is recoverable) if
`--authentication-config` would still be dropped.
- The `rbac` stack still publishes its restore script to the
`kube-system/apiserver-oidc-restore` ConfigMap, and `phase_master` re-runs it on
master right after `kubeadm upgrade apply` (idempotent, `/livez`-gated with
auto-rollback, non-fatal) — now redundant belt-and-suspenders that *also*
re-reconciles kubeadm-config. Self-skips when master is already at target.
**Manual fallback** — only for an out-of-band/manual `kubeadm` upgrade, or if the **Manual fallback** — only for an out-of-band/manual `kubeadm` upgrade, or if the
chain logged `WARN: --authentication-config absent after re-apply`: chain logged `WARN: --authentication-config absent after re-apply`:

View file

@ -27,7 +27,7 @@ KUBECONFIG_PATH="${KUBECONFIG:-${HOME}/.kube/config}"
[[ -f "$KUBECONFIG_PATH" ]] || KUBECONFIG_PATH="$(pwd)/config" [[ -f "$KUBECONFIG_PATH" ]] || KUBECONFIG_PATH="$(pwd)/config"
KUBECTL="" KUBECTL=""
JSON_RESULTS=() JSON_RESULTS=()
TOTAL_CHECKS=47 TOTAL_CHECKS=48
# Parallel execution settings. Each check function is self-contained — it # Parallel execution settings. Each check function is self-contained — it
# only reads cluster state and mutates the in-memory counters / JSON_RESULTS # only reads cluster state and mutates the in-memory counters / JSON_RESULTS
@ -3156,6 +3156,44 @@ PYEOF
esac esac
} }
# --- 48. Goldmane edge-aggregator availability ---
#
# The goldmane-edge-aggregator Deployment (ADR-0014 / infra #58) streams Calico
# Goldmane flows into the goldmane_edges CNPG DB — the durable who-talks-to-whom
# trail. The pod has NO /metrics endpoint, so its liveness can't be scraped;
# this check reads the Deployment's Available condition directly so the trail
# silently dying surfaces in the health board (mirrors the AggregatorDown
# Prometheus alert). Missing Deployment / not-Available -> FAIL.
check_goldmane_aggregator() {
section 48 "Goldmane Edge-Aggregator"
local ns="goldmane-edge-aggregator" dep="goldmane-edge-aggregator"
local avail desired ready
# One get; absent Deployment is a hard fail (the trail isn't deployed).
if ! $KUBECTL get deploy "$dep" -n "$ns" >/dev/null 2>&1; then
[[ "$QUIET" == true ]] && section_always 48 "Goldmane Edge-Aggregator"
fail "Deployment $ns/$dep not found — who-talks-to-whom edge trail is not running"
json_add "goldmane_aggregator" "FAIL" "deployment missing"
return 0
fi
avail=$($KUBECTL get deploy "$dep" -n "$ns" \
-o jsonpath='{.status.conditions[?(@.type=="Available")].status}' 2>/dev/null)
ready=$($KUBECTL get deploy "$dep" -n "$ns" -o jsonpath='{.status.readyReplicas}' 2>/dev/null)
desired=$($KUBECTL get deploy "$dep" -n "$ns" -o jsonpath='{.spec.replicas}' 2>/dev/null)
ready=${ready:-0}
desired=${desired:-0}
if [[ "$avail" == "True" ]]; then
pass "Edge-aggregator Available ($ready/$desired ready)"
json_add "goldmane_aggregator" "PASS" "${ready}/${desired} ready"
else
[[ "$QUIET" == true ]] && section_always 48 "Goldmane Edge-Aggregator"
fail "Edge-aggregator NOT Available ($ready/$desired ready) — edge trail has stopped recording"
json_add "goldmane_aggregator" "FAIL" "${ready}/${desired} ready; Available=${avail:-unknown}"
fi
}
# --- Summary --- # --- Summary ---
print_summary() { print_summary() {
if [[ "$JSON" == true ]]; then if [[ "$JSON" == true ]]; then
@ -3224,7 +3262,7 @@ main() {
check_monitoring_prom_am check_monitoring_vault check_monitoring_css check_monitoring_prom_am check_monitoring_vault check_monitoring_css
check_external_replicas check_external_divergence check_pve_thermals check_external_replicas check_external_divergence check_pve_thermals
check_pve_load check_external_traefik_5xx check_ha_status_dashboard check_pve_load check_external_traefik_5xx check_ha_status_dashboard
check_immich_search check_csi_ghost_drift check_immich_search check_csi_ghost_drift check_goldmane_aggregator
) )
# Auto-fix mutates cluster state inside individual checks — keep that # Auto-fix mutates cluster state inside individual checks — keep that

View file

@ -212,3 +212,65 @@ resource "kubectl_manifest" "whisker" {
spec = { notifications = "Disabled" } spec = { notifications = "Disabled" }
}) })
} }
# ---------------------------------------------------------------------------
# Gated public ingress for the Whisker UI (infra #57 / ADR-0014).
#
# whisker.viktorbarzin.me -> whisker:8081, Authentik-gated (auth="required":
# Whisker ships NO own login it's an admin observability UI, so Authentik
# forward-auth is the only gate between strangers and the flow view). The
# operator replicated `tls-secret` into calico-system already.
#
# TWO coupled pieces are required because the operator's own `whisker`
# NetworkPolicy (owned by the Whisker CR above) sets policyTypes:[Ingress]
# with NO ingress rules => default-deny on ingress to the whisker pod. The
# additive NP below ORs in a Traefik allow (k8s NetworkPolicies are additive
# across policies selecting the same pod), so we never edit the operator NP.
module "ingress_whisker" {
source = "../../modules/kubernetes/ingress_factory"
dns_type = "proxied"
namespace = "calico-system"
name = "whisker"
service_name = "whisker"
port = 8081
auth = "required"
tls_secret_name = "tls-secret"
extra_annotations = {
"gethomepage.dev/enabled" = "true"
"gethomepage.dev/name" = "Whisker"
"gethomepage.dev/description" = "Calico flow observability (who-talks-to-whom)"
"gethomepage.dev/icon" = "calico.png"
"gethomepage.dev/group" = "Infrastructure"
}
}
# Additive NetworkPolicy: permit Traefik -> whisker:8081. ORs with the
# operator's default-deny `whisker` NP (selecting the same pod) so Traefik
# can reach the UI without touching the operator-owned policy.
resource "kubernetes_network_policy_v1" "whisker_allow_traefik" {
metadata {
name = "whisker-allow-traefik"
namespace = "calico-system"
}
spec {
pod_selector {
match_labels = {
"app.kubernetes.io/name" = "whisker"
}
}
policy_types = ["Ingress"]
ingress {
from {
namespace_selector {
match_labels = {
"kubernetes.io/metadata.name" = "traefik"
}
}
}
ports {
port = "8081"
protocol = "TCP"
}
}
}
}

View file

@ -745,7 +745,10 @@ resource "kubernetes_deployment" "phpmyadmin" {
labels = { labels = {
"app" = "phpmyadmin" "app" = "phpmyadmin"
tier = var.tier tier = var.tier
# ADR-0014 service identity: dbaas is a multi-Service namespace, so the
# namespace alone can't attribute Goldmane flows. Value = the fronting
# Service name (kubernetes_service.phpmyadmin is named "pma").
"service-identity" = "pma"
} }
annotations = { annotations = {
"reloader.stakater.com/search" = "true" "reloader.stakater.com/search" = "true"
@ -762,6 +765,10 @@ resource "kubernetes_deployment" "phpmyadmin" {
metadata { metadata {
labels = { labels = {
"app" = "phpmyadmin" "app" = "phpmyadmin"
# ADR-0014: Goldmane/Felix stamps POD labels onto flows, so the
# disambiguating identity must live on the pod template (not just
# the Deployment metadata above). Not in selector no replace.
"service-identity" = "pma"
} }
} }
spec { spec {
@ -812,8 +819,19 @@ resource "kubernetes_deployment" "phpmyadmin" {
} }
} }
lifecycle { lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2 ignore_changes = [
ignore_changes = [spec[0].template[0].spec[0].dns_config] spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
# This Deployment is Keel-enrolled (keel.sh/policy=patch). Ignore the
# attributes Keel/Kyverno mutate at runtime so `terragrunt apply` (incl.
# the daily drift plan) doesn't fight them or revert the live image
# canonical KEEL/KYVERNO lifecycle guard, matches linkwarden/chrome-service.
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
metadata[0].annotations["keel.sh/match-tag"],
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE Keel manages tag updates
spec[0].template[0].metadata[0].annotations["keel.sh/update-time"], # KEEL_LIFECYCLE_V1
]
} }
} }
@ -1499,6 +1517,10 @@ resource "kubernetes_deployment" "pgadmin" {
} }
labels = { labels = {
tier = var.tier tier = var.tier
# ADR-0014 service identity: dbaas is a multi-Service namespace, so the
# namespace alone can't attribute Goldmane flows. Value = the fronting
# Service name (kubernetes_service.pgadmin is named "pgadmin").
"service-identity" = "pgadmin"
} }
} }
spec { spec {
@ -1514,6 +1536,10 @@ resource "kubernetes_deployment" "pgadmin" {
metadata { metadata {
labels = { labels = {
app = "pgadmin" app = "pgadmin"
# ADR-0014: Goldmane/Felix stamps POD labels onto flows, so the
# disambiguating identity must live on the pod template (not just
# the Deployment metadata above). Not in selector no replace.
"service-identity" = "pgadmin"
} }
} }
spec { spec {
@ -1568,8 +1594,20 @@ resource "kubernetes_deployment" "pgadmin" {
} }
} }
lifecycle { lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2 ignore_changes = [
ignore_changes = [spec[0].template[0].spec[0].dns_config] spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
# This Deployment is Keel-enrolled (keel.sh/policy=patch) and Keel has
# bumped the live image (dpage/pgadmin4:9.16). Ignore the Keel/Kyverno
# runtime-mutated attributes so `terragrunt apply` (incl. the daily drift
# plan) doesn't revert the image to bare `dpage/pgadmin4` or strip Keel's
# annotations canonical guard, matches linkwarden/chrome-service.
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
metadata[0].annotations["keel.sh/match-tag"],
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE Keel manages tag updates
spec[0].template[0].metadata[0].annotations["keel.sh/update-time"], # KEEL_LIFECYCLE_V1
]
} }
} }
resource "kubernetes_service" "pgadmin" { resource "kubernetes_service" "pgadmin" {

View file

@ -0,0 +1,496 @@
# =============================================================================
# goldmane-edge-aggregator durable who-talks-to-whom audit trail (ADR-0014 / #58)
# =============================================================================
# A small Go service that streams Calico Goldmane's gRPC Flows API (mTLS) and
# upserts the unique service-to-service edge set into Postgres, plus a daily
# Slack digest CronJob of first-seen edges. Code lives in the standalone
# `goldmane-edge-aggregator` repo; the authoritative deploy spec is its
# DEPLOY.md. This stack is the infra side of that spec.
#
# Goldmane runs as `Service goldmane:7443` (gRPC/mTLS) in calico-system, enabled
# via the operator CR in stacks/calico/main.tf. The durable Loki path is NOT
# the operator CRs this service IS the durable trail.
#
# Structure mirrors stacks/claude-memory (the canonical Tier-1 pattern): a
# per-service namespace, a CNPG Postgres DB + role + Vault 7-day rotation +
# ExternalSecret -> DATABASE_URL, the Reloader annotation, and the
# Terragrunt-generated backend.tf/providers.tf/tiers.tf layout. The novel bit is
# minting an mTLS client cert from the Tigera CA (hashicorp/tls; see versions.tf).
#
# IMAGE: ghcr.io/viktorbarzin/goldmane-edge-aggregator is PRIVATE. Onboarding
# MUST add the "goldmane-edge-aggregator" namespace to the ghcr-credentials
# Kyverno allowlist (stacks/kyverno/modules/kyverno/ghcr-credentials.tf,
# local.ghcr_private_namespaces) so the Kyverno-synced `ghcr-credentials` secret
# is cloned into this namespace otherwise the pulls 401. The imagePullSecrets
# reference below assumes that entry exists.
# =============================================================================
variable "postgresql_host" { type = string }
# Plan-time root creds for the idempotent DB-init Job (mirrors claude-memory).
data "vault_kv_secret_v2" "secrets" {
mount = "secret"
name = "goldmane-edge-aggregator"
}
# -----------------------------------------------------------------------------
# 1. Namespace
# -----------------------------------------------------------------------------
resource "kubernetes_namespace" "goldmane_edge_aggregator" {
metadata {
name = "goldmane-edge-aggregator"
labels = {
name = "goldmane-edge-aggregator"
# Tier 4-aux: a small off-path consumer service, like claude-memory.
tier = local.tiers.aux
"keel.sh/enrolled" = "true"
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
# -----------------------------------------------------------------------------
# 2. Goldmane mTLS client certificate (minted from the Tigera CA)
# -----------------------------------------------------------------------------
# The aggregator dials goldmane:7443 over mutual TLS. We mint a client cert
# signed by the Tigera CA (the same CA that issues Goldmane's serving cert), so
# Goldmane requires mutual TLS on :7443 and verifies the client cert chains to
# the Tigera CA it does NOT authorize by client identity, so ANY Tigera-CA-
# signed cert is accepted. Rather than copy the Tigera CA PRIVATE KEY into TF
# state to mint our own (a needless CA-key exposure; the hashicorp/tls provider
# is also incompatible with this repo's global generate-providers/lockfile
# pattern), we REUSE the operator-minted, Tigera-CA-signed client cert
# `whisker-backend-key-pair` (calico-system). We never touch the CA key.
# Trade-off: if the operator rotates that cert, re-apply to re-sync (hardening
# follow-up: mint an own-identity cert in-namespace if Whisker is ever removed).
data "kubernetes_secret" "whisker_backend" {
metadata {
name = "whisker-backend-key-pair"
namespace = "calico-system"
}
}
# The CA bundle that verifies Goldmane's serving cert. It lives ONLY in
# calico-system (verified: ConfigMap `tigera-ca-bundle`, 2 keys present
# `ca-bundle.crt` AND `tigera-ca-bundle.crt`, both the trusted bundle). We read
# it and recreate it as a ConfigMap in this namespace so the pod can mount it
# (a ConfigMap cannot be cross-namespace-mounted).
data "kubernetes_config_map" "tigera_ca_bundle" {
metadata {
name = "tigera-ca-bundle"
namespace = "calico-system"
}
}
resource "kubernetes_config_map" "tigera_ca_bundle" {
metadata {
name = "tigera-ca-bundle"
namespace = kubernetes_namespace.goldmane_edge_aggregator.metadata[0].name
}
# Copy the upstream bundle verbatim. We mount the `tigera-ca-bundle.crt` key
# at /etc/tigera-ca/tigera-ca-bundle.crt so the service's default
# CA_CERT_PATH (/etc/tigera-ca/tigera-ca-bundle.crt) resolves with no override.
data = data.kubernetes_config_map.tigera_ca_bundle.data
}
# Client cert + key for mTLS to goldmane:7443, mounted at TLS_CERT_PATH /
# TLS_KEY_PATH defaults (/etc/goldmane-client-tls/tls.crt and .../tls.key).
# Sourced verbatim from the operator's whisker-backend client key-pair (read
# above) already Tigera-CA-signed, which is all Goldmane verifies. No CA key
# is touched and no cross-namespace CA RBAC is needed.
resource "kubernetes_secret" "goldmane_client_tls" {
metadata {
name = "goldmane-client-tls"
namespace = kubernetes_namespace.goldmane_edge_aggregator.metadata[0].name
}
type = "Opaque"
data = {
"tls.crt" = data.kubernetes_secret.whisker_backend.data["tls.crt"]
"tls.key" = data.kubernetes_secret.whisker_backend.data["tls.key"]
}
}
# -----------------------------------------------------------------------------
# 3. Postgres: DB + role `goldmane_edges`, Vault 7-day rotation, DATABASE_URL
# -----------------------------------------------------------------------------
# Idempotent create of the role + DB using the CNPG root creds from Vault
# (dbaas_root_password), exactly mirroring claude-memory's db_init Job. The
# service creates the `edge` table itself at startup (migrations/0001_edge.sql),
# so no migration Job is needed.
resource "kubernetes_job" "db_init" {
metadata {
name = "goldmane-edges-db-init"
namespace = kubernetes_namespace.goldmane_edge_aggregator.metadata[0].name
}
spec {
template {
metadata {}
spec {
container {
name = "db-init"
image = "postgres:16-alpine"
command = [
"sh", "-c",
<<-EOT
set -e
# -d postgres: psql defaults the database name to the username;
# the root user has no root-named database, so be explicit.
PGPASSWORD='${data.vault_kv_secret_v2.secrets.data["dbaas_root_password"]}' psql -h ${var.postgresql_host} -U root -d postgres -tc "SELECT 1 FROM pg_roles WHERE rolname='goldmane_edges'" | grep -q 1 || \
PGPASSWORD='${data.vault_kv_secret_v2.secrets.data["dbaas_root_password"]}' psql -h ${var.postgresql_host} -U root -d postgres -c "CREATE ROLE goldmane_edges WITH LOGIN PASSWORD '${data.vault_kv_secret_v2.secrets.data["db_password"]}'"
PGPASSWORD='${data.vault_kv_secret_v2.secrets.data["dbaas_root_password"]}' psql -h ${var.postgresql_host} -U root -d postgres -tc "SELECT 1 FROM pg_database WHERE datname='goldmane_edges'" | grep -q 1 || \
PGPASSWORD='${data.vault_kv_secret_v2.secrets.data["dbaas_root_password"]}' psql -h ${var.postgresql_host} -U root -d postgres -c "CREATE DATABASE goldmane_edges OWNER goldmane_edges"
PGPASSWORD='${data.vault_kv_secret_v2.secrets.data["dbaas_root_password"]}' psql -h ${var.postgresql_host} -U root -d postgres -c "GRANT ALL PRIVILEGES ON DATABASE goldmane_edges TO goldmane_edges"
echo "Database init complete"
EOT
]
}
restart_policy = "Never"
}
}
backoff_limit = 3
}
wait_for_completion = true
timeouts {
create = "2m"
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno injects dns_config (ndots=2); ignore it so
# this idempotent Job isn't replaced (Jobs are immutable) on every apply.
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
# ExternalSecret projecting the Vault-rotated (7-day) credential into a K8s
# Secret as DATABASE_URL. The Vault DB static role `pg-goldmane-edges` and its
# place in the CNPG connection allowlist are added in stacks/vault/main.tf
# (see this stack's terragrunt.hcl note). remoteRef key: static-creds/pg-goldmane-edges.
resource "kubernetes_manifest" "db_external_secret" {
manifest = {
apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret"
metadata = {
name = "goldmane-edges-db-creds"
namespace = kubernetes_namespace.goldmane_edge_aggregator.metadata[0].name
}
spec = {
refreshInterval = "15m"
secretStoreRef = {
name = "vault-database"
kind = "ClusterSecretStore"
}
target = {
name = "goldmane-edges-db-creds"
template = {
data = {
DATABASE_URL = "postgresql://goldmane_edges:{{ .password }}@${var.postgresql_host}:5432/goldmane_edges"
}
}
}
data = [{
secretKey = "password"
remoteRef = {
key = "static-creds/pg-goldmane-edges"
property = "password"
}
}]
}
}
depends_on = [kubernetes_namespace.goldmane_edge_aggregator]
}
# -----------------------------------------------------------------------------
# 4. Slack webhook (reuse the alert-digest incoming webhook)
# -----------------------------------------------------------------------------
# The monitoring alert-digest CronJob posts with the Slack incoming webhook at
# Vault secret/monitoring -> key `alertmanager_slack_api_url`
# (stacks/monitoring/modules/monitoring/alert_digest.tf). Project that same URL
# into this namespace as SLACK_WEBHOOK_URL via an ExternalSecret (no new
# webhook). The digest CronJob defaults to #security.
resource "kubernetes_manifest" "slack_external_secret" {
manifest = {
apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret"
metadata = {
name = "goldmane-edges-slack"
namespace = kubernetes_namespace.goldmane_edge_aggregator.metadata[0].name
}
spec = {
refreshInterval = "1h"
secretStoreRef = {
name = "vault-kv"
kind = "ClusterSecretStore"
}
target = {
name = "goldmane-edges-slack"
}
data = [{
secretKey = "SLACK_WEBHOOK_URL"
remoteRef = {
key = "viktor"
property = "alertmanager_slack_api_url"
}
}]
}
}
depends_on = [kubernetes_namespace.goldmane_edge_aggregator]
}
# -----------------------------------------------------------------------------
# 5. aggregate Deployment (long-running gRPC stream -> Postgres upserts)
# -----------------------------------------------------------------------------
resource "kubernetes_deployment" "aggregate" {
depends_on = [
kubernetes_job.db_init,
kubernetes_manifest.db_external_secret,
]
metadata {
name = "goldmane-edge-aggregator"
namespace = kubernetes_namespace.goldmane_edge_aggregator.metadata[0].name
labels = {
app = "goldmane-edge-aggregator"
tier = local.tiers.aux
}
annotations = {
# Credential is env-injected and read only at startup; the 7-day rotation
# must bounce the pod or it keeps the stale password and silently fails
# DB auth (infra CLAUDE.md Reloader rule).
"secret.reloader.stakater.com/reload" = "goldmane-edges-db-creds"
}
}
spec {
# 1 replica: the edge set is a global upsert keyed on (src_ns, dst_ns,
# action); a second replica only doubles writes for no benefit (Goldmane
# streams per-flow). Stateless (no PVC) so RollingUpdate is fine.
replicas = 1
selector {
match_labels = {
app = "goldmane-edge-aggregator"
}
}
template {
metadata {
labels = {
app = "goldmane-edge-aggregator"
}
}
spec {
# PRIVATE ghcr image cloned into this namespace by the Kyverno
# sync-ghcr-credentials allowlist policy (add this ns to that list).
image_pull_secrets {
name = "ghcr-credentials"
}
container {
name = "aggregate"
# CI (GHA -> ghcr) overwrites this to :<sha8> via `kubectl set image`;
# the image tag is in ignore_changes below so the SHA sticks across
# `terragrunt apply` (fleet image-pin convention). Placeholder :latest
# until the deploy pipeline runs.
image = "ghcr.io/viktorbarzin/goldmane-edge-aggregator:latest"
args = ["aggregate"]
# Goldmane mTLS. GOLDMANE_HOST default host sans port =>
# ServerName "goldmane.calico-system.svc.cluster.local", which is a SAN
# on the live Goldmane serving cert (verified 2026-06-24:
# DNS:goldmane{,.calico-system{,.svc{,.cluster.local}}}). So no
# GOLDMANE_SERVER_NAME override and no GOLDMANE_TLS_INSECURE needed.
env {
name = "GOLDMANE_HOST"
value = "goldmane.calico-system.svc.cluster.local:7443"
}
# TLS_CERT_PATH / TLS_KEY_PATH / CA_CERT_PATH are left at their image
# defaults (/etc/goldmane-client-tls/tls.{crt,key} and
# /etc/tigera-ca/tigera-ca-bundle.crt) the mounts below match them.
env {
name = "DATABASE_URL"
value_from {
secret_key_ref {
name = "goldmane-edges-db-creds"
key = "DATABASE_URL"
}
}
}
volume_mount {
name = "goldmane-client-tls"
mount_path = "/etc/goldmane-client-tls"
read_only = true
}
volume_mount {
name = "tigera-ca"
mount_path = "/etc/tigera-ca"
read_only = true
}
resources {
# Idles low: a single gRPC stream + periodic upserts. requests=limits
# per the repo memory rule; no CPU limit (CFS throttling). Right-size
# later with krr.
requests = {
cpu = "10m"
memory = "64Mi"
}
limits = {
memory = "64Mi"
}
}
}
volume {
name = "goldmane-client-tls"
secret {
secret_name = kubernetes_secret.goldmane_client_tls.metadata[0].name
}
}
volume {
name = "tigera-ca"
config_map {
name = kubernetes_config_map.tigera_ca_bundle.metadata[0].name
}
}
}
}
}
lifecycle {
ignore_changes = [
# CI pipeline owns the image tag (kubectl set image from GHA/Woodpecker).
spec[0].template[0].spec[0].container[0].image,
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
metadata[0].annotations["keel.sh/match-tag"],
metadata[0].annotations["kubernetes.io/change-cause"],
metadata[0].annotations["deployment.kubernetes.io/revision"],
spec[0].template[0].metadata[0].annotations["keel.sh/update-time"], # KEEL_LIFECYCLE_V1
]
}
}
# -----------------------------------------------------------------------------
# 6. digest daily CronJob (first-seen edges -> Slack)
# -----------------------------------------------------------------------------
resource "kubernetes_cron_job_v1" "digest" {
depends_on = [
kubernetes_job.db_init,
kubernetes_manifest.db_external_secret,
kubernetes_manifest.slack_external_secret,
]
metadata {
name = "goldmane-edges-digest"
namespace = kubernetes_namespace.goldmane_edge_aggregator.metadata[0].name
labels = {
app = "goldmane-edge-aggregator"
tier = local.tiers.aux
}
}
spec {
# Daily 08:00 Europe/London aligns with the alert-digest cadence.
schedule = "0 8 * * *"
timezone = "Europe/London"
concurrency_policy = "Forbid"
successful_jobs_history_limit = 3
failed_jobs_history_limit = 3
starting_deadline_seconds = 600
job_template {
metadata {
labels = {
app = "goldmane-edge-aggregator"
}
annotations = {
# 7-day DB rotation: bounce the Job pod's stale env (Reloader rule).
"secret.reloader.stakater.com/reload" = "goldmane-edges-db-creds"
}
}
spec {
backoff_limit = 2
active_deadline_seconds = 300
ttl_seconds_after_finished = 86400
template {
metadata {
labels = {
app = "goldmane-edge-aggregator"
}
}
spec {
restart_policy = "OnFailure"
image_pull_secrets {
name = "ghcr-credentials"
}
container {
name = "digest"
# CronJobs track :latest + imagePullPolicy: Always (fleet
# convention) so the daily run picks up the current image.
image = "ghcr.io/viktorbarzin/goldmane-edge-aggregator:latest"
image_pull_policy = "Always"
args = ["digest"]
env {
name = "DATABASE_URL"
value_from {
secret_key_ref {
name = "goldmane-edges-db-creds"
key = "DATABASE_URL"
}
}
}
env {
name = "SLACK_WEBHOOK_URL"
value_from {
secret_key_ref {
name = "goldmane-edges-slack"
key = "SLACK_WEBHOOK_URL"
}
}
}
env {
name = "SLACK_CHANNEL"
# The shared alertmanager_slack_api_url incoming webhook's Slack
# app is NOT a member of #security, so overriding the channel to
# it returns HTTP 404 channel_not_found (verified 2026-06-25).
# alertmanager's own slack-security receiver shares this webhook
# and almost certainly hits the same wall. Post to #alerts (the
# webhook's working channel, same as alert-digest) until the app
# is invited to #security, then flip this back. See
# docs/runbooks/goldmane-flow-trail.md.
value = "#alerts"
}
resources {
requests = {
cpu = "10m"
memory = "64Mi"
}
limits = {
memory = "64Mi"
}
}
}
}
}
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1 (CronJob path): Kyverno mutates dns_config with ndots=2.
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}
# -----------------------------------------------------------------------------
# 7. Egress (default-deny consideration)
# -----------------------------------------------------------------------------
# Goldmane's own NetworkPolicy already allows INGRESS on 7443 from anywhere, so
# nothing is needed on the Goldmane side. No egress policy is declared here:
# this namespace is default-allow egress today. IF/WHEN it is brought under the
# wave-1 default-deny egress enforcement (per-namespace allowlists), add
# (Global)NetworkPolicy egress rules permitting:
# - goldmane.calico-system.svc.cluster.local:7443 (the flow stream)
# - pg-cluster-rw.dbaas.svc.cluster.local:5432 (Postgres)
# - hooks.slack.com:443 (digest -> Slack, internet)
# - kube-dns / CoreDNS :53 (DNS, every namespace)

View file

@ -0,0 +1,24 @@
include "root" {
path = find_in_parent_folders()
}
# Tier-1 stack (PG state backend). The root terragrunt.hcl generates backend.tf
# (pg backend, schema_name = "goldmane-edge-aggregator"), providers.tf,
# cloudflare_provider.tf and tiers.tf automatically do NOT hand-write those.
# This stack adds the hashicorp/tls provider via a local versions.tf (merged
# into the generated required_providers).
dependency "platform" {
config_path = "../platform"
skip_outputs = true
}
dependency "vault" {
config_path = "../vault"
skip_outputs = true
}
# The Vault DB static role pg-goldmane-edges (7-day rotation) and the CNPG
# connection allowlist entry live in the vault stack (stacks/vault/main.tf).
# The vault dependency above orders this stack after it so the ExternalSecret
# can materialize the rotated credential on first apply.

View file

@ -35,6 +35,14 @@ resource "kubernetes_namespace" "instagram_poster" {
# - immich_tag_instagram (optional auto-resolved if missing) # - immich_tag_instagram (optional auto-resolved if missing)
# - immich_tag_posted (optional auto-resolved if missing) # - immich_tag_posted (optional auto-resolved if missing)
resource "kubernetes_manifest" "external_secret" { resource "kubernetes_manifest" "external_secret" {
# The external-secrets controller takes server-side-apply ownership of
# .spec.refreshInterval, so a plain TF apply conflicts. force_conflicts lets
# TF win (values match, so it's stable) same pattern as grafana/woodpecker/
# traefik/k8s-version-upgrade. Surfaced 2026-06-24 by the first IG apply since
# the ESO v1 migration (the scale-to-0 push).
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"
@ -139,6 +147,11 @@ resource "kubernetes_manifest" "external_secret" {
# ESO refreshes the K8s Secret every 15m. `reloader.stakater.com/match` # ESO refreshes the K8s Secret every 15m. `reloader.stakater.com/match`
# bounces the pod when the password changes. # bounces the pod when the password changes.
resource "kubernetes_manifest" "benchmark_db_external_secret" { resource "kubernetes_manifest" "benchmark_db_external_secret" {
# See external_secret above ESO owns .spec.refreshInterval; force_conflicts
# lets the TF apply win instead of erroring on the field-manager conflict.
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"
@ -227,7 +240,11 @@ resource "kubernetes_deployment" "instagram_poster" {
} }
spec { spec {
replicas = 1 # Scaled to 0 (2026-06-24): Instagram Graph integration is unused and its
# ExternalSecret is dead (missing ig_graph_long_lived_token /
# ig_business_account_id in Vault secret/instagram-poster). Set back to 1
# after minting a Meta long-lived token and populating those keys.
replicas = 0
# RWO PVC cannot rolling-update. # RWO PVC cannot rolling-update.
strategy { strategy {
type = "Recreate" type = "Recreate"

View file

@ -416,6 +416,39 @@ phase_preflight() {
fi fi
fi fi
# 4b. apiserver-OIDC drift check (backstop for the rbac stack's kubeadm-config
# reconciliation). A `kubeadm upgrade` REGENERATES the apiserver manifest from
# kubeadm-config; if kubeadm-config still carries the legacy single-issuer
# --oidc-* args instead of --authentication-config, the regenerated apiserver
# loses structured multi-issuer auth → kubectl + dashboard SSO break AFTER the
# upgrade. This is RECOVERABLE (the apiserver does NOT crash — verified by an
# isolated repro 2026-06-24; the chain's post-master restore.sh re-adds the flag,
# and the rbac stack reconciles kubeadm-config so it won't recur) — so this is an
# ALERT, not a block. (NB the 2026-06-24 stall was NOT this — it was etcd IO
# starvation; see docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md.)
# Skip on an at-target master (resume — no apiserver regen).
if [ "$master_kubelet_v" != "$TARGET_VERSION" ]; then
local apiserver_diff
apiserver_diff=$(ssh "${SSH_OPTS[@]}" "$(ssh_target k8s-master)" "sudo kubeadm upgrade diff v$TARGET_VERSION 2>/dev/null" || true)
if echo "$apiserver_diff" | grep -qE '^-[[:space:]].*--authentication-config'; then
slack "WARN preflight — kubeadm upgrade will DROP --authentication-config (kubeadm-config OIDC drift). SSO breaks post-upgrade until restore.sh re-adds it; re-apply the rbac stack to reconcile kubeadm-config. Proceeding (recoverable, not a crash)."
fi
fi
# 4c. Reclaim kubeadm scratch on master. `kubeadm upgrade apply` dumps a full
# ~400MB etcd DB backup into /etc/kubernetes/tmp/kubeadm-backup-etcd-<ts>/ before
# every etcd upgrade and NEVER cleans it up — 145 dirs / 28GB had accumulated by
# 2026-06-24, pushing master root fs to 73% (image-GC churn + extra write IO on
# the shared HDD where etcd lives — a contributor to the etcd IO starvation that
# stalled that run, see post-mortem). Real etcd backups go to NFS, so these are
# throwaway. Prune ones >3 days old (keeps a short rollback window). Best-effort;
# never aborts the chain.
if [ "$master_kubelet_v" != "$TARGET_VERSION" ]; then
ssh "${SSH_OPTS[@]}" "$(ssh_target k8s-master)" \
"sudo find /etc/kubernetes/tmp -maxdepth 1 -type d \( -name 'kubeadm-backup-*' -o -name 'kubeadm-upgraded-manifests*' \) -mtime +3 -exec rm -rf {} + 2>/dev/null; echo -n 'master root after prune: '; df -h / | awk 'NR==2{print \$5\" used, \"\$4\" free\"}'" \
|| echo "kubeadm-scratch prune skipped (ssh/df failed) — non-fatal"
fi
# 5. Push in-flight + started_timestamp metrics + ns annotations # 5. Push in-flight + started_timestamp metrics + ns annotations
$KUBECTL annotate ns "$NS" \ $KUBECTL annotate ns "$NS" \
"viktorbarzin.me/k8s-upgrade-in-flight=$(date -u +%FT%TZ)" \ "viktorbarzin.me/k8s-upgrade-in-flight=$(date -u +%FT%TZ)" \

View file

@ -31,6 +31,9 @@ locals {
# "no local builds"). ghcr.io/viktorbarzin/k8s-portal:latest is PRIVATE # "no local builds"). ghcr.io/viktorbarzin/k8s-portal:latest is PRIVATE
# (infra repo default); the deployment references the cloned secret. # (infra repo default); the deployment references the cloned secret.
"k8s-portal", "k8s-portal",
# goldmane-edge-aggregator: PRIVATE ghcr image pulled by the aggregate
# Deployment + digest CronJob (ADR-0014, infra#58).
"goldmane-edge-aggregator",
] ]
} }

View file

@ -130,6 +130,11 @@ resource "kubernetes_deployment" "blackbox_exporter" {
labels = { labels = {
app = "blackbox-exporter" app = "blackbox-exporter"
tier = var.tier tier = var.tier
# ADR-0014 service identity: monitoring is a multi-Service namespace, so
# the namespace alone can't attribute Goldmane flows. Value = the
# fronting Service name (kubernetes_service.blackbox_exporter is named
# "blackbox-exporter").
"service-identity" = "blackbox-exporter"
} }
annotations = { annotations = {
"reloader.stakater.com/search" = "true" "reloader.stakater.com/search" = "true"
@ -146,6 +151,10 @@ resource "kubernetes_deployment" "blackbox_exporter" {
metadata { metadata {
labels = { labels = {
app = "blackbox-exporter" app = "blackbox-exporter"
# ADR-0014: Goldmane/Felix stamps POD labels onto flows, so the
# disambiguating identity must live on the pod template (not just
# the Deployment metadata above). Not in selector no replace.
"service-identity" = "blackbox-exporter"
} }
} }
spec { spec {

View file

@ -5,6 +5,11 @@ resource "kubernetes_deployment" "goflow2" {
labels = { labels = {
app = "goflow2" app = "goflow2"
tier = var.tier tier = var.tier
# ADR-0014 service identity: monitoring is a multi-Service namespace, so
# the namespace alone can't attribute Goldmane flows. Value = the
# fronting Service name (kubernetes_service.goflow2 the metrics svc; the
# goflow2-netflow NodePort is the same pod by another name).
"service-identity" = "goflow2"
} }
} }
spec { spec {
@ -18,6 +23,10 @@ resource "kubernetes_deployment" "goflow2" {
metadata { metadata {
labels = { labels = {
app = "goflow2" app = "goflow2"
# ADR-0014: Goldmane/Felix stamps POD labels onto flows, so the
# disambiguating identity must live on the pod template (not just
# the Deployment metadata above). Not in selector no replace.
"service-identity" = "goflow2"
} }
} }
spec { spec {

View file

@ -71,6 +71,15 @@ resource "kubernetes_persistent_volume" "alertmanager_pv" {
# DB credentials from Vault database engine (rotated automatically) # DB credentials from Vault database engine (rotated automatically)
# Provides GF_DATABASE_PASSWORD that auto-updates when password rotates # Provides GF_DATABASE_PASSWORD that auto-updates when password rotates
resource "kubernetes_manifest" "grafana_db_creds" { resource "kubernetes_manifest" "grafana_db_creds" {
# The external-secrets controller takes server-side-apply ownership of
# .spec.refreshInterval, so a plain TF apply conflicts ("conflict with
# external-secrets ... .spec.refreshInterval"). force_conflicts lets TF win
# (values match, so it's stable) same pattern as the woodpecker/traefik/
# k8s-version-upgrade stacks. Surfaced 2026-06-24: the first monitoring apply
# in a while exposed this latent conflict (prior pushes were docs-only).
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"

View file

@ -47,6 +47,10 @@ resource "kubernetes_deployment" "idrac-redfish" {
labels = { labels = {
app = "idrac-redfish-exporter" app = "idrac-redfish-exporter"
tier = var.tier tier = var.tier
# ADR-0014 service identity: monitoring is a multi-Service namespace, so
# the namespace alone can't attribute Goldmane flows. Value = the
# fronting Service name (kubernetes_service.idrac-redfish-exporter).
"service-identity" = "idrac-redfish-exporter"
} }
annotations = { annotations = {
"reloader.stakater.com/search" = "true" "reloader.stakater.com/search" = "true"
@ -63,6 +67,10 @@ resource "kubernetes_deployment" "idrac-redfish" {
metadata { metadata {
labels = { labels = {
app = "idrac-redfish-exporter" app = "idrac-redfish-exporter"
# ADR-0014: Goldmane/Felix stamps POD labels onto flows, so the
# disambiguating identity must live on the pod template (not just
# the Deployment metadata above). Not in selector no replace.
"service-identity" = "idrac-redfish-exporter"
} }
} }
spec { spec {

View file

@ -1450,6 +1450,49 @@ serverFiles:
Remediation: right-size top reservers via Goldilocks (immich-server, Remediation: right-size top reservers via Goldilocks (immich-server,
frigate, prometheus, pg-cluster, paperless) or bump VM RAM on frigate, prometheus, pg-cluster, paperless) or bump VM RAM on
k8s-node2/k8s-node3 from 32GB → 48GB to match node1. k8s-node2/k8s-node3 from 32GB → 48GB to match node1.
# Goldmane edge-aggregator (ADR-0014 / infra #58, #61): the durable
# who-talks-to-whom trail. The aggregator pod has NO /metrics endpoint,
# so its health is inferred from kube-state-metrics signals — the trail
# must not silently die. Two failure modes are covered:
# - the aggregate Deployment stops consuming Goldmane's flow stream
# (AggregatorDown) → no new edges ever land in the goldmane_edges DB
# - the daily digest CronJob can't post new edges to Slack
# (DigestFailing) → edges still land but nobody is told.
# A freshness probe (max(last_seen) staleness) is intentionally NOT here:
# AggregatorDown is the agreed floor and needs no extra moving parts.
- name: Network Observability (Goldmane)
rules:
# Deployment has <1 available replica for 15m. kube-state-metrics
# keeps `kube_deployment_status_replicas_available` (metric-keep list
# in serverFiles below). The 15m window rides out a normal rollout /
# node drain without paging; a genuinely-dead aggregator means the
# edge trail has stopped recording and stays down.
- alert: AggregatorDown
expr: |
kube_deployment_status_replicas_available{namespace="goldmane-edge-aggregator",deployment="goldmane-edge-aggregator"} < 1
and on() (time() - process_start_time_seconds{job="prometheus"}) > 900
for: 15m
labels:
severity: warning
annotations:
summary: "goldmane-edge-aggregator has no available replica — the who-talks-to-whom edge trail has stopped recording"
description: "The aggregate Deployment streams Calico Goldmane flows into the goldmane_edges CNPG DB. With 0 replicas, no new namespace-pair edges are captured. `kubectl -n goldmane-edge-aggregator describe deploy goldmane-edge-aggregator` + check the goldmane svc (calico-system) is reachable."
# The goldmane-edges-digest CronJob has a failed Job that started in
# the last 24h. Mirrors the generic JobFailed shape but scoped to the
# digest so it routes here. `for: 30m` rides out the apply/scrape
# transient; the digest runs daily so a real failure won't self-heal
# until the next run — surface it same-day rather than waiting 24h.
- alert: DigestFailing
expr: |
kube_job_status_failed{namespace="goldmane-edge-aggregator", job_name=~"goldmane-edges-digest.*"} > 0
and on(namespace, job_name)
(time() - kube_job_status_start_time{namespace="goldmane-edge-aggregator", job_name=~"goldmane-edges-digest.*"}) < 86400
for: 30m
labels:
severity: warning
annotations:
summary: "goldmane-edges-digest CronJob failing — new edges captured but not posted to #security"
description: "The daily edge digest Job {{ $labels.job_name }} failed. Edges may still be landing in the goldmane_edges DB but no one is being notified of new namespace-pairs. `kubectl -n goldmane-edge-aggregator logs job/{{ $labels.job_name }}`."
- name: Infrastructure Health - name: Infrastructure Health
rules: rules:
- alert: HomeAssistantDown - alert: HomeAssistantDown

View file

@ -22,6 +22,10 @@ resource "kubernetes_deployment" "pve_exporter" {
namespace = kubernetes_namespace.monitoring.metadata[0].name namespace = kubernetes_namespace.monitoring.metadata[0].name
labels = { labels = {
tier = var.tier tier = var.tier
# ADR-0014 service identity: monitoring is a multi-Service namespace, so
# the namespace alone can't attribute Goldmane flows. Value = the
# fronting Service name (kubernetes_service.proxmox-exporter).
"service-identity" = "proxmox-exporter"
} }
} }
@ -37,6 +41,10 @@ resource "kubernetes_deployment" "pve_exporter" {
metadata { metadata {
labels = { labels = {
app = "proxmox-exporter" app = "proxmox-exporter"
# ADR-0014: Goldmane/Felix stamps POD labels onto flows, so the
# disambiguating identity must live on the pod template (not just
# the Deployment metadata above). Not in selector no replace.
"service-identity" = "proxmox-exporter"
} }
} }

View file

@ -31,6 +31,10 @@ resource "kubernetes_deployment" "snmp-exporter" {
labels = { labels = {
app = "snmp-exporter" app = "snmp-exporter"
tier = var.tier tier = var.tier
# ADR-0014 service identity: monitoring is a multi-Service namespace, so
# the namespace alone can't attribute Goldmane flows. Value = the
# fronting Service name (kubernetes_service.snmp-exporter).
"service-identity" = "snmp-exporter"
} }
annotations = { annotations = {
"reloader.stakater.com/search" = "true" "reloader.stakater.com/search" = "true"
@ -47,6 +51,10 @@ resource "kubernetes_deployment" "snmp-exporter" {
metadata { metadata {
labels = { labels = {
app = "snmp-exporter" app = "snmp-exporter"
# ADR-0014: Goldmane/Felix stamps POD labels onto flows, so the
# disambiguating identity must live on the pod template (not just
# the Deployment metadata above). Not in selector no replace.
"service-identity" = "snmp-exporter"
} }
} }
spec { spec {

View file

@ -10,16 +10,29 @@
# match the existing RBAC subjects (kind: User, name: <raw email>; group names # match the existing RBAC subjects (kind: User, name: <raw email>; group names
# verbatim). Do NOT add a prefix or existing bindings break. # verbatim). Do NOT add a prefix or existing bindings break.
# #
# DRIFT WARNING: this edits the kube-apiserver static-pod manifest on the single # DRIFT WARNING (and how it's now handled): apiserver auth lives in THREE places
# master. A `kubeadm upgrade` regenerates that manifest and DROPS this flag (this # that must stay in sync, because a `kubeadm upgrade` REGENERATES the static-pod
# is exactly how OIDC silently broke before the flag was wiped and the # manifest from kubeadm-config:
# content-hash trigger never re-fired). After any k8s control-plane upgrade, # 1. /etc/kubernetes/pki/auth-config.yaml the structured authn file
# re-apply the rbac stack to restore apiserver OIDC. See # 2. the live kube-apiserver static-pod manifest references it via the flag
# docs/plans/2026-06-04-k8s-dashboard-sso-design.md. # 3. the kubeadm-config ClusterConfiguration CM what kubeadm regenerates from
# Originally only (1)+(2) were managed, so every kubeadm upgrade rewrote the
# manifest from the STALE CM, reverting --authentication-config to single-issuer
# --oidc-* flags. The consequence is SSO breakage AFTER the upgrade: kubectl +
# dashboard lose multi-issuer auth (the apiserver does NOT crash on this verified
# by an isolated repro 2026-06-24; the 2026-06-24 v1.35 upgrade *stall* was a
# separate etcd IO-starvation issue, see
# docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md). The
# remote script below now ALSO reconciles (3) via `kubeadm init phase
# upload-config`, so a future kubeadm upgrade regenerates a CORRECT manifest. The
# k8s-version-upgrade chain additionally ALERTS (does not block SSO drift is
# recoverable) via `kubeadm upgrade diff` in preflight if --authentication-config
# would still be dropped.
# #
# SAFETY: the remote script health-gates on /livez and AUTO-ROLLS-BACK the # SAFETY: the remote script health-gates on /livez and AUTO-ROLLS-BACK the
# manifest from a timestamped backup if the apiserver does not recover, so a # manifest from a timestamped backup if the apiserver does not recover, so a
# malformed config cannot leave the single master down. # malformed config cannot leave the single master down. Reconciling kubeadm-config
# is zero-impact on the running cluster (the CM is only read during an upgrade).
variable "k8s_master_host" { variable "k8s_master_host" {
type = string type = string
@ -97,6 +110,40 @@ locals {
print('flag-inserted' if done else 'ANCHOR-NOT-FOUND') print('flag-inserted' if done else 'ANCHOR-NOT-FOUND')
PY PY
# Reconciles the kubeadm-config ClusterConfiguration's apiServer.extraArgs:
# drops the stale single-issuer --oidc-* args and ensures --authentication-config
# is present (anchored after --authorization-mode). Stdlib-only (the master is
# only guaranteed python3, not pyyaml/yq). Idempotent; preserves all other
# fields (etcd args, audit args, extraVolumes) verbatim. Exits 3 if the
# authorization-mode anchor is missing (fail loud, leave the CM untouched).
kubeadm_oidc_reconcile_py = <<-PY
import sys
lines = sys.stdin.read().split('\n')
out, i, n = [], 0, len(lines)
have_authn = any('name: authentication-config' in l for l in lines)
inserted = have_authn
while i < n:
ln = lines[i]; s = ln.strip()
if s.startswith('- name: oidc-'):
i += 2 if (i + 1 < n and lines[i + 1].strip().startswith('value:')) else 1
continue
out.append(ln)
if (not inserted) and s == '- name: authorization-mode':
indent = ln[:len(ln) - len(ln.lstrip())]
if i + 1 < n and lines[i + 1].strip().startswith('value:'):
out.append(lines[i + 1]); i += 2
else:
i += 1
out.append(indent + '- name: authentication-config')
out.append(indent + ' value: /etc/kubernetes/pki/auth-config.yaml')
inserted = True
continue
i += 1
if not inserted:
sys.stderr.write('ANCHOR-NOT-FOUND: authorization-mode\n'); sys.exit(3)
sys.stdout.write('\n'.join(out))
PY
# Whole remote operation, base64-embedded for byte-exact transfer (no # Whole remote operation, base64-embedded for byte-exact transfer (no
# heredoc/escaping hazards across SSH). # heredoc/escaping hazards across SSH).
apiserver_auth_remote_script = <<-SH apiserver_auth_remote_script = <<-SH
@ -137,6 +184,30 @@ locals {
echo "rolled back to previous manifest"; exit 1 echo "rolled back to previous manifest"; exit 1
fi fi
echo "kube-apiserver healthy with multi-issuer --authentication-config" echo "kube-apiserver healthy with multi-issuer --authentication-config"
# 5. Reconcile kubeadm-config so a FUTURE `kubeadm upgrade` regenerates the
# apiserver manifest WITH --authentication-config instead of reverting to
# the stale single-issuer --oidc-* flags. Without this, kubeadm rewrote the
# manifest from kubeadm-config on every control-plane upgrade and the
# regenerated apiserver crash-looped (the 2026-06-24 v1.35 upgrade stall).
# Zero live impact (the CM is only read at upgrade time); idempotent;
# best-effort (the chain's `kubeadm upgrade diff` preflight gate is the
# backstop if this cannot run).
KC="sudo kubectl --kubeconfig /etc/kubernetes/admin.conf"
CC=$($KC -n kube-system get cm kubeadm-config -o jsonpath='{.data.ClusterConfiguration}' 2>/dev/null || true)
if [ -n "$CC" ] && { echo "$CC" | grep -q 'oidc-issuer-url' || ! echo "$CC" | grep -q 'authentication-config'; }; then
echo "Reconciling kubeadm-config (oidc-* -> authentication-config) so kubeadm upgrade keeps structured auth"
echo '${base64encode(local.kubeadm_oidc_reconcile_py)}' | base64 -d > /tmp/reconcile_kubeadm_oidc.py
if printf '%s' "$CC" | python3 /tmp/reconcile_kubeadm_oidc.py > /tmp/kubeadm-cc-new.yaml \
&& sudo kubeadm init phase upload-config kubeadm --config /tmp/kubeadm-cc-new.yaml; then
echo "kubeadm-config reconciled: future control-plane upgrades keep --authentication-config"
else
echo "WARN: kubeadm-config reconcile failed; the upgrade-chain preflight gate will block the next upgrade"
fi
rm -f /tmp/reconcile_kubeadm_oidc.py /tmp/kubeadm-cc-new.yaml
else
echo "kubeadm-config already uses --authentication-config (no oidc drift)"
fi
SH SH
} }
@ -155,6 +226,14 @@ resource "null_resource" "apiserver_oidc_config" {
} }
triggers = { triggers = {
# Intentionally hash ONLY the issuer config, NOT the remote script. CI applies
# the rbac stack with no ssh_private_key (var defaults to ""), so a re-run of
# this SSH provisioner in CI would fail hence the null_resource must stay a
# no-op on a plain CI apply. Script changes (e.g. the 2026-06-24 kubeadm-config
# reconciliation) reach the cluster via the apiserver-oidc-restore ConfigMap
# below (a plain k8s resource, no ssh) which the upgrade chain re-runs. To force
# this provisioner to re-run after a script change, apply locally with
# `-replace` + TF_VAR_ssh_private_key (see docs/runbooks/k8s-version-upgrade.md).
auth_config = sha256(local.apiserver_auth_config_yaml) auth_config = sha256(local.apiserver_auth_config_yaml)
} }
} }

View file

@ -674,6 +674,7 @@ resource "vault_database_secret_backend_connection" "postgresql" {
"pg-recruiter-responder", "pg-tripit", "pg-recruiter-responder", "pg-tripit",
"pg-nextcloud-todos", "pg-nextcloud-todos",
"pg-technitium", "pg-technitium",
"pg-goldmane-edges",
] ]
postgresql { postgresql {
@ -891,6 +892,17 @@ resource "vault_database_secret_backend_static_role" "pg_technitium" {
rotation_period = 604800 rotation_period = 604800
} }
# goldmane-edge-aggregator (ADR-0014 / infra #58) 7-day rotation for the
# goldmane_edges CNPG role. Consumed by stacks/goldmane-edge-aggregator via a
# vault-database ExternalSecret -> DATABASE_URL (remoteRef static-creds/pg-goldmane-edges).
resource "vault_database_secret_backend_static_role" "pg_goldmane_edges" {
backend = vault_mount.database.path
db_name = vault_database_secret_backend_connection.postgresql.name
name = "pg-goldmane-edges"
username = "goldmane_edges"
rotation_period = 604800
}
# ============================================================================= # =============================================================================
# Kubernetes Secrets Engine Dynamic K8s Credentials # Kubernetes Secrets Engine Dynamic K8s Credentials
# ============================================================================= # =============================================================================

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long