Compare commits
9 commits
wizard/gol
...
master
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
9c68d147e0 | ||
|
|
60a1cb9a25 | ||
|
|
c6bba1da6e | ||
|
|
b858561bd0 | ||
|
|
a7704f46a6 | ||
|
|
aa510e3600 | ||
|
|
53834deb24 | ||
|
|
8dd9a3978d | ||
|
|
65b2df1222 |
12 changed files with 3128 additions and 2301 deletions
|
|
@ -11,8 +11,8 @@ description: |
|
||||||
There are TWO Home Assistant deployments: ha-london (default) and ha-sofia.
|
There are TWO Home Assistant deployments: ha-london (default) and ha-sofia.
|
||||||
Always use Home Assistant for smart home control.
|
Always use Home Assistant for smart home control.
|
||||||
author: Claude Code
|
author: Claude Code
|
||||||
version: 2.0.0
|
version: 2.1.0
|
||||||
date: 2026-02-07
|
date: 2026-06-24
|
||||||
---
|
---
|
||||||
|
|
||||||
# Home Assistant Control
|
# Home Assistant Control
|
||||||
|
|
@ -395,14 +395,27 @@ Advanced SSH, File Editor, Studio Code Server, InfluxDB, Mosquitto, Node-RED, Fr
|
||||||
## ha-london Knowledge Map
|
## ha-london Knowledge Map
|
||||||
|
|
||||||
### Overview
|
### Overview
|
||||||
- **HA Version**: 2025.9.1 (Docker container on Raspberry Pi)
|
- **HA Version**: 2026.5.2 on **Home Assistant OS** (HAOS — managed appliance, NOT a `docker run` container). Latest is 2026.6.4 (update available, deliberately not applied).
|
||||||
- **Location**: London, UK
|
- **Location**: London, UK
|
||||||
- **Platform**: Raspberry Pi 4, HA OS (not Docker standalone)
|
- **Platform**: Raspberry Pi 4, HA OS
|
||||||
- **SSH**: `ssh hassio@192.168.8.103` (requires `sudo` for file access)
|
- **Access from the Sofia devvm**: london is **remote** — `homelab ha ssh --instance london` generally WON'T connect (ADR-0012). Drive it via the API: `homelab ha token --instance london` + `https://ha-london.viktorbarzin.me/api/...`, and the WebSocket API `wss://ha-london.viktorbarzin.me/api/websocket` for dashboards / config-entries / HACS installs.
|
||||||
- **Config path**: `/config/` (requires `sudo` for file access)
|
- **SSH (only from the London LAN)**: `ssh hassio@192.168.8.103` (requires `sudo` for file access)
|
||||||
|
- **Config path**: `/config/`
|
||||||
- **3 tracked people**: Viktor Barzin, Anca Milea, Gheorghe Milea
|
- **3 tracked people**: Viktor Barzin, Anca Milea, Gheorghe Milea
|
||||||
- **Zone**: London (home)
|
- **Zone**: London (home)
|
||||||
|
|
||||||
|
### Dashboards (redesigned 2026-06-24)
|
||||||
|
**Glossary** (HA terms — keep distinct):
|
||||||
|
- **Dashboard** = a sidebar entry (Overview, Air Quality, Map). Sidebar *order* is a per-USER frontend preference, not in any dashboard config.
|
||||||
|
- **View** = a tab inside a dashboard. View order is global (stored in the dashboard config).
|
||||||
|
- **Card** = a widget inside a view.
|
||||||
|
|
||||||
|
- **Overview** (`lovelace`, the default): responsive **sections** views, styled with Mushroom + mini-graph-card.
|
||||||
|
- **Home** tab: *Who's home* · *Comfort & Air* (CO₂/temp/humidity/PM2.5/VOC chips + CO₂ and temp/humidity trend graphs + link to Air Quality) · *Cowboy* (battery/range/last-ride) · *Energy* (5 Kasa plugs + power trend) · *Quick actions* (Netflix/Stremio/Night).
|
||||||
|
- **More** tab: *Network* (GL-MT6000 router) · *System* (HA version/update, last backup, RPi power) · *Phones*.
|
||||||
|
- **Air Quality** (`air-quality`): deep-dive (views: Home, Detailed). (`detialed`→`detailed` path typo fixed 2026-06-24.)
|
||||||
|
- Built via the WS `lovelace/config/save` API (london is remote — no SSH path).
|
||||||
|
|
||||||
### Key Systems
|
### Key Systems
|
||||||
|
|
||||||
#### 1. Smart Plugs (TP-Link Kasa) — Energy Monitoring
|
#### 1. Smart Plugs (TP-Link Kasa) — Energy Monitoring
|
||||||
|
|
@ -424,10 +437,15 @@ Named plugs with power/energy tracking:
|
||||||
- PM1.0/2.5/4.0/10 particulate sensors
|
- PM1.0/2.5/4.0/10 particulate sensors
|
||||||
- VOC, NOx, ammonia, CO, ethanol, hydrogen, methane, NO2 gas sensors
|
- VOC, NOx, ammonia, CO, ethanol, hydrogen, methane, NO2 gas sensors
|
||||||
|
|
||||||
#### 3. Cowboy E-Bike
|
#### 3. Cowboy E-Bike (`elsbrock/cowboy-ha`)
|
||||||
- `sensor.bike_state_of_charge`: Battery %
|
Bike named **"Classic Performance"** → entities are `sensor.classic_performance_*` (26 total). The old `sensor.bike_*` names are GONE (they were the dead `jdejaegh` integration).
|
||||||
- `sensor.bike_total_distance`: Total km
|
- `sensor.classic_performance_remaining_battery`: Battery % (was `sensor.bike_state_of_charge`)
|
||||||
- `sensor.bike_total_co2_saved`: CO2 saved (grams)
|
- `sensor.classic_performance_remaining_range`: Range km
|
||||||
|
- `sensor.classic_performance_mileage`: Total km (was `sensor.bike_total_distance`)
|
||||||
|
- `sensor.classic_performance_saved_co2`: Lifetime CO2 saved (was `sensor.bike_total_co2_saved`)
|
||||||
|
- Plus `_distance_today`, `_last_trip_*`, `_battery_health`, `device_tracker.classic_performance`, etc.
|
||||||
|
- **GOTCHA**: live battery/range/mileage read `unknown` while the bike is parked/asleep — Cowboy only reports live SoC when awake (ridden/charging); trip-history + `distance_today` stay live regardless.
|
||||||
|
- Auth: account **email+password** (no AWS Cognito — that was the dead `jdejaegh`/`cowboybike` lineage). Setup via UI config flow / REST `config_entries/flow`. Creds in Vaultwarden item **"cowboy bike"** (`homelab vault get "cowboy bike"`).
|
||||||
|
|
||||||
#### 4. Uptime Monitoring (UptimeRobot)
|
#### 4. Uptime Monitoring (UptimeRobot)
|
||||||
- `sensor.blog`: blog uptime
|
- `sensor.blog`: blog uptime
|
||||||
|
|
@ -446,12 +464,17 @@ Named plugs with power/energy tracking:
|
||||||
- Scripts: `script.start_netflix`, `script.start_stremio`
|
- Scripts: `script.start_netflix`, `script.start_stremio`
|
||||||
- Scene: `scene.night` (turns off Livia + Michelle plugs)
|
- Scene: `scene.night` (turns off Livia + Michelle plugs)
|
||||||
|
|
||||||
### Custom Components
|
### Custom Components (HACS integrations)
|
||||||
- **cowboy**: Cowboy e-bike integration (HACS)
|
- **cowboy** (`elsbrock/cowboy-ha` v1.2.0): Cowboy e-bike — revived 2026-06-24. The old `jdejaegh/home-assistant-cowboy` repo is **dead (404)**; don't chase it.
|
||||||
- **hildebrandglow_dcc**: UK smart meter DCC energy data (HACS)
|
- **hildebrandglow_dcc**: UK smart meter DCC energy — **DISABLED by user** (config entry `disabled_by: user`), not broken.
|
||||||
|
|
||||||
|
### HACS frontend cards (plugins)
|
||||||
|
- **Mushroom** (`piitaya/lovelace-mushroom`), **mini-graph-card** (`kalkih/mini-graph-card`), **plotly-graph-card** (`dbuezas/lovelace-plotly-graph-card`) — used by the redesigned Overview. Install over WS `hacs/repository/download`; resources auto-register in storage mode.
|
||||||
|
|
||||||
### Integrations
|
### Integrations
|
||||||
ESPHome, TP-Link Kasa, Tapo, UptimeRobot, Cowboy, Hildebrand Glow DCC, Oral-B BLE, Ookla Speedtest, HACS, OpenRouter (multiple free LLMs), Piper (local TTS), Whisper (local STT), Android TV/ADB
|
ESPHome, TP-Link Kasa, Tapo, UptimeRobot, **Cowboy** (elsbrock), Oral-B BLE, Ookla Speedtest (exposes only an `update` entity, no live speed sensors), HACS, OpenRouter (free LLMs), Piper (TTS), Whisper (STT), Android TV/ADB.
|
||||||
|
- **Disabled by user (NOT broken)**: `met` + `metoffice` (weather — so `weather.*` entities are ABSENT), `roomba` (Rumi vacuum), `hildebrandglow_dcc` (energy).
|
||||||
|
- **Failing**: `tplink` **Tapo P100** projector plug — `setup_retry`, 403 KLAP handshake from 192.168.8.108 (plug off / firmware). Left as-is.
|
||||||
|
|
||||||
### AI / Voice Assistants
|
### AI / Voice Assistants
|
||||||
- 5 free LLM conversation agents: Google Gemma 3 27B, Meta Llama 3.2 3B, Mistral Devstral 2, OpenAI GPT-OSS-20B, Z.AI GLM 4.5 Air
|
- 5 free LLM conversation agents: Google Gemma 3 27B, Meta Llama 3.2 3B, Mistral Devstral 2, OpenAI GPT-OSS-20B, Z.AI GLM 4.5 Air
|
||||||
|
|
@ -466,15 +489,8 @@ ESPHome, TP-Link Kasa, Tapo, UptimeRobot, Cowboy, Hildebrand Glow DCC, Oral-B BL
|
||||||
- Anca arrival/departure notifications
|
- Anca arrival/departure notifications
|
||||||
- Night scene: turns off Livia + Michelle
|
- Night scene: turns off Livia + Michelle
|
||||||
|
|
||||||
### Docker Setup
|
### Platform (HAOS — ignore any legacy `docker run` snippet)
|
||||||
```bash
|
ha-london runs **Home Assistant OS** (managed appliance), NOT a hand-run Docker container. There is no `docker run homeassistant/home-assistant` to manage. Install HACS components over the WebSocket API (`hacs/repository/download` with the repo's HACS id), then restart via `POST /api/services/homeassistant/restart` — a HAOS restart drops automations for ~1–2 min and resets `sensor.uptime` (use that as the "back up" marker).
|
||||||
docker run -d --name homeassistant --privileged \
|
|
||||||
-e TZ=Europe/London \
|
|
||||||
-v /home/pi/docker/homeAssistant:/config \
|
|
||||||
-v /run/dbus:/run/dbus:ro \
|
|
||||||
--network=host --restart=unless-stopped \
|
|
||||||
homeassistant/home-assistant:2025.9
|
|
||||||
```
|
|
||||||
|
|
||||||
### SSH Access
|
### SSH Access
|
||||||
```bash
|
```bash
|
||||||
|
|
|
||||||
|
|
@ -0,0 +1,97 @@
|
||||||
|
# Post-mortem: k8s 1.34→1.35 upgrade stalled — etcd IO starvation (2026-06-24)
|
||||||
|
|
||||||
|
> Filename kept for inbound links. The originally-suspected cause (kubeadm-config
|
||||||
|
> OIDC drift) turned out **not** to be the crash — see "Correction" below. The OIDC
|
||||||
|
> drift was a real *separate* latent bug fixed in the same change.
|
||||||
|
|
||||||
|
**Impact:** The autonomous k8s-version-upgrade chain (23:00 UTC nightly) reached
|
||||||
|
the master control-plane phase for the first time — preflight passed, etcd
|
||||||
|
snapshot taken, master cordoned + drained, etcd upgraded 3.6.5→3.6.6 — then the
|
||||||
|
kube-apiserver upgrade to v1.35.6 **crash-looped**. kubeadm waited its 5-minute
|
||||||
|
static-pod-hash window across all internal retries, then auto-rolled-back to
|
||||||
|
v1.34.9. The cluster stayed healthy on 1.34.9 (apiserver, all 7 nodes Ready), but
|
||||||
|
the run left **k8s-master cordoned** and the chain **wedged on `in_flight=1`**.
|
||||||
|
No data loss; no user-facing outage (the master carries control-plane taints, so
|
||||||
|
no workloads were displaced).
|
||||||
|
|
||||||
|
**Trigger:** the first *minor* upgrade the chain ever attempted (1.34→1.35) — the
|
||||||
|
first time kubeadm upgrades etcd (3.6.5→3.6.6) and regenerates the control-plane
|
||||||
|
static pods, i.e. the first time the upgrade pushes real write-IO at etcd.
|
||||||
|
|
||||||
|
## Root cause — etcd IO starvation on the shared HDD
|
||||||
|
|
||||||
|
The new kube-apiserver could not establish/keep a working connection to etcd
|
||||||
|
during the upgrade because **etcd was IO-starved**. etcd's surviving container log
|
||||||
|
from the crash window (`/var/log/pods/.../etcd/0.log`, 23:04–23:20 UTC) shows:
|
||||||
|
|
||||||
|
- **1,180** `apply request took too long` warnings in 16 minutes;
|
||||||
|
- individual applies of **4.3s / 2.9s / 2.7s / 1.8s** (healthy is <100ms),
|
||||||
|
clustered at **23:18:51 UTC** — exactly when kubeadm's final attempt was trying
|
||||||
|
to bring the new apiserver up.
|
||||||
|
|
||||||
|
A reproduced 1.35.6 apiserver with no etcd dies with
|
||||||
|
`F instance.go:233 Error creating leases: error creating storage factory: context
|
||||||
|
deadline exceeded` — the same failure mode a multi-second etcd produces. etcd
|
||||||
|
lives on the contended `sdc` HDD (**beads code-oflt**: "etcd/critical VM disks on
|
||||||
|
shared sdc HDD — recurring IO-storm root cause"). The upgrade itself piled IO onto
|
||||||
|
that spindle:
|
||||||
|
|
||||||
|
1. etcd's own upgrade-restart + WAL/db re-read (it restarted ~23:04, re-elected);
|
||||||
|
2. kubeadm dumping a full **~400MB etcd DB backup** to
|
||||||
|
`/etc/kubernetes/tmp/kubeadm-backup-etcd-<ts>/` (on the same HDD) before the
|
||||||
|
etcd upgrade — and **145 of these had accumulated to 28GB** (kubeadm never
|
||||||
|
cleans them up), pushing master root fs to **73%**, above the 70% kubelet
|
||||||
|
image-GC threshold, so image GC churned during the drain too;
|
||||||
|
3. master-drain pod evictions.
|
||||||
|
|
||||||
|
### Correction — it was NOT the OIDC flag swap
|
||||||
|
|
||||||
|
`kubeadm upgrade diff v1.35.6` showed the regenerated manifest also swaps
|
||||||
|
`--authentication-config` (structured multi-issuer OIDC) back to legacy
|
||||||
|
single-issuer `--oidc-*` flags (kubeadm-config drift, see secondary finding). That
|
||||||
|
was the *first* hypothesis — but an isolated repro of the 1.35.6 apiserver with
|
||||||
|
those exact `--oidc-*` flags **and authentik reachable** initialised OIDC cleanly
|
||||||
|
(`oidc.go:313`, no error) and ran fine until it hit the (deliberately dead) test
|
||||||
|
etcd. So the auth swap does **not** crash the apiserver; it was a red herring for
|
||||||
|
the crash. Image pull (all v1.35.6 images pre-pulled), OOM (none), and disk-full
|
||||||
|
were also ruled out.
|
||||||
|
|
||||||
|
## Secondary finding (real, fixed separately) — kubeadm-config OIDC drift
|
||||||
|
|
||||||
|
apiserver auth is configured in three places that must agree:
|
||||||
|
(1) `/etc/kubernetes/pki/auth-config.yaml` (structured, two issuers: `kubernetes`
|
||||||
|
+ `k8s-dashboard`, added 2026-06-19); (2) the live static-pod manifest
|
||||||
|
(`--authentication-config`); (3) the kubeadm-config `ClusterConfiguration` CM —
|
||||||
|
which still carried the legacy `--oidc-*` extraArgs. `kubeadm upgrade` regenerates
|
||||||
|
the manifest from (3), so it would have reverted structured auth → **dashboard +
|
||||||
|
kubectl SSO break after a successful upgrade** (recoverable: the chain's
|
||||||
|
post-master `restore.sh` re-adds the flag). This is a real bug, just not the crash.
|
||||||
|
|
||||||
|
## Resolution
|
||||||
|
|
||||||
|
1. **Reclaimed the 28GB kubeadm scratch** on master (`/etc/kubernetes/tmp/kubeadm-backup-*`) — root fs 73% → 23%.
|
||||||
|
2. **Reconciled kubeadm-config live** (zero cluster impact — CM only read at upgrade time): dropped `--oidc-*`, added `--authentication-config` via `kubeadm init phase upload-config kubeadm`. `kubeadm upgrade diff` then shows only the control-plane image bumps.
|
||||||
|
3. **Recovered:** uncordoned k8s-master, cleared the stuck `in_flight` gauge + annotation, deleted last night's Complete/Failed `1-35-6` phase jobs (a Complete preflight would otherwise make the detector idempotent-skip the re-run).
|
||||||
|
|
||||||
|
## Prevention (landed in this change)
|
||||||
|
|
||||||
|
| Gap | Fix |
|
||||||
|
|-----|-----|
|
||||||
|
| kubeadm leaks ~400MB etcd-DB backups into `/etc/kubernetes/tmp` forever (→ disk fills, image-GC churn, write-IO on etcd's spindle) | **`upgrade-step.sh` preflight now prunes** `/etc/kubernetes/tmp/kubeadm-backup-*` + `kubeadm-upgraded-manifests*` older than 3 days on master, every run. Best-effort, never aborts. |
|
||||||
|
| kubeadm-config drift would silently break SSO after an upgrade | `apiserver-oidc.tf`'s remote script now **also reconciles kubeadm-config** (`kubeadm init phase upload-config`), delivered via the `apiserver-oidc-restore` ConfigMap the chain re-runs (CI needs no ssh) or a local `-replace` apply. Preflight **alerts** (not blocks — SSO drift is recoverable) if `kubeadm upgrade diff` would still drop `--authentication-config`. |
|
||||||
|
| etcd on the contended `sdc` HDD starves under upgrade IO | **Durable fix is beads code-oflt** (move etcd/critical VM disks off `sdc`). Not in this change. Mitigations above reduce the upgrade's own IO; reclaimed disk removes the image-GC variable. |
|
||||||
|
|
||||||
|
## Lessons
|
||||||
|
|
||||||
|
- **Capture the failing component's own logs before concluding.** The `kubeadm
|
||||||
|
upgrade diff` made the OIDC swap look like the cause; only etcd's log (multi-second
|
||||||
|
applies) + an isolated apiserver repro showed the truth (etcd IO). A clean diff is
|
||||||
|
"what config changes," not "why it crashed."
|
||||||
|
- **etcd on shared HDD is the cluster's recurring fragility** (immich IO storm
|
||||||
|
2026-05-25, this stall). Upgrades concentrate IO (etcd restart + kubeadm's 400MB
|
||||||
|
backup copy + drain) onto that spindle. code-oflt is the real fix.
|
||||||
|
- **Tools that leave per-operation scratch must be reaped.** kubeadm's
|
||||||
|
`/etc/kubernetes/tmp` etcd backups are throwaway (real backups → NFS) but never
|
||||||
|
GC'd; 28GB had silently accumulated.
|
||||||
|
- **Out-of-band control-plane edits must be written back to kubeadm-config** — else
|
||||||
|
`kubeadm upgrade` silently reverts them (here: SSO; could be admission/audit/API flags).
|
||||||
|
|
@ -41,6 +41,8 @@ Job 0 — preflight (pinned: k8s-node1)
|
||||||
├── halt-on-alert (kured-style ignore-list)
|
├── halt-on-alert (kured-style ignore-list)
|
||||||
├── 24h-quiet baseline (no Ready transitions <24h ago)
|
├── 24h-quiet baseline (no Ready transitions <24h ago)
|
||||||
├── kubeadm upgrade plan matches target (skipped when master already at target — partial-resume)
|
├── kubeadm upgrade plan matches target (skipped when master already at target — partial-resume)
|
||||||
|
├── apiserver-OIDC drift check: kubeadm upgrade diff drops --authentication-config? → Slack WARN (recoverable; not a block)
|
||||||
|
├── reclaim kubeadm scratch: prune /etc/kubernetes/tmp/kubeadm-backup-* >3d on master (kubeadm leaks ~400MB etcd-db backups)
|
||||||
├── Push k8s_upgrade_in_flight=1, k8s_upgrade_started_timestamp=$(date +%s)
|
├── Push k8s_upgrade_in_flight=1, k8s_upgrade_started_timestamp=$(date +%s)
|
||||||
├── Trigger backup-etcd Job, wait, verify snapshot byte count
|
├── Trigger backup-etcd Job, wait, verify snapshot byte count
|
||||||
├── SSH master: containerd skew fix (if master < workers)
|
├── SSH master: containerd skew fix (if master < workers)
|
||||||
|
|
@ -222,22 +224,34 @@ Exposed in K8s via ExternalSecret `k8s-upgrade-creds` in the `k8s-upgrade` names
|
||||||
|
|
||||||
## Common Operations
|
## Common Operations
|
||||||
|
|
||||||
### Post-upgrade: apiserver OIDC restore (AUTOMATED by the chain since 2026-06-19)
|
### apiserver OIDC + kubeadm upgrades (kubeadm-config reconciliation since 2026-06-24)
|
||||||
|
|
||||||
`kubeadm upgrade apply` **regenerates `/etc/kubernetes/manifests/kube-apiserver.yaml`
|
`kubeadm upgrade apply` **regenerates `/etc/kubernetes/manifests/kube-apiserver.yaml`
|
||||||
and drops the `--authentication-config` flag**, silently disabling apiserver
|
from kubeadm-config**. apiserver auth uses a structured multi-issuer
|
||||||
OIDC (kubectl/kubelogin CLI **and** the web dashboard SSO break — tokens get
|
`--authentication-config` (kubectl + dashboard SSO), but kubeadm-config used to
|
||||||
401). This used to require a manual re-apply after **every** control-plane bump.
|
still carry the legacy single-issuer `--oidc-*` extraArgs — so every upgrade
|
||||||
|
reverted the flag, **silently breaking SSO after the upgrade** (the apiserver does
|
||||||
|
NOT crash on this — verified by isolated repro; it's recoverable via the restore
|
||||||
|
script below). NB: the **1.34→1.35 stall on 2026-06-24 was a *separate* issue —
|
||||||
|
etcd IO starvation**, not this drift; post-mortem:
|
||||||
|
`docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md`.
|
||||||
|
|
||||||
**Now automated:** the `rbac` stack publishes its OIDC restore script to the
|
**Primary fix (2026-06-24):** `stacks/rbac/modules/rbac/apiserver-oidc.tf` now
|
||||||
`kube-system/apiserver-oidc-restore` ConfigMap, and the version-upgrade chain's
|
**reconciles kubeadm-config** (`kubeadm init phase upload-config kubeadm`, rewriting
|
||||||
`phase_master` re-runs it on master immediately after `kubeadm upgrade apply`
|
`apiServer.extraArgs`: drop `--oidc-*`, add `--authentication-config`) as part of
|
||||||
(while tigera-operator is still quiesced, so the flag-add apiserver restart can't
|
its remote script. So kubeadm regenerates a **correct** manifest and the apiserver
|
||||||
crashloop the operator). It's idempotent, health-gates `/livez` with
|
upgrades with a pure image bump — `kubeadm upgrade diff <target>` shows only the
|
||||||
auto-rollback, and is **non-fatal** — a failure only lags SSO until the next rbac
|
image change. Zero live impact (the CM is read only during an upgrade).
|
||||||
apply (the version upgrade itself already succeeded). So a chain-driven
|
|
||||||
control-plane bump no longer breaks SSO. The master phase self-skips when master
|
**Backstops:**
|
||||||
is already at target, so this only runs when master was actually upgraded.
|
- **Preflight check 4b** runs `kubeadm upgrade diff` and **alerts** (Slack WARN, does
|
||||||
|
NOT block — the drift only breaks SSO, which is recoverable) if
|
||||||
|
`--authentication-config` would still be dropped.
|
||||||
|
- The `rbac` stack still publishes its restore script to the
|
||||||
|
`kube-system/apiserver-oidc-restore` ConfigMap, and `phase_master` re-runs it on
|
||||||
|
master right after `kubeadm upgrade apply` (idempotent, `/livez`-gated with
|
||||||
|
auto-rollback, non-fatal) — now redundant belt-and-suspenders that *also*
|
||||||
|
re-reconciles kubeadm-config. Self-skips when master is already at target.
|
||||||
|
|
||||||
**Manual fallback** — only for an out-of-band/manual `kubeadm` upgrade, or if the
|
**Manual fallback** — only for an out-of-band/manual `kubeadm` upgrade, or if the
|
||||||
chain logged `WARN: --authentication-config absent after re-apply`:
|
chain logged `WARN: --authentication-config absent after re-apply`:
|
||||||
|
|
|
||||||
488
stacks/goldmane-edge-aggregator/main.tf
Normal file
488
stacks/goldmane-edge-aggregator/main.tf
Normal file
|
|
@ -0,0 +1,488 @@
|
||||||
|
# =============================================================================
|
||||||
|
# goldmane-edge-aggregator — durable who-talks-to-whom audit trail (ADR-0014 / #58)
|
||||||
|
# =============================================================================
|
||||||
|
# A small Go service that streams Calico Goldmane's gRPC Flows API (mTLS) and
|
||||||
|
# upserts the unique service-to-service edge set into Postgres, plus a daily
|
||||||
|
# Slack digest CronJob of first-seen edges. Code lives in the standalone
|
||||||
|
# `goldmane-edge-aggregator` repo; the authoritative deploy spec is its
|
||||||
|
# DEPLOY.md. This stack is the infra side of that spec.
|
||||||
|
#
|
||||||
|
# Goldmane runs as `Service goldmane:7443` (gRPC/mTLS) in calico-system, enabled
|
||||||
|
# via the operator CR in stacks/calico/main.tf. The durable Loki path is NOT
|
||||||
|
# the operator CRs — this service IS the durable trail.
|
||||||
|
#
|
||||||
|
# Structure mirrors stacks/claude-memory (the canonical Tier-1 pattern): a
|
||||||
|
# per-service namespace, a CNPG Postgres DB + role + Vault 7-day rotation +
|
||||||
|
# ExternalSecret -> DATABASE_URL, the Reloader annotation, and the
|
||||||
|
# Terragrunt-generated backend.tf/providers.tf/tiers.tf layout. The novel bit is
|
||||||
|
# minting an mTLS client cert from the Tigera CA (hashicorp/tls; see versions.tf).
|
||||||
|
#
|
||||||
|
# IMAGE: ghcr.io/viktorbarzin/goldmane-edge-aggregator is PRIVATE. Onboarding
|
||||||
|
# MUST add the "goldmane-edge-aggregator" namespace to the ghcr-credentials
|
||||||
|
# Kyverno allowlist (stacks/kyverno/modules/kyverno/ghcr-credentials.tf,
|
||||||
|
# local.ghcr_private_namespaces) so the Kyverno-synced `ghcr-credentials` secret
|
||||||
|
# is cloned into this namespace — otherwise the pulls 401. The imagePullSecrets
|
||||||
|
# reference below assumes that entry exists.
|
||||||
|
# =============================================================================
|
||||||
|
|
||||||
|
variable "postgresql_host" { type = string }
|
||||||
|
|
||||||
|
# Plan-time root creds for the idempotent DB-init Job (mirrors claude-memory).
|
||||||
|
data "vault_kv_secret_v2" "secrets" {
|
||||||
|
mount = "secret"
|
||||||
|
name = "goldmane-edge-aggregator"
|
||||||
|
}
|
||||||
|
|
||||||
|
# -----------------------------------------------------------------------------
|
||||||
|
# 1. Namespace
|
||||||
|
# -----------------------------------------------------------------------------
|
||||||
|
resource "kubernetes_namespace" "goldmane_edge_aggregator" {
|
||||||
|
metadata {
|
||||||
|
name = "goldmane-edge-aggregator"
|
||||||
|
labels = {
|
||||||
|
name = "goldmane-edge-aggregator"
|
||||||
|
# Tier 4-aux: a small off-path consumer service, like claude-memory.
|
||||||
|
tier = local.tiers.aux
|
||||||
|
"keel.sh/enrolled" = "true"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
lifecycle {
|
||||||
|
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
|
||||||
|
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
# -----------------------------------------------------------------------------
|
||||||
|
# 2. Goldmane mTLS client certificate (minted from the Tigera CA)
|
||||||
|
# -----------------------------------------------------------------------------
|
||||||
|
# The aggregator dials goldmane:7443 over mutual TLS. We mint a client cert
|
||||||
|
# signed by the Tigera CA (the same CA that issues Goldmane's serving cert), so
|
||||||
|
# Goldmane requires mutual TLS on :7443 and verifies the client cert chains to
|
||||||
|
# the Tigera CA — it does NOT authorize by client identity, so ANY Tigera-CA-
|
||||||
|
# signed cert is accepted. Rather than copy the Tigera CA PRIVATE KEY into TF
|
||||||
|
# state to mint our own (a needless CA-key exposure; the hashicorp/tls provider
|
||||||
|
# is also incompatible with this repo's global generate-providers/lockfile
|
||||||
|
# pattern), we REUSE the operator-minted, Tigera-CA-signed client cert
|
||||||
|
# `whisker-backend-key-pair` (calico-system). We never touch the CA key.
|
||||||
|
# Trade-off: if the operator rotates that cert, re-apply to re-sync (hardening
|
||||||
|
# follow-up: mint an own-identity cert in-namespace if Whisker is ever removed).
|
||||||
|
data "kubernetes_secret" "whisker_backend" {
|
||||||
|
metadata {
|
||||||
|
name = "whisker-backend-key-pair"
|
||||||
|
namespace = "calico-system"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
# The CA bundle that verifies Goldmane's serving cert. It lives ONLY in
|
||||||
|
# calico-system (verified: ConfigMap `tigera-ca-bundle`, 2 keys present —
|
||||||
|
# `ca-bundle.crt` AND `tigera-ca-bundle.crt`, both the trusted bundle). We read
|
||||||
|
# it and recreate it as a ConfigMap in this namespace so the pod can mount it
|
||||||
|
# (a ConfigMap cannot be cross-namespace-mounted).
|
||||||
|
data "kubernetes_config_map" "tigera_ca_bundle" {
|
||||||
|
metadata {
|
||||||
|
name = "tigera-ca-bundle"
|
||||||
|
namespace = "calico-system"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
resource "kubernetes_config_map" "tigera_ca_bundle" {
|
||||||
|
metadata {
|
||||||
|
name = "tigera-ca-bundle"
|
||||||
|
namespace = kubernetes_namespace.goldmane_edge_aggregator.metadata[0].name
|
||||||
|
}
|
||||||
|
# Copy the upstream bundle verbatim. We mount the `tigera-ca-bundle.crt` key
|
||||||
|
# at /etc/tigera-ca/tigera-ca-bundle.crt so the service's default
|
||||||
|
# CA_CERT_PATH (/etc/tigera-ca/tigera-ca-bundle.crt) resolves with no override.
|
||||||
|
data = data.kubernetes_config_map.tigera_ca_bundle.data
|
||||||
|
}
|
||||||
|
|
||||||
|
# Client cert + key for mTLS to goldmane:7443, mounted at TLS_CERT_PATH /
|
||||||
|
# TLS_KEY_PATH defaults (/etc/goldmane-client-tls/tls.crt and .../tls.key).
|
||||||
|
# Sourced verbatim from the operator's whisker-backend client key-pair (read
|
||||||
|
# above) — already Tigera-CA-signed, which is all Goldmane verifies. No CA key
|
||||||
|
# is touched and no cross-namespace CA RBAC is needed.
|
||||||
|
resource "kubernetes_secret" "goldmane_client_tls" {
|
||||||
|
metadata {
|
||||||
|
name = "goldmane-client-tls"
|
||||||
|
namespace = kubernetes_namespace.goldmane_edge_aggregator.metadata[0].name
|
||||||
|
}
|
||||||
|
type = "Opaque"
|
||||||
|
data = {
|
||||||
|
"tls.crt" = data.kubernetes_secret.whisker_backend.data["tls.crt"]
|
||||||
|
"tls.key" = data.kubernetes_secret.whisker_backend.data["tls.key"]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
# -----------------------------------------------------------------------------
|
||||||
|
# 3. Postgres: DB + role `goldmane_edges`, Vault 7-day rotation, DATABASE_URL
|
||||||
|
# -----------------------------------------------------------------------------
|
||||||
|
# Idempotent create of the role + DB using the CNPG root creds from Vault
|
||||||
|
# (dbaas_root_password), exactly mirroring claude-memory's db_init Job. The
|
||||||
|
# service creates the `edge` table itself at startup (migrations/0001_edge.sql),
|
||||||
|
# so no migration Job is needed.
|
||||||
|
resource "kubernetes_job" "db_init" {
|
||||||
|
metadata {
|
||||||
|
name = "goldmane-edges-db-init"
|
||||||
|
namespace = kubernetes_namespace.goldmane_edge_aggregator.metadata[0].name
|
||||||
|
}
|
||||||
|
spec {
|
||||||
|
template {
|
||||||
|
metadata {}
|
||||||
|
spec {
|
||||||
|
container {
|
||||||
|
name = "db-init"
|
||||||
|
image = "postgres:16-alpine"
|
||||||
|
command = [
|
||||||
|
"sh", "-c",
|
||||||
|
<<-EOT
|
||||||
|
set -e
|
||||||
|
# -d postgres: psql defaults the database name to the username;
|
||||||
|
# the root user has no root-named database, so be explicit.
|
||||||
|
PGPASSWORD='${data.vault_kv_secret_v2.secrets.data["dbaas_root_password"]}' psql -h ${var.postgresql_host} -U root -d postgres -tc "SELECT 1 FROM pg_roles WHERE rolname='goldmane_edges'" | grep -q 1 || \
|
||||||
|
PGPASSWORD='${data.vault_kv_secret_v2.secrets.data["dbaas_root_password"]}' psql -h ${var.postgresql_host} -U root -d postgres -c "CREATE ROLE goldmane_edges WITH LOGIN PASSWORD '${data.vault_kv_secret_v2.secrets.data["db_password"]}'"
|
||||||
|
PGPASSWORD='${data.vault_kv_secret_v2.secrets.data["dbaas_root_password"]}' psql -h ${var.postgresql_host} -U root -d postgres -tc "SELECT 1 FROM pg_database WHERE datname='goldmane_edges'" | grep -q 1 || \
|
||||||
|
PGPASSWORD='${data.vault_kv_secret_v2.secrets.data["dbaas_root_password"]}' psql -h ${var.postgresql_host} -U root -d postgres -c "CREATE DATABASE goldmane_edges OWNER goldmane_edges"
|
||||||
|
PGPASSWORD='${data.vault_kv_secret_v2.secrets.data["dbaas_root_password"]}' psql -h ${var.postgresql_host} -U root -d postgres -c "GRANT ALL PRIVILEGES ON DATABASE goldmane_edges TO goldmane_edges"
|
||||||
|
echo "Database init complete"
|
||||||
|
EOT
|
||||||
|
]
|
||||||
|
}
|
||||||
|
restart_policy = "Never"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
backoff_limit = 3
|
||||||
|
}
|
||||||
|
wait_for_completion = true
|
||||||
|
timeouts {
|
||||||
|
create = "2m"
|
||||||
|
}
|
||||||
|
lifecycle {
|
||||||
|
# KYVERNO_LIFECYCLE_V1: Kyverno injects dns_config (ndots=2); ignore it so
|
||||||
|
# this idempotent Job isn't replaced (Jobs are immutable) on every apply.
|
||||||
|
ignore_changes = [spec[0].template[0].spec[0].dns_config]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
# ExternalSecret projecting the Vault-rotated (7-day) credential into a K8s
|
||||||
|
# Secret as DATABASE_URL. The Vault DB static role `pg-goldmane-edges` and its
|
||||||
|
# place in the CNPG connection allowlist are added in stacks/vault/main.tf
|
||||||
|
# (see this stack's terragrunt.hcl note). remoteRef key: static-creds/pg-goldmane-edges.
|
||||||
|
resource "kubernetes_manifest" "db_external_secret" {
|
||||||
|
manifest = {
|
||||||
|
apiVersion = "external-secrets.io/v1"
|
||||||
|
kind = "ExternalSecret"
|
||||||
|
metadata = {
|
||||||
|
name = "goldmane-edges-db-creds"
|
||||||
|
namespace = kubernetes_namespace.goldmane_edge_aggregator.metadata[0].name
|
||||||
|
}
|
||||||
|
spec = {
|
||||||
|
refreshInterval = "15m"
|
||||||
|
secretStoreRef = {
|
||||||
|
name = "vault-database"
|
||||||
|
kind = "ClusterSecretStore"
|
||||||
|
}
|
||||||
|
target = {
|
||||||
|
name = "goldmane-edges-db-creds"
|
||||||
|
template = {
|
||||||
|
data = {
|
||||||
|
DATABASE_URL = "postgresql://goldmane_edges:{{ .password }}@${var.postgresql_host}:5432/goldmane_edges"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
data = [{
|
||||||
|
secretKey = "password"
|
||||||
|
remoteRef = {
|
||||||
|
key = "static-creds/pg-goldmane-edges"
|
||||||
|
property = "password"
|
||||||
|
}
|
||||||
|
}]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
depends_on = [kubernetes_namespace.goldmane_edge_aggregator]
|
||||||
|
}
|
||||||
|
|
||||||
|
# -----------------------------------------------------------------------------
|
||||||
|
# 4. Slack webhook (reuse the alert-digest incoming webhook)
|
||||||
|
# -----------------------------------------------------------------------------
|
||||||
|
# The monitoring alert-digest CronJob posts with the Slack incoming webhook at
|
||||||
|
# Vault secret/monitoring -> key `alertmanager_slack_api_url`
|
||||||
|
# (stacks/monitoring/modules/monitoring/alert_digest.tf). Project that same URL
|
||||||
|
# into this namespace as SLACK_WEBHOOK_URL via an ExternalSecret (no new
|
||||||
|
# webhook). The digest CronJob defaults to #security.
|
||||||
|
resource "kubernetes_manifest" "slack_external_secret" {
|
||||||
|
manifest = {
|
||||||
|
apiVersion = "external-secrets.io/v1"
|
||||||
|
kind = "ExternalSecret"
|
||||||
|
metadata = {
|
||||||
|
name = "goldmane-edges-slack"
|
||||||
|
namespace = kubernetes_namespace.goldmane_edge_aggregator.metadata[0].name
|
||||||
|
}
|
||||||
|
spec = {
|
||||||
|
refreshInterval = "1h"
|
||||||
|
secretStoreRef = {
|
||||||
|
name = "vault-kv"
|
||||||
|
kind = "ClusterSecretStore"
|
||||||
|
}
|
||||||
|
target = {
|
||||||
|
name = "goldmane-edges-slack"
|
||||||
|
}
|
||||||
|
data = [{
|
||||||
|
secretKey = "SLACK_WEBHOOK_URL"
|
||||||
|
remoteRef = {
|
||||||
|
key = "viktor"
|
||||||
|
property = "alertmanager_slack_api_url"
|
||||||
|
}
|
||||||
|
}]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
depends_on = [kubernetes_namespace.goldmane_edge_aggregator]
|
||||||
|
}
|
||||||
|
|
||||||
|
# -----------------------------------------------------------------------------
|
||||||
|
# 5. aggregate — Deployment (long-running gRPC stream -> Postgres upserts)
|
||||||
|
# -----------------------------------------------------------------------------
|
||||||
|
resource "kubernetes_deployment" "aggregate" {
|
||||||
|
depends_on = [
|
||||||
|
kubernetes_job.db_init,
|
||||||
|
kubernetes_manifest.db_external_secret,
|
||||||
|
]
|
||||||
|
metadata {
|
||||||
|
name = "goldmane-edge-aggregator"
|
||||||
|
namespace = kubernetes_namespace.goldmane_edge_aggregator.metadata[0].name
|
||||||
|
labels = {
|
||||||
|
app = "goldmane-edge-aggregator"
|
||||||
|
tier = local.tiers.aux
|
||||||
|
}
|
||||||
|
annotations = {
|
||||||
|
# Credential is env-injected and read only at startup; the 7-day rotation
|
||||||
|
# must bounce the pod or it keeps the stale password and silently fails
|
||||||
|
# DB auth (infra CLAUDE.md Reloader rule).
|
||||||
|
"secret.reloader.stakater.com/reload" = "goldmane-edges-db-creds"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
spec {
|
||||||
|
# 1 replica: the edge set is a global upsert keyed on (src_ns, dst_ns,
|
||||||
|
# action); a second replica only doubles writes for no benefit (Goldmane
|
||||||
|
# streams per-flow). Stateless (no PVC) so RollingUpdate is fine.
|
||||||
|
replicas = 1
|
||||||
|
selector {
|
||||||
|
match_labels = {
|
||||||
|
app = "goldmane-edge-aggregator"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
template {
|
||||||
|
metadata {
|
||||||
|
labels = {
|
||||||
|
app = "goldmane-edge-aggregator"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
spec {
|
||||||
|
# PRIVATE ghcr image — cloned into this namespace by the Kyverno
|
||||||
|
# sync-ghcr-credentials allowlist policy (add this ns to that list).
|
||||||
|
image_pull_secrets {
|
||||||
|
name = "ghcr-credentials"
|
||||||
|
}
|
||||||
|
container {
|
||||||
|
name = "aggregate"
|
||||||
|
# CI (GHA -> ghcr) overwrites this to :<sha8> via `kubectl set image`;
|
||||||
|
# the image tag is in ignore_changes below so the SHA sticks across
|
||||||
|
# `terragrunt apply` (fleet image-pin convention). Placeholder :latest
|
||||||
|
# until the deploy pipeline runs.
|
||||||
|
image = "ghcr.io/viktorbarzin/goldmane-edge-aggregator:latest"
|
||||||
|
args = ["aggregate"]
|
||||||
|
|
||||||
|
# Goldmane mTLS. GOLDMANE_HOST default host sans port =>
|
||||||
|
# ServerName "goldmane.calico-system.svc.cluster.local", which is a SAN
|
||||||
|
# on the live Goldmane serving cert (verified 2026-06-24:
|
||||||
|
# DNS:goldmane{,.calico-system{,.svc{,.cluster.local}}}). So no
|
||||||
|
# GOLDMANE_SERVER_NAME override and no GOLDMANE_TLS_INSECURE needed.
|
||||||
|
env {
|
||||||
|
name = "GOLDMANE_HOST"
|
||||||
|
value = "goldmane.calico-system.svc.cluster.local:7443"
|
||||||
|
}
|
||||||
|
# TLS_CERT_PATH / TLS_KEY_PATH / CA_CERT_PATH are left at their image
|
||||||
|
# defaults (/etc/goldmane-client-tls/tls.{crt,key} and
|
||||||
|
# /etc/tigera-ca/tigera-ca-bundle.crt) — the mounts below match them.
|
||||||
|
|
||||||
|
env {
|
||||||
|
name = "DATABASE_URL"
|
||||||
|
value_from {
|
||||||
|
secret_key_ref {
|
||||||
|
name = "goldmane-edges-db-creds"
|
||||||
|
key = "DATABASE_URL"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
volume_mount {
|
||||||
|
name = "goldmane-client-tls"
|
||||||
|
mount_path = "/etc/goldmane-client-tls"
|
||||||
|
read_only = true
|
||||||
|
}
|
||||||
|
volume_mount {
|
||||||
|
name = "tigera-ca"
|
||||||
|
mount_path = "/etc/tigera-ca"
|
||||||
|
read_only = true
|
||||||
|
}
|
||||||
|
|
||||||
|
resources {
|
||||||
|
# Idles low: a single gRPC stream + periodic upserts. requests=limits
|
||||||
|
# per the repo memory rule; no CPU limit (CFS throttling). Right-size
|
||||||
|
# later with krr.
|
||||||
|
requests = {
|
||||||
|
cpu = "10m"
|
||||||
|
memory = "64Mi"
|
||||||
|
}
|
||||||
|
limits = {
|
||||||
|
memory = "64Mi"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
volume {
|
||||||
|
name = "goldmane-client-tls"
|
||||||
|
secret {
|
||||||
|
secret_name = kubernetes_secret.goldmane_client_tls.metadata[0].name
|
||||||
|
}
|
||||||
|
}
|
||||||
|
volume {
|
||||||
|
name = "tigera-ca"
|
||||||
|
config_map {
|
||||||
|
name = kubernetes_config_map.tigera_ca_bundle.metadata[0].name
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
lifecycle {
|
||||||
|
ignore_changes = [
|
||||||
|
# CI pipeline owns the image tag (kubectl set image from GHA/Woodpecker).
|
||||||
|
spec[0].template[0].spec[0].container[0].image,
|
||||||
|
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
|
||||||
|
metadata[0].annotations["keel.sh/policy"],
|
||||||
|
metadata[0].annotations["keel.sh/trigger"],
|
||||||
|
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
|
||||||
|
metadata[0].annotations["keel.sh/match-tag"],
|
||||||
|
metadata[0].annotations["kubernetes.io/change-cause"],
|
||||||
|
metadata[0].annotations["deployment.kubernetes.io/revision"],
|
||||||
|
spec[0].template[0].metadata[0].annotations["keel.sh/update-time"], # KEEL_LIFECYCLE_V1
|
||||||
|
]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
# -----------------------------------------------------------------------------
|
||||||
|
# 6. digest — daily CronJob (first-seen edges -> Slack)
|
||||||
|
# -----------------------------------------------------------------------------
|
||||||
|
resource "kubernetes_cron_job_v1" "digest" {
|
||||||
|
depends_on = [
|
||||||
|
kubernetes_job.db_init,
|
||||||
|
kubernetes_manifest.db_external_secret,
|
||||||
|
kubernetes_manifest.slack_external_secret,
|
||||||
|
]
|
||||||
|
metadata {
|
||||||
|
name = "goldmane-edges-digest"
|
||||||
|
namespace = kubernetes_namespace.goldmane_edge_aggregator.metadata[0].name
|
||||||
|
labels = {
|
||||||
|
app = "goldmane-edge-aggregator"
|
||||||
|
tier = local.tiers.aux
|
||||||
|
}
|
||||||
|
}
|
||||||
|
spec {
|
||||||
|
# Daily 08:00 Europe/London — aligns with the alert-digest cadence.
|
||||||
|
schedule = "0 8 * * *"
|
||||||
|
timezone = "Europe/London"
|
||||||
|
concurrency_policy = "Forbid"
|
||||||
|
successful_jobs_history_limit = 3
|
||||||
|
failed_jobs_history_limit = 3
|
||||||
|
starting_deadline_seconds = 600
|
||||||
|
|
||||||
|
job_template {
|
||||||
|
metadata {
|
||||||
|
labels = {
|
||||||
|
app = "goldmane-edge-aggregator"
|
||||||
|
}
|
||||||
|
annotations = {
|
||||||
|
# 7-day DB rotation: bounce the Job pod's stale env (Reloader rule).
|
||||||
|
"secret.reloader.stakater.com/reload" = "goldmane-edges-db-creds"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
spec {
|
||||||
|
backoff_limit = 2
|
||||||
|
active_deadline_seconds = 300
|
||||||
|
ttl_seconds_after_finished = 86400
|
||||||
|
|
||||||
|
template {
|
||||||
|
metadata {
|
||||||
|
labels = {
|
||||||
|
app = "goldmane-edge-aggregator"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
spec {
|
||||||
|
restart_policy = "OnFailure"
|
||||||
|
image_pull_secrets {
|
||||||
|
name = "ghcr-credentials"
|
||||||
|
}
|
||||||
|
container {
|
||||||
|
name = "digest"
|
||||||
|
# CronJobs track :latest + imagePullPolicy: Always (fleet
|
||||||
|
# convention) so the daily run picks up the current image.
|
||||||
|
image = "ghcr.io/viktorbarzin/goldmane-edge-aggregator:latest"
|
||||||
|
image_pull_policy = "Always"
|
||||||
|
args = ["digest"]
|
||||||
|
|
||||||
|
env {
|
||||||
|
name = "DATABASE_URL"
|
||||||
|
value_from {
|
||||||
|
secret_key_ref {
|
||||||
|
name = "goldmane-edges-db-creds"
|
||||||
|
key = "DATABASE_URL"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
env {
|
||||||
|
name = "SLACK_WEBHOOK_URL"
|
||||||
|
value_from {
|
||||||
|
secret_key_ref {
|
||||||
|
name = "goldmane-edges-slack"
|
||||||
|
key = "SLACK_WEBHOOK_URL"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
env {
|
||||||
|
name = "SLACK_CHANNEL"
|
||||||
|
value = "#security"
|
||||||
|
}
|
||||||
|
|
||||||
|
resources {
|
||||||
|
requests = {
|
||||||
|
cpu = "10m"
|
||||||
|
memory = "64Mi"
|
||||||
|
}
|
||||||
|
limits = {
|
||||||
|
memory = "64Mi"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
lifecycle {
|
||||||
|
# KYVERNO_LIFECYCLE_V1 (CronJob path): Kyverno mutates dns_config with ndots=2.
|
||||||
|
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
# -----------------------------------------------------------------------------
|
||||||
|
# 7. Egress (default-deny consideration)
|
||||||
|
# -----------------------------------------------------------------------------
|
||||||
|
# Goldmane's own NetworkPolicy already allows INGRESS on 7443 from anywhere, so
|
||||||
|
# nothing is needed on the Goldmane side. No egress policy is declared here:
|
||||||
|
# this namespace is default-allow egress today. IF/WHEN it is brought under the
|
||||||
|
# wave-1 default-deny egress enforcement (per-namespace allowlists), add
|
||||||
|
# (Global)NetworkPolicy egress rules permitting:
|
||||||
|
# - goldmane.calico-system.svc.cluster.local:7443 (the flow stream)
|
||||||
|
# - pg-cluster-rw.dbaas.svc.cluster.local:5432 (Postgres)
|
||||||
|
# - hooks.slack.com:443 (digest -> Slack, internet)
|
||||||
|
# - kube-dns / CoreDNS :53 (DNS, every namespace)
|
||||||
24
stacks/goldmane-edge-aggregator/terragrunt.hcl
Normal file
24
stacks/goldmane-edge-aggregator/terragrunt.hcl
Normal file
|
|
@ -0,0 +1,24 @@
|
||||||
|
include "root" {
|
||||||
|
path = find_in_parent_folders()
|
||||||
|
}
|
||||||
|
|
||||||
|
# Tier-1 stack (PG state backend). The root terragrunt.hcl generates backend.tf
|
||||||
|
# (pg backend, schema_name = "goldmane-edge-aggregator"), providers.tf,
|
||||||
|
# cloudflare_provider.tf and tiers.tf automatically — do NOT hand-write those.
|
||||||
|
# This stack adds the hashicorp/tls provider via a local versions.tf (merged
|
||||||
|
# into the generated required_providers).
|
||||||
|
|
||||||
|
dependency "platform" {
|
||||||
|
config_path = "../platform"
|
||||||
|
skip_outputs = true
|
||||||
|
}
|
||||||
|
|
||||||
|
dependency "vault" {
|
||||||
|
config_path = "../vault"
|
||||||
|
skip_outputs = true
|
||||||
|
}
|
||||||
|
|
||||||
|
# The Vault DB static role pg-goldmane-edges (7-day rotation) and the CNPG
|
||||||
|
# connection allowlist entry live in the vault stack (stacks/vault/main.tf).
|
||||||
|
# The vault dependency above orders this stack after it so the ExternalSecret
|
||||||
|
# can materialize the rotated credential on first apply.
|
||||||
|
|
@ -35,6 +35,14 @@ resource "kubernetes_namespace" "instagram_poster" {
|
||||||
# - immich_tag_instagram (optional — auto-resolved if missing)
|
# - immich_tag_instagram (optional — auto-resolved if missing)
|
||||||
# - immich_tag_posted (optional — auto-resolved if missing)
|
# - immich_tag_posted (optional — auto-resolved if missing)
|
||||||
resource "kubernetes_manifest" "external_secret" {
|
resource "kubernetes_manifest" "external_secret" {
|
||||||
|
# The external-secrets controller takes server-side-apply ownership of
|
||||||
|
# .spec.refreshInterval, so a plain TF apply conflicts. force_conflicts lets
|
||||||
|
# TF win (values match, so it's stable) — same pattern as grafana/woodpecker/
|
||||||
|
# traefik/k8s-version-upgrade. Surfaced 2026-06-24 by the first IG apply since
|
||||||
|
# the ESO v1 migration (the scale-to-0 push).
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
@ -139,6 +147,11 @@ resource "kubernetes_manifest" "external_secret" {
|
||||||
# ESO refreshes the K8s Secret every 15m. `reloader.stakater.com/match`
|
# ESO refreshes the K8s Secret every 15m. `reloader.stakater.com/match`
|
||||||
# bounces the pod when the password changes.
|
# bounces the pod when the password changes.
|
||||||
resource "kubernetes_manifest" "benchmark_db_external_secret" {
|
resource "kubernetes_manifest" "benchmark_db_external_secret" {
|
||||||
|
# See external_secret above — ESO owns .spec.refreshInterval; force_conflicts
|
||||||
|
# lets the TF apply win instead of erroring on the field-manager conflict.
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
@ -227,7 +240,11 @@ resource "kubernetes_deployment" "instagram_poster" {
|
||||||
}
|
}
|
||||||
|
|
||||||
spec {
|
spec {
|
||||||
replicas = 1
|
# Scaled to 0 (2026-06-24): Instagram Graph integration is unused and its
|
||||||
|
# ExternalSecret is dead (missing ig_graph_long_lived_token /
|
||||||
|
# ig_business_account_id in Vault secret/instagram-poster). Set back to 1
|
||||||
|
# after minting a Meta long-lived token and populating those keys.
|
||||||
|
replicas = 0
|
||||||
# RWO PVC — cannot rolling-update.
|
# RWO PVC — cannot rolling-update.
|
||||||
strategy {
|
strategy {
|
||||||
type = "Recreate"
|
type = "Recreate"
|
||||||
|
|
|
||||||
|
|
@ -416,6 +416,39 @@ phase_preflight() {
|
||||||
fi
|
fi
|
||||||
fi
|
fi
|
||||||
|
|
||||||
|
# 4b. apiserver-OIDC drift check (backstop for the rbac stack's kubeadm-config
|
||||||
|
# reconciliation). A `kubeadm upgrade` REGENERATES the apiserver manifest from
|
||||||
|
# kubeadm-config; if kubeadm-config still carries the legacy single-issuer
|
||||||
|
# --oidc-* args instead of --authentication-config, the regenerated apiserver
|
||||||
|
# loses structured multi-issuer auth → kubectl + dashboard SSO break AFTER the
|
||||||
|
# upgrade. This is RECOVERABLE (the apiserver does NOT crash — verified by an
|
||||||
|
# isolated repro 2026-06-24; the chain's post-master restore.sh re-adds the flag,
|
||||||
|
# and the rbac stack reconciles kubeadm-config so it won't recur) — so this is an
|
||||||
|
# ALERT, not a block. (NB the 2026-06-24 stall was NOT this — it was etcd IO
|
||||||
|
# starvation; see docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md.)
|
||||||
|
# Skip on an at-target master (resume — no apiserver regen).
|
||||||
|
if [ "$master_kubelet_v" != "$TARGET_VERSION" ]; then
|
||||||
|
local apiserver_diff
|
||||||
|
apiserver_diff=$(ssh "${SSH_OPTS[@]}" "$(ssh_target k8s-master)" "sudo kubeadm upgrade diff v$TARGET_VERSION 2>/dev/null" || true)
|
||||||
|
if echo "$apiserver_diff" | grep -qE '^-[[:space:]].*--authentication-config'; then
|
||||||
|
slack "WARN preflight — kubeadm upgrade will DROP --authentication-config (kubeadm-config OIDC drift). SSO breaks post-upgrade until restore.sh re-adds it; re-apply the rbac stack to reconcile kubeadm-config. Proceeding (recoverable, not a crash)."
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
|
||||||
|
# 4c. Reclaim kubeadm scratch on master. `kubeadm upgrade apply` dumps a full
|
||||||
|
# ~400MB etcd DB backup into /etc/kubernetes/tmp/kubeadm-backup-etcd-<ts>/ before
|
||||||
|
# every etcd upgrade and NEVER cleans it up — 145 dirs / 28GB had accumulated by
|
||||||
|
# 2026-06-24, pushing master root fs to 73% (image-GC churn + extra write IO on
|
||||||
|
# the shared HDD where etcd lives — a contributor to the etcd IO starvation that
|
||||||
|
# stalled that run, see post-mortem). Real etcd backups go to NFS, so these are
|
||||||
|
# throwaway. Prune ones >3 days old (keeps a short rollback window). Best-effort;
|
||||||
|
# never aborts the chain.
|
||||||
|
if [ "$master_kubelet_v" != "$TARGET_VERSION" ]; then
|
||||||
|
ssh "${SSH_OPTS[@]}" "$(ssh_target k8s-master)" \
|
||||||
|
"sudo find /etc/kubernetes/tmp -maxdepth 1 -type d \( -name 'kubeadm-backup-*' -o -name 'kubeadm-upgraded-manifests*' \) -mtime +3 -exec rm -rf {} + 2>/dev/null; echo -n 'master root after prune: '; df -h / | awk 'NR==2{print \$5\" used, \"\$4\" free\"}'" \
|
||||||
|
|| echo "kubeadm-scratch prune skipped (ssh/df failed) — non-fatal"
|
||||||
|
fi
|
||||||
|
|
||||||
# 5. Push in-flight + started_timestamp metrics + ns annotations
|
# 5. Push in-flight + started_timestamp metrics + ns annotations
|
||||||
$KUBECTL annotate ns "$NS" \
|
$KUBECTL annotate ns "$NS" \
|
||||||
"viktorbarzin.me/k8s-upgrade-in-flight=$(date -u +%FT%TZ)" \
|
"viktorbarzin.me/k8s-upgrade-in-flight=$(date -u +%FT%TZ)" \
|
||||||
|
|
|
||||||
|
|
@ -31,6 +31,9 @@ locals {
|
||||||
# "no local builds"). ghcr.io/viktorbarzin/k8s-portal:latest is PRIVATE
|
# "no local builds"). ghcr.io/viktorbarzin/k8s-portal:latest is PRIVATE
|
||||||
# (infra repo default); the deployment references the cloned secret.
|
# (infra repo default); the deployment references the cloned secret.
|
||||||
"k8s-portal",
|
"k8s-portal",
|
||||||
|
# goldmane-edge-aggregator: PRIVATE ghcr image pulled by the aggregate
|
||||||
|
# Deployment + digest CronJob (ADR-0014, infra#58).
|
||||||
|
"goldmane-edge-aggregator",
|
||||||
]
|
]
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -71,6 +71,15 @@ resource "kubernetes_persistent_volume" "alertmanager_pv" {
|
||||||
# DB credentials from Vault database engine (rotated automatically)
|
# DB credentials from Vault database engine (rotated automatically)
|
||||||
# Provides GF_DATABASE_PASSWORD that auto-updates when password rotates
|
# Provides GF_DATABASE_PASSWORD that auto-updates when password rotates
|
||||||
resource "kubernetes_manifest" "grafana_db_creds" {
|
resource "kubernetes_manifest" "grafana_db_creds" {
|
||||||
|
# The external-secrets controller takes server-side-apply ownership of
|
||||||
|
# .spec.refreshInterval, so a plain TF apply conflicts ("conflict with
|
||||||
|
# external-secrets ... .spec.refreshInterval"). force_conflicts lets TF win
|
||||||
|
# (values match, so it's stable) — same pattern as the woodpecker/traefik/
|
||||||
|
# k8s-version-upgrade stacks. Surfaced 2026-06-24: the first monitoring apply
|
||||||
|
# in a while exposed this latent conflict (prior pushes were docs-only).
|
||||||
|
field_manager {
|
||||||
|
force_conflicts = true
|
||||||
|
}
|
||||||
manifest = {
|
manifest = {
|
||||||
apiVersion = "external-secrets.io/v1"
|
apiVersion = "external-secrets.io/v1"
|
||||||
kind = "ExternalSecret"
|
kind = "ExternalSecret"
|
||||||
|
|
|
||||||
|
|
@ -10,16 +10,29 @@
|
||||||
# match the existing RBAC subjects (kind: User, name: <raw email>; group names
|
# match the existing RBAC subjects (kind: User, name: <raw email>; group names
|
||||||
# verbatim). Do NOT add a prefix or existing bindings break.
|
# verbatim). Do NOT add a prefix or existing bindings break.
|
||||||
#
|
#
|
||||||
# DRIFT WARNING: this edits the kube-apiserver static-pod manifest on the single
|
# DRIFT WARNING (and how it's now handled): apiserver auth lives in THREE places
|
||||||
# master. A `kubeadm upgrade` regenerates that manifest and DROPS this flag (this
|
# that must stay in sync, because a `kubeadm upgrade` REGENERATES the static-pod
|
||||||
# is exactly how OIDC silently broke before — the flag was wiped and the
|
# manifest from kubeadm-config:
|
||||||
# content-hash trigger never re-fired). After any k8s control-plane upgrade,
|
# 1. /etc/kubernetes/pki/auth-config.yaml — the structured authn file
|
||||||
# re-apply the rbac stack to restore apiserver OIDC. See
|
# 2. the live kube-apiserver static-pod manifest — references it via the flag
|
||||||
# docs/plans/2026-06-04-k8s-dashboard-sso-design.md.
|
# 3. the kubeadm-config ClusterConfiguration CM — what kubeadm regenerates from
|
||||||
|
# Originally only (1)+(2) were managed, so every kubeadm upgrade rewrote the
|
||||||
|
# manifest from the STALE CM, reverting --authentication-config to single-issuer
|
||||||
|
# --oidc-* flags. The consequence is SSO breakage AFTER the upgrade: kubectl +
|
||||||
|
# dashboard lose multi-issuer auth (the apiserver does NOT crash on this — verified
|
||||||
|
# by an isolated repro 2026-06-24; the 2026-06-24 v1.35 upgrade *stall* was a
|
||||||
|
# separate etcd IO-starvation issue, see
|
||||||
|
# docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md). The
|
||||||
|
# remote script below now ALSO reconciles (3) via `kubeadm init phase
|
||||||
|
# upload-config`, so a future kubeadm upgrade regenerates a CORRECT manifest. The
|
||||||
|
# k8s-version-upgrade chain additionally ALERTS (does not block — SSO drift is
|
||||||
|
# recoverable) via `kubeadm upgrade diff` in preflight if --authentication-config
|
||||||
|
# would still be dropped.
|
||||||
#
|
#
|
||||||
# SAFETY: the remote script health-gates on /livez and AUTO-ROLLS-BACK the
|
# SAFETY: the remote script health-gates on /livez and AUTO-ROLLS-BACK the
|
||||||
# manifest from a timestamped backup if the apiserver does not recover, so a
|
# manifest from a timestamped backup if the apiserver does not recover, so a
|
||||||
# malformed config cannot leave the single master down.
|
# malformed config cannot leave the single master down. Reconciling kubeadm-config
|
||||||
|
# is zero-impact on the running cluster (the CM is only read during an upgrade).
|
||||||
|
|
||||||
variable "k8s_master_host" {
|
variable "k8s_master_host" {
|
||||||
type = string
|
type = string
|
||||||
|
|
@ -97,6 +110,40 @@ locals {
|
||||||
print('flag-inserted' if done else 'ANCHOR-NOT-FOUND')
|
print('flag-inserted' if done else 'ANCHOR-NOT-FOUND')
|
||||||
PY
|
PY
|
||||||
|
|
||||||
|
# Reconciles the kubeadm-config ClusterConfiguration's apiServer.extraArgs:
|
||||||
|
# drops the stale single-issuer --oidc-* args and ensures --authentication-config
|
||||||
|
# is present (anchored after --authorization-mode). Stdlib-only (the master is
|
||||||
|
# only guaranteed python3, not pyyaml/yq). Idempotent; preserves all other
|
||||||
|
# fields (etcd args, audit args, extraVolumes) verbatim. Exits 3 if the
|
||||||
|
# authorization-mode anchor is missing (fail loud, leave the CM untouched).
|
||||||
|
kubeadm_oidc_reconcile_py = <<-PY
|
||||||
|
import sys
|
||||||
|
lines = sys.stdin.read().split('\n')
|
||||||
|
out, i, n = [], 0, len(lines)
|
||||||
|
have_authn = any('name: authentication-config' in l for l in lines)
|
||||||
|
inserted = have_authn
|
||||||
|
while i < n:
|
||||||
|
ln = lines[i]; s = ln.strip()
|
||||||
|
if s.startswith('- name: oidc-'):
|
||||||
|
i += 2 if (i + 1 < n and lines[i + 1].strip().startswith('value:')) else 1
|
||||||
|
continue
|
||||||
|
out.append(ln)
|
||||||
|
if (not inserted) and s == '- name: authorization-mode':
|
||||||
|
indent = ln[:len(ln) - len(ln.lstrip())]
|
||||||
|
if i + 1 < n and lines[i + 1].strip().startswith('value:'):
|
||||||
|
out.append(lines[i + 1]); i += 2
|
||||||
|
else:
|
||||||
|
i += 1
|
||||||
|
out.append(indent + '- name: authentication-config')
|
||||||
|
out.append(indent + ' value: /etc/kubernetes/pki/auth-config.yaml')
|
||||||
|
inserted = True
|
||||||
|
continue
|
||||||
|
i += 1
|
||||||
|
if not inserted:
|
||||||
|
sys.stderr.write('ANCHOR-NOT-FOUND: authorization-mode\n'); sys.exit(3)
|
||||||
|
sys.stdout.write('\n'.join(out))
|
||||||
|
PY
|
||||||
|
|
||||||
# Whole remote operation, base64-embedded for byte-exact transfer (no
|
# Whole remote operation, base64-embedded for byte-exact transfer (no
|
||||||
# heredoc/escaping hazards across SSH).
|
# heredoc/escaping hazards across SSH).
|
||||||
apiserver_auth_remote_script = <<-SH
|
apiserver_auth_remote_script = <<-SH
|
||||||
|
|
@ -137,6 +184,30 @@ locals {
|
||||||
echo "rolled back to previous manifest"; exit 1
|
echo "rolled back to previous manifest"; exit 1
|
||||||
fi
|
fi
|
||||||
echo "kube-apiserver healthy with multi-issuer --authentication-config"
|
echo "kube-apiserver healthy with multi-issuer --authentication-config"
|
||||||
|
|
||||||
|
# 5. Reconcile kubeadm-config so a FUTURE `kubeadm upgrade` regenerates the
|
||||||
|
# apiserver manifest WITH --authentication-config instead of reverting to
|
||||||
|
# the stale single-issuer --oidc-* flags. Without this, kubeadm rewrote the
|
||||||
|
# manifest from kubeadm-config on every control-plane upgrade and the
|
||||||
|
# regenerated apiserver crash-looped (the 2026-06-24 v1.35 upgrade stall).
|
||||||
|
# Zero live impact (the CM is only read at upgrade time); idempotent;
|
||||||
|
# best-effort (the chain's `kubeadm upgrade diff` preflight gate is the
|
||||||
|
# backstop if this cannot run).
|
||||||
|
KC="sudo kubectl --kubeconfig /etc/kubernetes/admin.conf"
|
||||||
|
CC=$($KC -n kube-system get cm kubeadm-config -o jsonpath='{.data.ClusterConfiguration}' 2>/dev/null || true)
|
||||||
|
if [ -n "$CC" ] && { echo "$CC" | grep -q 'oidc-issuer-url' || ! echo "$CC" | grep -q 'authentication-config'; }; then
|
||||||
|
echo "Reconciling kubeadm-config (oidc-* -> authentication-config) so kubeadm upgrade keeps structured auth"
|
||||||
|
echo '${base64encode(local.kubeadm_oidc_reconcile_py)}' | base64 -d > /tmp/reconcile_kubeadm_oidc.py
|
||||||
|
if printf '%s' "$CC" | python3 /tmp/reconcile_kubeadm_oidc.py > /tmp/kubeadm-cc-new.yaml \
|
||||||
|
&& sudo kubeadm init phase upload-config kubeadm --config /tmp/kubeadm-cc-new.yaml; then
|
||||||
|
echo "kubeadm-config reconciled: future control-plane upgrades keep --authentication-config"
|
||||||
|
else
|
||||||
|
echo "WARN: kubeadm-config reconcile failed; the upgrade-chain preflight gate will block the next upgrade"
|
||||||
|
fi
|
||||||
|
rm -f /tmp/reconcile_kubeadm_oidc.py /tmp/kubeadm-cc-new.yaml
|
||||||
|
else
|
||||||
|
echo "kubeadm-config already uses --authentication-config (no oidc drift)"
|
||||||
|
fi
|
||||||
SH
|
SH
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
@ -155,6 +226,14 @@ resource "null_resource" "apiserver_oidc_config" {
|
||||||
}
|
}
|
||||||
|
|
||||||
triggers = {
|
triggers = {
|
||||||
|
# Intentionally hash ONLY the issuer config, NOT the remote script. CI applies
|
||||||
|
# the rbac stack with no ssh_private_key (var defaults to ""), so a re-run of
|
||||||
|
# this SSH provisioner in CI would fail — hence the null_resource must stay a
|
||||||
|
# no-op on a plain CI apply. Script changes (e.g. the 2026-06-24 kubeadm-config
|
||||||
|
# reconciliation) reach the cluster via the apiserver-oidc-restore ConfigMap
|
||||||
|
# below (a plain k8s resource, no ssh) which the upgrade chain re-runs. To force
|
||||||
|
# this provisioner to re-run after a script change, apply locally with
|
||||||
|
# `-replace` + TF_VAR_ssh_private_key (see docs/runbooks/k8s-version-upgrade.md).
|
||||||
auth_config = sha256(local.apiserver_auth_config_yaml)
|
auth_config = sha256(local.apiserver_auth_config_yaml)
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
|
||||||
|
|
@ -674,6 +674,7 @@ resource "vault_database_secret_backend_connection" "postgresql" {
|
||||||
"pg-recruiter-responder", "pg-tripit",
|
"pg-recruiter-responder", "pg-tripit",
|
||||||
"pg-nextcloud-todos",
|
"pg-nextcloud-todos",
|
||||||
"pg-technitium",
|
"pg-technitium",
|
||||||
|
"pg-goldmane-edges",
|
||||||
]
|
]
|
||||||
|
|
||||||
postgresql {
|
postgresql {
|
||||||
|
|
@ -891,6 +892,17 @@ resource "vault_database_secret_backend_static_role" "pg_technitium" {
|
||||||
rotation_period = 604800
|
rotation_period = 604800
|
||||||
}
|
}
|
||||||
|
|
||||||
|
# goldmane-edge-aggregator (ADR-0014 / infra #58) — 7-day rotation for the
|
||||||
|
# goldmane_edges CNPG role. Consumed by stacks/goldmane-edge-aggregator via a
|
||||||
|
# vault-database ExternalSecret -> DATABASE_URL (remoteRef static-creds/pg-goldmane-edges).
|
||||||
|
resource "vault_database_secret_backend_static_role" "pg_goldmane_edges" {
|
||||||
|
backend = vault_mount.database.path
|
||||||
|
db_name = vault_database_secret_backend_connection.postgresql.name
|
||||||
|
name = "pg-goldmane-edges"
|
||||||
|
username = "goldmane_edges"
|
||||||
|
rotation_period = 604800
|
||||||
|
}
|
||||||
|
|
||||||
# =============================================================================
|
# =============================================================================
|
||||||
# Kubernetes Secrets Engine — Dynamic K8s Credentials
|
# Kubernetes Secrets Engine — Dynamic K8s Credentials
|
||||||
# =============================================================================
|
# =============================================================================
|
||||||
|
|
|
||||||
File diff suppressed because one or more lines are too long
Loading…
Add table
Add a link
Reference in a new issue