Compare commits

..

21 commits

Author SHA1 Message Date
Viktor Barzin
abb15cd49d devvm: personalize emo's cluster-health skill for ha-sofia
All checks were successful
ci/woodpecker/push/default Pipeline was successful
emo cares about ha-sofia + his Sofia smart-home devices (Tuya, the MPPT
ATS, the Барзини → Статус dashboard), and only about the cluster when it's
breaking those. Rewrite his vendored cluster-health into an ha-sofia-focused,
read-only variant:
- leads with ha-sofia's in-cluster dependency chain (tuya-bridge + the
  cloudflared/Traefik/DNS/TLS reachability path), all checkable read-only;
- fixes the script path to emo's own clone (/home/emo/code) — he can't read
  wizard's tree — and runs it --no-fix (he's cluster read-only);
- loads emo's own HA token (see below) so the ha-sofia checks (26-29, 45)
  actually run for him; documents the host-SSH/Vault checks that skip;
- triages: cluster FAIL/WARN matters only if on his chain; everything else is
  a one-line "admin's area"; escalate via /file-issue since he can't fix.

This snapshot copy is now an emo-specific variant, intentionally diverged
from the canonical 47-check admin skill — README updated to say "do not
re-sync from canonical".

Token: a dedicated long-lived HA token (client_name emo-cluster-health) was
minted on ha-sofia via the admin account and stored emo-readable at
/home/emo/.config/cluster-health/haos_token (600). It carries admin HA scope
(HA only mints tokens for the authenticating account); independently revocable.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 16:03:14 +00:00
Viktor Barzin
fc83595f5e devvm: vendor cluster-health into per-user agent-skill snapshot
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Make cluster-health a user-global skill for emo (the lone entry in the
provisioner's SKILL_USERS allowlist), so it's available from any directory
— not only when working inside the infra clone where it already exists as a
project skill (.claude/skills/cluster-health). install_skills() in
t3-provision-users.sh copies the vendored snapshot into ~/.agents/skills/ and
symlinks ~/.claude/skills/, so this is the durable, rebuild-surviving path.

cluster-health is homelab-local (vendored from this repo's own
.claude/skills/), unlike the other snapshot entries which mirror upstream
mattpocock/skills + vercel-labs/skills; README documents its provenance and
the explicit re-sync step so the vendored copy doesn't silently drift.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 15:20:19 +00:00
Viktor Barzin
fd33d1a447 monitoring: consolidate all Slack alerting to #alerts, abandon #security
Some checks are pending
ci/woodpecker/push/default Pipeline is running
The dedicated #security Slack channel was unreachable: the shared incoming
webhook (Vault secret/viktor -> alertmanager_slack_api_url) belongs to a
Slack app that isn't a member of #security, so any channel override on it
returns HTTP 404 channel_not_found. The goldmane-edges-digest was silently
failing for that reason.

Per request ("dump the security channel, post in an existing one"), route
everything to #alerts instead:
- alertmanager slack-security receiver -> #alerts (keeps its [SECURITY/<sev>]
  title styling so security-lane alerts still stand out in the shared channel)
- goldmane-edges-digest CronJob SLACK_CHANNEL -> #alerts (comment only; value
  was already switched and applied last change)
- AggregatorDown / DigestFailing alert summaries reworded to say #alerts
- docs swept (security.md, monitoring.md, ADR-0014, goldmane runbook,
  .claude/CLAUDE.md, service-catalog, CONTEXT.md) to drop the
  "invite the app / flip back to #security" caveats and state the
  #security abandonment + #alerts consolidation as the current routing.

Monitoring stack applied (alertmanager rolled, live config verified:
slack-security channel is now #alerts).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 13:29:44 +00:00
Viktor Barzin
196d0db4bd rbac/apiserver-oidc: back up the apiserver manifest OUTSIDE /etc/kubernetes/manifests
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The SSO restore script backed up the live manifest with
`cp "$MANIFEST" "$MANIFEST.bak.$TS"` — i.e. INSIDE /etc/kubernetes/manifests/.
The kubelet treats every file in that dir as a static pod, so the .bak became a
SECOND kube-apiserver static pod. While both copies were identical it was
harmless, but the instant `kubeadm upgrade` changed the real manifest's image to
v1.35.6, the kubelet saw two same-named pods with different specs and flip-flopped
(pod attempt count hit 13) — the new apiserver never stabilised, so kubeadm timed
out on "static Pod hash did not change after 5m" and rolled back. THIS was the
real cause of the 1.34->1.35 upgrade stalling for days (not etcd IO, which was a
downstream symptom of the flip-flopping apiserver hammering etcd).

Fix: write backups to a dedicated dir OUTSIDE the static-pod dir
(/etc/kubernetes/apiserver-oidc-bak/) and read the rollback copy from there. The
stray .bak that planted the landmine on 2026-06-18 was moved out manually
2026-06-26; this prevents the SSO script (and the upgrade chain's restore.sh,
which is the same script) from ever re-creating it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 10:29:19 +00:00
Viktor Barzin
5d33327c30 postiz: repoint postgres-backup CronJob at CNPG (was failing on removed host)
Some checks failed
ci/woodpecker/push/default Pipeline failed
The postiz-postgres-backup CronJob still dumped from the chart's bundled
`postiz-postgresql` host with a hardcoded `postiz-password`. That bundled
PostgreSQL was removed when postiz migrated to the shared CNPG cluster, so
the host no longer resolves (NXDOMAIN) and every nightly run failed —
firing BackupCronJobFailed, and leaving the postiz DB with no logical dump
in the offsite pipeline.

Connect via the app's own DATABASE_URL (from the postiz-secrets Secret,
postgresql://postiz:…@pg-cluster-rw.dbaas.svc.cluster.local/postiz) instead
of a hardcoded host/user/password, so the backup tracks the live DB and
credentials. Verified with a one-off test job: psql + pg_dump 16.4 connect
to CNPG 16.9 and produce a 180K custom-format dump.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 09:34:42 +00:00
Viktor Barzin
1bca799bb4 monitoring: give kube-state-metrics a 512Mi memory limit (Burstable)
Some checks failed
ci/woodpecker/push/default Pipeline failed
kube-state-metrics had no explicit resources, so the monitoring-namespace
LimitRange pinned it to requests=limits=256Mi (Guaranteed QoS). KSM idles
around 45Mi but momentarily spikes past 256Mi during a full object relist
(450+ pods, 150+ jobs, all secrets/endpoints) and gets OOMKilled. Each OOM
blacks out the KSM-exported series that ~10 alert rules read, so they all
fire false "<svc>Down" criticals at once and self-resolve when KSM recovers
~5 min later — exactly the alert storm seen at 2026-06-26 08:42 UTC.

Set explicit Burstable resources: keep the request low (64Mi, just above
idle) so we don't reserve memory we don't use, and raise only the limit to
512Mi to absorb the relist peak. No CPU limit, per the cluster-wide policy.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 09:06:31 +00:00
Viktor Barzin
d105713ae7 fix(workstation): claude-auth-sync must merge, not overwrite, the shared Vault path
All checks were successful
ci/woodpecker/push/default Pipeline was successful
cas_backup did `vault kv put secret/workstation/claude-users/<user>`, a full
KV-v2 replace that rewrote the document with only its 3 OAuth keys. Because
`homelab vault setup` co-locates the user's vaultwarden_* credentials on that
same path, every six-hourly sync silently deleted them — so `homelab vault`
reported "not configured" within hours of each setup. (Reported as: homelab
vault "keeps getting reset / logged out", set up 3 times.)

Switch the backup to a merge: `kv patch -method=rw` (read+update, needs no
`patch` capability) when the path exists, and `kv put` only to create it on the
first backup. Add a regression test with a fake vault asserting a pre-existing
sibling key survives a backup, and document the merge requirement in the
renewal runbook.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 08:33:41 +00:00
Viktor Barzin
6f1951af93 fix(workstation): carry OS/sudo authz policy into managed-settings source + multi-tenancy doc
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ADR-0015's policy change was applied live to /etc/claude-code/managed-settings.json, but that file self-deploys from the repo source scripts/workstation/managed-settings.json via the hourly reconcile (sync_managed_config). Without updating the source the next reconcile would REVERT /etc to the old 'never read other homes' rule. This updates the source-of-truth claudeMd (now byte-identical to /etc) so the change is durable + canonical, and refresh_codex_mirror propagates it to every user's ~/.codex/AGENTS.md. Also notes the access-model change in the multi-tenancy architecture doc (pointer to ADR-0015).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 08:25:33 +00:00
Viktor Barzin
8121d8a4ac docs(adr): add ADR-0015 (OS/sudo is the authorization boundary), supersede ADR-0011 privacy norm
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor (owner) wants agents to stop refusing file reads the OS already permits. wizard holds passwordless root ((ALL) NOPASSWD: ALL), so the managed-settings rule 'never read another user's ~/.claude' was stricter than the OS itself. The managed-settings policy (/etc/claude-code/managed-settings.json) was updated out-of-band to defer to OS/sudo authorization with no extra prompt; backup kept at .bak-2026-06-26. This ADR records the decision, its symmetry across sudo-holders, and the larger blast radius.

ADR-0011's usage-telemetry design is unchanged; only the cross-user privacy norm it referenced is superseded. The original ask was to delete ADR-0011 — superseded instead to preserve the audit trail and the ADR-0012/0013 references.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 08:22:29 +00:00
Viktor Barzin
ebc8b6588f ESO: add force_conflicts to all ExternalSecret manifests (fleet sweep)
Some checks failed
ci/woodpecker/push/default Pipeline failed
The 2026-06-22 external-secrets v1 migration made the ESO controller the
server-side-apply owner of .spec.refreshInterval on every ExternalSecret, so any
stack defining one via kubernetes_manifest fails `terraform apply` with a
field-manager conflict the next time it's applied (instagram-poster + grafana hit
this on 2026-06-24; it was latent across the whole fleet). Add
field_manager { force_conflicts = true } to all 101 remaining ExternalSecret
manifests across 70 stacks, matching the fix already on grafana / woodpecker /
traefik / k8s-version-upgrade / instagram-poster. TF and ESO set the same value,
so it's stable (no perpetual drift). Defuses the landmine before each stack's
next apply trips it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-25 21:28:11 +00:00
Viktor Barzin
6c5288998f goldmane-trail: polish follow-ups #57/#59/#61/#62/#63 + digest→#alerts
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Completes the Goldmane who-talks-to-whom trail (ADR-0014), implemented by a
subagent workflow (distinct stacks in parallel, docs last):

- #57 Whisker gated ingress: ingress_factory (whisker.viktorbarzin.me,
  auth=required, Authentik-gated) + a NetworkPolicy allowing traefik->whisker:8081
  (the operator's whisker NP default-denies ingress). calico stack.
- #61 pipeline health: AggregatorDown + DigestFailing Prometheus alerts
  (prometheus_chart_values.tpl) + cluster-health check #48.
- #59 service-identity labels on the multi-Service namespaces (monitoring's 5
  TF-managed deployments + dbaas), with the KYVERNO_LIFECYCLE_V1 marker so they
  update in-place.
- #62/#63 docs: docs/runbooks/goldmane-flow-trail.md (new), service-catalog,
  security.md + monitoring.md east-west sections, ADR-0014 as-built, CONTEXT.md.
  #62 = the SQL to derive the Wave-1 per-namespace egress allowlist from the
  edge table (feeds code-8ywc; enforce-flips out of scope).

Also fixes the digest's Slack target: #security override 404s channel_not_found
because the shared alertmanager_slack_api_url webhook's app isn't a member of
#security (this likely also breaks alertmanager's slack-security receiver — flagged
in the runbook). Routed to #alerts (the webhook's working channel) until the app
is invited; verified a real digest run posts cleanly (360 edges).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-25 17:49:25 +00:00
Viktor Barzin
306cdd4cb3 state(dbaas): update encrypted state 2026-06-25 17:31:03 +00:00
Viktor Barzin
9c68d147e0 k8s-upgrade: reclaim+auto-prune kubeadm /etc/kubernetes/tmp leak; correct crash root cause to etcd IO (not OIDC)
Some checks failed
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/default Pipeline failed
Digging into "why did the apiserver crash" disproved the earlier OIDC
explanation. An isolated v1.35.6 apiserver repro with authentik reachable
initialises OIDC cleanly (oidc.go:313, no error) and runs fine — so the
--authentication-config -> --oidc-* revert is NOT what crashed it. etcd's
surviving crash-window log is the real cause: 1180 "apply request took too long"
warnings in 16 min, individual applies up to 4.3s (healthy <100ms) right as
kubeadm tried to bring up the new apiserver. That's etcd IO starvation on the
shared sdc HDD (beads code-oflt).

A big contributor + the reason master root fs sat at 73%: kubeadm dumps a full
~400MB etcd DB backup into /etc/kubernetes/tmp/kubeadm-backup-etcd-<ts>/ before
every etcd upgrade and never cleans it up — 145 dirs / 28GB had accumulated,
driving image-GC churn and extra write-IO onto etcd's spindle. Reclaimed live
(73% -> 23%) and added a preflight prune (>3 days) so it can't re-accumulate.

Also corrected the OIDC handling: the kubeadm-config drift is real but only
breaks dashboard/kubectl SSO AFTER a successful upgrade (recoverable via the
chain's restore.sh + the kubeadm-config reconciliation) — it does not crash the
apiserver. So the preflight check is now an ALERT, not a block (was added on the
wrong hypothesis). Post-mortem, runbook, and apiserver-oidc.tf header corrected.

Per Viktor: reclaim the disk and automate so the manual cleanup never recurs;
the durable IO fix remains code-oflt (etcd off the shared HDD).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-25 15:23:15 +00:00
Viktor Barzin
60a1cb9a25 k8s-upgrade: reconcile kubeadm-config OIDC drift that crash-looped the v1.35 apiserver upgrade
All checks were successful
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
Last night's autonomous 1.34->1.35 run reached the master control-plane phase
for the first time (preflight passed, etcd snapshot taken, etcd upgraded), then
the kube-apiserver upgrade to v1.35.6 crash-looped and kubeadm auto-rolled-back
to 1.34.9. The cluster stayed healthy but the master was left cordoned and the
chain wedged on in_flight.

Root cause: kubeadm upgrade regenerates the apiserver static-pod manifest from
the kubeadm-config ConfigMap. apiserver auth was switched on 2026-06-19 to a
structured multi-issuer --authentication-config (kubectl + dashboard SSO), but
kubeadm-config still carried the legacy single-issuer --oidc-* extraArgs, so the
regenerated manifest reverted structured auth and the new apiserver crash-looped.
Proven via `kubeadm upgrade diff`. The existing post-upgrade OIDC restore step
never ran because the upgrade itself never succeeded.

Fix:
- rbac/apiserver-oidc.tf: the remote script now also reconciles kubeadm-config
  (kubeadm init phase upload-config: drop --oidc-*, add --authentication-config)
  so a future kubeadm upgrade regenerates a correct manifest. Delivered to the
  cluster via the apiserver-oidc-restore ConfigMap the chain re-runs (CI needs no
  ssh key); trigger deliberately not script-hashed since CI cannot ssh.
- k8s-version-upgrade/upgrade-step.sh: new preflight gate runs `kubeadm upgrade
  diff` and BLOCKS+alerts (never drains the master) if --authentication-config
  would still be dropped.
- Post-mortem + runbook updated.

The live kubeadm-config was reconciled directly on the master and verified
(`kubeadm upgrade diff` now shows only the control-plane image bump), so tonight's
run can complete the 1.34->1.35 upgrade.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-25 14:16:04 +00:00
Viktor Barzin
c6bba1da6e home-assistant skill: refresh ha-london map (HAOS 2026.5.2, Cowboy revived, Overview redesign)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked to redesign the ha-london dashboards and fix the broken integrations (the Cowboy one). The skill's ha-london knowledge map had drifted badly from reality, so this brings it current: it claimed HA 2025.9.1 on a docker-run container (it's HAOS 2026.5.2, managed); listed the now-dead jdejaegh Cowboy integration with sensor.bike_* entities (revived via elsbrock/cowboy-ha v1.2.0 -> entities are sensor.classic_performance_*); and didn't flag that met/metoffice/roomba/hildebrandglow are user-disabled (not broken) or that Tapo P100 is failing. Also documents the redesigned Overview (Home+More sections, Mushroom+mini-graph-card), the dashboard/view/card glossary, the parked-bike 'unknown battery' gotcha, and that london is API-only from the Sofia devvm (no SSH).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 22:03:15 +00:00
Viktor Barzin
b858561bd0 Merge remote-tracking branch 'origin/master'
Some checks failed
ci/woodpecker/push/default Pipeline failed
2026-06-24 20:59:39 +00:00
Viktor Barzin
a7704f46a6 deploy goldmane-edge-aggregator: durable who-talks-to-whom edge trail (#58, ADR-0014)
Infra side of ADR-0014: an mTLS gRPC consumer of Calico Goldmane's Flows API
that records the namespace-pair edge-set in CNPG and posts a daily new-edge
digest to #security. Adds the goldmane-edge-aggregator stack, the
pg-goldmane-edges Vault rotation role (Tier-0 vault state updated here), and the
namespace in the ghcr-credentials allowlist.

Cert: REUSES the operator-minted, Tigera-CA-signed whisker-backend client cert
(Goldmane verifies only the CA chain, not identity) instead of minting from the
Tigera CA private key. This avoids putting the CA key in TF state AND the
hashicorp/tls provider, which is incompatible with this repo's global
generate-providers/lockfile pattern (it broke every stack's lockfile).

Verified live: aggregator streaming flows, 174 edges in Postgres across 50x54
namespaces, db+slack ExternalSecrets synced, digest dry-run formats correctly,
private image pulls via the Kyverno-synced ghcr-credentials.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 20:59:39 +00:00
Viktor Barzin
aa510e3600 instagram-poster: force_conflicts on ESO manifests (fix apply)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The ESO v1 migration (2026-06-22) made the external-secrets controller own
.spec.refreshInterval via server-side apply, so terraform apply of the two
ExternalSecret manifests fails with a field-manager conflict (Woodpecker #348),
which blocked the replicas=0 scale-down from landing. Add force_conflicts=true
to both, matching the grafana/woodpecker/traefik fix applied to other stacks
the same day.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 20:49:53 +00:00
Viktor Barzin
53834deb24 instagram-poster: scale to 0 (unused, dead ExternalSecret)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Viktor confirmed the Instagram Graph poster isn't used. Its ExternalSecret
has been dead on missing Vault keys (ig_graph_long_lived_token,
ig_business_account_id), so the deployment sat at 0/1 firing
DeploymentReplicasMismatch. Setting replicas=0 stops the alert and makes the
scale-down durable (a bare kubectl scale reverts on the next stack apply).
Re-set to 1 after minting a Meta long-lived token + populating the Vault keys.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 20:45:30 +00:00
Viktor Barzin
8dd9a3978d Merge remote-tracking branch 'forgejo/master' into wizard/homelab-vault
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-24 12:25:52 +00:00
Viktor Barzin
65b2df1222 fix(monitoring): force_conflicts on grafana_db_creds ExternalSecret
The external-secrets controller owns .spec.refreshInterval via SSA, so a plain
terraform apply of the monitoring stack conflicts. Latent until 2026-06-24 (the
homelab-vault loki-rules change was the first monitoring apply in a while and
surfaced it). force_conflicts lets TF win — same pattern as woodpecker/traefik/
k8s-version-upgrade stacks.
2026-06-24 12:25:36 +00:00
104 changed files with 5586 additions and 4084 deletions

View file

@ -233,7 +233,7 @@ the workflow's built-in `GITHUB_TOKEN` (`packages: write`).
- **External monitoring**: `[External] <service>` monitors in Uptime Kuma test full external path (DNS → Cloudflare → Tunnel → Traefik). Divergence metric `external_internal_divergence_count` → alert `ExternalAccessDivergence` (15min). Config: `stacks/uptime-kuma/`, targets from `cloudflare_proxied_names` in `config.tfvars` (17 remaining centrally-managed hostnames; most DNS records now auto-created by `ingress_factory` `dns_type` param). - **External monitoring**: `[External] <service>` monitors in Uptime Kuma test full external path (DNS → Cloudflare → Tunnel → Traefik). Divergence metric `external_internal_divergence_count` → alert `ExternalAccessDivergence` (15min). Config: `stacks/uptime-kuma/`, targets from `cloudflare_proxied_names` in `config.tfvars` (17 remaining centrally-managed hostnames; most DNS records now auto-created by `ingress_factory` `dns_type` param).
- Key alerts: OOMKill, pod replica mismatch, 4xx/5xx error rates, UPS battery, CPU temp, SSD writes, NFS responsiveness, ClusterMemoryRequestsHigh (>85%), ContainerNearOOM (>85% limit), PodUnschedulable, ExternalAccessDivergence, ImmichSmartSearchSlow (context-search latency / clip_index cache eviction). - Key alerts: OOMKill, pod replica mismatch, 4xx/5xx error rates, UPS battery, CPU temp, SSD writes, NFS responsiveness, ClusterMemoryRequestsHigh (>85%), ContainerNearOOM (>85% limit), PodUnschedulable, ExternalAccessDivergence, ImmichSmartSearchSlow (context-search latency / clip_index cache eviction).
- **E2E email monitoring**: CronJob `email-roundtrip-monitor` (every 20 min) sends test email via Brevo HTTP API to `smoke-test@viktorbarzin.me` (catch-all → `spam@`), verifies IMAP delivery, deletes test email, pushes metrics to Pushgateway + Uptime Kuma. Alerts: `EmailRoundtripFailing` (60m), `EmailRoundtripStale` (60m), `EmailRoundtripNeverRun` (60m). Outbound relay: Brevo EU (`smtp-relay.brevo.com:587`, 300/day free — migrated from Mailgun). Inbound external traffic enters via pfSense HAProxy on `10.0.20.1:{25,465,587,993}`, which forwards to k8s `mailserver-proxy` NodePort (30125-30128) with `send-proxy-v2`. Mailserver pod runs alt PROXY-speaking listeners (2525/4465/5587/10993) alongside stock PROXY-free ones (25/465/587/993) for intra-cluster clients. Real client IPs recovered from PROXY v2 header despite kube-proxy SNAT (replaces pre-2026-04-19 MetalLB `10.0.20.202` ETP:Local scheme; see bd code-yiu + `docs/runbooks/mailserver-pfsense-haproxy.md`). Vault: `brevo_api_key` in `secret/viktor` (probe + relay). - **E2E email monitoring**: CronJob `email-roundtrip-monitor` (every 20 min) sends test email via Brevo HTTP API to `smoke-test@viktorbarzin.me` (catch-all → `spam@`), verifies IMAP delivery, deletes test email, pushes metrics to Pushgateway + Uptime Kuma. Alerts: `EmailRoundtripFailing` (60m), `EmailRoundtripStale` (60m), `EmailRoundtripNeverRun` (60m). Outbound relay: Brevo EU (`smtp-relay.brevo.com:587`, 300/day free — migrated from Mailgun). Inbound external traffic enters via pfSense HAProxy on `10.0.20.1:{25,465,587,993}`, which forwards to k8s `mailserver-proxy` NodePort (30125-30128) with `send-proxy-v2`. Mailserver pod runs alt PROXY-speaking listeners (2525/4465/5587/10993) alongside stock PROXY-free ones (25/465/587/993) for intra-cluster clients. Real client IPs recovered from PROXY v2 header despite kube-proxy SNAT (replaces pre-2026-04-19 MetalLB `10.0.20.202` ETP:Local scheme; see bd code-yiu + `docs/runbooks/mailserver-pfsense-haproxy.md`). Vault: `brevo_api_key` in `secret/viktor` (probe + relay).
- **Authentik walling-off guard**: `blackbox-exporter` (monitoring ns, `stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf`) probes each must-stay-public `auth = "none"` carve-out URL with `no_follow_redirects` and FAILS (`fail_if_header_matches` on `Location`) iff it 302s to Authentik. Catches a carve-out regressing (TF revert / deploy / `ingress_factory` `auth` default flipping back to `"required"`). Scrape job `blackbox-authentik-walloff` (1m) → alert `AuthentikWallingOffPublicPath` (`probe_failed_due_to_regex == 1`, for 10m, `lane=security``#security` Slack). **To guard a new carve-out: add one line to `local.authentik_walloff_targets`** (a `service → URL` map; `valid_status_codes` includes 301/302 so legit redirects/404s stay green — only the Authentik `Location` fails the probe). `curl -sI '<url>'` must NOT show a Location to `authentik.viktorbarzin.me` before adding. - **Authentik walling-off guard**: `blackbox-exporter` (monitoring ns, `stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf`) probes each must-stay-public `auth = "none"` carve-out URL with `no_follow_redirects` and FAILS (`fail_if_header_matches` on `Location`) iff it 302s to Authentik. Catches a carve-out regressing (TF revert / deploy / `ingress_factory` `auth` default flipping back to `"required"`). Scrape job `blackbox-authentik-walloff` (1m) → alert `AuthentikWallingOffPublicPath` (`probe_failed_due_to_regex == 1`, for 10m, `lane=security`posts to `#alerts` via the `slack-security` receiver, which keeps its `[SECURITY]` styling; the dedicated `#security` channel was abandoned 2026-06-25 — the shared webhook's app isn't a member of it). **To guard a new carve-out: add one line to `local.authentik_walloff_targets`** (a `service → URL` map; `valid_status_codes` includes 301/302 so legit redirects/404s stay green — only the Authentik `Location` fails the probe). `curl -sI '<url>'` must NOT show a Location to `authentik.viktorbarzin.me` before adding.
## Security Posture (Wave 1 — locked 2026-05-18) ## Security Posture (Wave 1 — locked 2026-05-18)
@ -241,9 +241,10 @@ Plan in `docs/architecture/security.md` + response playbook in `docs/runbooks/se
- **Identity allowlist for security rules**: ONLY `me@viktorbarzin.me`. NOT `viktor@viktorbarzin.me`, NOT `emo@viktorbarzin.me` (those don't exist). emo's identity scheme is unknown — ask before assuming. - **Identity allowlist for security rules**: ONLY `me@viktorbarzin.me`. NOT `viktor@viktorbarzin.me`, NOT `emo@viktorbarzin.me` (those don't exist). emo's identity scheme is unknown — ask before assuming.
- **Source-IP allowlist (K2, K9, V7, S1)**: `10.0.20.0/22`, `192.168.1.0/24` (Proxmox + Sofia LAN), K8s pod CIDR, K8s service CIDR, Headscale tailnet. **Policy: no public-IP access** — Vault, kube-apiserver, PVE sshd must transit LAN or Headscale. **One documented exception (2026-06-11): break-glass SSH** — PVE sshd on a WAN-exposed `:52222`, key-only, dedicated break-glass key only (`Match LocalPort`), rate-limited + fail2ban; intentionally cluster-independent so it survives an outage. As-built `docs/runbooks/breakglass-ssh.md`. (Replaced the 2026-05-30 port-knock design — circular Vault dep caused a lockout.) - **Source-IP allowlist (K2, K9, V7, S1)**: `10.0.20.0/22`, `192.168.1.0/24` (Proxmox + Sofia LAN), K8s pod CIDR, K8s service CIDR, Headscale tailnet. **Policy: no public-IP access** — Vault, kube-apiserver, PVE sshd must transit LAN or Headscale. **One documented exception (2026-06-11): break-glass SSH** — PVE sshd on a WAN-exposed `:52222`, key-only, dedicated break-glass key only (`Match LocalPort`), rate-limited + fail2ban; intentionally cluster-independent so it survives an outage. As-built `docs/runbooks/breakglass-ssh.md`. (Replaced the 2026-05-30 port-knock design — circular Vault dep caused a lockout.)
- **Response model**: (I) Slack-only daily skim. All security alerts via Loki ruler → Alertmanager → `#security` Slack receiver. Single channel with severity labels inside (critical/warning/info). No paging. - **Response model**: (I) Slack-only daily skim. All security alerts via Loki ruler → Alertmanager → the `slack-security` receiver, which posts to `#alerts` (it keeps its `[SECURITY/<sev>]` title styling so security-lane alerts stand out). Severity labels carried in the alert (critical/warning/info). No paging. The dedicated `#security` channel was abandoned 2026-06-25 — the shared `alertmanager_slack_api_url` webhook's Slack app isn't a member of it (a `#security` override 404s), so everything consolidated to `#alerts`.
- **Kyverno policies (wave 1)**: `deny-privileged-containers`, `deny-host-namespaces`, `restrict-sys-admin`, `require-trusted-registries` flip Audit→Enforce with the 31-namespace exclude list (memory id=1970). `failurePolicy: Ignore` preserved. Cosign `verify-images` deferred. - **Kyverno policies (wave 1)**: `deny-privileged-containers`, `deny-host-namespaces`, `restrict-sys-admin`, `require-trusted-registries` flip Audit→Enforce with the 31-namespace exclude list (memory id=1970). `failurePolicy: Ignore` preserved. Cosign `verify-images` deferred.
- **NetworkPolicy default-deny egress (wave 1)**: observe-then-enforce (γ approach) — Calico flow logs cluster-wide + GlobalNetworkPolicy log-only on tier 3+4, build empirical allowlist after 1 week, phased per-namespace enforce starting `recruiter-responder`. Tier 0/1/2 deferred. - **NetworkPolicy default-deny egress (wave 1)**: observe-then-enforce (γ approach) — Calico flow logs cluster-wide + GlobalNetworkPolicy log-only on tier 3+4, build empirical allowlist after 1 week, phased per-namespace enforce starting `recruiter-responder`. Tier 0/1/2 deferred. **The internal (ns-to-ns) half of each allowlist now derives faster from the east-west flow trail** (below): `SELECT DISTINCT dst_ns FROM edge WHERE src_ns='<ns>' AND action='allow'`. External egress is NOT in that table (empty-ns flows dropped) — those still come from the Calico flow-log W1.6 snapshot. Enforce-flips remain out of scope of the trail (observe-and-derive only; beads `code-8ywc`).
- **East-west flow trail (who-talks-to-whom, ADR-0014)**: Calico **Goldmane** (`goldmane.calico-system:7443`, gRPC/mTLS, ~60-min in-memory ring buffer — no etcd writes) + **Whisker** live UI (`whisker.viktorbarzin.me`, Authentik-gated) → **`goldmane-edge-aggregator`** streams Goldmane's `Flows.Stream` over mTLS and upserts the namespace-pair **edge set** (`edge(src_ns,dst_ns,action,first_seen,last_seen,flow_count)`, self-edges + public-internet flows dropped) into **CNPG DB `goldmane_edges`** → daily **`goldmane-edges-digest`** CronJob posts first-seen edges to `#alerts` (consolidated to `#alerts`; the `#security` channel was abandoned 2026-06-25 — the shared webhook's Slack app isn't a member of it, so a `#security` override 404s; see runbook). **CERT-REUSE GOTCHA**: the aggregator's mTLS client cert reuses the operator's Tigera-CA-signed `whisker-backend-key-pair` Secret (Goldmane verifies CA-chain only) — **re-apply `stacks/goldmane-edge-aggregator` if the operator rotates it** (symptom: no `last_seen` updates, `AggregatorDown`). Service identity = namespace, + `service-identity` label only in `monitoring`/`kube-system`/`dbaas`. Health: `AggregatorDown` + `DigestFailing` alerts + cluster-health #48. Runbook: `docs/runbooks/goldmane-flow-trail.md`. (Goldmane is OSS tech-preview — reversible operator-CR toggle in `stacks/calico/main.tf`.)
- **What's NOT in scope**: canary tokens (rejected — self-trigger risk with Viktor's normal `vault kv list secret/viktor` and `kubectl get secret -A` workflows), Falco/Tetragon (too noisy for Slack-only daily check), Cloudflare/GitHub audit polling (deferred to wave 2). - **What's NOT in scope**: canary tokens (rejected — self-trigger risk with Viktor's normal `vault kv list secret/viktor` and `kubectl get secret -A` workflows), Falco/Tetragon (too noisy for Slack-only daily check), Cloudflare/GitHub audit polling (deferred to wave 2).
## Storage & Backup Architecture ## Storage & Backup Architecture

View file

@ -13,6 +13,8 @@
| authentik | Identity provider (SSO) | authentik | | authentik | Identity provider (SSO) | authentik |
| cloudflared | Cloudflare tunnel | cloudflared | | cloudflared | Cloudflare tunnel | cloudflared |
| authelia | Auth middleware (may be merged into ebooks or removed) | platform | | authelia | Auth middleware (may be merged into ebooks or removed) | platform |
| goldmane | Calico 3.30 OSS flow aggregator (`goldmane.calico-system.svc:7443`, gRPC/mTLS). Stamps identity (ns/pod/workload/labels + allow-deny) on every flow from Felix into a ~60-min in-memory ring buffer — no etcd/API writes. East-west "who-talks-to-whom" source (ADR-0014). Enabled via operator CR (`kubectl_manifest.goldmane`). | calico |
| whisker | Calico 3.30 OSS live flow-observability UI (`whisker.calico-system.svc:8081`) at `whisker.viktorbarzin.me` (Authentik-gated, `auth=required` — no own login; additive NP ORs Traefik past the operator default-deny). ~60-min live view of Goldmane flows, NOT history. Enabled via operator CR (`kubectl_manifest.whisker`). | calico |
| monitoring | Prometheus/Grafana/Loki stack | monitoring | | monitoring | Prometheus/Grafana/Loki stack | monitoring |
## Storage & Security (Tier: cluster) ## Storage & Security (Tier: cluster)
@ -37,6 +39,7 @@
## Active Use ## Active Use
| Service | Description | Stack | | Service | Description | Stack |
|---------|-------------|-------| |---------|-------------|-------|
| goldmane-edge-aggregator | Durable who-talks-to-whom audit trail (ADR-0014 / #58). Go service: `aggregate` Deployment streams Goldmane's gRPC `Flows.Stream` (mTLS) and upserts the low-cardinality namespace-pair edge set (`edge(src_ns,dst_ns,action,first_seen,last_seen,flow_count)`) into CNPG DB `goldmane_edges`; `goldmane-edges-digest` CronJob posts first-seen edges daily to `#alerts` (the `#security` channel was abandoned 2026-06-25 — shared webhook's app isn't a member of it). mTLS client cert REUSES the operator's `whisker-backend-key-pair` (re-apply if rotated). Tier-4-aux. Image `ghcr.io/viktorbarzin/goldmane-edge-aggregator` (private). Runbook: [goldmane-flow-trail.md](../../docs/runbooks/goldmane-flow-trail.md). | goldmane-edge-aggregator |
| mailserver | Email (docker-mailserver) | mailserver | | mailserver | Email (docker-mailserver) | mailserver |
| shadowsocks | Proxy | shadowsocks | | shadowsocks | Proxy | shadowsocks |
| webhook_handler | Webhook processing | webhook_handler | | webhook_handler | Webhook processing | webhook_handler |
@ -161,3 +164,4 @@ procedures) are documented in `infra/docs/runbooks/`:
| pfSense + Unbound DNS | [pfsense-unbound.md](../../docs/runbooks/pfsense-unbound.md) | | pfSense + Unbound DNS | [pfsense-unbound.md](../../docs/runbooks/pfsense-unbound.md) |
| Mailserver PROXY-protocol / HAProxy | [mailserver-pfsense-haproxy.md](../../docs/runbooks/mailserver-pfsense-haproxy.md) | | Mailserver PROXY-protocol / HAProxy | [mailserver-pfsense-haproxy.md](../../docs/runbooks/mailserver-pfsense-haproxy.md) |
| Technitium apply flow | [technitium-apply.md](../../docs/runbooks/technitium-apply.md) | | Technitium apply flow | [technitium-apply.md](../../docs/runbooks/technitium-apply.md) |
| Goldmane flow trail (east-west who-talks-to-whom) | [goldmane-flow-trail.md](../../docs/runbooks/goldmane-flow-trail.md) |

View file

@ -11,8 +11,8 @@ description: |
There are TWO Home Assistant deployments: ha-london (default) and ha-sofia. There are TWO Home Assistant deployments: ha-london (default) and ha-sofia.
Always use Home Assistant for smart home control. Always use Home Assistant for smart home control.
author: Claude Code author: Claude Code
version: 2.0.0 version: 2.1.0
date: 2026-02-07 date: 2026-06-24
--- ---
# Home Assistant Control # Home Assistant Control
@ -395,14 +395,27 @@ Advanced SSH, File Editor, Studio Code Server, InfluxDB, Mosquitto, Node-RED, Fr
## ha-london Knowledge Map ## ha-london Knowledge Map
### Overview ### Overview
- **HA Version**: 2025.9.1 (Docker container on Raspberry Pi) - **HA Version**: 2026.5.2 on **Home Assistant OS** (HAOS — managed appliance, NOT a `docker run` container). Latest is 2026.6.4 (update available, deliberately not applied).
- **Location**: London, UK - **Location**: London, UK
- **Platform**: Raspberry Pi 4, HA OS (not Docker standalone) - **Platform**: Raspberry Pi 4, HA OS
- **SSH**: `ssh hassio@192.168.8.103` (requires `sudo` for file access) - **Access from the Sofia devvm**: london is **remote**`homelab ha ssh --instance london` generally WON'T connect (ADR-0012). Drive it via the API: `homelab ha token --instance london` + `https://ha-london.viktorbarzin.me/api/...`, and the WebSocket API `wss://ha-london.viktorbarzin.me/api/websocket` for dashboards / config-entries / HACS installs.
- **Config path**: `/config/` (requires `sudo` for file access) - **SSH (only from the London LAN)**: `ssh hassio@192.168.8.103` (requires `sudo` for file access)
- **Config path**: `/config/`
- **3 tracked people**: Viktor Barzin, Anca Milea, Gheorghe Milea - **3 tracked people**: Viktor Barzin, Anca Milea, Gheorghe Milea
- **Zone**: London (home) - **Zone**: London (home)
### Dashboards (redesigned 2026-06-24)
**Glossary** (HA terms — keep distinct):
- **Dashboard** = a sidebar entry (Overview, Air Quality, Map). Sidebar *order* is a per-USER frontend preference, not in any dashboard config.
- **View** = a tab inside a dashboard. View order is global (stored in the dashboard config).
- **Card** = a widget inside a view.
- **Overview** (`lovelace`, the default): responsive **sections** views, styled with Mushroom + mini-graph-card.
- **Home** tab: *Who's home* · *Comfort & Air* (CO₂/temp/humidity/PM2.5/VOC chips + CO₂ and temp/humidity trend graphs + link to Air Quality) · *Cowboy* (battery/range/last-ride) · *Energy* (5 Kasa plugs + power trend) · *Quick actions* (Netflix/Stremio/Night).
- **More** tab: *Network* (GL-MT6000 router) · *System* (HA version/update, last backup, RPi power) · *Phones*.
- **Air Quality** (`air-quality`): deep-dive (views: Home, Detailed). (`detialed``detailed` path typo fixed 2026-06-24.)
- Built via the WS `lovelace/config/save` API (london is remote — no SSH path).
### Key Systems ### Key Systems
#### 1. Smart Plugs (TP-Link Kasa) — Energy Monitoring #### 1. Smart Plugs (TP-Link Kasa) — Energy Monitoring
@ -424,10 +437,15 @@ Named plugs with power/energy tracking:
- PM1.0/2.5/4.0/10 particulate sensors - PM1.0/2.5/4.0/10 particulate sensors
- VOC, NOx, ammonia, CO, ethanol, hydrogen, methane, NO2 gas sensors - VOC, NOx, ammonia, CO, ethanol, hydrogen, methane, NO2 gas sensors
#### 3. Cowboy E-Bike #### 3. Cowboy E-Bike (`elsbrock/cowboy-ha`)
- `sensor.bike_state_of_charge`: Battery % Bike named **"Classic Performance"** → entities are `sensor.classic_performance_*` (26 total). The old `sensor.bike_*` names are GONE (they were the dead `jdejaegh` integration).
- `sensor.bike_total_distance`: Total km - `sensor.classic_performance_remaining_battery`: Battery % (was `sensor.bike_state_of_charge`)
- `sensor.bike_total_co2_saved`: CO2 saved (grams) - `sensor.classic_performance_remaining_range`: Range km
- `sensor.classic_performance_mileage`: Total km (was `sensor.bike_total_distance`)
- `sensor.classic_performance_saved_co2`: Lifetime CO2 saved (was `sensor.bike_total_co2_saved`)
- Plus `_distance_today`, `_last_trip_*`, `_battery_health`, `device_tracker.classic_performance`, etc.
- **GOTCHA**: live battery/range/mileage read `unknown` while the bike is parked/asleep — Cowboy only reports live SoC when awake (ridden/charging); trip-history + `distance_today` stay live regardless.
- Auth: account **email+password** (no AWS Cognito — that was the dead `jdejaegh`/`cowboybike` lineage). Setup via UI config flow / REST `config_entries/flow`. Creds in Vaultwarden item **"cowboy bike"** (`homelab vault get "cowboy bike"`).
#### 4. Uptime Monitoring (UptimeRobot) #### 4. Uptime Monitoring (UptimeRobot)
- `sensor.blog`: blog uptime - `sensor.blog`: blog uptime
@ -446,12 +464,17 @@ Named plugs with power/energy tracking:
- Scripts: `script.start_netflix`, `script.start_stremio` - Scripts: `script.start_netflix`, `script.start_stremio`
- Scene: `scene.night` (turns off Livia + Michelle plugs) - Scene: `scene.night` (turns off Livia + Michelle plugs)
### Custom Components ### Custom Components (HACS integrations)
- **cowboy**: Cowboy e-bike integration (HACS) - **cowboy** (`elsbrock/cowboy-ha` v1.2.0): Cowboy e-bike — revived 2026-06-24. The old `jdejaegh/home-assistant-cowboy` repo is **dead (404)**; don't chase it.
- **hildebrandglow_dcc**: UK smart meter DCC energy data (HACS) - **hildebrandglow_dcc**: UK smart meter DCC energy — **DISABLED by user** (config entry `disabled_by: user`), not broken.
### HACS frontend cards (plugins)
- **Mushroom** (`piitaya/lovelace-mushroom`), **mini-graph-card** (`kalkih/mini-graph-card`), **plotly-graph-card** (`dbuezas/lovelace-plotly-graph-card`) — used by the redesigned Overview. Install over WS `hacs/repository/download`; resources auto-register in storage mode.
### Integrations ### Integrations
ESPHome, TP-Link Kasa, Tapo, UptimeRobot, Cowboy, Hildebrand Glow DCC, Oral-B BLE, Ookla Speedtest, HACS, OpenRouter (multiple free LLMs), Piper (local TTS), Whisper (local STT), Android TV/ADB ESPHome, TP-Link Kasa, Tapo, UptimeRobot, **Cowboy** (elsbrock), Oral-B BLE, Ookla Speedtest (exposes only an `update` entity, no live speed sensors), HACS, OpenRouter (free LLMs), Piper (TTS), Whisper (STT), Android TV/ADB.
- **Disabled by user (NOT broken)**: `met` + `metoffice` (weather — so `weather.*` entities are ABSENT), `roomba` (Rumi vacuum), `hildebrandglow_dcc` (energy).
- **Failing**: `tplink` **Tapo P100** projector plug — `setup_retry`, 403 KLAP handshake from 192.168.8.108 (plug off / firmware). Left as-is.
### AI / Voice Assistants ### AI / Voice Assistants
- 5 free LLM conversation agents: Google Gemma 3 27B, Meta Llama 3.2 3B, Mistral Devstral 2, OpenAI GPT-OSS-20B, Z.AI GLM 4.5 Air - 5 free LLM conversation agents: Google Gemma 3 27B, Meta Llama 3.2 3B, Mistral Devstral 2, OpenAI GPT-OSS-20B, Z.AI GLM 4.5 Air
@ -466,15 +489,8 @@ ESPHome, TP-Link Kasa, Tapo, UptimeRobot, Cowboy, Hildebrand Glow DCC, Oral-B BL
- Anca arrival/departure notifications - Anca arrival/departure notifications
- Night scene: turns off Livia + Michelle - Night scene: turns off Livia + Michelle
### Docker Setup ### Platform (HAOS — ignore any legacy `docker run` snippet)
```bash ha-london runs **Home Assistant OS** (managed appliance), NOT a hand-run Docker container. There is no `docker run homeassistant/home-assistant` to manage. Install HACS components over the WebSocket API (`hacs/repository/download` with the repo's HACS id), then restart via `POST /api/services/homeassistant/restart` — a HAOS restart drops automations for ~12 min and resets `sensor.uptime` (use that as the "back up" marker).
docker run -d --name homeassistant --privileged \
-e TZ=Europe/London \
-v /home/pi/docker/homeAssistant:/config \
-v /run/dbus:/run/dbus:ro \
--network=host --restart=unless-stopped \
homeassistant/home-assistant:2025.9
```
### SSH Access ### SSH Access
```bash ```bash

View file

@ -125,7 +125,7 @@ How a **Service** is named in flow/audit data — its **namespace** is the prima
_Avoid_: equating "service identity" with a workload's **ServiceAccount** (that's the deferred enforcement principal, not the attribution key) or with cryptographic/SPIFFE identity; "Service" here is the domain **Service**, not the K8s `Service` object. _Avoid_: equating "service identity" with a workload's **ServiceAccount** (that's the deferred enforcement principal, not the attribution key) or with cryptographic/SPIFFE identity; "Service" here is the domain **Service**, not the K8s `Service` object.
**Goldmane / Whisker**: **Goldmane / Whisker**:
Calico 3.30's OSS flow-observability pair — **Goldmane** aggregates identity-stamped flows (namespace/pod/workload/labels + allow-deny + policy trace) streamed from Felix over gRPC into an in-memory ~60-min ring buffer (no etcd/API writes); **Whisker** is its live web UI. The east-west "who-talks-to-whom" data plane, succeeding raw iptables-`LOG`→journald lines (which carry no identity). Durable history requires emitting Goldmane flows to **Loki**; the in-memory buffer alone is not an audit trail. Calico 3.30's OSS flow-observability pair — **Goldmane** aggregates identity-stamped flows (namespace/pod/workload/labels + allow-deny + policy trace) streamed from Felix over gRPC into an in-memory ~60-min ring buffer (no etcd/API writes); **Whisker** is its live web UI. The east-west "who-talks-to-whom" data plane, succeeding raw iptables-`LOG`→journald lines (which carry no identity). The in-memory buffer alone is not an audit trail — durable history is the **`goldmane-edge-aggregator`** (the implemented trail; ADR-0014 originally framed this as a Loki emitter), which streams Goldmane's gRPC `Flows.Stream` over mTLS and upserts the namespace-pair **edge set** into CNPG DB `goldmane_edges` + a daily `#alerts` digest (the `#security` channel was abandoned 2026-06-25). As-built: `docs/runbooks/goldmane-flow-trail.md`.
_Avoid_: assuming Goldmane persists (it's a ring buffer — lost on restart); expecting a ServiceAccount field in its schema (it carries labels, not SA); confusing it with Cilium **Hubble** (needs the Cilium datapath, unusable on Calico) or **Kiali** (needs an Istio mesh). _Avoid_: assuming Goldmane persists (it's a ring buffer — lost on restart); expecting a ServiceAccount field in its schema (it carries labels, not SA); confusing it with Cilium **Hubble** (needs the Cilium datapath, unusable on Calico) or **Kiali** (needs an Istio mesh).
### Storage ### Storage

View file

@ -5,6 +5,14 @@ exists to answer the question that drove the whole CLI — *which verbs are wort
adding next* — with data instead of one maintainer's habits (the earlier mining adding next* — with data instead of one maintainer's habits (the earlier mining
covered a single user's ~51k commands, so the surface is shaped to that user). covered a single user's ~51k commands, so the surface is shaped to that user).
> **Update (2026-06-26) — the cross-user privacy *norm* below is superseded by
> [ADR-0015](0015-os-is-the-authorization-boundary.md).** The prohibition this
> ADR leaned on ("reading another user's `~/.claude` is off-limits even for an
> owner in-session") no longer holds: the managed-settings policy now **defers
> to OS/sudo authorization**. The `usage top` telemetry design itself is
> unchanged and still current — only the "never read homes" framing in the
> third decision below is overtaken.
## Decisions ## Decisions
- **Emit on dispatch, in `dispatch()`.** The longest-prefix match already knows - **Emit on dispatch, in `dispatch()`.** The longest-prefix match already knows

View file

@ -27,3 +27,9 @@ As the Service count grows we want an audit-grade record of which Service talks
- **Enforcement gains a better data source.** Goldmane's allow/deny + policy-trace flows build the Wave 1 empirical egress allowlist faster than the current iptables-`LOG`→journald→Loki path, and policies select on namespace/label with no SA dependency. - **Enforcement gains a better data source.** Goldmane's allow/deny + policy-trace flows build the Wave 1 empirical egress allowlist faster than the current iptables-`LOG`→journald→Loki path, and policies select on namespace/label with no SA dependency.
- **New ubiquitous language** recorded in `CONTEXT.md`: **Service identity** and **Goldmane / Whisker**. - **New ubiquitous language** recorded in `CONTEXT.md`: **Service identity** and **Goldmane / Whisker**.
- **Revisit triggers:** adopt dedicated per-Service SAs if identity-aware NetworkPolicy needs a principal finer than namespace/label, or if mTLS is ever required; reconsider Retina if DNS/drop-level flow detail becomes necessary. - **Revisit triggers:** adopt dedicated per-Service SAs if identity-aware NetworkPolicy needs a principal finer than namespace/label, or if mTLS is ever required; reconsider Retina if DNS/drop-level flow detail becomes necessary.
## As-built (2026-06-25)
Implemented across infra issues #57#63. **One material deviation from the decision above:** the durable trail is NOT a Goldmane→Loki emitter (no such emitter exists in OSS Calico 3.30) — it is the **`goldmane-edge-aggregator`** service, which streams Goldmane's gRPC `Flows.Stream` API over mTLS and upserts the unique namespace-pair **edge set** (`edge(src_ns,dst_ns,action,first_seen,last_seen,flow_count)`, self-edges + empty-namespace flows dropped) into **CNPG DB `goldmane_edges`**, plus a daily `goldmane-edges-digest` CronJob → `#alerts` (all Slack consolidated to `#alerts`; the `#security` channel was abandoned 2026-06-25 — the shared webhook's Slack app isn't a member of it — see runbook). The mTLS client cert **reuses the operator's Tigera-CA-signed `whisker-backend-key-pair`** rather than copying the CA private key into TF state (Goldmane verifies CA-chain only, not identity) — re-apply `stacks/goldmane-edge-aggregator` if the operator rotates it. `service-identity` labels are live on the multi-Service namespaces (`monitoring`, `dbaas`). Whisker UI is Authentik-gated at `whisker.viktorbarzin.me`. Health: Prometheus alerts `AggregatorDown` + `DigestFailing` and cluster-health check #48.
Full as-built, query recipes (incl. the Wave-1 egress-allowlist derivation), and troubleshooting: [`docs/runbooks/goldmane-flow-trail.md`](../runbooks/goldmane-flow-trail.md). Stacks: `stacks/calico` (Goldmane/Whisker + Whisker ingress), `stacks/goldmane-edge-aggregator` (the trail). Code: `~/code/goldmane-edge-aggregator`.

View file

@ -0,0 +1,57 @@
# OS is the authorization boundary: agents defer to Unix/sudo, not a stricter in-policy rule
Supersedes the cross-user privacy *norm* that the devvm managed-settings policy
carried and that ADR-0011 leaned on ("never read another user's home /
`~/.claude`, off-limits even for an owner in-session"). ADR-0011's actual
subject — `usage top` telemetry and its emit design — is unchanged and still
current; only the privacy prohibition it referenced is superseded here.
## Context
The devvm managed-settings policy (`/etc/claude-code/managed-settings.json`,
`claudeMd`) carried two rules that were, in practice, *stricter than the OS*:
"you are not the admin, do not escalate privileges" and "never read another
user's home directory, credentials, tokens, or `~/.claude`." The OS told a
different story: `wizard` holds `(ALL) NOPASSWD: ALL` — full passwordless root.
The kernel had already granted total read access; the policy was layering an
artificial refusal on top of an authorization the OS already permits, and the
"not the admin" framing was factually wrong for a NOPASSWD-root user.
Two honest ways to resolve the inconsistency: tighten sudo to match the policy,
or loosen the policy to match the OS. The owner chose the latter on 2026-06-26,
for analytics/debugging across the shared box.
## Decision
- **Authorization follows the OS, not this policy.** Agents may access whatever
their OS user can access — directly or via `sudo` where they hold sudo rights
— and must not impose restrictions stricter than the OS. On this box that
includes other users' home directories and `~/.claude` for users who hold
broad sudo.
- **No separate prompt or carve-out** for OS-authorized access. The Unix
permission model + sudoers is the single source of truth for who may read
what. Other homes are `0750`-owned, so a cross-home read necessarily transits
`sudo` and is therefore captured in the sudo/auth audit log.
- **Cluster/infra RBAC tiering is unchanged.** kubectl / Vault / infra access
stays scoped to each user's RBAC tier; "defer to the OS" is about OS-level
file access, not a licence to exceed cluster RBAC.
- **Scope is symmetric and multi-user.** The rule lives in the *shared*
managed-settings, so every user's agents defer to that user's own sudo grant.
Any user with broad sudo gets the same cross-home read capability over other
users' files. Accepted by the owner with that understanding; emo's and
ancamilea's `~/.claude` is now agent-readable by sudo-holders.
- **Takes effect in a fresh session.** managed-settings loads at session start;
the session that made the change keeps running under the old policy.
## Consequences
- The privacy-preserving telemetry rationale in ADR-0011 (`usage top` as the
"cross-user analytics without reading homes" answer) remains useful but is no
longer the *only* sanctioned path; direct reads via `sudo` are now permitted.
- Larger blast radius: if an agent session running as a sudo-holder is
prompt-injected or otherwise compromised, it can now read every user's secrets
with no in-agent friction (sudo here is passwordless). The sudo/auth audit log
is the remaining accountability control.
- Reversible: restore the prior `claudeMd` bullets (backup kept at
`/etc/claude-code/managed-settings.json.bak-2026-06-26`) and start a fresh
session.

View file

@ -286,7 +286,7 @@ Uptime Kuma monitors: TCP SMTP (port 25) on `176.12.22.76` (external), IMAP (por
#### Security Alerts (Wave 1 — planned, beads `code-8ywc`) #### Security Alerts (Wave 1 — planned, beads `code-8ywc`)
Routed via **Loki ruler → Alertmanager → `#security` Slack receiver**. Same handling path as infra alerts. Single channel with severity labels inside (critical/warning/info), not three separate channels. Detection sources: K8s API audit log (`job=kube-audit`), Vault audit log (`job=vault-audit`), PVE sshd journald (`job=sshd-pve`), Calico flow logs (`job=calico-flow`, W1.6 only). Routed via **Loki ruler → Alertmanager → the `slack-security` receiver, which posts to `#alerts`** (it keeps its `[SECURITY/<sev>]` title styling so security-lane alerts stand out there). Same handling path as infra alerts; severity labels carried in the alert (critical/warning/info). The dedicated `#security` channel was abandoned 2026-06-25 — the shared `alertmanager_slack_api_url` webhook's Slack app isn't a member of it (a `#security` override 404s), so everything consolidated to `#alerts`. Detection sources: K8s API audit log (`job=kube-audit`), Vault audit log (`job=vault-audit`), PVE sshd journald (`job=sshd-pve`), Calico flow logs (`job=calico-flow`, W1.6 only).
| # | Source | Event | Severity | | # | Source | Event | Severity |
|---|---|---|---| |---|---|---|---|
@ -318,9 +318,20 @@ IOPS impact estimated ~1-2 GB/day additional disk writes after custom audit-poli
Detects the inverse of the K-series alerts: a service that **must work WITHOUT Authentik SSO** getting accidentally walled off. Services on `ingress_factory auth = "required"` put Authentik forward-auth on `/`, which 302-bounces native-client / public / webhook / WebSocket / SPA-XHR paths. We carve those out with path-scoped `auth = "none"` ingresses; a TF revert, a bad deploy, or `ingress_factory`'s fail-closed `auth` default flipping back to `"required"` can silently clobber a carve-out. Detects the inverse of the K-series alerts: a service that **must work WITHOUT Authentik SSO** getting accidentally walled off. Services on `ingress_factory auth = "required"` put Authentik forward-auth on `/`, which 302-bounces native-client / public / webhook / WebSocket / SPA-XHR paths. We carve those out with path-scoped `auth = "none"` ingresses; a TF revert, a bad deploy, or `ingress_factory`'s fail-closed `auth` default flipping back to `"required"` can silently clobber a carve-out.
- **Mechanism**: `blackbox-exporter` (monitoring ns) probes a representative GET-able URL per carve-out with `no_follow_redirects: true`. The `http_no_authentik_redirect` module FAILS the probe (`fail_if_header_matches` on the `Location` header, regex `authentik\.viktorbarzin\.me|/outpost\.goauthentik\.io|/application/o/authorize`) iff the response redirects to Authentik. `valid_status_codes` enumerates all expected non-Authentik responses **including 301/302** (so a legitimate redirect, e.g. a short-link 302, or a 404 carve-out like meshcentral `/agent.ashx`, stays green). Scrape job: `blackbox-authentik-walloff` (1m). - **Mechanism**: `blackbox-exporter` (monitoring ns) probes a representative GET-able URL per carve-out with `no_follow_redirects: true`. The `http_no_authentik_redirect` module FAILS the probe (`fail_if_header_matches` on the `Location` header, regex `authentik\.viktorbarzin\.me|/outpost\.goauthentik\.io|/application/o/authorize`) iff the response redirects to Authentik. `valid_status_codes` enumerates all expected non-Authentik responses **including 301/302** (so a legitimate redirect, e.g. a short-link 302, or a 404 carve-out like meshcentral `/agent.ashx`, stays green). Scrape job: `blackbox-authentik-walloff` (1m).
- **Alert**: `probe_failed_due_to_regex{job="blackbox-authentik-walloff"} == 1` for 10m → `severity=warning`, `lane=security`**`#security` Slack** (Slack-only, no paging). `probe_failed_due_to_regex` (not bare `probe_success==0`) is the signal: it isolates the Authentik-redirect from unrelated 5xx/DNS/TLS failures already covered by reachability alerts. Inhibited by `TraefikDown` and `AuthentikDown` (symptom, not regression, during those outages). - **Alert**: `probe_failed_due_to_regex{job="blackbox-authentik-walloff"} == 1` for 10m → `severity=warning`, `lane=security`posts to **`#alerts`** via the `slack-security` receiver, which keeps its `[SECURITY]` styling (Slack-only, no paging; the dedicated `#security` channel was abandoned 2026-06-25 — the shared webhook's app isn't a member of it). `probe_failed_due_to_regex` (not bare `probe_success==0`) is the signal: it isolates the Authentik-redirect from unrelated 5xx/DNS/TLS failures already covered by reachability alerts. Inhibited by `TraefikDown` and `AuthentikDown` (symptom, not regression, during those outages).
- **Target list + how to add one**: `local.authentik_walloff_targets` in `stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf` — a map of `service → URL`. To guard a NEW carve-out, add ONE line. Verify it does NOT already 302 to Authentik first: `curl -s -o /dev/null -w '%{http_code} %{redirect_url}\n' '<url>'`. The map key becomes the `service` label on the metric + alert. (Note: openclaw `task-webhook` is intentionally NOT probed — no public DNS record.) - **Target list + how to add one**: `local.authentik_walloff_targets` in `stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf` — a map of `service → URL`. To guard a NEW carve-out, add ONE line. Verify it does NOT already 302 to Authentik first: `curl -s -o /dev/null -w '%{http_code} %{redirect_url}\n' '<url>'`. The map key becomes the `service` label on the metric + alert. (Note: openclaw `task-webhook` is intentionally NOT probed — no public DNS record.)
#### East-west flow observability (Goldmane edge-aggregator) — `AggregatorDown` / `DigestFailing` (ADR-0014)
Health for the durable "who-talks-to-whom" trail (Calico Goldmane → `goldmane-edge-aggregator` → CNPG `goldmane_edges` → daily `#alerts` digest; full trail in security.md + [runbooks/goldmane-flow-trail.md](../runbooks/goldmane-flow-trail.md)). The aggregator pod exposes **no `/metrics`**, so health is inferred from kube-state-metrics. Alert group `Network Observability (Goldmane)` in `prometheus_chart_values.tpl`; both route the default `slack-warning` receiver → **`#alerts`**.
| Alert | Expr (abridged) | For | Severity |
|---|---|---|---|
| `AggregatorDown` | `kube_deployment_status_replicas_available{namespace="goldmane-edge-aggregator",deployment="goldmane-edge-aggregator"} < 1` (+ Prometheus-restart guard) | 15m | warning |
| `DigestFailing` | `kube_job_status_failed{namespace="goldmane-edge-aggregator",job_name=~"goldmane-edges-digest.*"} > 0` within 24h | 30m | warning |
The two layers are **complementary**: `AggregatorDown` ⇒ no new edges land in the DB; `DigestFailing` ⇒ edges still land but nobody is told. (`< 1` requires the metric series to exist — a fully-deleted Deployment is instead caught by cluster-health check #48 below as "deployment missing".) A freshness probe (#61b) was deliberately skipped — `AggregatorDown` is the agreed floor. **Cluster-health check #48** (`check_goldmane_aggregator` in `scripts/cluster_healthcheck.sh`) reads the Deployment's `Available` condition independently (human / `--quiet` / `--json`; JSON key `goldmane_aggregator`).
#### Backup Alerts #### Backup Alerts
- **PostgreSQLBackupStale**: >36h since last backup - **PostgreSQLBackupStale**: >36h since last backup
- **MySQLBackupStale**: >36h since last backup - **MySQLBackupStale**: >36h since last backup

View file

@ -541,7 +541,7 @@ Separate from the in-cluster namespace-owner model above, the **devvm** (`10.0.1
**RBAC tiers:** `admin` (Viktor — cluster-admin, unlocked tree, secrets) · `power-user` (cluster-wide read-only, NO Secrets, via a dedicated `oidc-power-user-readonly` ClusterRole) · `namespace-owner` (admin in own namespace only). Each session acts as the user's **own** OIDC identity (kubelogin), never the admin's. **RBAC tiers:** `admin` (Viktor — cluster-admin, unlocked tree, secrets) · `power-user` (cluster-wide read-only, NO Secrets, via a dedicated `oidc-power-user-readonly` ClusterRole) · `namespace-owner` (admin in own namespace only). Each session acts as the user's **own** OIDC identity (kubelogin), never the admin's.
**Config inheritance (live):** wizard authors the base (his chezmoi-versioned `~/.claude`). Two native layers carry it to every user — the enforced org `claudeMd` in `/etc/claude-code/managed-settings.json` (top precedence, all sessions) and per-user `~/.claude/{skills,rules,…}` **symlinks** to the base (seeded via `/etc/skel`; edits propagate live). Secrets stay per-user at mode 600, never symlinked. **The managed config self-deploys from the repo** (2026-06-10): the hourly reconcile's `sync_managed_config` installs `scripts/workstation/managed-settings.json` to `/etc/claude-code/` whenever the repo copy changes — so editing the claudeMd = edit + commit, no manual install — and `refresh_codex_mirror` regenerates each user's `~/.codex/AGENTS.md` (a static mirror of the claudeMd; only files carrying the mirror header are touched, user-customized ones are left alone). Repo-level guidance (`.claude/CLAUDE.md`, `AGENTS.md`, `CONTEXT.md` in the infra repo) reaches non-admins through their auto-freshened clones — commit + push and every user has it within the hour. **Config inheritance (live):** wizard authors the base (his chezmoi-versioned `~/.claude`). Two native layers carry it to every user — the enforced org `claudeMd` in `/etc/claude-code/managed-settings.json` (top precedence, all sessions) and per-user `~/.claude/{skills,rules,…}` **symlinks** to the base (seeded via `/etc/skel`; edits propagate live). Secrets stay per-user at mode 600, never symlinked. **(2026-06-26: the managed `claudeMd` now defers OS-level file access to the OS/sudo — a user holding broad `sudo` may read other users' files incl. `~/.claude`; the mode-600 / no-symlink posture is unchanged but is no longer reinforced by an agent "never read other homes" rule. See [ADR-0015](../adr/0015-os-is-the-authorization-boundary.md).)** **The managed config self-deploys from the repo** (2026-06-10): the hourly reconcile's `sync_managed_config` installs `scripts/workstation/managed-settings.json` to `/etc/claude-code/` whenever the repo copy changes — so editing the claudeMd = edit + commit, no manual install — and `refresh_codex_mirror` regenerates each user's `~/.codex/AGENTS.md` (a static mirror of the claudeMd; only files carrying the mirror header are touched, user-customized ones are left alone). Repo-level guidance (`.claude/CLAUDE.md`, `AGENTS.md`, `CONTEXT.md` in the infra repo) reaches non-admins through their auto-freshened clones — commit + push and every user has it within the hour.
**Memory — homelab CLI hooks (rolled out 2026-06-21, deploy-fixed 2026-06-22):** the per-user `claude_memory` MCP was retired for the **homelab-memory hooks** — the reconcile's `install_memory` (re)installs four scripts into `~/.claude/hooks/` each run (`homelab-memory-recall.py` UserPromptSubmit recall, `auto-learn.py` Stop-hook extraction, `pre-compact-backup.sh`/`post-compact-recovery.sh`), wires them into `settings.json` if-absent + additive, and removes the old `claude_memory` MCP. **The provisioner binary itself now self-deploys from the repo** (step 0: `bash -n`-gated `install` + re-exec when `scripts/t3-provision-users.sh` differs from `/usr/local/bin/t3-provision-users`, guarded against re-exec loops / DRY_RUN mutation) — added after this very rollout sat committed-but-undeployed for a day (only the manual `setup-devvm.sh` had ever deployed the binary), so the hourly reconcile kept running the pre-memory version and emo/anca silently lost memory (recall + auto-learn never wired). A latent `set -e` abort in `install_memory` (a bare `[[ -d plugin-dir ]] && …` returning non-zero) was also fixed; it had killed the reconcile after the first user the first time it actually ran. The hooks need a `MEMORY_API_KEY` (or `CLAUDE_MEMORY_API_KEY`) in the user's `settings.json` env — the `homelab` CLI defaults the API URL, so **the key is the only hard requirement**; `install_memory` reuses an existing key and only WARNs if absent (it does NOT mint one — that's an admin Vault step, see Remaining). wizard + emo carry a key from their original MCP setup; **ancamilea is keyless → her memory no-ops until a key is minted.** (`auto-learn.py`'s passive store calls the API directly, so it additionally needs `*_API_URL` in env to avoid its local-SQLite fallback; recall + manual `homelab memory store` go through the URL-defaulting CLI and need only the key.) **Memory — homelab CLI hooks (rolled out 2026-06-21, deploy-fixed 2026-06-22):** the per-user `claude_memory` MCP was retired for the **homelab-memory hooks** — the reconcile's `install_memory` (re)installs four scripts into `~/.claude/hooks/` each run (`homelab-memory-recall.py` UserPromptSubmit recall, `auto-learn.py` Stop-hook extraction, `pre-compact-backup.sh`/`post-compact-recovery.sh`), wires them into `settings.json` if-absent + additive, and removes the old `claude_memory` MCP. **The provisioner binary itself now self-deploys from the repo** (step 0: `bash -n`-gated `install` + re-exec when `scripts/t3-provision-users.sh` differs from `/usr/local/bin/t3-provision-users`, guarded against re-exec loops / DRY_RUN mutation) — added after this very rollout sat committed-but-undeployed for a day (only the manual `setup-devvm.sh` had ever deployed the binary), so the hourly reconcile kept running the pre-memory version and emo/anca silently lost memory (recall + auto-learn never wired). A latent `set -e` abort in `install_memory` (a bare `[[ -d plugin-dir ]] && …` returning non-zero) was also fixed; it had killed the reconcile after the first user the first time it actually ran. The hooks need a `MEMORY_API_KEY` (or `CLAUDE_MEMORY_API_KEY`) in the user's `settings.json` env — the `homelab` CLI defaults the API URL, so **the key is the only hard requirement**; `install_memory` reuses an existing key and only WARNs if absent (it does NOT mint one — that's an admin Vault step, see Remaining). wizard + emo carry a key from their original MCP setup; **ancamilea is keyless → her memory no-ops until a key is minted.** (`auto-learn.py`'s passive store calls the API directly, so it additionally needs `*_API_URL` in env to avoid its local-SQLite fallback; recall + manual `homelab memory store` go through the URL-defaulting CLI and need only the key.)

View file

@ -272,7 +272,7 @@ Beads epic: `code-8ywc`. **Status: partially live as of 2026-05-18.**
The block below documents the locked design. The block below documents the locked design.
Response model: **(I) Slack-only, daily skim.** All security alerts land in a new `#security` Slack channel via Alertmanager. No paging. Mean detection time accepted as ~12-24h; the design weight sits on prevention (Kyverno enforce, NetworkPolicy default-deny egress) rather than runtime detection. Response model: **(I) Slack-only, daily skim.** All security alerts post to **`#alerts`** via Alertmanager (the `slack-security` receiver keeps its distinct `[SECURITY/<sev>]` title styling so security-lane alerts still stand out). The dedicated `#security` channel was abandoned (2026-06-25) — the shared `alertmanager_slack_api_url` incoming webhook's Slack app isn't a member of it, so a channel override there returns HTTP `404 channel_not_found`; everything consolidated to `#alerts`. No paging. Mean detection time accepted as ~12-24h; the design weight sits on prevention (Kyverno enforce, NetworkPolicy default-deny egress) rather than runtime detection.
#### Detection sources #### Detection sources
@ -285,7 +285,7 @@ Response model: **(I) Slack-only, daily skim.** All security alerts land in a ne
#### Alert rules (16 total) #### Alert rules (16 total)
Routed via **Loki ruler → Alertmanager → `#security` Slack receiver**. Same handling path as existing infra alerts — silenceable in Alertmanager UI, history queryable, severity labels (critical/warning/info) inside the single `#security` channel. Routed via **Loki ruler → Alertmanager → the `slack-security` receiver, which posts to `#alerts`** (it keeps its `[SECURITY/<sev>]` title styling so security-lane alerts stand out there; the dedicated `#security` channel was abandoned 2026-06-25 — the shared webhook's Slack app isn't a member of it). Same handling path as existing infra alerts — silenceable in Alertmanager UI, history queryable, severity labels (critical/warning/info) carried in the alert.
**K8s API audit (K2-K9, 8 rules — K1 cluster-admin-grant intentionally skipped):** **K8s API audit (K2-K9, 8 rules — K1 cluster-admin-grant intentionally skipped):**
@ -364,6 +364,69 @@ Beads: `code-8ywc` W1.6 + W1.7. **Status: planned.**
- Rare-event misses: a Sunday-only CronJob's egress won't appear in 7 days of flow logs. Mitigation: extend observation to 2 weeks for namespaces with weekly CronJobs. - Rare-event misses: a Sunday-only CronJob's egress won't appear in 7 days of flow logs. Mitigation: extend observation to 2 weeks for namespaces with weekly CronJobs.
- Mass-rollout cascade: the 26h March 2026 outage (memory id=390) was a mass-change cascade. Mitigation: phased per-namespace with health-check pauses, similar to the 2026-05-17 Keel phased rollout (memory id=1972). - Mass-rollout cascade: the 26h March 2026 outage (memory id=390) was a mass-change cascade. Mitigation: phased per-namespace with health-check pauses, similar to the 2026-05-17 Keel phased rollout (memory id=1972).
#### Deriving the per-namespace egress allowlist from the edge trail (Wave 1 W1.7)
The durable **east-west flow trail** (below) is now the preferred data source for
the *internal* (namespace-to-namespace) half of each Wave-1 egress allowlist —
faster and identity-stamped vs the original iptables-`LOG`→journald→Loki path
(ADR-0014: "Enforcement gains a better data source"). The unique observed
namespace pairs live in CNPG DB `goldmane_edges`, table `edge`. To derive the
namespaces a source is observed talking to (the `allow` set that seeds its
NetworkPolicy):
```sql
SELECT DISTINCT dst_ns FROM edge WHERE src_ns='<ns>' AND action='allow' ORDER BY dst_ns;
```
The full SQL recipe (whole-cluster matrix, deny sanity-checks, the ≥7-day
observation caveat) is in
[runbooks/goldmane-flow-trail.md → Deriving the Wave-1 egress allowlist](../runbooks/goldmane-flow-trail.md#deriving-the-wave-1-egress-allowlist-from-the-edge-table-infra-62).
**External / public-internet egress is NOT in this table** (empty-namespace flows
are dropped) — for those destinations keep using the Calico flow-log observation
(the W1.6 snapshot, `wave1-egress-observation-2026-05-22.md`). This feeds the
existing observe-then-enforce effort (beads `code-8ywc`); **enforce-flips remain
out of scope** of the trail — it is observe-and-derive only.
### East-west flow observability (Goldmane / Whisker + edge trail) (ADR-0014)
The "who-talks-to-whom" data plane that succeeds raw iptables-`LOG` lines (which
carried no identity). **Service identity = the workload's namespace** (primary),
refined by a `service-identity` label in the few multi-Service namespaces
(`monitoring`, `kube-system`, `dbaas`). End-to-end trail, three layers:
1. **Calico Goldmane + Whisker** (`calico-system`) — Goldmane aggregates
identity-stamped flows (ns/pod/workload/labels + allow-deny + policy-trace)
streamed from Felix over gRPC into a **~60-min in-memory ring buffer** (no
etcd/API writes — the etcd-cost constraint that drove the design). **Whisker**
is its live web UI at `whisker.viktorbarzin.me` (Authentik-gated,
`auth = "required"` — Whisker has no own login; an additive NetworkPolicy ORs
Traefik past the operator's default-deny `whisker` NP). The ring buffer is
**not** a trail (lost on Goldmane restart). Enabled via operator CRs in
`stacks/calico/main.tf`; reversible toggle (Goldmane is OSS tech-preview).
2. **`goldmane-edge-aggregator`** (`stacks/goldmane-edge-aggregator`) — streams
Goldmane's gRPC `Flows.Stream` over **mTLS** and upserts the low-cardinality
namespace-pair edge set (`edge(src_ns,dst_ns,action,first_seen,last_seen,
flow_count)`) into CNPG DB `goldmane_edges`. Self-edges and empty-namespace
(public-internet) flows are dropped — in-cluster relationships only. The mTLS
client cert **reuses the operator's Tigera-CA-signed `whisker-backend-key-pair`**
(Goldmane verifies CA-chain only, not identity) rather than copying the CA
private key into TF state — **re-apply the stack if the operator rotates that
Secret**.
3. **`goldmane-edges-digest`** CronJob — posts first-seen edges daily to
**`#alerts`** (reuses the alert-digest webhook). All Slack now consolidates to
`#alerts`; the `#security` channel was abandoned 2026-06-25 because that
webhook's Slack app isn't a member of it (a `#security` override 404s). See
runbook.
The trail is **attribution-grade, not cryptographic** (reconstructs events in a
trusted cluster; cannot prove identity against a spoofing pod — accepted trust-model
limit; east-west stays plaintext, no mTLS between app pods). Health is covered by
the **`AggregatorDown`** + **`DigestFailing`** alerts and cluster-health check #48
(see monitoring.md). Full as-built, query recipes, and troubleshooting:
[runbooks/goldmane-flow-trail.md](../runbooks/goldmane-flow-trail.md). Decision:
[ADR-0014](../adr/0014-service-identity-and-east-west-observability.md); glossary
`CONTEXT.md`**Service identity**, **Goldmane / Whisker**.
### TLS & HTTP/3 ### TLS & HTTP/3
**Traefik** handles TLS termination: **Traefik** handles TLS termination:

View file

@ -0,0 +1,97 @@
# Post-mortem: k8s 1.34→1.35 upgrade stalled — etcd IO starvation (2026-06-24)
> Filename kept for inbound links. The originally-suspected cause (kubeadm-config
> OIDC drift) turned out **not** to be the crash — see "Correction" below. The OIDC
> drift was a real *separate* latent bug fixed in the same change.
**Impact:** The autonomous k8s-version-upgrade chain (23:00 UTC nightly) reached
the master control-plane phase for the first time — preflight passed, etcd
snapshot taken, master cordoned + drained, etcd upgraded 3.6.5→3.6.6 — then the
kube-apiserver upgrade to v1.35.6 **crash-looped**. kubeadm waited its 5-minute
static-pod-hash window across all internal retries, then auto-rolled-back to
v1.34.9. The cluster stayed healthy on 1.34.9 (apiserver, all 7 nodes Ready), but
the run left **k8s-master cordoned** and the chain **wedged on `in_flight=1`**.
No data loss; no user-facing outage (the master carries control-plane taints, so
no workloads were displaced).
**Trigger:** the first *minor* upgrade the chain ever attempted (1.34→1.35) — the
first time kubeadm upgrades etcd (3.6.5→3.6.6) and regenerates the control-plane
static pods, i.e. the first time the upgrade pushes real write-IO at etcd.
## Root cause — etcd IO starvation on the shared HDD
The new kube-apiserver could not establish/keep a working connection to etcd
during the upgrade because **etcd was IO-starved**. etcd's surviving container log
from the crash window (`/var/log/pods/.../etcd/0.log`, 23:0423:20 UTC) shows:
- **1,180** `apply request took too long` warnings in 16 minutes;
- individual applies of **4.3s / 2.9s / 2.7s / 1.8s** (healthy is <100ms),
clustered at **23:18:51 UTC** — exactly when kubeadm's final attempt was trying
to bring the new apiserver up.
A reproduced 1.35.6 apiserver with no etcd dies with
`F instance.go:233 Error creating leases: error creating storage factory: context
deadline exceeded` — the same failure mode a multi-second etcd produces. etcd
lives on the contended `sdc` HDD (**beads code-oflt**: "etcd/critical VM disks on
shared sdc HDD — recurring IO-storm root cause"). The upgrade itself piled IO onto
that spindle:
1. etcd's own upgrade-restart + WAL/db re-read (it restarted ~23:04, re-elected);
2. kubeadm dumping a full **~400MB etcd DB backup** to
`/etc/kubernetes/tmp/kubeadm-backup-etcd-<ts>/` (on the same HDD) before the
etcd upgrade — and **145 of these had accumulated to 28GB** (kubeadm never
cleans them up), pushing master root fs to **73%**, above the 70% kubelet
image-GC threshold, so image GC churned during the drain too;
3. master-drain pod evictions.
### Correction — it was NOT the OIDC flag swap
`kubeadm upgrade diff v1.35.6` showed the regenerated manifest also swaps
`--authentication-config` (structured multi-issuer OIDC) back to legacy
single-issuer `--oidc-*` flags (kubeadm-config drift, see secondary finding). That
was the *first* hypothesis — but an isolated repro of the 1.35.6 apiserver with
those exact `--oidc-*` flags **and authentik reachable** initialised OIDC cleanly
(`oidc.go:313`, no error) and ran fine until it hit the (deliberately dead) test
etcd. So the auth swap does **not** crash the apiserver; it was a red herring for
the crash. Image pull (all v1.35.6 images pre-pulled), OOM (none), and disk-full
were also ruled out.
## Secondary finding (real, fixed separately) — kubeadm-config OIDC drift
apiserver auth is configured in three places that must agree:
(1) `/etc/kubernetes/pki/auth-config.yaml` (structured, two issuers: `kubernetes`
+ `k8s-dashboard`, added 2026-06-19); (2) the live static-pod manifest
(`--authentication-config`); (3) the kubeadm-config `ClusterConfiguration` CM —
which still carried the legacy `--oidc-*` extraArgs. `kubeadm upgrade` regenerates
the manifest from (3), so it would have reverted structured auth → **dashboard +
kubectl SSO break after a successful upgrade** (recoverable: the chain's
post-master `restore.sh` re-adds the flag). This is a real bug, just not the crash.
## Resolution
1. **Reclaimed the 28GB kubeadm scratch** on master (`/etc/kubernetes/tmp/kubeadm-backup-*`) — root fs 73% → 23%.
2. **Reconciled kubeadm-config live** (zero cluster impact — CM only read at upgrade time): dropped `--oidc-*`, added `--authentication-config` via `kubeadm init phase upload-config kubeadm`. `kubeadm upgrade diff` then shows only the control-plane image bumps.
3. **Recovered:** uncordoned k8s-master, cleared the stuck `in_flight` gauge + annotation, deleted last night's Complete/Failed `1-35-6` phase jobs (a Complete preflight would otherwise make the detector idempotent-skip the re-run).
## Prevention (landed in this change)
| Gap | Fix |
|-----|-----|
| kubeadm leaks ~400MB etcd-DB backups into `/etc/kubernetes/tmp` forever (→ disk fills, image-GC churn, write-IO on etcd's spindle) | **`upgrade-step.sh` preflight now prunes** `/etc/kubernetes/tmp/kubeadm-backup-*` + `kubeadm-upgraded-manifests*` older than 3 days on master, every run. Best-effort, never aborts. |
| kubeadm-config drift would silently break SSO after an upgrade | `apiserver-oidc.tf`'s remote script now **also reconciles kubeadm-config** (`kubeadm init phase upload-config`), delivered via the `apiserver-oidc-restore` ConfigMap the chain re-runs (CI needs no ssh) or a local `-replace` apply. Preflight **alerts** (not blocks — SSO drift is recoverable) if `kubeadm upgrade diff` would still drop `--authentication-config`. |
| etcd on the contended `sdc` HDD starves under upgrade IO | **Durable fix is beads code-oflt** (move etcd/critical VM disks off `sdc`). Not in this change. Mitigations above reduce the upgrade's own IO; reclaimed disk removes the image-GC variable. |
## Lessons
- **Capture the failing component's own logs before concluding.** The `kubeadm
upgrade diff` made the OIDC swap look like the cause; only etcd's log (multi-second
applies) + an isolated apiserver repro showed the truth (etcd IO). A clean diff is
"what config changes," not "why it crashed."
- **etcd on shared HDD is the cluster's recurring fragility** (immich IO storm
2026-05-25, this stall). Upgrades concentrate IO (etcd restart + kubeadm's 400MB
backup copy + drain) onto that spindle. code-oflt is the real fix.
- **Tools that leave per-operation scratch must be reaped.** kubeadm's
`/etc/kubernetes/tmp` etcd backups are throwaway (real backups → NFS) but never
GC'd; 28GB had silently accumulated.
- **Out-of-band control-plane edits must be written back to kubeadm-config** — else
`kubeadm upgrade` silently reverts them (here: SSO; could be admission/audit/API flags).

View file

@ -11,6 +11,11 @@ inference every six hours and backs up only the `claudeAiOauth` object to:
secret/workstation/claude-users/<os-user> secret/workstation/claude-users/<os-user>
``` ```
The backup **merges** into that path (`vault kv patch -method=rw`, falling back to
`kv put` only when the path does not exist yet), so keys that other tools
co-locate there — notably `homelab vault`'s `vaultwarden_*` credentials — survive.
A blind `kv put` here silently wiped them on every six-hourly run (fixed 2026-06-26).
The user's unrelated `mcpOAuth` credentials never leave their home directory. The user's unrelated `mcpOAuth` credentials never leave their home directory.
Each renewal service has a distinct 32-day periodic Vault token, mode `0600`, at Each renewal service has a distinct 32-day periodic Vault token, mode `0600`, at
`~/.config/claude-auth-sync/vault-token`. Its policy can access only that user's `~/.config/claude-auth-sync/vault-token`. Its policy can access only that user's

View file

@ -0,0 +1,301 @@
# Goldmane Flow Trail — east-west "who-talks-to-whom" observability
> As-built runbook for the Calico Goldmane + Whisker flow plane and the
> `goldmane-edge-aggregator` durable audit trail. Design + rationale:
> [ADR-0014](../adr/0014-service-identity-and-east-west-observability.md).
> Glossary: `CONTEXT.md`**Service identity**, **Goldmane / Whisker**.
> Implements infra issues #57 (Whisker ingress), #58 (aggregator), #61
> (monitoring), #62 (egress allowlist queries), #63 (these docs).
## What the trail is
Three layers turn raw east-west traffic into a queryable, durable record of
which Service talks to which. **Service identity = the workload's namespace**
(primary), refined by a `service-identity` label in the few multi-Service
namespaces (`monitoring`, `kube-system`, `dbaas`) — see ADR-0014.
| Layer | Component | Lifetime | Where it lives |
|---|---|---|---|
| **Live map** | Calico **Goldmane** + **Whisker** | ~60-min in-memory ring buffer (lost on Goldmane restart) | `calico-system`; Whisker UI at `whisker.viktorbarzin.me` |
| **Durable trail** | `goldmane-edge-aggregator` (`aggregate` mode) | persistent | CNPG Postgres DB `goldmane_edges`, table `edge` |
| **Notification** | `goldmane-edges-digest` CronJob (`digest` mode) | daily | Slack `#alerts` |
**Goldmane** aggregates identity-stamped flows (namespace / pod / workload /
labels + allow-deny + policy-trace) streamed from Felix (the existing
`calico-node` DaemonSet) over gRPC into a ~60-minute in-memory ring buffer —
**nothing is written to etcd or the K8s API** (the etcd-cost constraint that
drove the whole design). **Whisker** is its live web UI. Because the ring
buffer is *not* a trail (a Goldmane restart loses the window), the
`goldmane-edge-aggregator` consumes Goldmane's gRPC `Flows.Stream` API over
mTLS and upserts the unique **namespace-pair edge set** into Postgres; a daily
CronJob posts first-seen edges to Slack.
The edge set is deliberately **low-cardinality** — one row per
`(src_ns, dst_ns, action)`, *not* per-pod or per-port — so the table stays
small no matter how much traffic flows.
## Where the data lives
### Whisker UI — live, ~60 min
- `https://whisker.viktorbarzin.me` (Authentik-gated — Whisker ships no own
login; `auth = "required"`). Shows the live flow stream + a service graph for
roughly the last hour. Use it for "what is talking right now"; it is **not**
history.
- In-cluster: `Service goldmane:7443` (gRPC/mTLS), `Service whisker:8081`
(HTTP), both in `calico-system`.
### CNPG `goldmane_edges` — durable
- Postgres DB `goldmane_edges` on the CNPG cluster
(`pg-cluster-rw.dbaas.svc.cluster.local:5432`). One table:
```
edge(src_ns text, dst_ns text, action text,
first_seen timestamptz, last_seen timestamptz, flow_count bigint,
PRIMARY KEY (src_ns, dst_ns, action))
```
- `action``allow` / `deny` / `pass` / `unspecified` (normalised Goldmane
action).
- **Self-edges (`src_ns == dst_ns`) and empty-namespace flows** (host-endpoint
/ public-internet) are **dropped** — the trail is about in-cluster service
relationships only. (Egress to the public internet is therefore NOT in this
table; it lives in the Wave-1 Calico flow-log path — see security.md.)
- A **"new edge"** = a row whose `first_seen` falls inside the digest window.
- Role `goldmane_edges` (Vault-rotated, 7-day) owns the DB. The `edge` table
is created idempotently by the aggregator at startup (canonical DDL also in
the repo at `migrations/0001_edge.sql`).
### Slack `#alerts` — daily digest
> **Channel note (2026-06-25):** posts to **`#alerts`**. The dedicated `#security` channel was abandoned — the shared `alertmanager_slack_api_url` incoming webhook's Slack app is not a member of it, so a channel override there returns HTTP `404 channel_not_found`. Everything now posts to `#alerts` (this digest plus alertmanager's `slack-security` receiver, which keeps its `[SECURITY]` styling so security-lane alerts still stand out there).
- CronJob `goldmane-edges-digest` (08:00 Europe/London) posts edges first seen
in the last 24h. Quiet when there are none. Reuses the existing alert-digest
Slack incoming webhook (Vault `secret/viktor``alertmanager_slack_api_url`)
— no new webhook was created.
## How to enable / disable
### Goldmane + Whisker (the flow plane)
Operator CRs in **`stacks/calico/main.tf`** — NOT the Helm `goldmane`/`whisker`
flags (those stay `false`; the operator's own `installation`/`apiServer` are
operator-managed via the `goldmanes`/`whiskers.operator.tigera.io` CRDs):
- `kubectl_manifest.goldmane` (kind `Goldmane`) — creating it makes the operator
re-render `calico-node` with the `FELIX_FLOWLOGSGOLDMANESERVER` env (the
operator auto-wires Felix — **do NOT patch FelixConfiguration**), triggering a
supervised `calico-node` DaemonSet roll. Yields `Deployment` + `Service
goldmane:7443`.
- `kubectl_manifest.whisker` (kind `Whisker`, `depends_on` goldmane;
`notifications = Disabled`). Yields `Deployment` + `Service whisker:8081`.
**To disable:** delete those two CRs and re-apply `stacks/calico`. Reversible
toggle (Goldmane is tech-preview in OSS Calico 3.30 — the main standing risk per
ADR-0014).
### Whisker public ingress (infra #57)
Also in `stacks/calico/main.tf`:
- `module "ingress_whisker"` (`ingress_factory`, `auth = "required"`,
`dns_type = "proxied"`) → `whisker.viktorbarzin.me`.
- `kubernetes_network_policy_v1.whisker_allow_traefik` — **required alongside the
ingress**: the operator's own `whisker` NetworkPolicy (owned by the Whisker CR)
is `policyTypes: [Ingress]` with no rules = default-deny ingress to the pod.
This additive NP ORs in an allow for `namespaceSelector
kubernetes.io/metadata.name=traefik` on TCP 8081. Without it Traefik 502s.
### The aggregator + digest (the durable trail) — `stacks/goldmane-edge-aggregator`
A Tier-1 stack (PG state) mirroring the claude-memory pattern. `scripts/tg
apply` from `stacks/goldmane-edge-aggregator/`. It provisions: the namespace,
the mTLS client material, the Postgres DB-init Job, the `DATABASE_URL`
ExternalSecret (Vault static role `pg-goldmane-edges`), the Slack ExternalSecret,
the `aggregate` Deployment, and the `digest` CronJob. **To disable the trail
without touching the flow plane:** scale `deployment/goldmane-edge-aggregator` to
0 (transient) or remove the stack (permanent) — Goldmane/Whisker keep running.
Image: `ghcr.io/viktorbarzin/goldmane-edge-aggregator` (PRIVATE) — the
`goldmane-edge-aggregator` namespace must be in the `ghcr-credentials` Kyverno
allowlist (`stacks/kyverno/modules/kyverno/ghcr-credentials.tf`,
`local.ghcr_private_namespaces`) or pulls 401. Code repo:
`~/code/goldmane-edge-aggregator` (see its `README.md` + `DEPLOY.md`).
## mTLS cert — the REUSE decision (cert-reuse gotcha)
The aggregator dials `goldmane:7443` over **mutual TLS**. Goldmane requires the
client cert to chain to the **Tigera CA**, but it does **NOT authorize by client
identity** — any Tigera-CA-signed cert is accepted.
Rather than copy the Tigera CA **private key** into Terraform state to mint our
own cert (a needless CA-key exposure; the `hashicorp/tls` provider also clashes
with this repo's global generate-providers/lockfile pattern), the stack
**REUSES the operator-minted, Tigera-CA-signed `whisker-backend-key-pair`
Secret** (`calico-system`), copying its `tls.crt`/`tls.key` into the
`goldmane-client-tls` Secret in the aggregator namespace. The CA *bundle* that
verifies Goldmane's serving cert (`tigera-ca-bundle` ConfigMap, key
`tigera-ca-bundle.crt`) is likewise copied verbatim (a ConfigMap can't be
cross-namespace-mounted).
> **GOTCHA — if the operator rotates `whisker-backend-key-pair`, re-apply
> `stacks/goldmane-edge-aggregator`** to re-sync the copied cert. Symptom of a
> stale copy: the `aggregate` pod logs TLS handshake / `Flows.Stream` failures
> and no `last_seen` updates land in the `edge` table. Hardening follow-up
> (noted in the stack): mint an own-identity cert in-namespace if Whisker is ever
> removed (which would delete the reused source Secret).
The Deployment leaves `GOLDMANE_HOST=goldmane.calico-system.svc.cluster.local:7443`
and the default cert/CA paths; the default ServerName (host sans port) is a SAN
on Goldmane's live serving cert, so no `GOLDMANE_SERVER_NAME` /
`GOLDMANE_TLS_INSECURE` override is needed.
## How to query who-talks-to-whom
`psql` into the DB (creds: Vault static role `static-creds/pg-goldmane-edges`, or
exec a CNPG pod). All queries are against the single `edge` table.
```sql
-- Everything talking to a namespace (inbound), most-active first
SELECT src_ns, action, flow_count, first_seen, last_seen
FROM edge WHERE dst_ns = '<ns>' ORDER BY flow_count DESC;
-- Everything a namespace talks TO (outbound)
SELECT dst_ns, action, flow_count, first_seen, last_seen
FROM edge WHERE src_ns = '<ns>' ORDER BY last_seen DESC;
-- New edges in the last 24h (what the digest reports)
SELECT src_ns, dst_ns, action, flow_count, first_seen
FROM edge WHERE first_seen > now() - interval '24 hours'
ORDER BY first_seen DESC;
-- Any DENIED edges (policy is dropping this pair)
SELECT src_ns, dst_ns, flow_count, last_seen
FROM edge WHERE action = 'deny' ORDER BY last_seen DESC;
-- Full edge set as a graph adjacency list
SELECT src_ns, dst_ns, action, flow_count FROM edge ORDER BY src_ns, dst_ns;
```
For the **live** (sub-hour) view including pod/port detail, use the Whisker UI —
the `edge` table intentionally aggregates that away.
## Deriving the Wave-1 egress allowlist from the edge table (infra #62)
The durable edge set is a faster, identity-stamped data source for the existing
**observe-then-enforce** egress effort (beads `code-8ywc`; snapshot
`docs/architecture/wave1-egress-observation-2026-05-22.md`) than the original
iptables-`LOG` → journald → Loki path (ADR-0014 consequence: "Enforcement gains
a better data source"). It replaces the *internal* (namespace-to-namespace) leg
of the allowlist; **external/public-internet egress is NOT in this table** (empty
dst namespace, dropped) — for those destinations keep using the Calico flow-log
path described in security.md.
**Per-namespace internal egress allowlist** — the set of in-cluster namespaces a
given source is *observed* talking to with `action='allow'`:
```sql
-- Internal egress allowlist for one namespace (feeds its NetworkPolicy)
SELECT DISTINCT dst_ns
FROM edge
WHERE src_ns = '<ns>' AND action = 'allow'
ORDER BY dst_ns;
```
```sql
-- Full internal egress matrix for all namespaces at once
SELECT src_ns, array_agg(DISTINCT dst_ns ORDER BY dst_ns) AS allowed_dst_ns
FROM edge
WHERE action = 'allow'
GROUP BY src_ns
ORDER BY src_ns;
```
```sql
-- Sanity: namespaces with a DENY edge already (policy is biting; investigate
-- before tightening further)
SELECT DISTINCT src_ns, dst_ns FROM edge WHERE action = 'deny';
```
**How this feeds enforcement (scope):** the derived `dst_ns` set is the
*internal* half of a namespace's egress allowlist — it tells you which
in-cluster namespaces to permit before flipping that namespace to default-deny.
The universal baseline (kube-dns :53, often dbaas :3306/:5432, redis :6379) and
the external destinations still come from the Wave-1 observation snapshot.
**Enforce-flips remain OUT OF SCOPE** here — this is observe-and-derive only;
the phased per-namespace default-deny rollout (starting `recruiter-responder`)
is tracked under `code-8ywc`. Cross-links:
[security.md → NetworkPolicy Default-Deny Egress](../architecture/security.md#networkpolicy-default-deny-egress-wave-1--observe-then-enforce-tier-34),
[wave1-egress-observation-2026-05-22.md](../architecture/wave1-egress-observation-2026-05-22.md),
[ADR-0014](../adr/0014-service-identity-and-east-west-observability.md).
> **Caveat (same as the Wave-1 snapshot):** an edge only exists if it was
> *observed*. A weekly CronJob or a 7-day Vault rotation may not have fired yet —
> collect ≥7 days of edges before treating a namespace's `allow` set as
> complete. The `first_seen` column tells you how long an edge has been known;
> the digest surfaces brand-new ones daily.
## Monitoring & health (infra #61)
The aggregator pod has **no `/metrics` endpoint** — health is inferred from
kube-state-metrics. Three complementary signals (memory ids 6598, 6599;
see also [monitoring.md → Security Alerts](../architecture/monitoring.md#security-alerts-wave-1--planned-beads-code-8ywc)):
| Signal | What | Where |
|---|---|---|
| **`AggregatorDown`** | `kube_deployment_status_replicas_available{namespace="goldmane-edge-aggregator",deployment="goldmane-edge-aggregator"} < 1` for 15m → warning | Prometheus alert group `Network Observability (Goldmane)` in `stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl`; routes `slack-warning``#alerts` |
| **`DigestFailing`** | `kube_job_status_failed{...job_name=~"goldmane-edges-digest.*"} > 0` within 24h, for 30m → warning | same alert group → `#alerts` |
| **cluster-health #48** | `check_goldmane_aggregator` reads the Deployment's `Available` condition (missing or not-Available → FAIL) | `scripts/cluster_healthcheck.sh` (human / `--quiet` / `--json` modes; emits `goldmane_aggregator`) |
The two alert layers are deliberately complementary: `AggregatorDown`
**no new edges land** in the DB; `DigestFailing` → **edges still land but nobody
is told**. A freshness probe (#61b) was intentionally skipped — `AggregatorDown`
is the agreed floor.
## Troubleshooting
**Whisker UI 502 / unreachable.** The additive
`kubernetes_network_policy_v1.whisker_allow_traefik` is missing or the
operator's default-deny `whisker` NP regenerated — re-apply `stacks/calico`. A
brand-new ingress host is also invisible to LAN split-horizon until the hourly
`technitium-ingress-dns-sync` runs (memory #5349); test meanwhile with
`curl -sSI --resolve whisker.viktorbarzin.me:443:10.0.20.203 https://whisker.viktorbarzin.me`
(expect a 302 to Authentik — the gate working).
**No new `last_seen` updates / `AggregatorDown` firing.** Check the `aggregate`
pod logs (`kubectl logs -n goldmane-edge-aggregator deploy/goldmane-edge-aggregator`).
Common causes, in order:
1. **Stale mTLS cert** — the operator rotated `whisker-backend-key-pair`; re-apply
`stacks/goldmane-edge-aggregator` (see cert-reuse gotcha above). Symptom: TLS
handshake / `Flows.Stream` errors.
2. **Stale DB password** — the 7-day Vault rotation bounced the credential but
the pod kept the old one. The Deployment carries
`secret.reloader.stakater.com/reload: goldmane-edges-db-creds`; if it's not
restarting on rotation, verify the Reloader annotation and the ExternalSecret.
3. **Goldmane restarted** — the in-memory window was lost (expected); the stream
reconnects automatically and resumes upserting. No data loss in the DB
(only the sub-hour live window in Whisker is gone).
**Digest never posts / `DigestFailing` firing.** Inspect the most recent
`goldmane-edges-digest-*` Job (`kubectl get jobs -n goldmane-edge-aggregator`;
`kubectl logs job/<name>`). The CronJob's `ttl_seconds_after_finished=86400` GCs
pods after a day, so check soon after a failed run. With `SLACK_WEBHOOK_URL`
empty the binary forces a dry-run (no post) — verify the `goldmane-edges-slack`
ExternalSecret resolved. A dry run / smoke test: run the image with `args:
["digest"]` + `DRY_RUN=1` to print the message instead of POSTing.
> Known state (2026-06-25): the digest CronJob's first Job **failed** and it has
> never successfully posted (`lastSuccessfulTime` empty) — the digest leg is the
> live gap; `DigestFailing` is catching it. Edges still land in the DB via the
> `aggregate` Deployment; only the `#alerts` digest notification is affected.
> Investigation/fix belongs to the aggregator slice (#58/#60), not monitoring.
**No edges at all in the table.** Confirm Goldmane is enabled
(`kubectl get goldmane,whisker -A`) and `calico-node` rolled with the
`FELIX_FLOWLOGSGOLDMANESERVER` env; confirm the `goldmane-edges-db-init` Job
completed; confirm the aggregator pod is `Running` and not `ImagePullBackOff`
(ghcr allowlist).
## Related
- [ADR-0014 — Service identity & east-west observability](../adr/0014-service-identity-and-east-west-observability.md)
- [security.md — NetworkPolicy Default-Deny Egress + east-west flow observability](../architecture/security.md)
- [monitoring.md — east-west flow observability + alerts](../architecture/monitoring.md)
- [wave1-egress-observation-2026-05-22.md](../architecture/wave1-egress-observation-2026-05-22.md)
- `CONTEXT.md` glossary — **Service identity**, **Goldmane / Whisker**
- Code: `~/code/goldmane-edge-aggregator` (`README.md`, `DEPLOY.md`); stacks
`stacks/goldmane-edge-aggregator`, `stacks/calico`

View file

@ -41,6 +41,8 @@ Job 0 — preflight (pinned: k8s-node1)
├── halt-on-alert (kured-style ignore-list) ├── halt-on-alert (kured-style ignore-list)
├── 24h-quiet baseline (no Ready transitions <24h ago) ├── 24h-quiet baseline (no Ready transitions <24h ago)
├── kubeadm upgrade plan matches target (skipped when master already at target — partial-resume) ├── kubeadm upgrade plan matches target (skipped when master already at target — partial-resume)
├── apiserver-OIDC drift check: kubeadm upgrade diff drops --authentication-config? → Slack WARN (recoverable; not a block)
├── reclaim kubeadm scratch: prune /etc/kubernetes/tmp/kubeadm-backup-* >3d on master (kubeadm leaks ~400MB etcd-db backups)
├── Push k8s_upgrade_in_flight=1, k8s_upgrade_started_timestamp=$(date +%s) ├── Push k8s_upgrade_in_flight=1, k8s_upgrade_started_timestamp=$(date +%s)
├── Trigger backup-etcd Job, wait, verify snapshot byte count ├── Trigger backup-etcd Job, wait, verify snapshot byte count
├── SSH master: containerd skew fix (if master < workers) ├── SSH master: containerd skew fix (if master < workers)
@ -222,22 +224,34 @@ Exposed in K8s via ExternalSecret `k8s-upgrade-creds` in the `k8s-upgrade` names
## Common Operations ## Common Operations
### Post-upgrade: apiserver OIDC restore (AUTOMATED by the chain since 2026-06-19) ### apiserver OIDC + kubeadm upgrades (kubeadm-config reconciliation since 2026-06-24)
`kubeadm upgrade apply` **regenerates `/etc/kubernetes/manifests/kube-apiserver.yaml` `kubeadm upgrade apply` **regenerates `/etc/kubernetes/manifests/kube-apiserver.yaml`
and drops the `--authentication-config` flag**, silently disabling apiserver from kubeadm-config**. apiserver auth uses a structured multi-issuer
OIDC (kubectl/kubelogin CLI **and** the web dashboard SSO break — tokens get `--authentication-config` (kubectl + dashboard SSO), but kubeadm-config used to
401). This used to require a manual re-apply after **every** control-plane bump. still carry the legacy single-issuer `--oidc-*` extraArgs — so every upgrade
reverted the flag, **silently breaking SSO after the upgrade** (the apiserver does
NOT crash on this — verified by isolated repro; it's recoverable via the restore
script below). NB: the **1.34→1.35 stall on 2026-06-24 was a *separate* issue —
etcd IO starvation**, not this drift; post-mortem:
`docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md`.
**Now automated:** the `rbac` stack publishes its OIDC restore script to the **Primary fix (2026-06-24):** `stacks/rbac/modules/rbac/apiserver-oidc.tf` now
`kube-system/apiserver-oidc-restore` ConfigMap, and the version-upgrade chain's **reconciles kubeadm-config** (`kubeadm init phase upload-config kubeadm`, rewriting
`phase_master` re-runs it on master immediately after `kubeadm upgrade apply` `apiServer.extraArgs`: drop `--oidc-*`, add `--authentication-config`) as part of
(while tigera-operator is still quiesced, so the flag-add apiserver restart can't its remote script. So kubeadm regenerates a **correct** manifest and the apiserver
crashloop the operator). It's idempotent, health-gates `/livez` with upgrades with a pure image bump — `kubeadm upgrade diff <target>` shows only the
auto-rollback, and is **non-fatal** — a failure only lags SSO until the next rbac image change. Zero live impact (the CM is read only during an upgrade).
apply (the version upgrade itself already succeeded). So a chain-driven
control-plane bump no longer breaks SSO. The master phase self-skips when master **Backstops:**
is already at target, so this only runs when master was actually upgraded. - **Preflight check 4b** runs `kubeadm upgrade diff` and **alerts** (Slack WARN, does
NOT block — the drift only breaks SSO, which is recoverable) if
`--authentication-config` would still be dropped.
- The `rbac` stack still publishes its restore script to the
`kube-system/apiserver-oidc-restore` ConfigMap, and `phase_master` re-runs it on
master right after `kubeadm upgrade apply` (idempotent, `/livez`-gated with
auto-rollback, non-fatal) — now redundant belt-and-suspenders that *also*
re-reconciles kubeadm-config. Self-skips when master is already at target.
**Manual fallback** — only for an out-of-band/manual `kubeadm` upgrade, or if the **Manual fallback** — only for an out-of-band/manual `kubeadm` upgrade, or if the
chain logged `WARN: --authentication-config absent after re-apply`: chain logged `WARN: --authentication-config absent after re-apply`:

View file

@ -27,7 +27,7 @@ KUBECONFIG_PATH="${KUBECONFIG:-${HOME}/.kube/config}"
[[ -f "$KUBECONFIG_PATH" ]] || KUBECONFIG_PATH="$(pwd)/config" [[ -f "$KUBECONFIG_PATH" ]] || KUBECONFIG_PATH="$(pwd)/config"
KUBECTL="" KUBECTL=""
JSON_RESULTS=() JSON_RESULTS=()
TOTAL_CHECKS=47 TOTAL_CHECKS=48
# Parallel execution settings. Each check function is self-contained — it # Parallel execution settings. Each check function is self-contained — it
# only reads cluster state and mutates the in-memory counters / JSON_RESULTS # only reads cluster state and mutates the in-memory counters / JSON_RESULTS
@ -3156,6 +3156,44 @@ PYEOF
esac esac
} }
# --- 48. Goldmane edge-aggregator availability ---
#
# The goldmane-edge-aggregator Deployment (ADR-0014 / infra #58) streams Calico
# Goldmane flows into the goldmane_edges CNPG DB — the durable who-talks-to-whom
# trail. The pod has NO /metrics endpoint, so its liveness can't be scraped;
# this check reads the Deployment's Available condition directly so the trail
# silently dying surfaces in the health board (mirrors the AggregatorDown
# Prometheus alert). Missing Deployment / not-Available -> FAIL.
check_goldmane_aggregator() {
section 48 "Goldmane Edge-Aggregator"
local ns="goldmane-edge-aggregator" dep="goldmane-edge-aggregator"
local avail desired ready
# One get; absent Deployment is a hard fail (the trail isn't deployed).
if ! $KUBECTL get deploy "$dep" -n "$ns" >/dev/null 2>&1; then
[[ "$QUIET" == true ]] && section_always 48 "Goldmane Edge-Aggregator"
fail "Deployment $ns/$dep not found — who-talks-to-whom edge trail is not running"
json_add "goldmane_aggregator" "FAIL" "deployment missing"
return 0
fi
avail=$($KUBECTL get deploy "$dep" -n "$ns" \
-o jsonpath='{.status.conditions[?(@.type=="Available")].status}' 2>/dev/null)
ready=$($KUBECTL get deploy "$dep" -n "$ns" -o jsonpath='{.status.readyReplicas}' 2>/dev/null)
desired=$($KUBECTL get deploy "$dep" -n "$ns" -o jsonpath='{.spec.replicas}' 2>/dev/null)
ready=${ready:-0}
desired=${desired:-0}
if [[ "$avail" == "True" ]]; then
pass "Edge-aggregator Available ($ready/$desired ready)"
json_add "goldmane_aggregator" "PASS" "${ready}/${desired} ready"
else
[[ "$QUIET" == true ]] && section_always 48 "Goldmane Edge-Aggregator"
fail "Edge-aggregator NOT Available ($ready/$desired ready) — edge trail has stopped recording"
json_add "goldmane_aggregator" "FAIL" "${ready}/${desired} ready; Available=${avail:-unknown}"
fi
}
# --- Summary --- # --- Summary ---
print_summary() { print_summary() {
if [[ "$JSON" == true ]]; then if [[ "$JSON" == true ]]; then
@ -3224,7 +3262,7 @@ main() {
check_monitoring_prom_am check_monitoring_vault check_monitoring_css check_monitoring_prom_am check_monitoring_vault check_monitoring_css
check_external_replicas check_external_divergence check_pve_thermals check_external_replicas check_external_divergence check_pve_thermals
check_pve_load check_external_traefik_5xx check_ha_status_dashboard check_pve_load check_external_traefik_5xx check_ha_status_dashboard
check_immich_search check_csi_ghost_drift check_immich_search check_csi_ghost_drift check_goldmane_aggregator
) )
# Auto-fix mutates cluster state inside individual checks — keep that # Auto-fix mutates cluster state inside individual checks — keep that

View file

@ -28,5 +28,61 @@ ok "accept own scoped Vault token" cas_vault_identity_ok token-devvm-claude-auth
no "reject another user's token" cas_vault_identity_ok token-devvm-claude-auth-anca default,workstation-claude-anca no "reject another user's token" cas_vault_identity_ok token-devvm-claude-auth-anca default,workstation-claude-anca
no "reject wrong policy" cas_vault_identity_ok token-devvm-claude-auth-emo default,workstation-claude-anca no "reject wrong policy" cas_vault_identity_ok token-devvm-claude-auth-emo default,workstation-claude-anca
# --- Regression: cas_backup must MERGE into the shared Vault path, preserving
# sibling keys that other tools co-locate there (e.g. `homelab vault`'s
# vaultwarden_* creds) — NOT overwrite the whole KV document. A blind `kv put`
# wiped them every 6h (claude-auth-sync clobber, 2026-06-26).
fakebin="$tmp/bin"; mkdir -p "$fakebin"
store="$tmp/vault-store.json"
cat > "$fakebin/vault" <<'FAKE'
#!/usr/bin/env bash
# Minimal KV-v2 fake backed by $VAULT_FAKE_STORE (a flat JSON object).
[[ "$1" == kv ]] || { echo '{}'; exit 0; } # token lookup etc. -> ignore
op="$2"; shift 2
store="$VAULT_FAKE_STORE"
case "$op" in
get)
for a in "$@"; do [[ "$a" == -field=* ]] && field="${a#-field=}"; done
if [[ "$*" == *-format=json* ]]; then
[[ -f "$store" ]] || { echo "No value found"; exit 2; }
jq -n --argjson d "$(cat "$store")" '{data:{data:$d}}'; exit 0
fi
[[ -f "$store" ]] || exit 2 # bare get == existence check
if [[ -n "${field:-}" ]]; then
v="$(jq -r --arg k "$field" '.[$k] // empty' "$store")"; [[ -n "$v" ]] || exit 1
printf '%s' "$v"; exit 0
fi
exit 0 ;;
put) echo '{}' > "$store" ;; # full replace
patch) [[ -f "$store" ]] || { echo "No value found"; exit 2; } ;; # merge (rw)
*) exit 1 ;;
esac
for a in "$@"; do
case "$a" in
-*|secret/*) continue ;; # flags + the path arg
*=*) k="${a%%=*}"; v="${a#*=}"
t="$(mktemp)"; jq --arg k "$k" --arg v "$v" '.[$k]=$v' "$store" > "$t" && mv "$t" "$store" ;;
esac
done
exit 0
FAKE
chmod +x "$fakebin/vault"
CAS_VAULT_PATH="secret/workstation/claude-users/test"
CAS_CREDENTIALS="$tmp/credentials.json"
CAS_STATE_DIR="$tmp/state"
_oldpath="$PATH"; PATH="$fakebin:$PATH"; export VAULT_FAKE_STORE="$store"
printf '{"vaultwarden_master_password":"keep-me"}\n' > "$store" # pretend `homelab vault setup` ran
ok "backup succeeds (existing doc)" cas_backup
eq "merge preserves sibling key" keep-me "$(jq -r '.vaultwarden_master_password' "$store")"
eq "merge writes claude oauth" access "$(jq -r '.claude_ai_oauth_json|fromjson|.accessToken' "$store")"
rm -f "$store" # fresh user: no doc yet
ok "backup succeeds (creates doc)" cas_backup
eq "create writes claude oauth" access "$(jq -r '.claude_ai_oauth_json|fromjson|.accessToken' "$store")"
PATH="$_oldpath"; unset VAULT_FAKE_STORE
printf '\n%d passed, %d failed\n' "$pass" "$fail" printf '\n%d passed, %d failed\n' "$pass" "$fail"
(( fail == 0 )) (( fail == 0 ))

View file

@ -82,7 +82,17 @@ cas_backup() {
return 1 return 1
} }
expires="$(jq -r '.expiresAt' <<<"$oauth")" expires="$(jq -r '.expiresAt' <<<"$oauth")"
vault kv put "$CAS_VAULT_PATH" \ # MERGE into the shared path so sibling keys other tools co-locate there
# (e.g. `homelab vault`'s vaultwarden_* creds) survive. `kv patch -method=rw`
# is read+update (needs no `patch` capability) but requires the secret to
# already exist, so create it with `kv put` on the very first backup only.
local -a write_cmd
if vault kv get "$CAS_VAULT_PATH" >/dev/null 2>&1; then
write_cmd=(vault kv patch -method=rw "$CAS_VAULT_PATH")
else
write_cmd=(vault kv put "$CAS_VAULT_PATH")
fi
"${write_cmd[@]}" \
claude_ai_oauth_json="$oauth" \ claude_ai_oauth_json="$oauth" \
credential_expires_at_ms="$expires" \ credential_expires_at_ms="$expires" \
backed_up_at="$(date -Is)" >/dev/null || { backed_up_at="$(date -Is)" >/dev/null || {

View file

@ -19,13 +19,29 @@ unpinned-CLI dependencies out of the hourly **root** reconcile.
- `mattpocock/skills` (https://github.com/mattpocock/skills) — all except `find-skills` - `mattpocock/skills` (https://github.com/mattpocock/skills) — all except `find-skills`
- `vercel-labs/skills` (https://github.com/vercel-labs/skills) — `find-skills` - `vercel-labs/skills` (https://github.com/vercel-labs/skills) — `find-skills`
- **homelab-local, emo-PERSONALIZED**`cluster-health` here is an
**emo-specific variant**, not a copy of the canonical skill. It started as a
copy of this repo's `.claude/skills/cluster-health/` but was rewritten on
2026-06-26 to focus on ha-sofia + emo's Sofia devices (emo is the only entry
in `SKILL_USERS`, a read-only power-user). The canonical admin skill
(`.claude/skills/cluster-health/`) is the full 47-check version and is left
untouched. **Do NOT `cp -a` the canonical copy over this one** — that would
clobber the personalization. Maintain the two independently.
## Refreshing ## Refreshing
Re-snapshot from a current install and commit the diff: Re-snapshot the upstream skills from a current install and commit the diff:
```sh ```sh
cp -a ~/.agents/skills/. scripts/workstation/claude-skills/ cp -a ~/.agents/skills/. scripts/workstation/claude-skills/
``` ```
Snapshot taken 2026-06-23. `cluster-health` is hand-maintained (emo variant) — it is **not** covered by the
`cp -a` above and must **not** be overwritten from `.claude/skills/`. Edit it in
place here when emo's needs change, then refresh his live copy (the provisioner's
`install_skills()` is if-absent, so it won't update an existing `~/.agents/skills`
copy — `cp` the new `SKILL.md` to `/home/emo/.agents/skills/cluster-health/` and
`chown emo:emo`, or remove emo's copy and re-run the reconcile).
Snapshot taken 2026-06-23 (upstream); `cluster-health` vendored 2026-06-26,
personalized for emo 2026-06-26.

View file

@ -0,0 +1,146 @@
---
name: cluster-health
description: |
Personalized for emo. Check whether the homelab Kubernetes cluster is
affecting ha-sofia or the Sofia smart-home devices it runs (Tuya devices,
the MPPT ATS, lights, climate, security, irrigation). Use when:
(1) "is ha-sofia ok", "are my devices / the ATS / the lights down",
(2) "is the cluster affecting Sofia / my devices",
(3) "check the cluster", "cluster health", "is everything running",
(4) a device on the Барзини → Статус dashboard looks offline.
Runs the cluster-wide healthcheck read-only and triages it by what
ha-sofia actually depends on; the rest of the cluster is the admin's area.
author: Claude Code
version: 3.0.0-emo
date: 2026-06-26
---
# Cluster Health — personalized for emo (ha-sofia focus)
## What you actually care about
You care about **ha-sofia** and the **Sofia smart-home devices** it runs —
the Tuya devices, the **MPPT ATS**, and the lights / climate / security /
irrigation on your **Барзини → Статус** dashboard. The wider Kubernetes
cluster matters to you **only when it's breaking something ha-sofia or your
devices depend on.** Anything else is the admin's (wizard's) area — note it in
one line and move on; don't chase it.
You have **read-only** cluster access. You can SEE everything but change
nothing — so when something on your chain is broken, the job is to confirm it
and hand it off, not to repair it.
## How ha-sofia depends on the cluster
ha-sofia itself runs at the house (HAOS at https://ha-sofia.viktorbarzin.me) —
**not** in the cluster. The cluster reaches it through exactly two things:
1. **tuya-bridge** (namespace `tuya-bridge`) — the REST API ha-sofia calls for
every Tuya device **and the MPPT ATS**. If it's unhealthy, your Tuya devices
+ ATS stop responding. **This is the #1 thing to check.**
2. **The path that carries ha-sofia ⇄ tuya-bridge and keeps ha-sofia
reachable**: cloudflared (tunnel) → Traefik (LB) → the ingress + TLS cert
for `tuya-bridge.viktorbarzin.me` and `ha-sofia.viktorbarzin.me`, plus
Technitium DNS. If any of these break, ha-sofia can't reach tuya-bridge and
you can't reach ha-sofia remotely.
Everything else in the cluster is unrelated to you unless it's hosting one of
those pods.
## Step 1 — run the healthcheck (read-only, with your HA token)
Your account can't read Vault, so load your own ha-sofia token first (it was
minted for you and lives at `~/.config/cluster-health/haos_token`). Then run
the script from YOUR clone, read-only:
```bash
cd /home/emo/code
export HOME_ASSISTANT_SOFIA_TOKEN="$(cat ~/.config/cluster-health/haos_token)"
bash scripts/cluster_healthcheck.sh --no-fix --quiet
# machine-readable instead:
# bash scripts/cluster_healthcheck.sh --no-fix --quiet --json | tee /tmp/cluster-health.json
```
- **Never pass `--fix`** — it deletes pods (a write); you're read-only and it
will fail.
- Exit codes: `0` healthy, `1` warnings, `2` failures.
With the token exported, the **ha-sofia checks run for you**:
26 Entity Availability · 27 Integration Health · 28 Automation Status ·
29 System Resources · **45 Status Dashboard** — your Барзини → Статус view,
classifying every device tile as OK / ⚠️ / Offline across Сигурност, Мрежа &
IT, Енергия, Климат, Уреди, Мултимедия, Осветление, Поливна. Check 30 also
covers the **tuya** exporter.
## Step 2 — triage the output by relevance to YOU
Read the PASS/WARN/FAIL summary, then split the WARN/FAIL items in two:
- **On your chain → this is what matters.** Anything touching: `tuya-bridge`,
`cloudflared`, `traefik`, DNS (check 21), the TLS cert / ingress for your two
hosts (checks 12, 22, 31, 32), or a **node** hosting those pods — plus all the
**ha-sofia** checks (2629, 45) and the **tuya** exporter (30).
- **Not on your chain → one line, then drop it.** Summarise as "N unrelated
cluster issues (admin's area)" and don't investigate.
## Step 3 — read-only checks for your chain
All of these work with your read-only access:
```bash
# tuya-bridge — your devices + the ATS
kubectl get pods -n tuya-bridge
kubectl rollout status deploy/tuya-bridge -n tuya-bridge
kubectl logs -n tuya-bridge deploy/tuya-bridge --tail=50
# the reachability path ha-sofia uses
kubectl get pods -n cloudflared
kubectl get pods -n traefik
kubectl get ingress -A | grep -Ei 'tuya-bridge|ha-sofia'
# whole external path in one shot (DNS + tunnel + Traefik + cert):
curl -sI --max-time 10 https://tuya-bridge.viktorbarzin.me | head -1
# reachable -> HTTP/2 200 / 401 / 403 (any HTTP response = path is up)
# broken -> curl: timeout / could not resolve host
```
The fastest **device-level** signal is your own dashboard: open
**https://ha-sofia.viktorbarzin.me → Барзини → Статус**. If devices show
Offline / Разкачен / ⚠️ **but tuya-bridge is healthy**, the problem is at the
house (device power / Wi-Fi / the Sofia TP-Link network) — **not** the cluster.
## Step 4 — if something on your chain is broken
You can't fix the cluster (read-only), so **capture + hand off**:
```bash
kubectl describe pod -n tuya-bridge <pod>
kubectl logs -n tuya-bridge <pod> --previous --tail=200
```
Then file it for the admin with the **`/file-issue`** skill — e.g. *"ha-sofia
Tuya devices + ATS unresponsive; tuya-bridge pod CrashLooping"* with the output
above. cloudflared / Traefik / DNS outages are cluster-wide — the admin's
alerting is already firing, but file it so it's tracked from your side too.
## What will skip for you (expected — not failures)
A few checks need access your account doesn't have. They warn/skip — that's
normal, and **none of them are on your ha-sofia chain**:
- **Uptime Kuma (14)** — needs an admin password from Vault.
- **PVE host checks** — 36 (LVM snapshots), 43 (host thermals), 44 (host load),
and the Proxmox CSI ghost-disk check — all need root SSH to the Proxmox host.
- **`--fix`** — pod deletion (a write); not available to you.
(The ha-sofia checks are **not** in this list — your token makes them work.)
## Your ha-sofia token
- Stored at `~/.config/cluster-health/haos_token` (yours, mode 600).
- It's a **dedicated** long-lived token, named `emo-cluster-health` under
ha-sofia → your profile → **Long-Lived Access Tokens**. Revoking it there
affects only you.
- It currently carries admin-level HA scope (Home Assistant only lets a token
be minted for the account that created it, and it was minted via the admin
account). If it ever stops working, tell wizard and a fresh one can be minted.

View file

@ -1,4 +1,4 @@
{ {
"claudeMd": "# Viktor Barzin homelab — shared multi-user Claude Code Workstation (devvm)\n\nYou are running as a specific OS user on a SHARED devvm Workstation, not as the admin. These org-wide rules apply to EVERY user and sit at the top of settings precedence (they cannot be overridden by a user's own config):\n\n- Respect your permission tier. kubectl, Vault, and infra access are scoped to your RBAC tier (admin / power-user / namespace-owner). Do not attempt to escalate privileges or reach another user's resources.\n- Secrets are per-user. Never read another user's home directory, credentials, tokens, or ~/.claude secrets. Your own secrets live in your home at mode 600.\n- Infrastructure changes go through Terraform/Terragrunt — never direct kubectl apply/edit/patch. Committed stack changes are auto-applied by CI on push to master; verify the live result with your read-only kubectl.\n- The AGENT does ALL git mechanics silently — the user may not know git, so never ask them to commit, push, pull, or open anything, and never surface git jargon. Lifecycle (worktrees, landing, cleanup): ~/.claude/rules/execution.md. Org red-lines on top:\n - THE COMMIT MESSAGE IS THE AUDIT TRAIL — subject says WHAT changed; body says WHY in plain words (paraphrase the user's actual request).\n - Never use [ci skip] as a non-admin (it hides the change from the audit feed).\n - Push rejected by branch protection (user not whitelisted) → fall back to a <os-user>/<topic> branch + PR via the Forgejo API (token = password field in ~/.git-credentials).\n - Keep every clone on a clean master when done; tell the user in plain words what happened.\n - Full recipe: AGENTS.md → \"Non-admin workstation users\" in your infra clone.\n- Follow the engineering rules in ~/.claude/rules/ (execution, planning) and every CLAUDE.md in the repo tree.\n- Code lives under ~/code in one of two per-user layouts: either ~/code IS the git-crypt-LOCKED infra clone (single layout), or ~/code is a workspace directory of per-project clones — the locked infra clone at ~/code/infra plus other project repos alongside it. [ -d ~/code/.git ] means single. In locked infra clones secret files read as ciphertext — that is expected, not an error.\n", "claudeMd": "# Viktor Barzin homelab — shared multi-user Claude Code Workstation (devvm)\n\nYou are running as a specific OS user on a SHARED devvm Workstation, not as the admin. These org-wide rules apply to EVERY user and sit at the top of settings precedence (they cannot be overridden by a user's own config):\n\n- Respect your permission tier for cluster/infra. kubectl, Vault, and infra access are scoped to your RBAC tier (admin / power-user / namespace-owner) — don't try to exceed it. At the OS level, authorization follows the OS: where your user holds `sudo`, using it is permitted and audited.\n- OS access follows OS permissions, not a separate rule here. You may read whatever your OS user can read — directly or via `sudo` where you hold it — including other users' home directories, credentials, tokens, and `~/.claude`. Do not impose restrictions stricter than the OS; no extra prompt is required for OS-authorized access. Protect your own secrets at mode 600.\n- Infrastructure changes go through Terraform/Terragrunt — never direct kubectl apply/edit/patch. Committed stack changes are auto-applied by CI on push to master; verify the live result with your read-only kubectl.\n- The AGENT does ALL git mechanics silently — the user may not know git, so never ask them to commit, push, pull, or open anything, and never surface git jargon. Lifecycle (worktrees, landing, cleanup): ~/.claude/rules/execution.md. Org red-lines on top:\n - THE COMMIT MESSAGE IS THE AUDIT TRAIL — subject says WHAT changed; body says WHY in plain words (paraphrase the user's actual request).\n - Never use [ci skip] as a non-admin (it hides the change from the audit feed).\n - Push rejected by branch protection (user not whitelisted) → fall back to a <os-user>/<topic> branch + PR via the Forgejo API (token = password field in ~/.git-credentials).\n - Keep every clone on a clean master when done; tell the user in plain words what happened.\n - Full recipe: AGENTS.md → \"Non-admin workstation users\" in your infra clone.\n- Follow the engineering rules in ~/.claude/rules/ (execution, planning) and every CLAUDE.md in the repo tree.\n- Code lives under ~/code in one of two per-user layouts: either ~/code IS the git-crypt-LOCKED infra clone (single layout), or ~/code is a workspace directory of per-project clones — the locked infra clone at ~/code/infra plus other project repos alongside it. [ -d ~/code/.git ] means single. In locked infra clones secret files read as ciphertext — that is expected, not an error.\n",
"model": "claude-opus-4-8" "model": "claude-opus-4-8"
} }

View file

@ -5,6 +5,9 @@ variable "tls_secret_name" {
variable "nfs_server" { type = string } variable "nfs_server" { type = string }
resource "kubernetes_manifest" "external_secret" { resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"

View file

@ -5,6 +5,9 @@ variable "tls_secret_name" {
variable "nfs_server" { type = string } variable "nfs_server" { type = string }
resource "kubernetes_manifest" "external_secret" { resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"
@ -42,6 +45,9 @@ data "kubernetes_secret" "eso_secrets" {
# DB credentials from Vault database engine (rotated automatically) # DB credentials from Vault database engine (rotated automatically)
# Provides DATABASE_URL that auto-updates when password rotates # Provides DATABASE_URL that auto-updates when password rotates
resource "kubernetes_manifest" "db_external_secret" { resource "kubernetes_manifest" "db_external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"

View file

@ -6,6 +6,9 @@
# are non-secret and live in values.yaml. The reloader annotation rolls the # are non-secret and live in values.yaml. The reloader annotation rolls the
# authentik pods if the password ever changes. # authentik pods if the password ever changes.
resource "kubernetes_manifest" "authentik_email_secret" { resource "kubernetes_manifest" "authentik_email_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"

View file

@ -601,6 +601,9 @@ resource "kubernetes_config_map" "beadboard_config" {
# Pulls the claude-agent-service bearer token from Vault so BeadBoard can # Pulls the claude-agent-service bearer token from Vault so BeadBoard can
# dispatch agent jobs via the in-cluster HTTP API. # dispatch agent jobs via the in-cluster HTTP API.
resource "kubernetes_manifest" "beadboard_agent_service_secret" { resource "kubernetes_manifest" "beadboard_agent_service_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"

View file

@ -28,6 +28,9 @@ resource "kubernetes_namespace" "broker_sync" {
# trading212_api_keys JSON array of {account_id, account_type, api_key, name, currency} # trading212_api_keys JSON array of {account_id, account_type, api_key, name, currency}
# imap_host, imap_user, imap_password, imap_directory for InvestEngine + Schwab email ingest # imap_host, imap_user, imap_password, imap_directory for InvestEngine + Schwab email ingest
resource "kubernetes_manifest" "external_secret" { resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"

View file

@ -212,3 +212,65 @@ resource "kubectl_manifest" "whisker" {
spec = { notifications = "Disabled" } spec = { notifications = "Disabled" }
}) })
} }
# ---------------------------------------------------------------------------
# Gated public ingress for the Whisker UI (infra #57 / ADR-0014).
#
# whisker.viktorbarzin.me -> whisker:8081, Authentik-gated (auth="required":
# Whisker ships NO own login it's an admin observability UI, so Authentik
# forward-auth is the only gate between strangers and the flow view). The
# operator replicated `tls-secret` into calico-system already.
#
# TWO coupled pieces are required because the operator's own `whisker`
# NetworkPolicy (owned by the Whisker CR above) sets policyTypes:[Ingress]
# with NO ingress rules => default-deny on ingress to the whisker pod. The
# additive NP below ORs in a Traefik allow (k8s NetworkPolicies are additive
# across policies selecting the same pod), so we never edit the operator NP.
module "ingress_whisker" {
source = "../../modules/kubernetes/ingress_factory"
dns_type = "proxied"
namespace = "calico-system"
name = "whisker"
service_name = "whisker"
port = 8081
auth = "required"
tls_secret_name = "tls-secret"
extra_annotations = {
"gethomepage.dev/enabled" = "true"
"gethomepage.dev/name" = "Whisker"
"gethomepage.dev/description" = "Calico flow observability (who-talks-to-whom)"
"gethomepage.dev/icon" = "calico.png"
"gethomepage.dev/group" = "Infrastructure"
}
}
# Additive NetworkPolicy: permit Traefik -> whisker:8081. ORs with the
# operator's default-deny `whisker` NP (selecting the same pod) so Traefik
# can reach the UI without touching the operator-owned policy.
resource "kubernetes_network_policy_v1" "whisker_allow_traefik" {
metadata {
name = "whisker-allow-traefik"
namespace = "calico-system"
}
spec {
pod_selector {
match_labels = {
"app.kubernetes.io/name" = "whisker"
}
}
policy_types = ["Ingress"]
ingress {
from {
namespace_selector {
match_labels = {
"kubernetes.io/metadata.name" = "traefik"
}
}
}
ports {
port = "8081"
protocol = "TCP"
}
}
}
}

View file

@ -19,6 +19,9 @@ resource "kubernetes_namespace" "changedetection" {
} }
resource "kubernetes_manifest" "external_secret" { resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"

View file

@ -41,6 +41,9 @@ resource "kubernetes_namespace" "chrome_service" {
# --- Secrets (single-key extract: api_bearer_token) --- # --- Secrets (single-key extract: api_bearer_token) ---
resource "kubernetes_manifest" "external_secret" { resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"

View file

@ -49,6 +49,9 @@ resource "kubernetes_namespace" "ci_pipeline_health" {
# billing on PRIVATE mirrors, which a future scoped read:packages rotation of # billing on PRIVATE mirrors, which a future scoped read:packages rotation of
# the alias could not do. Blast radius = this single-CronJob namespace. # the alias could not do. Blast radius = this single-CronJob namespace.
resource "kubernetes_manifest" "external_secret" { resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"

View file

@ -38,6 +38,9 @@ resource "kubernetes_namespace" "claude_agent" {
# --- Secrets --- # --- Secrets ---
resource "kubernetes_manifest" "external_secret" { resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"

View file

@ -57,6 +57,9 @@ resource "kubernetes_service_account" "breakglass" {
# DENIED this path (see stacks/vault/main.tf) so the shared, prompt-injectable # DENIED this path (see stacks/vault/main.tf) so the shared, prompt-injectable
# pod can never read it. # pod can never read it.
resource "kubernetes_manifest" "external_secret_ssh" { resource "kubernetes_manifest" "external_secret_ssh" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"
@ -82,6 +85,9 @@ resource "kubernetes_manifest" "external_secret_ssh" {
# Env secrets: the Anthropic OAuth token (shared with claude-agent-service # Env secrets: the Anthropic OAuth token (shared with claude-agent-service
# same account) and the app bearer token (in-cluster/CLI fallback caller auth). # same account) and the app bearer token (in-cluster/CLI fallback caller auth).
resource "kubernetes_manifest" "external_secret_env" { resource "kubernetes_manifest" "external_secret_env" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"

View file

@ -29,6 +29,9 @@ resource "kubernetes_namespace" "claude-memory" {
} }
resource "kubernetes_manifest" "external_secret" { resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"
@ -57,6 +60,9 @@ resource "kubernetes_manifest" "external_secret" {
# DB credentials from Vault database engine (rotated every 24h) # DB credentials from Vault database engine (rotated every 24h)
resource "kubernetes_manifest" "db_external_secret" { resource "kubernetes_manifest" "db_external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"

View file

@ -5,6 +5,9 @@ variable "tls_secret_name" {
variable "public_ip" { type = string } variable "public_ip" { type = string }
resource "kubernetes_manifest" "external_secret" { resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"

View file

@ -23,6 +23,9 @@ resource "kubernetes_namespace" "dawarich" {
} }
resource "kubernetes_manifest" "external_secret" { resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"

View file

@ -745,7 +745,10 @@ resource "kubernetes_deployment" "phpmyadmin" {
labels = { labels = {
"app" = "phpmyadmin" "app" = "phpmyadmin"
tier = var.tier tier = var.tier
# ADR-0014 service identity: dbaas is a multi-Service namespace, so the
# namespace alone can't attribute Goldmane flows. Value = the fronting
# Service name (kubernetes_service.phpmyadmin is named "pma").
"service-identity" = "pma"
} }
annotations = { annotations = {
"reloader.stakater.com/search" = "true" "reloader.stakater.com/search" = "true"
@ -762,6 +765,10 @@ resource "kubernetes_deployment" "phpmyadmin" {
metadata { metadata {
labels = { labels = {
"app" = "phpmyadmin" "app" = "phpmyadmin"
# ADR-0014: Goldmane/Felix stamps POD labels onto flows, so the
# disambiguating identity must live on the pod template (not just
# the Deployment metadata above). Not in selector no replace.
"service-identity" = "pma"
} }
} }
spec { spec {
@ -812,8 +819,19 @@ resource "kubernetes_deployment" "phpmyadmin" {
} }
} }
lifecycle { lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2 ignore_changes = [
ignore_changes = [spec[0].template[0].spec[0].dns_config] spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
# This Deployment is Keel-enrolled (keel.sh/policy=patch). Ignore the
# attributes Keel/Kyverno mutate at runtime so `terragrunt apply` (incl.
# the daily drift plan) doesn't fight them or revert the live image
# canonical KEEL/KYVERNO lifecycle guard, matches linkwarden/chrome-service.
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
metadata[0].annotations["keel.sh/match-tag"],
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE Keel manages tag updates
spec[0].template[0].metadata[0].annotations["keel.sh/update-time"], # KEEL_LIFECYCLE_V1
]
} }
} }
@ -1499,6 +1517,10 @@ resource "kubernetes_deployment" "pgadmin" {
} }
labels = { labels = {
tier = var.tier tier = var.tier
# ADR-0014 service identity: dbaas is a multi-Service namespace, so the
# namespace alone can't attribute Goldmane flows. Value = the fronting
# Service name (kubernetes_service.pgadmin is named "pgadmin").
"service-identity" = "pgadmin"
} }
} }
spec { spec {
@ -1514,6 +1536,10 @@ resource "kubernetes_deployment" "pgadmin" {
metadata { metadata {
labels = { labels = {
app = "pgadmin" app = "pgadmin"
# ADR-0014: Goldmane/Felix stamps POD labels onto flows, so the
# disambiguating identity must live on the pod template (not just
# the Deployment metadata above). Not in selector no replace.
"service-identity" = "pgadmin"
} }
} }
spec { spec {
@ -1568,8 +1594,20 @@ resource "kubernetes_deployment" "pgadmin" {
} }
} }
lifecycle { lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2 ignore_changes = [
ignore_changes = [spec[0].template[0].spec[0].dns_config] spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
# This Deployment is Keel-enrolled (keel.sh/policy=patch) and Keel has
# bumped the live image (dpage/pgadmin4:9.16). Ignore the Keel/Kyverno
# runtime-mutated attributes so `terragrunt apply` (incl. the daily drift
# plan) doesn't revert the image to bare `dpage/pgadmin4` or strip Keel's
# annotations canonical guard, matches linkwarden/chrome-service.
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
metadata[0].annotations["keel.sh/match-tag"],
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE Keel manages tag updates
spec[0].template[0].metadata[0].annotations["keel.sh/update-time"], # KEEL_LIFECYCLE_V1
]
} }
} }
resource "kubernetes_service" "pgadmin" { resource "kubernetes_service" "pgadmin" {

View file

@ -20,6 +20,9 @@ resource "kubernetes_namespace" "diun" {
} }
resource "kubernetes_manifest" "external_secret" { resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"

View file

@ -20,6 +20,9 @@ resource "kubernetes_namespace" "ebooks" {
# ExternalSecrets for all three sources # ExternalSecrets for all three sources
resource "kubernetes_manifest" "calibre_external_secret" { resource "kubernetes_manifest" "calibre_external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"
@ -47,6 +50,9 @@ resource "kubernetes_manifest" "calibre_external_secret" {
} }
resource "kubernetes_manifest" "audiobookshelf_external_secret" { resource "kubernetes_manifest" "audiobookshelf_external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"
@ -74,6 +80,9 @@ resource "kubernetes_manifest" "audiobookshelf_external_secret" {
} }
resource "kubernetes_manifest" "servarr_external_secret" { resource "kubernetes_manifest" "servarr_external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"

View file

@ -33,6 +33,9 @@ resource "kubernetes_namespace" "f1-stream" {
} }
resource "kubernetes_manifest" "external_secret" { resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"
@ -62,6 +65,9 @@ resource "kubernetes_manifest" "external_secret" {
# Pull the chrome-service bearer token into this namespace as a separate # Pull the chrome-service bearer token into this namespace as a separate
# Secret so the verifier can reach the in-cluster Playwright pool. # Secret so the verifier can reach the in-cluster Playwright pool.
resource "kubernetes_manifest" "chrome_service_client_secret" { resource "kubernetes_manifest" "chrome_service_client_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"

View file

@ -53,6 +53,9 @@ resource "kubernetes_namespace" "fire_planner" {
# Seed before applying: # Seed before applying:
# secret/fire-planner -> property `recompute_bearer_token` # secret/fire-planner -> property `recompute_bearer_token`
resource "kubernetes_manifest" "external_secret" { resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"
@ -115,6 +118,9 @@ resource "kubernetes_manifest" "external_secret" {
# Template builds the asyncpg DSN consumed by the FastAPI app + CronJob # Template builds the asyncpg DSN consumed by the FastAPI app + CronJob
# as DB_CONNECTION_STRING. # as DB_CONNECTION_STRING.
resource "kubernetes_manifest" "db_external_secret" { resource "kubernetes_manifest" "db_external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"
@ -159,6 +165,9 @@ resource "kubernetes_manifest" "db_external_secret" {
# pg-sync sidecar populates `daily_account_valuation` etc. hourly; the # pg-sync sidecar populates `daily_account_valuation` etc. hourly; the
# fire-planner ingest reads those tables via this role. # fire-planner ingest reads those tables via this role.
resource "kubernetes_manifest" "wealthfolio_sync_db_external_secret" { resource "kubernetes_manifest" "wealthfolio_sync_db_external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"
@ -661,6 +670,9 @@ variable "run_examples_bulk_ingest" {
# Reddit OAuth creds pulled from Vault secret/viktor. # Reddit OAuth creds pulled from Vault secret/viktor.
resource "kubernetes_manifest" "external_secret_examples_reddit" { resource "kubernetes_manifest" "external_secret_examples_reddit" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"
@ -701,6 +713,9 @@ resource "kubernetes_manifest" "external_secret_examples_reddit" {
# claude-agent-service bearer pulled separately so its rotation cadence # claude-agent-service bearer pulled separately so its rotation cadence
# is decoupled from the Reddit creds. # is decoupled from the Reddit creds.
resource "kubernetes_manifest" "external_secret_examples_claude" { resource "kubernetes_manifest" "external_secret_examples_claude" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"

View file

@ -6,6 +6,9 @@
# (stacks/authentik/email-secret.tf) one credential, one rotation point. The # (stacks/authentik/email-secret.tf) one credential, one rotation point. The
# reloader annotation rolls the Forgejo pod if the password is ever rotated. # reloader annotation rolls the Forgejo pod if the password is ever rotated.
resource "kubernetes_manifest" "forgejo_email_secret" { resource "kubernetes_manifest" "forgejo_email_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"

View file

@ -3,6 +3,9 @@ variable "tls_secret_name" {
sensitive = true sensitive = true
} }
resource "kubernetes_manifest" "external_secret" { resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"

View file

@ -18,6 +18,9 @@ resource "kubernetes_namespace" "immich" {
} }
resource "kubernetes_manifest" "external_secret" { resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"

View file

@ -57,16 +57,19 @@ resource "kubernetes_namespace" "goldmane_edge_aggregator" {
# ----------------------------------------------------------------------------- # -----------------------------------------------------------------------------
# The aggregator dials goldmane:7443 over mutual TLS. We mint a client cert # The aggregator dials goldmane:7443 over mutual TLS. We mint a client cert
# signed by the Tigera CA (the same CA that issues Goldmane's serving cert), so # signed by the Tigera CA (the same CA that issues Goldmane's serving cert), so
# Goldmane trusts the client and the client trusts Goldmane's server cert via # Goldmane requires mutual TLS on :7443 and verifies the client cert chains to
# the published CA bundle. # the Tigera CA it does NOT authorize by client identity, so ANY Tigera-CA-
# # signed cert is accepted. Rather than copy the Tigera CA PRIVATE KEY into TF
# The Tigera CA private key lives in the `tigera-ca-private` Secret in # state to mint our own (a needless CA-key exposure; the hashicorp/tls provider
# tigera-operator (Opaque; verified keys: tls.crt + tls.key). The stack's apply # is also incompatible with this repo's global generate-providers/lockfile
# identity needs RBAC get on that secret see the Role/RoleBinding below. # pattern), we REUSE the operator-minted, Tigera-CA-signed client cert
data "kubernetes_secret" "tigera_ca" { # `whisker-backend-key-pair` (calico-system). We never touch the CA key.
# Trade-off: if the operator rotates that cert, re-apply to re-sync (hardening
# follow-up: mint an own-identity cert in-namespace if Whisker is ever removed).
data "kubernetes_secret" "whisker_backend" {
metadata { metadata {
name = "tigera-ca-private" name = "whisker-backend-key-pair"
namespace = "tigera-operator" namespace = "calico-system"
} }
} }
@ -93,46 +96,11 @@ resource "kubernetes_config_map" "tigera_ca_bundle" {
data = data.kubernetes_config_map.tigera_ca_bundle.data data = data.kubernetes_config_map.tigera_ca_bundle.data
} }
# Client private key. # Client cert + key for mTLS to goldmane:7443, mounted at TLS_CERT_PATH /
resource "tls_private_key" "goldmane_client" { # TLS_KEY_PATH defaults (/etc/goldmane-client-tls/tls.crt and .../tls.key).
algorithm = "RSA" # Sourced verbatim from the operator's whisker-backend client key-pair (read
rsa_bits = 2048 # above) already Tigera-CA-signed, which is all Goldmane verifies. No CA key
} # is touched and no cross-namespace CA RBAC is needed.
# CSR for the client cert. CN identifies the client; the service-DNS SAN mirrors
# how Felix/whisker-backend present a client identity to Goldmane.
resource "tls_cert_request" "goldmane_client" {
private_key_pem = tls_private_key.goldmane_client.private_key_pem
subject {
common_name = "goldmane-edge-aggregator"
organization = "goldmane-edge-aggregator"
}
dns_names = [
"goldmane-edge-aggregator",
"goldmane-edge-aggregator.goldmane-edge-aggregator.svc.cluster.local",
]
}
# Sign the CSR with the Tigera CA. 10-year validity (87600h): re-apply rotates
# it well before expiry; a long horizon avoids surprise mTLS outages from an
# unattended stack. The Tigera CA itself outlives this (operator-managed).
resource "tls_locally_signed_cert" "goldmane_client" {
cert_request_pem = tls_cert_request.goldmane_client.cert_request_pem
ca_private_key_pem = data.kubernetes_secret.tigera_ca.data["tls.key"]
ca_cert_pem = data.kubernetes_secret.tigera_ca.data["tls.crt"]
validity_period_hours = 87600 # 10y
early_renewal_hours = 720 # re-sign on apply when <30d remain
allowed_uses = [
"client_auth",
"digital_signature",
"key_encipherment",
]
}
# The minted client cert + key, mounted at TLS_CERT_PATH / TLS_KEY_PATH defaults
# (/etc/goldmane-client-tls/tls.crt and .../tls.key).
resource "kubernetes_secret" "goldmane_client_tls" { resource "kubernetes_secret" "goldmane_client_tls" {
metadata { metadata {
name = "goldmane-client-tls" name = "goldmane-client-tls"
@ -140,47 +108,8 @@ resource "kubernetes_secret" "goldmane_client_tls" {
} }
type = "Opaque" type = "Opaque"
data = { data = {
"tls.crt" = tls_locally_signed_cert.goldmane_client.cert_pem "tls.crt" = data.kubernetes_secret.whisker_backend.data["tls.crt"]
"tls.key" = tls_private_key.goldmane_client.private_key_pem "tls.key" = data.kubernetes_secret.whisker_backend.data["tls.key"]
}
}
# Narrow RBAC so this stack's apply identity (and ESO/Reloader are unaffected)
# can `get` the Tigera CA private key in tigera-operator. The data source above
# reads it at apply time; this Role/RoleBinding documents + grants that access
# rather than relying on cluster-admin. The subject is the same SA the other
# Tier-1 stacks apply as (claude-agent/terraform-state for headless, the human
# OIDC identity interactively) both are cluster-admin today, so this is
# belt-and-braces / least-privilege intent for when apply identities tighten.
resource "kubernetes_role" "read_tigera_ca" {
metadata {
name = "goldmane-edge-aggregator-read-tigera-ca"
namespace = "tigera-operator"
}
rule {
api_groups = [""]
resources = ["secrets"]
resource_names = ["tigera-ca-private"]
verbs = ["get"]
}
}
resource "kubernetes_role_binding" "read_tigera_ca" {
metadata {
name = "goldmane-edge-aggregator-read-tigera-ca"
namespace = "tigera-operator"
}
role_ref {
api_group = "rbac.authorization.k8s.io"
kind = "Role"
name = kubernetes_role.read_tigera_ca.metadata[0].name
}
# The headless apply identity (claude-agent-service runs Tier-1 applies as the
# `terraform-state` Vault K8s role in the claude-agent namespace).
subject {
kind = "ServiceAccount"
name = "default"
namespace = "claude-agent"
} }
} }
@ -227,6 +156,11 @@ resource "kubernetes_job" "db_init" {
timeouts { timeouts {
create = "2m" create = "2m"
} }
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno injects dns_config (ndots=2); ignore it so
# this idempotent Job isn't replaced (Jobs are immutable) on every apply.
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
} }
# ExternalSecret projecting the Vault-rotated (7-day) credential into a K8s # ExternalSecret projecting the Vault-rotated (7-day) credential into a K8s
@ -234,6 +168,9 @@ resource "kubernetes_job" "db_init" {
# place in the CNPG connection allowlist are added in stacks/vault/main.tf # place in the CNPG connection allowlist are added in stacks/vault/main.tf
# (see this stack's terragrunt.hcl note). remoteRef key: static-creds/pg-goldmane-edges. # (see this stack's terragrunt.hcl note). remoteRef key: static-creds/pg-goldmane-edges.
resource "kubernetes_manifest" "db_external_secret" { resource "kubernetes_manifest" "db_external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"
@ -276,6 +213,9 @@ resource "kubernetes_manifest" "db_external_secret" {
# into this namespace as SLACK_WEBHOOK_URL via an ExternalSecret (no new # into this namespace as SLACK_WEBHOOK_URL via an ExternalSecret (no new
# webhook). The digest CronJob defaults to #security. # webhook). The digest CronJob defaults to #security.
resource "kubernetes_manifest" "slack_external_secret" { resource "kubernetes_manifest" "slack_external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"
@ -295,7 +235,7 @@ resource "kubernetes_manifest" "slack_external_secret" {
data = [{ data = [{
secretKey = "SLACK_WEBHOOK_URL" secretKey = "SLACK_WEBHOOK_URL"
remoteRef = { remoteRef = {
key = "monitoring" key = "viktor"
property = "alertmanager_slack_api_url" property = "alertmanager_slack_api_url"
} }
}] }]
@ -515,8 +455,13 @@ resource "kubernetes_cron_job_v1" "digest" {
} }
} }
env { env {
name = "SLACK_CHANNEL" name = "SLACK_CHANNEL"
value = "#security" # Posts to #alerts. The dedicated #security channel was abandoned
# 2026-06-25 the shared alertmanager_slack_api_url webhook's
# Slack app isn't a member of it (channel override 404s), so all
# Slack (incl. alertmanager's security-lane alerts) consolidated
# to #alerts. See docs/runbooks/goldmane-flow-trail.md.
value = "#alerts"
} }
resources { resources {

View file

@ -5,6 +5,9 @@ variable "tls_secret_name" {
variable "nfs_server" { type = string } variable "nfs_server" { type = string }
resource "kubernetes_manifest" "external_secret" { resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"

View file

@ -208,6 +208,9 @@ module "ingress" {
} }
resource "kubernetes_manifest" "external_secret" { resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"

View file

@ -250,6 +250,9 @@ module "ingress_test" {
} }
resource "kubernetes_manifest" "external_secret_db" { resource "kubernetes_manifest" "external_secret_db" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"
@ -284,6 +287,9 @@ resource "kubernetes_manifest" "external_secret_db" {
} }
resource "kubernetes_manifest" "external_secret_kv" { resource "kubernetes_manifest" "external_secret_kv" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"

View file

@ -37,6 +37,9 @@ module "tls_secret" {
# --- Secrets (ESO from Vault) --- # --- Secrets (ESO from Vault) ---
resource "kubernetes_manifest" "external_secret" { resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"

View file

@ -162,6 +162,9 @@ resource "kubernetes_resource_quota" "immich" {
} }
resource "kubernetes_manifest" "external_secret" { resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"

View file

@ -20,6 +20,9 @@ resource "kubernetes_namespace" "insta2spotify" {
} }
resource "kubernetes_manifest" "external_secret" { resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"

View file

@ -35,6 +35,14 @@ resource "kubernetes_namespace" "instagram_poster" {
# - immich_tag_instagram (optional auto-resolved if missing) # - immich_tag_instagram (optional auto-resolved if missing)
# - immich_tag_posted (optional auto-resolved if missing) # - immich_tag_posted (optional auto-resolved if missing)
resource "kubernetes_manifest" "external_secret" { resource "kubernetes_manifest" "external_secret" {
# The external-secrets controller takes server-side-apply ownership of
# .spec.refreshInterval, so a plain TF apply conflicts. force_conflicts lets
# TF win (values match, so it's stable) same pattern as grafana/woodpecker/
# traefik/k8s-version-upgrade. Surfaced 2026-06-24 by the first IG apply since
# the ESO v1 migration (the scale-to-0 push).
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"
@ -139,6 +147,11 @@ resource "kubernetes_manifest" "external_secret" {
# ESO refreshes the K8s Secret every 15m. `reloader.stakater.com/match` # ESO refreshes the K8s Secret every 15m. `reloader.stakater.com/match`
# bounces the pod when the password changes. # bounces the pod when the password changes.
resource "kubernetes_manifest" "benchmark_db_external_secret" { resource "kubernetes_manifest" "benchmark_db_external_secret" {
# See external_secret above ESO owns .spec.refreshInterval; force_conflicts
# lets the TF apply win instead of erroring on the field-manager conflict.
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"
@ -227,7 +240,11 @@ resource "kubernetes_deployment" "instagram_poster" {
} }
spec { spec {
replicas = 1 # Scaled to 0 (2026-06-24): Instagram Graph integration is unused and its
# ExternalSecret is dead (missing ig_graph_long_lived_token /
# ig_business_account_id in Vault secret/instagram-poster). Set back to 1
# after minting a Meta long-lived token and populating those keys.
replicas = 0
# RWO PVC cannot rolling-update. # RWO PVC cannot rolling-update.
strategy { strategy {
type = "Recreate" type = "Recreate"

View file

@ -41,6 +41,9 @@ resource "kubernetes_namespace" "job_hunter" {
# digest_to_address where the weekly digest goes # digest_to_address where the weekly digest goes
# digest_from_address From: header for the digest # digest_from_address From: header for the digest
resource "kubernetes_manifest" "external_secret" { resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"
@ -105,6 +108,9 @@ resource "kubernetes_manifest" "external_secret" {
# DB credentials from Vault database engine (7-day rotation). # DB credentials from Vault database engine (7-day rotation).
# Template builds the asyncpg DSN consumed by the FastAPI app as DB_CONNECTION_STRING. # Template builds the asyncpg DSN consumed by the FastAPI app as DB_CONNECTION_STRING.
resource "kubernetes_manifest" "db_external_secret" { resource "kubernetes_manifest" "db_external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"
@ -325,6 +331,9 @@ resource "kubernetes_service" "job_hunter" {
# references it as $__env{JOB_HUNTER_PG_PASSWORD}. Reloader restarts # references it as $__env{JOB_HUNTER_PG_PASSWORD}. Reloader restarts
# Grafana whenever ESO updates this secret (every 7d on rotation). # Grafana whenever ESO updates this secret (every 7d on rotation).
resource "kubernetes_manifest" "grafana_job_hunter_db_external_secret" { resource "kubernetes_manifest" "grafana_job_hunter_db_external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"

View file

@ -5,6 +5,9 @@
# ----------------------------------------------------------------------------- # -----------------------------------------------------------------------------
resource "kubernetes_manifest" "oauth2_proxy_externalsecret" { resource "kubernetes_manifest" "oauth2_proxy_externalsecret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"

View file

@ -416,6 +416,39 @@ phase_preflight() {
fi fi
fi fi
# 4b. apiserver-OIDC drift check (backstop for the rbac stack's kubeadm-config
# reconciliation). A `kubeadm upgrade` REGENERATES the apiserver manifest from
# kubeadm-config; if kubeadm-config still carries the legacy single-issuer
# --oidc-* args instead of --authentication-config, the regenerated apiserver
# loses structured multi-issuer auth → kubectl + dashboard SSO break AFTER the
# upgrade. This is RECOVERABLE (the apiserver does NOT crash — verified by an
# isolated repro 2026-06-24; the chain's post-master restore.sh re-adds the flag,
# and the rbac stack reconciles kubeadm-config so it won't recur) — so this is an
# ALERT, not a block. (NB the 2026-06-24 stall was NOT this — it was etcd IO
# starvation; see docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md.)
# Skip on an at-target master (resume — no apiserver regen).
if [ "$master_kubelet_v" != "$TARGET_VERSION" ]; then
local apiserver_diff
apiserver_diff=$(ssh "${SSH_OPTS[@]}" "$(ssh_target k8s-master)" "sudo kubeadm upgrade diff v$TARGET_VERSION 2>/dev/null" || true)
if echo "$apiserver_diff" | grep -qE '^-[[:space:]].*--authentication-config'; then
slack "WARN preflight — kubeadm upgrade will DROP --authentication-config (kubeadm-config OIDC drift). SSO breaks post-upgrade until restore.sh re-adds it; re-apply the rbac stack to reconcile kubeadm-config. Proceeding (recoverable, not a crash)."
fi
fi
# 4c. Reclaim kubeadm scratch on master. `kubeadm upgrade apply` dumps a full
# ~400MB etcd DB backup into /etc/kubernetes/tmp/kubeadm-backup-etcd-<ts>/ before
# every etcd upgrade and NEVER cleans it up — 145 dirs / 28GB had accumulated by
# 2026-06-24, pushing master root fs to 73% (image-GC churn + extra write IO on
# the shared HDD where etcd lives — a contributor to the etcd IO starvation that
# stalled that run, see post-mortem). Real etcd backups go to NFS, so these are
# throwaway. Prune ones >3 days old (keeps a short rollback window). Best-effort;
# never aborts the chain.
if [ "$master_kubelet_v" != "$TARGET_VERSION" ]; then
ssh "${SSH_OPTS[@]}" "$(ssh_target k8s-master)" \
"sudo find /etc/kubernetes/tmp -maxdepth 1 -type d \( -name 'kubeadm-backup-*' -o -name 'kubeadm-upgraded-manifests*' \) -mtime +3 -exec rm -rf {} + 2>/dev/null; echo -n 'master root after prune: '; df -h / | awk 'NR==2{print \$5\" used, \"\$4\" free\"}'" \
|| echo "kubeadm-scratch prune skipped (ssh/df failed) — non-fatal"
fi
# 5. Push in-flight + started_timestamp metrics + ns annotations # 5. Push in-flight + started_timestamp metrics + ns annotations
$KUBECTL annotate ns "$NS" \ $KUBECTL annotate ns "$NS" \
"viktorbarzin.me/k8s-upgrade-in-flight=$(date -u +%FT%TZ)" \ "viktorbarzin.me/k8s-upgrade-in-flight=$(date -u +%FT%TZ)" \

View file

@ -304,6 +304,9 @@ resource "kubernetes_config_map" "kms_slack_notifier" {
} }
resource "kubernetes_manifest" "kms_slack_external_secret" { resource "kubernetes_manifest" "kms_slack_external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"

View file

@ -29,6 +29,9 @@ resource "kubernetes_namespace" "linkwarden" {
} }
resource "kubernetes_manifest" "external_secret" { resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"
@ -57,6 +60,9 @@ resource "kubernetes_manifest" "external_secret" {
# DB credentials from Vault database engine (rotated every 24h) # DB credentials from Vault database engine (rotated every 24h)
resource "kubernetes_manifest" "db_external_secret" { resource "kubernetes_manifest" "db_external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"

View file

@ -800,6 +800,9 @@ resource "kubernetes_service" "mailserver_proxy" {
# `EMAIL_MONITOR_IMAP_PASSWORD` so the CronJob can consume them via a single # `EMAIL_MONITOR_IMAP_PASSWORD` so the CronJob can consume them via a single
# `env_from { secret_ref {} }` block. # `env_from { secret_ref {} }` block.
resource "kubernetes_manifest" "email_roundtrip_monitor_secrets" { resource "kubernetes_manifest" "email_roundtrip_monitor_secrets" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"

View file

@ -25,6 +25,9 @@ resource "kubernetes_namespace" "matrix" {
# flipped to false. The token stays in Vault so registration can be re-opened # flipped to false. The token stays in Vault so registration can be re-opened
# later (e.g. to add family) without regenerating it. # later (e.g. to add family) without regenerating it.
resource "kubernetes_manifest" "secrets_external_secret" { resource "kubernetes_manifest" "secrets_external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"

View file

@ -130,6 +130,11 @@ resource "kubernetes_deployment" "blackbox_exporter" {
labels = { labels = {
app = "blackbox-exporter" app = "blackbox-exporter"
tier = var.tier tier = var.tier
# ADR-0014 service identity: monitoring is a multi-Service namespace, so
# the namespace alone can't attribute Goldmane flows. Value = the
# fronting Service name (kubernetes_service.blackbox_exporter is named
# "blackbox-exporter").
"service-identity" = "blackbox-exporter"
} }
annotations = { annotations = {
"reloader.stakater.com/search" = "true" "reloader.stakater.com/search" = "true"
@ -146,6 +151,10 @@ resource "kubernetes_deployment" "blackbox_exporter" {
metadata { metadata {
labels = { labels = {
app = "blackbox-exporter" app = "blackbox-exporter"
# ADR-0014: Goldmane/Felix stamps POD labels onto flows, so the
# disambiguating identity must live on the pod template (not just
# the Deployment metadata above). Not in selector no replace.
"service-identity" = "blackbox-exporter"
} }
} }
spec { spec {

View file

@ -5,6 +5,11 @@ resource "kubernetes_deployment" "goflow2" {
labels = { labels = {
app = "goflow2" app = "goflow2"
tier = var.tier tier = var.tier
# ADR-0014 service identity: monitoring is a multi-Service namespace, so
# the namespace alone can't attribute Goldmane flows. Value = the
# fronting Service name (kubernetes_service.goflow2 the metrics svc; the
# goflow2-netflow NodePort is the same pod by another name).
"service-identity" = "goflow2"
} }
} }
spec { spec {
@ -18,6 +23,10 @@ resource "kubernetes_deployment" "goflow2" {
metadata { metadata {
labels = { labels = {
app = "goflow2" app = "goflow2"
# ADR-0014: Goldmane/Felix stamps POD labels onto flows, so the
# disambiguating identity must live on the pod template (not just
# the Deployment metadata above). Not in selector no replace.
"service-identity" = "goflow2"
} }
} }
spec { spec {

View file

@ -71,6 +71,15 @@ resource "kubernetes_persistent_volume" "alertmanager_pv" {
# DB credentials from Vault database engine (rotated automatically) # DB credentials from Vault database engine (rotated automatically)
# Provides GF_DATABASE_PASSWORD that auto-updates when password rotates # Provides GF_DATABASE_PASSWORD that auto-updates when password rotates
resource "kubernetes_manifest" "grafana_db_creds" { resource "kubernetes_manifest" "grafana_db_creds" {
# The external-secrets controller takes server-side-apply ownership of
# .spec.refreshInterval, so a plain TF apply conflicts ("conflict with
# external-secrets ... .spec.refreshInterval"). force_conflicts lets TF win
# (values match, so it's stable) same pattern as the woodpecker/traefik/
# k8s-version-upgrade stacks. Surfaced 2026-06-24: the first monitoring apply
# in a while exposed this latent conflict (prior pushes were docs-only).
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"

View file

@ -47,6 +47,10 @@ resource "kubernetes_deployment" "idrac-redfish" {
labels = { labels = {
app = "idrac-redfish-exporter" app = "idrac-redfish-exporter"
tier = var.tier tier = var.tier
# ADR-0014 service identity: monitoring is a multi-Service namespace, so
# the namespace alone can't attribute Goldmane flows. Value = the
# fronting Service name (kubernetes_service.idrac-redfish-exporter).
"service-identity" = "idrac-redfish-exporter"
} }
annotations = { annotations = {
"reloader.stakater.com/search" = "true" "reloader.stakater.com/search" = "true"
@ -63,6 +67,10 @@ resource "kubernetes_deployment" "idrac-redfish" {
metadata { metadata {
labels = { labels = {
app = "idrac-redfish-exporter" app = "idrac-redfish-exporter"
# ADR-0014: Goldmane/Felix stamps POD labels onto flows, so the
# disambiguating identity must live on the pod template (not just
# the Deployment metadata above). Not in selector no replace.
"service-identity" = "idrac-redfish-exporter"
} }
} }
spec { spec {

View file

@ -60,9 +60,10 @@ alertmanager:
receiver: slack-warning receiver: slack-warning
routes: routes:
# Wave 1 security lane — matches alerts that set `lane = "security"` # Wave 1 security lane — matches alerts that set `lane = "security"`
# (K2-K9, V1-V7, S1 from Loki ruler). Routes to dedicated #security # (K2-K9, V1-V7, S1 from Loki ruler). Posts via the slack-security
# channel regardless of severity. Defined first + continue: false so # receiver (distinct [SECURITY] styling) to #alerts; the dedicated
# security alerts never fall through to the generic #alerts channel. # #security channel was abandoned 2026-06-25 (shared webhook can't reach
# it). continue: false so they get the security-styled receiver.
- receiver: slack-security - receiver: slack-security
group_wait: 10s group_wait: 10s
group_interval: 1m group_interval: 1m
@ -235,7 +236,10 @@ alertmanager:
- name: slack-security - name: slack-security
slack_configs: slack_configs:
- send_resolved: true - send_resolved: true
channel: "#security" # #security was abandoned 2026-06-25 — the shared incoming webhook's
# Slack app isn't a member of it (channel override 404s). Security-lane
# alerts keep their distinct [SECURITY] styling but post to #alerts.
channel: "#alerts"
color: '{{ if eq .Status "firing" }}{{ if eq (index .Alerts 0).Labels.severity "critical" }}danger{{ else }}warning{{ end }}{{ else }}good{{ end }}' color: '{{ if eq .Status "firing" }}{{ if eq (index .Alerts 0).Labels.severity "critical" }}danger{{ else }}warning{{ end }}{{ else }}good{{ end }}'
fallback: '{{ if eq .Status "firing" }}[SECURITY-{{ (index .Alerts 0).Labels.severity | toUpper }}]{{ else }}[RESOLVED]{{ end }}: {{ .GroupLabels.alertname }}' fallback: '{{ if eq .Status "firing" }}[SECURITY-{{ (index .Alerts 0).Labels.severity | toUpper }}]{{ else }}[RESOLVED]{{ end }}: {{ .GroupLabels.alertname }}'
title: '{{ if eq .Status "firing" }}[SECURITY/{{ (index .Alerts 0).Labels.severity | toUpper }}]{{ else }}[RESOLVED]{{ end }} {{ .GroupLabels.alertname }} ({{ .Alerts | len }})' title: '{{ if eq .Status "firing" }}[SECURITY/{{ (index .Alerts 0).Labels.severity | toUpper }}]{{ else }}[RESOLVED]{{ end }} {{ .GroupLabels.alertname }} ({{ .Alerts | len }})'
@ -253,6 +257,19 @@ alertmanager:
memory: 256Mi memory: 256Mi
limits: limits:
memory: 256Mi memory: 256Mi
# kube-state-metrics idles ~45Mi but briefly spikes past the monitoring-namespace
# LimitRange default (256Mi) during a full object relist (450+ pods, 150+ jobs, all
# secrets/endpoints), so it gets OOMKilled. Each OOM blacks out KSM-derived series
# for ~5min and cascades into a wall of false "<svc>Down" criticals that self-resolve
# (storm 2026-06-26 08:42). Burstable: low request (minimal reservation) + a 512Mi
# limit to absorb the relist peak. No CPU limit (cluster-wide policy).
kube-state-metrics:
resources:
requests:
cpu: 100m
memory: 64Mi
limits:
memory: 512Mi
prometheus-node-exporter: prometheus-node-exporter:
enabled: true enabled: true
resources: resources:
@ -1450,6 +1467,49 @@ serverFiles:
Remediation: right-size top reservers via Goldilocks (immich-server, Remediation: right-size top reservers via Goldilocks (immich-server,
frigate, prometheus, pg-cluster, paperless) or bump VM RAM on frigate, prometheus, pg-cluster, paperless) or bump VM RAM on
k8s-node2/k8s-node3 from 32GB → 48GB to match node1. k8s-node2/k8s-node3 from 32GB → 48GB to match node1.
# Goldmane edge-aggregator (ADR-0014 / infra #58, #61): the durable
# who-talks-to-whom trail. The aggregator pod has NO /metrics endpoint,
# so its health is inferred from kube-state-metrics signals — the trail
# must not silently die. Two failure modes are covered:
# - the aggregate Deployment stops consuming Goldmane's flow stream
# (AggregatorDown) → no new edges ever land in the goldmane_edges DB
# - the daily digest CronJob can't post new edges to Slack
# (DigestFailing) → edges still land but nobody is told.
# A freshness probe (max(last_seen) staleness) is intentionally NOT here:
# AggregatorDown is the agreed floor and needs no extra moving parts.
- name: Network Observability (Goldmane)
rules:
# Deployment has <1 available replica for 15m. kube-state-metrics
# keeps `kube_deployment_status_replicas_available` (metric-keep list
# in serverFiles below). The 15m window rides out a normal rollout /
# node drain without paging; a genuinely-dead aggregator means the
# edge trail has stopped recording and stays down.
- alert: AggregatorDown
expr: |
kube_deployment_status_replicas_available{namespace="goldmane-edge-aggregator",deployment="goldmane-edge-aggregator"} < 1
and on() (time() - process_start_time_seconds{job="prometheus"}) > 900
for: 15m
labels:
severity: warning
annotations:
summary: "goldmane-edge-aggregator has no available replica — the who-talks-to-whom edge trail has stopped recording"
description: "The aggregate Deployment streams Calico Goldmane flows into the goldmane_edges CNPG DB. With 0 replicas, no new namespace-pair edges are captured. `kubectl -n goldmane-edge-aggregator describe deploy goldmane-edge-aggregator` + check the goldmane svc (calico-system) is reachable."
# The goldmane-edges-digest CronJob has a failed Job that started in
# the last 24h. Mirrors the generic JobFailed shape but scoped to the
# digest so it routes here. `for: 30m` rides out the apply/scrape
# transient; the digest runs daily so a real failure won't self-heal
# until the next run — surface it same-day rather than waiting 24h.
- alert: DigestFailing
expr: |
kube_job_status_failed{namespace="goldmane-edge-aggregator", job_name=~"goldmane-edges-digest.*"} > 0
and on(namespace, job_name)
(time() - kube_job_status_start_time{namespace="goldmane-edge-aggregator", job_name=~"goldmane-edges-digest.*"}) < 86400
for: 30m
labels:
severity: warning
annotations:
summary: "goldmane-edges-digest CronJob failing — new edges captured but not posted to #alerts"
description: "The daily edge digest Job {{ $labels.job_name }} failed. Edges may still be landing in the goldmane_edges DB but no one is being notified of new namespace-pairs. `kubectl -n goldmane-edge-aggregator logs job/{{ $labels.job_name }}`."
- name: Infrastructure Health - name: Infrastructure Health
rules: rules:
- alert: HomeAssistantDown - alert: HomeAssistantDown
@ -3190,7 +3250,8 @@ serverFiles:
# means blackbox's fail_if_header_matches caught a Location -> Authentik: # means blackbox's fail_if_header_matches caught a Location -> Authentik:
# a path-scoped `auth = "none"` carve-out was clobbered (TF revert, deploy, # a path-scoped `auth = "none"` carve-out was clobbered (TF revert, deploy,
# ingress_factory default flipping back to auth="required"). lane=security # ingress_factory default flipping back to auth="required"). lane=security
# routes it to the #security Slack receiver (Slack-only, no paging). # routes it to the slack-security receiver, which posts to #alerts
# (#security abandoned 2026-06-25; Slack-only, no paging).
- name: Authentik Walling Off - name: Authentik Walling Off
rules: rules:
- alert: AuthentikWallingOffPublicPath - alert: AuthentikWallingOffPublicPath

View file

@ -22,6 +22,10 @@ resource "kubernetes_deployment" "pve_exporter" {
namespace = kubernetes_namespace.monitoring.metadata[0].name namespace = kubernetes_namespace.monitoring.metadata[0].name
labels = { labels = {
tier = var.tier tier = var.tier
# ADR-0014 service identity: monitoring is a multi-Service namespace, so
# the namespace alone can't attribute Goldmane flows. Value = the
# fronting Service name (kubernetes_service.proxmox-exporter).
"service-identity" = "proxmox-exporter"
} }
} }
@ -37,6 +41,10 @@ resource "kubernetes_deployment" "pve_exporter" {
metadata { metadata {
labels = { labels = {
app = "proxmox-exporter" app = "proxmox-exporter"
# ADR-0014: Goldmane/Felix stamps POD labels onto flows, so the
# disambiguating identity must live on the pod template (not just
# the Deployment metadata above). Not in selector no replace.
"service-identity" = "proxmox-exporter"
} }
} }

View file

@ -31,6 +31,10 @@ resource "kubernetes_deployment" "snmp-exporter" {
labels = { labels = {
app = "snmp-exporter" app = "snmp-exporter"
tier = var.tier tier = var.tier
# ADR-0014 service identity: monitoring is a multi-Service namespace, so
# the namespace alone can't attribute Goldmane flows. Value = the
# fronting Service name (kubernetes_service.snmp-exporter).
"service-identity" = "snmp-exporter"
} }
annotations = { annotations = {
"reloader.stakater.com/search" = "true" "reloader.stakater.com/search" = "true"
@ -47,6 +51,10 @@ resource "kubernetes_deployment" "snmp-exporter" {
metadata { metadata {
labels = { labels = {
app = "snmp-exporter" app = "snmp-exporter"
# ADR-0014: Goldmane/Felix stamps POD labels onto flows, so the
# disambiguating identity must live on the pod template (not just
# the Deployment metadata above). Not in selector no replace.
"service-identity" = "snmp-exporter"
} }
} }
spec { spec {

View file

@ -26,6 +26,9 @@ resource "kubernetes_namespace" "n8n" {
} }
resource "kubernetes_manifest" "external_secret" { resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"
@ -53,6 +56,9 @@ resource "kubernetes_manifest" "external_secret" {
} }
resource "kubernetes_manifest" "external_secret_claude_agent" { resource "kubernetes_manifest" "external_secret_claude_agent" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"
@ -84,6 +90,9 @@ resource "kubernetes_manifest" "external_secret_claude_agent" {
# Shared secrets for the Immich Telegram Postiz Instagram pipeline. # Shared secrets for the Immich Telegram Postiz Instagram pipeline.
# Workflows in stacks/n8n/workflows/instagram-*.json reference these env vars. # Workflows in stacks/n8n/workflows/instagram-*.json reference these env vars.
resource "kubernetes_manifest" "external_secret_instagram_pipeline" { resource "kubernetes_manifest" "external_secret_instagram_pipeline" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"

View file

@ -19,6 +19,9 @@ resource "kubernetes_namespace" "navidrome" {
} }
resource "kubernetes_manifest" "external_secret" { resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"

View file

@ -21,6 +21,9 @@ resource "kubernetes_namespace" "netbox" {
} }
resource "kubernetes_manifest" "external_secret" { resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"

View file

@ -58,6 +58,9 @@ resource "kubernetes_namespace" "nextcloud_todos" {
# DB user: created in dbaas (null_resource.pg_nextcloud_todos_db); password # DB user: created in dbaas (null_resource.pg_nextcloud_todos_db); password
# managed via the Vault database engine see static-creds/pg-nextcloud-todos. # managed via the Vault database engine see static-creds/pg-nextcloud-todos.
resource "kubernetes_manifest" "external_secret" { resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"
@ -97,6 +100,9 @@ resource "kubernetes_manifest" "external_secret" {
# Pre-req in dbaas: CNPG cluster has DB `nextcloud_todos`, role # Pre-req in dbaas: CNPG cluster has DB `nextcloud_todos`, role
# `nextcloud_todos`, and Vault role `static-creds/pg-nextcloud-todos`. # `nextcloud_todos`, and Vault role `static-creds/pg-nextcloud-todos`.
resource "kubernetes_manifest" "db_external_secret" { resource "kubernetes_manifest" "db_external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"

View file

@ -125,6 +125,9 @@ resource "kubernetes_namespace" "nextcloud" {
# other enrolled workload (immich, freshrss) is both correct and drift-free. # other enrolled workload (immich, freshrss) is both correct and drift-free.
resource "kubernetes_manifest" "external_secret" { resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"
@ -154,6 +157,9 @@ resource "kubernetes_manifest" "external_secret" {
# DB credentials from Vault database engine (rotated every 24h) # DB credentials from Vault database engine (rotated every 24h)
# Nextcloud Helm chart reads password at runtime via existingSecret reference # Nextcloud Helm chart reads password at runtime via existingSecret reference
resource "kubernetes_manifest" "db_external_secret" { resource "kubernetes_manifest" "db_external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"

View file

@ -4,6 +4,9 @@ variable "tls_secret_name" {
} }
resource "kubernetes_manifest" "external_secret" { resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"

View file

@ -24,6 +24,9 @@ resource "kubernetes_namespace" "onlyoffice" {
} }
resource "kubernetes_manifest" "external_secret" { resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"

View file

@ -37,6 +37,9 @@ module "tls_secret" {
} }
resource "kubernetes_manifest" "external_secret" { resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"

View file

@ -5,6 +5,9 @@ variable "tls_secret_name" {
variable "nfs_server" { type = string } variable "nfs_server" { type = string }
resource "kubernetes_manifest" "external_secret" { resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"

View file

@ -26,6 +26,9 @@ resource "kubernetes_namespace" "paperless_ai" {
# api_key M2M key between the Node UI and the Python RAG service. # api_key M2M key between the Node UI and the Python RAG service.
# custom_api_key placeholder bearer for llama-swap (no auth, field required). # custom_api_key placeholder bearer for llama-swap (no auth, field required).
resource "kubernetes_manifest" "external_secret" { resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"

View file

@ -28,6 +28,9 @@ resource "kubernetes_namespace" "paperless-mcp" {
# Paperless API token (MCP -> paperless). Synced from Vault to a K8s Secret # Paperless API token (MCP -> paperless). Synced from Vault to a K8s Secret
# by ESO; the pod reads it via secret_key_ref. # by ESO; the pod reads it via secret_key_ref.
resource "kubernetes_manifest" "external_secret" { resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"

View file

@ -34,6 +34,9 @@ resource "kubernetes_namespace" "paperless-ngx" {
} }
resource "kubernetes_manifest" "external_secret" { resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"

View file

@ -58,6 +58,9 @@ resource "kubernetes_namespace" "payslip_ingest" {
# - `actualbudget_budget_sync_id` # - `actualbudget_budget_sync_id`
# (same as Viktor's sync_id) # (same as Viktor's sync_id)
resource "kubernetes_manifest" "external_secret" { resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"
@ -133,6 +136,9 @@ resource "kubernetes_manifest" "external_secret" {
# DB credentials from Vault database engine (rotated every 7 days). # DB credentials from Vault database engine (rotated every 7 days).
# Template builds the asyncpg DSN consumed by the FastAPI app as DB_CONNECTION_STRING. # Template builds the asyncpg DSN consumed by the FastAPI app as DB_CONNECTION_STRING.
resource "kubernetes_manifest" "db_external_secret" { resource "kubernetes_manifest" "db_external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"
@ -450,6 +456,9 @@ resource "kubernetes_cron_job_v1" "actualbudget_payroll_sync" {
# references it as $__env{PAYSLIPS_PG_PASSWORD}. Reloader restarts # references it as $__env{PAYSLIPS_PG_PASSWORD}. Reloader restarts
# Grafana whenever ESO updates this secret (every 7d on rotation). # Grafana whenever ESO updates this secret (every 7d on rotation).
resource "kubernetes_manifest" "grafana_payslips_db_external_secret" { resource "kubernetes_manifest" "grafana_payslips_db_external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"

View file

@ -28,6 +28,9 @@ resource "kubernetes_namespace" "phpipam" {
} }
resource "kubernetes_manifest" "external_secret" { resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"
@ -57,6 +60,9 @@ resource "kubernetes_manifest" "external_secret" {
} }
resource "kubernetes_manifest" "external_secret_pfsense_ssh" { resource "kubernetes_manifest" "external_secret_pfsense_ssh" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"
@ -86,6 +92,9 @@ resource "kubernetes_manifest" "external_secret_pfsense_ssh" {
} }
resource "kubernetes_manifest" "external_secret_admin" { resource "kubernetes_manifest" "external_secret_admin" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"

View file

@ -19,6 +19,9 @@ resource "kubernetes_namespace" "plotting-book" {
} }
resource "kubernetes_manifest" "external_secret" { resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"

View file

@ -72,6 +72,9 @@ resource "kubernetes_persistent_volume_claim" "uploads" {
# Helm-owned Secret resource intact. The chart's deployment already wires # Helm-owned Secret resource intact. The chart's deployment already wires
# this Secret in via `envFrom: secretRef: postiz-secrets`. # this Secret in via `envFrom: secretRef: postiz-secrets`.
resource "kubernetes_manifest" "external_secret_jwt" { resource "kubernetes_manifest" "external_secret_jwt" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"
@ -188,17 +191,18 @@ resource "kubernetes_service" "temporal" {
} }
# #
# Backup CronJob nightly pg_dump of the bundled postiz-postgresql to NFS. # Backup CronJob nightly pg_dump of the postiz database to NFS.
# #
# The bundled PostgreSQL StatefulSet uses local-path storage on the K8s node # Postiz's database lives on the SHARED CNPG cluster
# OS disk (chart default), which is NOT covered by Layer 1 (LVM thin # (pg-cluster-rw.dbaas.svc.cluster.local/postiz) the chart's bundled
# snapshots) or Layer 2 (sda file backup) of the 3-2-1 pipeline. A pg_dump # PostgreSQL was dropped in the CNPG migration, so the old `postiz-postgresql`
# CronJob writing to /srv/nfs/postiz-backup/ closes the gap: dumps land on # host no longer resolves (this CronJob was failing on it for weeks
# Proxmox host NFS covered by inotify-driven offsite sync to Synology. # BackupCronJobFailed; repointed 2026-06-26). The dump now connects via the
# Three databases are dumped: postiz (app data), temporal (workflow engine), # app's own DATABASE_URL (from the postiz-secrets Secret) so it always tracks
# temporal_visibility (workflow search). Bitnami chart-default credentials # the live host + credentials. Dumps land on /srv/nfs/postiz-backup/ covered
# are used same creds the Postiz pod itself uses, scoped to the postiz # by inotify-driven offsite sync to Synology, closing the gap (CNPG data PVCs
# namespace via ClusterIP-only Services. # live in dbaas, excluded from the LVM-snapshot leg). Only the postiz app DB is
# dumped here; temporal's DBs are not.
# #
module "nfs_backup_host" { module "nfs_backup_host" {
@ -248,10 +252,9 @@ resource "kubernetes_cron_job_v1" "postgres_backup" {
STATUS=0 STATUS=0
for db in postiz; do for db in postiz; do
echo "Dumping $db..." echo "Dumping $db..."
if PGPASSWORD=postiz-password pg_dump -h postiz-postgresql -U postiz \ if pg_dump -d "$DATABASE_URL" \
--format=custom --compress=6 \ --format=custom --compress=6 \
--file="$BACKUP_DIR/$db-$TIMESTAMP.dump" \ --file="$BACKUP_DIR/$db-$TIMESTAMP.dump"; then
"$db"; then
echo " OK: $db ($(du -h "$BACKUP_DIR/$db-$TIMESTAMP.dump" | cut -f1))" echo " OK: $db ($(du -h "$BACKUP_DIR/$db-$TIMESTAMP.dump" | cut -f1))"
else else
echo " FAIL: $db" >&2 echo " FAIL: $db" >&2
@ -268,6 +271,18 @@ resource "kubernetes_cron_job_v1" "postgres_backup" {
exit $STATUS exit $STATUS
EOT EOT
] ]
# Connect to the live CNPG database using the app's own
# DATABASE_URL (postgresql://postiz:@pg-cluster-rw.dbaas/postiz)
# instead of a hardcoded host/password survives credential changes.
env {
name = "DATABASE_URL"
value_from {
secret_key_ref {
name = "postiz-secrets"
key = "DATABASE_URL"
}
}
}
volume_mount { volume_mount {
name = "backup" name = "backup"
mount_path = "/backup" mount_path = "/backup"

View file

@ -207,6 +207,9 @@ resource "kubernetes_cluster_role_binding" "pve_snapshot_admin" {
# Creates K8s Secret "proxmox-csi-encryption" in kube-system from Vault KV. # Creates K8s Secret "proxmox-csi-encryption" in kube-system from Vault KV.
# Referenced by the proxmox-lvm-encrypted StorageClass for node-stage and node-expand. # Referenced by the proxmox-lvm-encrypted StorageClass for node-stage and node-expand.
resource "kubernetes_manifest" "external_secret_encryption" { resource "kubernetes_manifest" "external_secret_encryption" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"

View file

@ -10,16 +10,29 @@
# match the existing RBAC subjects (kind: User, name: <raw email>; group names # match the existing RBAC subjects (kind: User, name: <raw email>; group names
# verbatim). Do NOT add a prefix or existing bindings break. # verbatim). Do NOT add a prefix or existing bindings break.
# #
# DRIFT WARNING: this edits the kube-apiserver static-pod manifest on the single # DRIFT WARNING (and how it's now handled): apiserver auth lives in THREE places
# master. A `kubeadm upgrade` regenerates that manifest and DROPS this flag (this # that must stay in sync, because a `kubeadm upgrade` REGENERATES the static-pod
# is exactly how OIDC silently broke before the flag was wiped and the # manifest from kubeadm-config:
# content-hash trigger never re-fired). After any k8s control-plane upgrade, # 1. /etc/kubernetes/pki/auth-config.yaml the structured authn file
# re-apply the rbac stack to restore apiserver OIDC. See # 2. the live kube-apiserver static-pod manifest references it via the flag
# docs/plans/2026-06-04-k8s-dashboard-sso-design.md. # 3. the kubeadm-config ClusterConfiguration CM what kubeadm regenerates from
# Originally only (1)+(2) were managed, so every kubeadm upgrade rewrote the
# manifest from the STALE CM, reverting --authentication-config to single-issuer
# --oidc-* flags. The consequence is SSO breakage AFTER the upgrade: kubectl +
# dashboard lose multi-issuer auth (the apiserver does NOT crash on this verified
# by an isolated repro 2026-06-24; the 2026-06-24 v1.35 upgrade *stall* was a
# separate etcd IO-starvation issue, see
# docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md). The
# remote script below now ALSO reconciles (3) via `kubeadm init phase
# upload-config`, so a future kubeadm upgrade regenerates a CORRECT manifest. The
# k8s-version-upgrade chain additionally ALERTS (does not block SSO drift is
# recoverable) via `kubeadm upgrade diff` in preflight if --authentication-config
# would still be dropped.
# #
# SAFETY: the remote script health-gates on /livez and AUTO-ROLLS-BACK the # SAFETY: the remote script health-gates on /livez and AUTO-ROLLS-BACK the
# manifest from a timestamped backup if the apiserver does not recover, so a # manifest from a timestamped backup if the apiserver does not recover, so a
# malformed config cannot leave the single master down. # malformed config cannot leave the single master down. Reconciling kubeadm-config
# is zero-impact on the running cluster (the CM is only read during an upgrade).
variable "k8s_master_host" { variable "k8s_master_host" {
type = string type = string
@ -97,12 +110,55 @@ locals {
print('flag-inserted' if done else 'ANCHOR-NOT-FOUND') print('flag-inserted' if done else 'ANCHOR-NOT-FOUND')
PY PY
# Reconciles the kubeadm-config ClusterConfiguration's apiServer.extraArgs:
# drops the stale single-issuer --oidc-* args and ensures --authentication-config
# is present (anchored after --authorization-mode). Stdlib-only (the master is
# only guaranteed python3, not pyyaml/yq). Idempotent; preserves all other
# fields (etcd args, audit args, extraVolumes) verbatim. Exits 3 if the
# authorization-mode anchor is missing (fail loud, leave the CM untouched).
kubeadm_oidc_reconcile_py = <<-PY
import sys
lines = sys.stdin.read().split('\n')
out, i, n = [], 0, len(lines)
have_authn = any('name: authentication-config' in l for l in lines)
inserted = have_authn
while i < n:
ln = lines[i]; s = ln.strip()
if s.startswith('- name: oidc-'):
i += 2 if (i + 1 < n and lines[i + 1].strip().startswith('value:')) else 1
continue
out.append(ln)
if (not inserted) and s == '- name: authorization-mode':
indent = ln[:len(ln) - len(ln.lstrip())]
if i + 1 < n and lines[i + 1].strip().startswith('value:'):
out.append(lines[i + 1]); i += 2
else:
i += 1
out.append(indent + '- name: authentication-config')
out.append(indent + ' value: /etc/kubernetes/pki/auth-config.yaml')
inserted = True
continue
i += 1
if not inserted:
sys.stderr.write('ANCHOR-NOT-FOUND: authorization-mode\n'); sys.exit(3)
sys.stdout.write('\n'.join(out))
PY
# Whole remote operation, base64-embedded for byte-exact transfer (no # Whole remote operation, base64-embedded for byte-exact transfer (no
# heredoc/escaping hazards across SSH). # heredoc/escaping hazards across SSH).
apiserver_auth_remote_script = <<-SH apiserver_auth_remote_script = <<-SH
MANIFEST=/etc/kubernetes/manifests/kube-apiserver.yaml MANIFEST=/etc/kubernetes/manifests/kube-apiserver.yaml
AUTHCFG=/etc/kubernetes/pki/auth-config.yaml AUTHCFG=/etc/kubernetes/pki/auth-config.yaml
TS=$(date +%s) TS=$(date +%s)
# Manifest backups MUST live OUTSIDE /etc/kubernetes/manifests/ the kubelet
# treats EVERY file in that dir as a static pod, so a kube-apiserver.yaml.bak
# there becomes a SECOND apiserver static pod. On a kubeadm upgrade (when the
# real manifest's image changes) the two conflict, the kubelet flip-flops, the
# new apiserver never stabilises kubeadm "static Pod hash did not change"
# rollback. This stalled the 1.34->1.35 upgrade for days (root cause found
# 2026-06-26; the old `cp "$MANIFEST" "$MANIFEST.bak"` planted it on 2026-06-18).
BAKDIR=/etc/kubernetes/apiserver-oidc-bak
sudo install -d -m 700 "$BAKDIR"
# 1. Write the structured AuthenticationConfiguration (hot-reloaded by the # 1. Write the structured AuthenticationConfiguration (hot-reloaded by the
# apiserver on change; mounted into the pod via the existing pki hostPath). # apiserver on change; mounted into the pod via the existing pki hostPath).
@ -112,7 +168,7 @@ locals {
# 2. Ensure the apiserver references it. Only touch the manifest ( restart) # 2. Ensure the apiserver references it. Only touch the manifest ( restart)
# when the flag is missing; otherwise the file write above hot-reloads. # when the flag is missing; otherwise the file write above hot-reloads.
if ! sudo grep -q -- '--authentication-config=' "$MANIFEST"; then if ! sudo grep -q -- '--authentication-config=' "$MANIFEST"; then
sudo cp "$MANIFEST" "$MANIFEST.bak.$TS" sudo cp "$MANIFEST" "$BAKDIR/kube-apiserver.yaml.$TS"
sudo sed -i '/--oidc-issuer-url/d;/--oidc-client-id/d;/--oidc-username-claim/d;/--oidc-groups-claim/d' "$MANIFEST" sudo sed -i '/--oidc-issuer-url/d;/--oidc-client-id/d;/--oidc-username-claim/d;/--oidc-groups-claim/d' "$MANIFEST"
echo '${base64encode(local.apiserver_flag_insert_py)}' | base64 -d | sudo python3 - "$MANIFEST" echo '${base64encode(local.apiserver_flag_insert_py)}' | base64 -d | sudo python3 - "$MANIFEST"
fi fi
@ -131,12 +187,36 @@ locals {
done done
if [ "$ok" != "1" ]; then if [ "$ok" != "1" ]; then
echo "kube-apiserver UNHEALTHY after change — rolling back" echo "kube-apiserver UNHEALTHY after change — rolling back"
BAK=$(ls -t "$MANIFEST".bak.* 2>/dev/null | head -1) BAK=$(ls -t "$BAKDIR"/kube-apiserver.yaml.* 2>/dev/null | head -1)
if [ -n "$BAK" ]; then sudo cp "$BAK" "$MANIFEST"; fi if [ -n "$BAK" ]; then sudo cp "$BAK" "$MANIFEST"; fi
for i in $(seq 1 60); do sleep 2; if curl -sk https://localhost:6443/livez 2>/dev/null | grep -q '^ok'; then break; fi; done for i in $(seq 1 60); do sleep 2; if curl -sk https://localhost:6443/livez 2>/dev/null | grep -q '^ok'; then break; fi; done
echo "rolled back to previous manifest"; exit 1 echo "rolled back to previous manifest"; exit 1
fi fi
echo "kube-apiserver healthy with multi-issuer --authentication-config" echo "kube-apiserver healthy with multi-issuer --authentication-config"
# 5. Reconcile kubeadm-config so a FUTURE `kubeadm upgrade` regenerates the
# apiserver manifest WITH --authentication-config instead of reverting to
# the stale single-issuer --oidc-* flags. Without this, kubeadm rewrote the
# manifest from kubeadm-config on every control-plane upgrade and the
# regenerated apiserver crash-looped (the 2026-06-24 v1.35 upgrade stall).
# Zero live impact (the CM is only read at upgrade time); idempotent;
# best-effort (the chain's `kubeadm upgrade diff` preflight gate is the
# backstop if this cannot run).
KC="sudo kubectl --kubeconfig /etc/kubernetes/admin.conf"
CC=$($KC -n kube-system get cm kubeadm-config -o jsonpath='{.data.ClusterConfiguration}' 2>/dev/null || true)
if [ -n "$CC" ] && { echo "$CC" | grep -q 'oidc-issuer-url' || ! echo "$CC" | grep -q 'authentication-config'; }; then
echo "Reconciling kubeadm-config (oidc-* -> authentication-config) so kubeadm upgrade keeps structured auth"
echo '${base64encode(local.kubeadm_oidc_reconcile_py)}' | base64 -d > /tmp/reconcile_kubeadm_oidc.py
if printf '%s' "$CC" | python3 /tmp/reconcile_kubeadm_oidc.py > /tmp/kubeadm-cc-new.yaml \
&& sudo kubeadm init phase upload-config kubeadm --config /tmp/kubeadm-cc-new.yaml; then
echo "kubeadm-config reconciled: future control-plane upgrades keep --authentication-config"
else
echo "WARN: kubeadm-config reconcile failed; the upgrade-chain preflight gate will block the next upgrade"
fi
rm -f /tmp/reconcile_kubeadm_oidc.py /tmp/kubeadm-cc-new.yaml
else
echo "kubeadm-config already uses --authentication-config (no oidc drift)"
fi
SH SH
} }
@ -155,6 +235,14 @@ resource "null_resource" "apiserver_oidc_config" {
} }
triggers = { triggers = {
# Intentionally hash ONLY the issuer config, NOT the remote script. CI applies
# the rbac stack with no ssh_private_key (var defaults to ""), so a re-run of
# this SSH provisioner in CI would fail hence the null_resource must stay a
# no-op on a plain CI apply. Script changes (e.g. the 2026-06-24 kubeadm-config
# reconciliation) reach the cluster via the apiserver-oidc-restore ConfigMap
# below (a plain k8s resource, no ssh) which the upgrade chain re-runs. To force
# this provisioner to re-run after a script change, apply locally with
# `-replace` + TF_VAR_ssh_private_key (see docs/runbooks/k8s-version-upgrade.md).
auth_config = sha256(local.apiserver_auth_config_yaml) auth_config = sha256(local.apiserver_auth_config_yaml)
} }
} }

View file

@ -7,6 +7,9 @@ variable "redis_host" { type = string }
variable "mysql_host" { type = string } variable "mysql_host" { type = string }
resource "kubernetes_manifest" "external_secret" { resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"
@ -36,6 +39,9 @@ resource "kubernetes_manifest" "external_secret" {
# DB credentials from Vault database engine (rotated automatically) # DB credentials from Vault database engine (rotated automatically)
# Provides DB_CONNECTION_STRING that auto-updates when password rotates # Provides DB_CONNECTION_STRING that auto-updates when password rotates
resource "kubernetes_manifest" "db_external_secret" { resource "kubernetes_manifest" "db_external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"
@ -85,6 +91,9 @@ data "kubernetes_secret" "eso_secrets" {
# fresh node would also fail. ESO renders the dockerconfigjson server-side # fresh node would also fail. ESO renders the dockerconfigjson server-side
# (Sprig `b64enc`) so the PAT never sits in K8s in cleartext. # (Sprig `b64enc`) so the PAT never sits in K8s in cleartext.
resource "kubernetes_manifest" "dockerhub_pull_secret" { resource "kubernetes_manifest" "dockerhub_pull_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"

View file

@ -55,6 +55,9 @@ resource "kubernetes_namespace" "recruiter_responder" {
# Schema in CNPG: `recruiter_responder` (alembic creates on first migrate). # Schema in CNPG: `recruiter_responder` (alembic creates on first migrate).
# DB user: created via Vault database engine see static-creds/pg-recruiter-responder. # DB user: created via Vault database engine see static-creds/pg-recruiter-responder.
resource "kubernetes_manifest" "external_secret" { resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"
@ -107,6 +110,9 @@ resource "kubernetes_manifest" "external_secret" {
# Pre-req in dbaas: CNPG cluster has DB `recruiter_responder`, role # Pre-req in dbaas: CNPG cluster has DB `recruiter_responder`, role
# `recruiter_responder`, and Vault role `static-creds/pg-recruiter-responder`. # `recruiter_responder`, and Vault role `static-creds/pg-recruiter-responder`.
resource "kubernetes_manifest" "db_external_secret" { resource "kubernetes_manifest" "db_external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"

View file

@ -41,6 +41,9 @@ module "tls_secret" {
} }
resource "kubernetes_manifest" "external_secret" { resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"

View file

@ -25,6 +25,9 @@ resource "kubernetes_namespace" "rybbit" {
} }
resource "kubernetes_manifest" "external_secret" { resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"

View file

@ -185,6 +185,9 @@ resource "kubernetes_service" "aiostreams" {
} }
resource "kubernetes_manifest" "probe_secrets" { resource "kubernetes_manifest" "probe_secrets" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"

View file

@ -5,6 +5,9 @@ variable "tls_secret_name" {
variable "nfs_server" { type = string } variable "nfs_server" { type = string }
resource "kubernetes_manifest" "external_secret" { resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"

View file

@ -21,6 +21,9 @@ resource "kubernetes_namespace" "shadowsocks" {
} }
resource "kubernetes_manifest" "external_secret" { resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"

View file

@ -20,6 +20,9 @@ resource "kubernetes_namespace" "speedtest" {
} }
resource "kubernetes_manifest" "external_secret" { resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"

View file

@ -16,6 +16,9 @@
# `secret/stem95su.rclone_conf`. A failed run surfaces as a failed Job. # `secret/stem95su.rclone_conf`. A failed run surfaces as a failed Job.
resource "kubernetes_manifest" "rclone_external_secret" { resource "kubernetes_manifest" "rclone_external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"

View file

@ -58,6 +58,9 @@ resource "kubernetes_namespace" "t3_afk" {
# (wired into ~/.gitconfig insteadOf rewrites in the container command). # (wired into ~/.gitconfig insteadOf rewrites in the container command).
resource "kubernetes_manifest" "external_secret" { resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"

View file

@ -22,6 +22,9 @@ resource "kubernetes_namespace" "tandoor" {
} }
resource "kubernetes_manifest" "external_secret" { resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"

View file

@ -419,6 +419,9 @@ module "ingress" {
# ExternalSecret for Technitium MySQL password (Vault auto-rotation) # ExternalSecret for Technitium MySQL password (Vault auto-rotation)
resource "kubernetes_manifest" "external_secret" { resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"

View file

@ -49,6 +49,9 @@ module "tls_secret" {
} }
resource "kubernetes_manifest" "external_secret" { resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"
@ -103,6 +106,9 @@ resource "kubernetes_manifest" "external_secret" {
# DB credentials from Vault database engine (rotated every 24h) # DB credentials from Vault database engine (rotated every 24h)
resource "kubernetes_manifest" "db_external_secret" { resource "kubernetes_manifest" "db_external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"

View file

@ -215,6 +215,9 @@ resource "kubernetes_namespace" "tripit" {
# Schema in CNPG: `tripit` (alembic creates tables on first migrate). # Schema in CNPG: `tripit` (alembic creates tables on first migrate).
# DB user: created via Vault database engine see static-creds/pg-tripit. # DB user: created via Vault database engine see static-creds/pg-tripit.
resource "kubernetes_manifest" "external_secret" { resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"
@ -291,6 +294,9 @@ resource "kubernetes_manifest" "external_secret" {
# Pre-req in dbaas: CNPG cluster has DB `tripit`, role `tripit`, and Vault # Pre-req in dbaas: CNPG cluster has DB `tripit`, role `tripit`, and Vault
# role `static-creds/pg-tripit`. # role `static-creds/pg-tripit`.
resource "kubernetes_manifest" "db_external_secret" { resource "kubernetes_manifest" "db_external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"

View file

@ -14,6 +14,9 @@ resource "kubernetes_namespace" "tuya-bridge" {
} }
resource "kubernetes_manifest" "external_secret" { resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"

View file

@ -35,6 +35,9 @@ resource "kubernetes_namespace" "shlink" {
} }
resource "kubernetes_manifest" "external_secret" { resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"
@ -67,6 +70,9 @@ resource "kubernetes_manifest" "external_secret" {
# the deployment is migrated to use env_from with this secret, the plan-time # the deployment is migrated to use env_from with this secret, the plan-time
# kubernetes_secret can be removed. # kubernetes_secret can be removed.
resource "kubernetes_manifest" "db_external_secret" { resource "kubernetes_manifest" "db_external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"

View file

@ -21,6 +21,9 @@ resource "kubernetes_namespace" "wealthfolio" {
} }
resource "kubernetes_manifest" "external_secret" { resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"
@ -51,6 +54,9 @@ resource "kubernetes_manifest" "external_secret" {
# `pg-wealthfolio-sync` rotates this every 7 days; ExternalSecret refreshes # `pg-wealthfolio-sync` rotates this every 7 days; ExternalSecret refreshes
# the K8s Secret every 15m so the sidecar always has a valid password. # the K8s Secret every 15m so the sidecar always has a valid password.
resource "kubernetes_manifest" "wealthfolio_sync_db_external_secret" { resource "kubernetes_manifest" "wealthfolio_sync_db_external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"
@ -777,6 +783,9 @@ resource "kubernetes_cron_job_v1" "wealthfolio_sync" {
# below references it as $__env{WEALTH_PG_PASSWORD}. Reloader restarts # below references it as $__env{WEALTH_PG_PASSWORD}. Reloader restarts
# Grafana whenever ESO updates this secret (every 7d on rotation). # Grafana whenever ESO updates this secret (every 7d on rotation).
resource "kubernetes_manifest" "grafana_wealth_db_external_secret" { resource "kubernetes_manifest" "grafana_wealth_db_external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"

View file

@ -291,6 +291,9 @@ module "ingress" {
} }
resource "kubernetes_manifest" "external_secret" { resource "kubernetes_manifest" "external_secret" {
field_manager {
force_conflicts = true
}
manifest = { manifest = {
apiVersion = "external-secrets.io/v1" apiVersion = "external-secrets.io/v1"
kind = "ExternalSecret" kind = "ExternalSecret"

Some files were not shown because too many files have changed in this diff Show more