goldmane-trail: polish follow-ups #57/#59/#61/#62/#63 + digest→#alerts
All checks were successful
ci/woodpecker/push/default Pipeline was successful

Completes the Goldmane who-talks-to-whom trail (ADR-0014), implemented by a
subagent workflow (distinct stacks in parallel, docs last):

- #57 Whisker gated ingress: ingress_factory (whisker.viktorbarzin.me,
  auth=required, Authentik-gated) + a NetworkPolicy allowing traefik->whisker:8081
  (the operator's whisker NP default-denies ingress). calico stack.
- #61 pipeline health: AggregatorDown + DigestFailing Prometheus alerts
  (prometheus_chart_values.tpl) + cluster-health check #48.
- #59 service-identity labels on the multi-Service namespaces (monitoring's 5
  TF-managed deployments + dbaas), with the KYVERNO_LIFECYCLE_V1 marker so they
  update in-place.
- #62/#63 docs: docs/runbooks/goldmane-flow-trail.md (new), service-catalog,
  security.md + monitoring.md east-west sections, ADR-0014 as-built, CONTEXT.md.
  #62 = the SQL to derive the Wave-1 per-namespace egress allowlist from the
  edge table (feeds code-8ywc; enforce-flips out of scope).

Also fixes the digest's Slack target: #security override 404s channel_not_found
because the shared alertmanager_slack_api_url webhook's app isn't a member of
#security (this likely also breaks alertmanager's slack-security receiver — flagged
in the runbook). Routed to #alerts (the webhook's working channel) until the app
is invited; verified a real digest run posts cleanly (360 edges).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-06-25 17:49:25 +00:00
parent 306cdd4cb3
commit 6c5288998f
17 changed files with 626 additions and 11 deletions

View file

@ -130,6 +130,11 @@ resource "kubernetes_deployment" "blackbox_exporter" {
labels = {
app = "blackbox-exporter"
tier = var.tier
# ADR-0014 service identity: monitoring is a multi-Service namespace, so
# the namespace alone can't attribute Goldmane flows. Value = the
# fronting Service name (kubernetes_service.blackbox_exporter is named
# "blackbox-exporter").
"service-identity" = "blackbox-exporter"
}
annotations = {
"reloader.stakater.com/search" = "true"
@ -146,6 +151,10 @@ resource "kubernetes_deployment" "blackbox_exporter" {
metadata {
labels = {
app = "blackbox-exporter"
# ADR-0014: Goldmane/Felix stamps POD labels onto flows, so the
# disambiguating identity must live on the pod template (not just
# the Deployment metadata above). Not in selector no replace.
"service-identity" = "blackbox-exporter"
}
}
spec {

View file

@ -5,6 +5,11 @@ resource "kubernetes_deployment" "goflow2" {
labels = {
app = "goflow2"
tier = var.tier
# ADR-0014 service identity: monitoring is a multi-Service namespace, so
# the namespace alone can't attribute Goldmane flows. Value = the
# fronting Service name (kubernetes_service.goflow2 the metrics svc; the
# goflow2-netflow NodePort is the same pod by another name).
"service-identity" = "goflow2"
}
}
spec {
@ -18,6 +23,10 @@ resource "kubernetes_deployment" "goflow2" {
metadata {
labels = {
app = "goflow2"
# ADR-0014: Goldmane/Felix stamps POD labels onto flows, so the
# disambiguating identity must live on the pod template (not just
# the Deployment metadata above). Not in selector no replace.
"service-identity" = "goflow2"
}
}
spec {

View file

@ -47,6 +47,10 @@ resource "kubernetes_deployment" "idrac-redfish" {
labels = {
app = "idrac-redfish-exporter"
tier = var.tier
# ADR-0014 service identity: monitoring is a multi-Service namespace, so
# the namespace alone can't attribute Goldmane flows. Value = the
# fronting Service name (kubernetes_service.idrac-redfish-exporter).
"service-identity" = "idrac-redfish-exporter"
}
annotations = {
"reloader.stakater.com/search" = "true"
@ -63,6 +67,10 @@ resource "kubernetes_deployment" "idrac-redfish" {
metadata {
labels = {
app = "idrac-redfish-exporter"
# ADR-0014: Goldmane/Felix stamps POD labels onto flows, so the
# disambiguating identity must live on the pod template (not just
# the Deployment metadata above). Not in selector no replace.
"service-identity" = "idrac-redfish-exporter"
}
}
spec {

View file

@ -1450,6 +1450,49 @@ serverFiles:
Remediation: right-size top reservers via Goldilocks (immich-server,
frigate, prometheus, pg-cluster, paperless) or bump VM RAM on
k8s-node2/k8s-node3 from 32GB → 48GB to match node1.
# Goldmane edge-aggregator (ADR-0014 / infra #58, #61): the durable
# who-talks-to-whom trail. The aggregator pod has NO /metrics endpoint,
# so its health is inferred from kube-state-metrics signals — the trail
# must not silently die. Two failure modes are covered:
# - the aggregate Deployment stops consuming Goldmane's flow stream
# (AggregatorDown) → no new edges ever land in the goldmane_edges DB
# - the daily digest CronJob can't post new edges to Slack
# (DigestFailing) → edges still land but nobody is told.
# A freshness probe (max(last_seen) staleness) is intentionally NOT here:
# AggregatorDown is the agreed floor and needs no extra moving parts.
- name: Network Observability (Goldmane)
rules:
# Deployment has <1 available replica for 15m. kube-state-metrics
# keeps `kube_deployment_status_replicas_available` (metric-keep list
# in serverFiles below). The 15m window rides out a normal rollout /
# node drain without paging; a genuinely-dead aggregator means the
# edge trail has stopped recording and stays down.
- alert: AggregatorDown
expr: |
kube_deployment_status_replicas_available{namespace="goldmane-edge-aggregator",deployment="goldmane-edge-aggregator"} < 1
and on() (time() - process_start_time_seconds{job="prometheus"}) > 900
for: 15m
labels:
severity: warning
annotations:
summary: "goldmane-edge-aggregator has no available replica — the who-talks-to-whom edge trail has stopped recording"
description: "The aggregate Deployment streams Calico Goldmane flows into the goldmane_edges CNPG DB. With 0 replicas, no new namespace-pair edges are captured. `kubectl -n goldmane-edge-aggregator describe deploy goldmane-edge-aggregator` + check the goldmane svc (calico-system) is reachable."
# The goldmane-edges-digest CronJob has a failed Job that started in
# the last 24h. Mirrors the generic JobFailed shape but scoped to the
# digest so it routes here. `for: 30m` rides out the apply/scrape
# transient; the digest runs daily so a real failure won't self-heal
# until the next run — surface it same-day rather than waiting 24h.
- alert: DigestFailing
expr: |
kube_job_status_failed{namespace="goldmane-edge-aggregator", job_name=~"goldmane-edges-digest.*"} > 0
and on(namespace, job_name)
(time() - kube_job_status_start_time{namespace="goldmane-edge-aggregator", job_name=~"goldmane-edges-digest.*"}) < 86400
for: 30m
labels:
severity: warning
annotations:
summary: "goldmane-edges-digest CronJob failing — new edges captured but not posted to #security"
description: "The daily edge digest Job {{ $labels.job_name }} failed. Edges may still be landing in the goldmane_edges DB but no one is being notified of new namespace-pairs. `kubectl -n goldmane-edge-aggregator logs job/{{ $labels.job_name }}`."
- name: Infrastructure Health
rules:
- alert: HomeAssistantDown

View file

@ -22,6 +22,10 @@ resource "kubernetes_deployment" "pve_exporter" {
namespace = kubernetes_namespace.monitoring.metadata[0].name
labels = {
tier = var.tier
# ADR-0014 service identity: monitoring is a multi-Service namespace, so
# the namespace alone can't attribute Goldmane flows. Value = the
# fronting Service name (kubernetes_service.proxmox-exporter).
"service-identity" = "proxmox-exporter"
}
}
@ -37,6 +41,10 @@ resource "kubernetes_deployment" "pve_exporter" {
metadata {
labels = {
app = "proxmox-exporter"
# ADR-0014: Goldmane/Felix stamps POD labels onto flows, so the
# disambiguating identity must live on the pod template (not just
# the Deployment metadata above). Not in selector no replace.
"service-identity" = "proxmox-exporter"
}
}

View file

@ -31,6 +31,10 @@ resource "kubernetes_deployment" "snmp-exporter" {
labels = {
app = "snmp-exporter"
tier = var.tier
# ADR-0014 service identity: monitoring is a multi-Service namespace, so
# the namespace alone can't attribute Goldmane flows. Value = the
# fronting Service name (kubernetes_service.snmp-exporter).
"service-identity" = "snmp-exporter"
}
annotations = {
"reloader.stakater.com/search" = "true"
@ -47,6 +51,10 @@ resource "kubernetes_deployment" "snmp-exporter" {
metadata {
labels = {
app = "snmp-exporter"
# ADR-0014: Goldmane/Felix stamps POD labels onto flows, so the
# disambiguating identity must live on the pod template (not just
# the Deployment metadata above). Not in selector no replace.
"service-identity" = "snmp-exporter"
}
}
spec {