infra/docs/architecture/monitoring.md
Viktor Barzin a048b37f60 security(wave1): W1.1 audit-log shipping LIVE + W1.5 trusted-registries Enforce LIVE
## W1.1 — K8s API audit log shipping (LIVE)
- alloy.yaml: added control-plane toleration so Alloy DaemonSet runs on
  k8s-master node. Verified alloy-7zg7t scheduled on master, tailing
  /var/log/kubernetes/audit.log
- loki.tf "Security Wave 1" rule group: added K2-K9 alert rules
  (skipped K1 per Q7 decision):
  - K2 K8sSATokenFromUnexpectedIP
  - K3 K8sSensitiveSecretReadByUnexpectedActor
  - K4 K8sExecIntoSensitiveNamespace
  - K5 K8sMassDelete (>5 Pod/Secret/CM in 60s by single user)
  - K6 K8sAuditPolicyModified (kubeadm-config CM change)
  - K7 K8sClusterRoleWildcardCreated (verbs=* + resources=*)
  - K8 K8sAnonymousBindingGranted
  - K9 K8sViktorFromUnexpectedIP
- All rules use source-IP regex matching the wave-1 allowlist
  (10.0.20.0/22, 192.168.1.0/24, 10.10.0.0/16 pod, 10.96.0.0/12 svc,
  100.64-127 tailnet) and `lane = "security"` → #security Slack route.
- Verified: kubectl-audit logs flowing in Loki query
  {job="kubernetes-audit"} returns events with node=k8s-master.
- Verified: /loki/api/v1/rules lists all K2-K9 + V1-V7 + S1.

## W1.5 — require-trusted-registries Enforce (LIVE)
- security-policies.tf: flipped Audit→Enforce with explicit allowlist
  built by `kubectl get pods -A -o jsonpath='{..image}'` enumeration.
- Removed `*/*` catch-all (which made Audit→Enforce a no-op).
- Pattern includes 15 explicit registries, 6 DockerHub library bare
  names, 56 DockerHub user repos.
- Verified by admission dry-run:
  - evilcorp.example/malware:v1 → BLOCKED with custom message
  - alpine:3.20 → ALLOWED (matches `alpine*`)
  - docker.io/library/alpine:3.20 → ALLOWED (matches `docker.io/*`)

## W1.6 — Calico flow logs (BLOCKED — Calico OSS limitation)
- Tried adding FelixConfiguration with flowLogsFileEnabled=true via
  kubectl_manifest in stacks/calico/main.tf
- Calico OSS rejected with "strict decoding error: unknown field
  spec.flowLogsFileEnabled" — these fields are Calico Enterprise/Tigera-only
- Removed the failed resource. Documented alternative paths in main.tf
  comment block: GNP with action=Log (iptables NFLOG → journal), Cilium
  migration, eBPF tooling, or Tigera Operator adoption.

## Docs updates
- security.md status table refreshed: W1.1/W1.2/W1.3/W1.4/W1.5 LIVE,
  W1.6/W1.7 blocked
- monitoring.md: Loki marked DEPLOYED (was incorrectly NOT-DEPLOYED in
  prior session before today's apply)

## Cleanup
- Removed stacks/kyverno/imports.tf (TF 1.5+ import blocks completed
  their job in the 2026-05-18 apply; should not stay in tree per TF docs)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:59 +00:00

15 KiB

Monitoring & Alerting Architecture

Overview

The monitoring stack provides comprehensive observability for the home Kubernetes cluster through metrics collection (Prometheus), visualization (Grafana), log aggregation (Loki), alerting (Alertmanager), and uptime monitoring (Uptime Kuma). GPU metrics are collected via NVIDIA's dcgm-exporter. The system tracks infrastructure health, application performance, backup success, and resource utilization with intelligent alert inhibition to reduce noise during cascading failures.

Architecture Diagram

graph TB
    subgraph "Metric Sources"
        K8S[Kubernetes API Server]
        NODES[Node Exporters]
        PODS[Application Pods]
        GPU[NVIDIA GPU via dcgm-exporter]
        UPS[UPS Exporter]
        NFS[NFS Exporter]
        EMAIL[Email Roundtrip Probe<br/>CronJob every 10m]
    end

    subgraph "Monitoring Stack (platform stack)"
        PROM[Prometheus<br/>Scrape & Store]
        LOKI[Loki<br/>Log Aggregation]
        AM[Alertmanager<br/>Alert Routing]
        GRAFANA[Grafana<br/>14+ Dashboards<br/>OIDC via Authentik]
        UPTIME[Uptime Kuma<br/>HTTP Monitors]
    end

    subgraph "Alert Flow"
        INHIBIT[Inhibition Rules<br/>Node Down → Suppress Pod Alerts]
        NOTIFY[Notifications]
    end

    K8S -->|ServiceMonitors| PROM
    NODES -->|Metrics| PROM
    PODS -->|Metrics| PROM
    PODS -->|Logs| LOKI
    GPU -->|GPU Metrics| PROM
    UPS -->|UPS Metrics| PROM
    NFS -->|NFS Metrics| PROM

    PROM -->|Query| GRAFANA
    PROM -->|Alerts| AM
    LOKI -->|Query| GRAFANA

    AM --> INHIBIT
    INHIBIT --> NOTIFY

    EMAIL -->|Pushgateway| PROM
    EMAIL -.->|Push| UPTIME
    PODS -.->|HTTP Health| UPTIME

Components

Component Version Location Purpose
Prometheus Latest (Diun monitored) stacks/monitoring/modules/monitoring/ Metrics collection and storage, scrape configs for all services
Grafana Latest (Diun monitored) stacks/monitoring/modules/monitoring/ Visualization, 14+ dashboards (API server, CoreDNS, GPU, UPS, etc.)
Loki DEPLOYED 2026-05-18 (SingleBinary mode, 30d retention, 50Gi PVC on proxmox-lvm, ruler enabled → Alertmanager). Re-enabled from previous "operational overhead" disable. Ships logs via Alloy DaemonSet (now on all nodes including master after 2026-05-19 toleration add). stacks/monitoring/modules/monitoring/ Log aggregation and querying
Alertmanager Latest (Diun monitored) stacks/monitoring/modules/monitoring/ Alert routing with cascade inhibitions
Uptime Kuma Latest (Diun monitored) stacks/uptime-kuma/ Internal + external HTTP monitors, status page
External Monitor Sync Python 3.12 stacks/uptime-kuma/ CronJob (10min) syncs [External] monitors from cloudflare_proxied_names
dcgm-exporter Configurable resources stacks/monitoring/modules/monitoring/ NVIDIA GPU metrics collection
Email Roundtrip Probe Python 3.12 stacks/mailserver/modules/mailserver/ E2E email delivery verification via Mailgun API + IMAP
Forgejo Registry Integrity Probe Alpine 3.20 + curl/jq stacks/monitoring/modules/monitoring/main.tf CronJob every 15m: walks /v2/_catalog on forgejo.viktorbarzin.me (HTTP via in-cluster service), HEADs every tagged manifest + index child; emits registry_manifest_integrity_* metrics to Pushgateway. Replaces the legacy registry-integrity-probe against registry.viktorbarzin.me:5050 decommissioned in Phase 4 of forgejo-registry-consolidation 2026-05-07.

How It Works

Metrics Collection

Prometheus scrapes metrics from all cluster components and applications using ServiceMonitor CRDs and scrape configs. Every new service deployed to the cluster receives:

  1. A Prometheus scrape configuration (via ServiceMonitor or static config)
  2. An Uptime Kuma HTTP monitor for internal health checks
  3. An external HTTP monitor (auto-created by external-monitor-sync for all Cloudflare-proxied services)

External Monitoring

The external-monitor-sync CronJob (every 10min, stacks/uptime-kuma/) ensures Uptime Kuma has [External] <service> monitors for externally-reachable ingresses. Discovery is opt-OUT: the script lists every ingress via the K8s API and creates a monitor for any host ending in .viktorbarzin.me, skipping only those annotated uptime.viktorbarzin.me/external-monitor: "false". Both ingress_factory and the reverse-proxy factory emit that annotation when the caller sets external_monitor = false; leaving it null keeps the opt-in default (important for helm-provisioned ingresses that don't go through our factories). The legacy cloudflare_proxied_names ConfigMap is a fallback if the K8s API discovery fails.

These monitors test the full external access path (DNS → Cloudflare → Tunnel → Traefik → Service) from inside the cluster. The status-page-pusher groups them as "External Reachability" and pushes a external_internal_divergence_count metric to Pushgateway when services are externally down but internally up. Alert ExternalAccessDivergence fires after 15min of divergence.

Data flows from targets through Prometheus storage to Grafana dashboards. Applications emit logs to stdout/stderr which are aggregated by Loki and queryable through Grafana's log viewer.

Alert Cascade Inhibition

Alertmanager implements intelligent alert suppression to prevent alert storms during cascading failures:

graph LR
    NODE_DOWN[Node Down Alert] -->|Inhibits| POD_ALERTS[Pod Alerts on That Node]
    COMPLETED[Completed CronJob Pod] -->|Excluded from| POD_READY[Pod Not Ready Alerts]

When a node goes down, all pod-level alerts for pods scheduled on that node are suppressed, reducing noise and focusing attention on the root cause.

GPU Monitoring

NVIDIA GPU metrics are collected via dcgm-exporter with configurable resource limits (dcgmExporter.resources). Metrics include GPU utilization, memory usage, temperature, and power consumption.

Database Version Pinning

MySQL, PostgreSQL, and Redis images have Diun monitoring disabled to prevent automatic version updates that could cause compatibility issues. Version upgrades are manual and coordinated.

Configuration

Key Config Files

  • Monitoring Stack: stacks/platform/modules/monitoring/
    • Prometheus scrape configs and recording rules
    • Grafana dashboard definitions
    • Alertmanager routing and inhibition rules
    • Uptime Kuma configuration

Prometheus Scrape Configs

Every service must expose metrics and be registered in Prometheus via ServiceMonitor or static scrape config. Standard pattern:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: my-service
spec:
  selector:
    matchLabels:
      app: my-service
  endpoints:
  - port: metrics

Grafana Dashboards

14+ pre-configured dashboards covering:

  • Kubernetes API Server
  • CoreDNS
  • GPU metrics
  • UPS status
  • Node metrics
  • Pod resource usage
  • Application-specific metrics

Alert Definitions

Infrastructure Alerts

  • OOMKill: Container killed due to out-of-memory
  • PodReplicaMismatch: Deployment/StatefulSet replica count doesn't match desired
  • ClusterMemoryRequestsHigh: Cluster memory requests >85%
  • ContainerNearOOM: Container using >85% of memory limit
  • PodUnschedulable: Pod cannot be scheduled due to resource constraints
  • CPUTemp: CPU temperature threshold exceeded
  • SSDWrites: Excessive SSD write volume
  • NFSResponsiveness: NFS mount latency issues
  • UPSBattery: UPS battery charge low

Application Alerts

  • 4xx/5xx Error Rates: HTTP error rate threshold exceeded

Email Monitoring Alerts

  • EmailRoundtripFailing: E2E email probe returning failure for >30m
  • EmailRoundtripStale: No successful email round-trip in >80m (60m threshold + for:20m)
  • EmailRoundtripNeverRun: Email probe has never reported (40m)

Registry Integrity Alerts

  • RegistryManifestIntegrityFailure: Private registry serving 404 for manifests it advertises (orphan OCI-index children) — fires after 30m of registry_manifest_integrity_failures > 0. Remediation: rebuild affected image per docs/runbooks/registry-rebuild-image.md.
  • RegistryIntegrityProbeStale: Probe hasn't reported in >1h (CronJob broken)
  • RegistryCatalogInaccessible: Probe cannot fetch /v2/_catalog (auth failure or registry down)

The email monitoring system uses a CronJob (email-roundtrip-monitor, every 10 min) in the mailserver namespace that:

  1. Sends a test email via Mailgun HTTP API to smoke-test@viktorbarzin.me
  2. Email lands in the spam@ catch-all mailbox via MX delivery
  3. Verifies delivery via IMAP (searches by UUID marker in subject)
  4. Deletes the test email immediately
  5. Pushes metrics (email_roundtrip_success, email_roundtrip_duration_seconds, email_roundtrip_last_success_timestamp) to Prometheus Pushgateway
  6. Pushes status to Uptime Kuma E2E Push monitor

Uptime Kuma monitors: TCP SMTP (port 25) on 176.12.22.76 (external), IMAP (port 993) on 10.0.20.202, and Dovecot exporter metrics on port 9166.

Security Alerts (Wave 1 — planned, beads code-8ywc)

Routed via Loki ruler → Alertmanager → #security Slack receiver. Same handling path as infra alerts. Single channel with severity labels inside (critical/warning/info), not three separate channels. Detection sources: K8s API audit log (job=kube-audit), Vault audit log (job=vault-audit), PVE sshd journald (job=sshd-pve), Calico flow logs (job=calico-flow, W1.6 only).

# Source Event Severity
K2 kube-audit SA token used from outside cluster critical
K3 kube-audit Secret read in vault/sealed-secrets/external-secrets by non-allowlisted SA critical
K4 kube-audit Exec into vault/kube-system/dbaas/cnpg-system pod by non-allowlisted user warning
K5 kube-audit Mass delete (>5 Pod/Secret/CM in 60s) critical
K6 kube-audit Audit policy itself modified critical
K7 kube-audit New *,* ClusterRole created warning
K8 kube-audit Anonymous binding granted critical
K9 kube-audit me@viktorbarzin.me request from non-allowlist sourceIP critical
V1 vault-audit Root token created critical
V2 vault-audit Audit device disabled/modified critical
V3 vault-audit Seal status changed critical
V4 vault-audit Policy written/modified (allowlist Terraform actor) warning
V5 vault-audit Auth failure spike >10/min warning
V6 vault-audit Token with policies different from parent created critical
V7 vault-audit Viktor's entity_id from non-allowlist remote_addr (requires x_forwarded_for_authorized_addrs) critical
S1 sshd-pve sshd auth success from non-allowlist IP critical

K1 (cluster-admin grant) intentionally skipped — see security.md.

Allowlist source-IP CIDRs (used by K2, K9, V7, S1): 10.0.20.0/22, 192.168.1.0/24, K8s pod CIDR, K8s service CIDR, Headscale tailnet. Policy: no public-IP access; all admin paths transit LAN or Headscale.

IOPS impact estimated ~1-2 GB/day additional disk writes after custom audit-policy tuning. Retention: 90d for security streams.

Backup Alerts

  • PostgreSQLBackupStale: >36h since last backup
  • MySQLBackupStale: >36h since last backup
  • EtcdBackupStale: >8d since last backup
  • VaultBackupStale: >8d since last backup
  • VaultwardenBackupStale: >8d since last backup
  • RedisBackupStale: >8d since last backup
  • PrometheusBackupStale: >32d since last backup
  • VaultwardenIntegrityFail: Backup integrity check failed

Vault Paths

No direct Vault integration required for the monitoring stack (platform stack cannot depend on Vault due to circular dependency).

Decisions & Rationale

Why Prometheus over alternatives (InfluxDB, Graphite)?

  • Native Kubernetes integration via ServiceMonitor CRDs
  • Pull-based model reduces application complexity (no push agents)
  • Powerful query language (PromQL) for alerting and visualization
  • Industry standard for cloud-native monitoring

Why Grafana over Prometheus UI?

  • Superior visualization capabilities
  • OIDC authentication via Authentik for secure access
  • Multi-data-source support (Prometheus + Loki)
  • Rich dashboard ecosystem

Why Loki for logs?

  • Designed for Kubernetes log aggregation
  • Cost-effective (indexes metadata, not full log content)
  • Tight Grafana integration
  • LogQL query language similar to PromQL

Why Uptime Kuma?

  • Simple HTTP/TCP/Ping monitoring
  • Public status page for service availability
  • Lightweight compared to full APM solutions
  • Complements Prometheus for black-box monitoring

Why alert inhibition?

  • Prevents alert fatigue during cascading failures
  • Root cause focus (fix the node, not 50 pods)
  • Reduces on-call noise

Why exclude completed CronJob pods?

  • CronJobs naturally transition to Completed state
  • "Pod not ready" is expected and not actionable
  • Prevents false positive alerts

Why disable Diun for databases?

  • Version upgrades require migration planning
  • Breaking schema changes need coordination
  • Manual upgrade testing prevents production issues

Troubleshooting

Alert is firing but I don't see the issue

Check inhibition rules in Alertmanager. The alert may be suppressed due to a higher-level failure (e.g., node down suppressing pod alerts).

Grafana dashboards show no data

  1. Check Prometheus targets: kubectl port-forward -n monitoring svc/prometheus 9090:9090http://localhost:9090/targets
  2. Verify ServiceMonitor is created: kubectl get servicemonitor -A
  3. Check Prometheus logs for scrape errors: kubectl logs -n monitoring deployment/prometheus

Loki logs not appearing

  1. Verify pod logs are going to stdout/stderr (not files)
  2. Check Loki is scraping pod logs: kubectl logs -n monitoring deployment/loki
  3. Ensure Grafana data source is configured correctly

Backup alert firing but backup exists

  1. Check backup timestamp in Prometheus: backup_last_success_timestamp_seconds{job="my-backup"}
  2. Verify backup job completed successfully: kubectl logs -n backups cronjob/my-backup
  3. Ensure backup job updates the Prometheus metric via pushgateway or ServiceMonitor

GPU metrics not showing

  1. Verify dcgm-exporter is running: kubectl get pods -n monitoring -l app=dcgm-exporter
  2. Check GPU node has NVIDIA drivers installed
  3. Verify dcgm-exporter has access to GPU: kubectl logs -n monitoring deployment/dcgm-exporter

Uptime Kuma monitor shows down but service is healthy

  1. Check network policies aren't blocking Uptime Kuma's pod
  2. Verify service endpoint is reachable from Uptime Kuma namespace
  3. Check Uptime Kuma logs: kubectl logs -n monitoring deployment/uptime-kuma