infra/docs/plans/2026-03-02-security-observability-design.md
Viktor Barzin db7ea58d5c [ci skip] add security observability layer design document
Tetragon-centric approach: eBPF runtime security, pfSense syslog
collection, CoreDNS query logging, Calico NetworkPolicies,
on-demand mitmproxy, unified Grafana security dashboard.
~625MB steady-state, <5GB budget.
2026-03-02 21:13:01 +00:00

13 KiB

Security Observability Layer — Design Document

Date: 2026-03-02 Status: Approved Approach: Tetragon-Centric (Approach A)

Problem Statement

The cluster has strong perimeter security (CrowdSec, Traefik middleware chain, Cloudflare WAF) and good monitoring (Prometheus, Loki, Grafana), but lacks:

  • Runtime security monitoring (syscall-level container activity)
  • Egress visibility (what pods connect to externally)
  • HTTPS inspection capability (even on-demand)
  • Network segmentation (no NetworkPolicies — any pod can reach any pod)
  • Firewall log centralization (pfSense logs not in Loki)
  • Unified security dashboard

Requirements

  • Threat model: Defense in depth — external attacks, compromised containers, lateral movement, data exfiltration
  • TLS inspection: Connection metadata (SNI/IP/bytes) by default, selective deep inspection on-demand
  • Alerting: Slack (existing channel)
  • Resource budget: <5GB RAM total for new tooling
  • Enforcement: Observe & alert now, enforce later
  • CNI: Calico (confirmed, with GlobalNetworkPolicy CRD support)

Architecture

                    ┌─────────────────────────────────────────────────┐
                    │              Existing Stack                      │
                    │  Prometheus ← scrape ← Tetragon metrics         │
                    │  Loki ← Alloy ← Tetragon event logs            │
                    │                   ← pfSense syslog              │
                    │                   ← CoreDNS query logs          │
                    │  Grafana ← Unified Security Dashboard           │
                    │  Alertmanager → Slack                           │
                    └─────────────────────────────────────────────────┘

    ┌─────────────┐   ┌──────────────────┐   ┌─────────────────────┐
    │  Tetragon   │   │ Kyverno Policy   │   │  mitmproxy          │
    │ (DaemonSet) │   │ Reporter (1 pod) │   │ (on-demand, 1 pod)  │
    │ eBPF agent  │   │                  │   │ HTTPS inspection    │
    │ per node    │   │ Violations →     │   │ for suspect pods    │
    │             │   │ Prometheus +     │   │                     │
    │ Monitors:   │   │ Grafana          │   │ Transparent proxy   │
    │ • processes │   │                  │   │ via NetworkPolicy   │
    │ • network   │   └──────────────────┘   └─────────────────────┘
    │ • files     │
    │ • syscalls  │   ┌──────────────────┐
    │             │   │ Inspektor Gadget │
    └─────────────┘   │ (temporary)      │
                      │ Auto-generate    │
                      │ NetworkPolicies  │
                      │ from observed    │
                      │ traffic baseline │
                      └──────────────────┘

    ┌────────────────────────────────────────────────┐
    │           Calico NetworkPolicies               │
    │  (Generated from baseline, enforced gradually) │
    │  Default deny egress + allow known connections │
    └────────────────────────────────────────────────┘

Data Flows

  1. Tetragon → Prometheus (metrics) + stdout → Alloy → Loki (events)
  2. pfSense → syslog UDP → Alloy syslog receiver → Loki
  3. CoreDNS → uncomment log → stdout → Alloy → Loki
  4. Kyverno Policy Reporter → Prometheus (violation metrics)
  5. Grafana ← queries all sources → Unified Security Dashboard
  6. Alertmanager → Slack (security-specific alert rules)

Component Details

1. Tetragon (Runtime Security + Network Visibility)

Purpose: eBPF-based kernel-level monitoring of process execution, network connections, file access, and privilege escalation.

Deployment:

  • Helm chart: cilium/tetragon (CNCF project, part of Cilium ecosystem)
  • Type: DaemonSet on all 5 nodes
  • Resources: ~80-120MB RAM/node, ~50m CPU idle
  • Tier: 1-cluster
  • Namespace: tetragon
  • New stack: stacks/tetragon/

TracingPolicy CRDs (what to monitor):

Policy Detects Severity
Privilege escalation setuid(0), setgid(0), dangerous capabilities Critical
Reverse shell Shell process with outbound connection to external IP Critical
Crypto miner Connections to mining pool ports (3333, 14444, etc.) Warning
Container escape mount syscalls, /proc/self/ns/* access, nsenter Critical
Sensitive file access Reads of /etc/shadow, K8s service account tokens Warning
Unexpected egress Outbound connections to non-private IPs (log all) Info
Unexpected binaries Shells spawning in non-shell containers Warning

Observe → Enforce path:

  • Start: TracingPolicy (observe + alert only)
  • Later: TracingPolicyEnforced (can SIGKILL processes)

Integration:

  • Prometheus metrics via pod annotations (auto-scraped by existing kubernetes-pods job)
  • Events as JSON to stdout → Alloy → Loki
  • New Prometheus alert rules for critical Tetragon events

2. pfSense Log Collection

Purpose: Centralize firewall logs into Loki for correlation with cluster security events.

Implementation:

  • Deploy a small syslog-receiver Deployment (1 replica) with a MetalLB LoadBalancer IP
  • Forward received syslog to Loki via loki.write
  • OR add loki.source.syslog to existing Alloy config
  • Configure pfSense: Status → System Logs → Settings → Remote Logging → point to syslog receiver IP:1514

Recommended approach: Dedicated syslog receiver Deployment (not Alloy DaemonSet) because:

  • Stable LoadBalancer IP for pfSense to target
  • Doesn't couple to a specific node
  • Can parse filterlog CSV format independently

Parse pfSense filterlog: Extract interface, action (pass/block), direction, source IP, dest IP, protocol, port into Loki labels.

Resource cost: ~50-100MB for the syslog receiver pod.

3. CoreDNS Query Logging

Purpose: Detect DNS tunneling, C2 callbacks, unusual domain lookups.

Implementation: Uncomment #loglog in CoreDNS ConfigMap (stacks/platform/modules/technitium/main.tf).

Scope: Only enable on the main zone (.), NOT the viktorbarzin.lan zone (Technitium already logs those to MySQL).

Alert rules for Loki:

  • High NX domain rate from a single pod
  • DNS tunneling signatures (subdomain labels >40 chars)
  • Queries to known malicious TLDs

Resource cost: 0 additional (just increased log volume in Loki).

4. NetworkPolicy Strategy (Calico)

Purpose: Restrict pod-to-pod and pod-to-external traffic using Calico NetworkPolicies.

Phased rollout:

Phase Action Timeline
Observe Deploy Inspektor Gadget, capture 24-48h traffic baseline Week 1
Generate kubectl gadget advise network-policy per namespace Week 1
Review Convert to Terraform kubernetes_network_policy resources Week 2
Enforce (low-risk) Apply to aux-tier namespaces first Week 3
Enforce (all) Gradually apply to edge, cluster, core tiers Week 4+

Key policies:

  • Default deny egress for aux-tier namespaces
  • Allow DNS (port 53) + known external endpoints per service
  • Block inter-namespace traffic except known dependencies (redis, postgresql, loki)

Inspektor Gadget:

  • CNCF Sandbox project, ~80MB/node as DaemonSet
  • Temporary deployment — remove after baseline capture (~400MB total while running)
  • kubectl gadget advise network-policy auto-generates policies from observed traffic

Resource cost: 0 permanent (Calico already enforces). ~400MB temporary.

5. mitmproxy (On-Demand HTTPS Inspection)

Purpose: Deep HTTPS traffic inspection for specific suspicious pods during incident investigation.

Deployment:

  • Single-replica Deployment, scaled to 0 by default
  • Namespace: mitmproxy
  • New stack: stacks/mitmproxy/
  • Web UI at mitmproxy.viktorbarzin.lan (local-only access)

Usage workflow:

  1. Scale to 1: kubectl scale deployment mitmproxy --replicas=1 -n mitmproxy
  2. Apply Calico NetworkPolicy redirecting suspect pod's egress through mitmproxy
  3. Mount mitmproxy CA cert into target pod's trust store
  4. Inspect traffic via web UI
  5. Scale back to 0 when done

Resource cost: ~200MB when active, 0 when scaled to 0.

6. Kyverno Policy Reporter

Purpose: Surface Kyverno policy violations (currently in audit mode) in Grafana dashboards.

Deployment:

  • Add as sub-chart or separate Helm release in Kyverno stack
  • 1 replica Deployment
  • Exports metrics to Prometheus
  • ~50MB RAM

Integration:

  • Prometheus scrapes Policy Reporter metrics
  • Grafana dashboard shows violations by policy, namespace, severity

7. Unified Security Dashboard + Alert Rules

Grafana Dashboard layout:

Row Panels Data Source
Overview Active CrowdSec bans, Tetragon alerts/24h, Kyverno violations/24h, pfSense blocks/24h Prometheus
Attack Timeline Combined time series of all security events Prometheus
Runtime Security Suspicious processes, privilege escalations, file access alerts Loki (Tetragon)
Network Top egress destinations by namespace, unusual DNS queries, pfSense blocks Loki + Prometheus
Policy Kyverno violations by policy/namespace/severity Prometheus (Policy Reporter)

New Prometheus Alert Rules:

Alert Trigger Severity
TetragonPrivilegeEscalation setuid(0) in non-system container Critical
TetragonReverseShell Shell + outbound connection Critical
TetragonCryptoMiner Connection to mining pool ports Warning
TetragonUnexpectedEgress Pod → unexpected external IP Warning
SuspiciousDNSQuery High NX rate or long subdomains Warning
PfSenseHighBlockRate >100 blocks/min from single source Warning
KyvernoViolationSpike >10 violations in 5 minutes Warning

Resource Budget

Component Type Steady-State RAM Notes
Tetragon DaemonSet (5 nodes) ~500MB Runtime security + egress
Syslog receiver Deployment (1) ~75MB pfSense logs
Kyverno Policy Reporter Deployment (1) ~50MB Violation metrics
mitmproxy Deployment (0/1) 0 (200MB active) On-demand only
CoreDNS logging Config change 0 More Loki volume
Inspektor Gadget Temporary DaemonSet 0 (~400MB while running) Removed after baseline
Total steady-state ~625MB Well under 5GB budget

Implementation Phases

Phase 1: Core Observability (~625MB)

  1. Deploy Tetragon with TracingPolicy CRDs
  2. Enable CoreDNS query logging
  3. Deploy Kyverno Policy Reporter
  4. Add Prometheus alert rules for Tetragon events

Phase 2: Log Centralization (+0MB permanent)

  1. Deploy syslog receiver for pfSense logs
  2. Configure pfSense remote syslog
  3. Build unified Grafana security dashboard

Phase 3: Network Segmentation (+0MB permanent, ~400MB temporary)

  1. Deploy Inspektor Gadget temporarily
  2. Capture 24-48h traffic baseline
  3. Generate and review NetworkPolicies
  4. Apply policies gradually (aux → edge → cluster → core)
  5. Remove Inspektor Gadget

Phase 4: On-Demand Inspection (+0MB permanent)

  1. Deploy mitmproxy (scaled to 0)
  2. Document investigation workflow

New Terraform Stacks

  • stacks/tetragon/ — Helm chart + TracingPolicy CRDs + Prometheus rules
  • stacks/mitmproxy/ — On-demand HTTPS inspection proxy

Modified Stacks

  • stacks/platform/modules/monitoring/ — Alloy syslog or syslog receiver, Grafana dashboard, alert rules
  • stacks/platform/modules/technitium/ — CoreDNS log uncomment
  • stacks/platform/modules/kyverno/ — Policy Reporter sub-chart

Existing Stack (No Changes Needed)

  • CrowdSec (IDS/IPS with Traefik bouncer) — already covers external attack detection
  • Prometheus + Alertmanager — alert routing infrastructure ready
  • Loki + Alloy — log pipeline ready, just needs new sources
  • Caretta — eBPF service map complements Tetragon's process-level view
  • GoFlow2 — NetFlow data complements Tetragon's connection tracking
  • Calico — CNI with full NetworkPolicy enforcement ready