Viktor Barzin db7ea58d5c [ci skip] add security observability layer design document

Tetragon-centric approach: eBPF runtime security, pfSense syslog
collection, CoreDNS query logging, Calico NetworkPolicies,
on-demand mitmproxy, unified Grafana security dashboard.
~625MB steady-state, <5GB budget.

2026-03-02 21:13:01 +00:00

13 KiB

Raw Blame History

Security Observability Layer — Design Document

Date: 2026-03-02 Status: Approved Approach: Tetragon-Centric (Approach A)

Problem Statement

The cluster has strong perimeter security (CrowdSec, Traefik middleware chain, Cloudflare WAF) and good monitoring (Prometheus, Loki, Grafana), but lacks:

Runtime security monitoring (syscall-level container activity)
Egress visibility (what pods connect to externally)
HTTPS inspection capability (even on-demand)
Network segmentation (no NetworkPolicies — any pod can reach any pod)
Firewall log centralization (pfSense logs not in Loki)
Unified security dashboard

Requirements

Threat model: Defense in depth — external attacks, compromised containers, lateral movement, data exfiltration
TLS inspection: Connection metadata (SNI/IP/bytes) by default, selective deep inspection on-demand
Alerting: Slack (existing channel)
Resource budget: <5GB RAM total for new tooling
Enforcement: Observe & alert now, enforce later
CNI: Calico (confirmed, with GlobalNetworkPolicy CRD support)

Architecture

                    ┌─────────────────────────────────────────────────┐
                    │              Existing Stack                      │
                    │  Prometheus ← scrape ← Tetragon metrics         │
                    │  Loki ← Alloy ← Tetragon event logs            │
                    │                   ← pfSense syslog              │
                    │                   ← CoreDNS query logs          │
                    │  Grafana ← Unified Security Dashboard           │
                    │  Alertmanager → Slack                           │
                    └─────────────────────────────────────────────────┘

    ┌─────────────┐   ┌──────────────────┐   ┌─────────────────────┐
    │  Tetragon   │   │ Kyverno Policy   │   │  mitmproxy          │
    │ (DaemonSet) │   │ Reporter (1 pod) │   │ (on-demand, 1 pod)  │
    │ eBPF agent  │   │                  │   │ HTTPS inspection    │
    │ per node    │   │ Violations →     │   │ for suspect pods    │
    │             │   │ Prometheus +     │   │                     │
    │ Monitors:   │   │ Grafana          │   │ Transparent proxy   │
    │ • processes │   │                  │   │ via NetworkPolicy   │
    │ • network   │   └──────────────────┘   └─────────────────────┘
    │ • files     │
    │ • syscalls  │   ┌──────────────────┐
    │             │   │ Inspektor Gadget │
    └─────────────┘   │ (temporary)      │
                      │ Auto-generate    │
                      │ NetworkPolicies  │
                      │ from observed    │
                      │ traffic baseline │
                      └──────────────────┘

    ┌────────────────────────────────────────────────┐
    │           Calico NetworkPolicies               │
    │  (Generated from baseline, enforced gradually) │
    │  Default deny egress + allow known connections │
    └────────────────────────────────────────────────┘

Data Flows

Tetragon → Prometheus (metrics) + stdout → Alloy → Loki (events)
pfSense → syslog UDP → Alloy syslog receiver → Loki
CoreDNS → uncomment log → stdout → Alloy → Loki
Kyverno Policy Reporter → Prometheus (violation metrics)
Grafana ← queries all sources → Unified Security Dashboard
Alertmanager → Slack (security-specific alert rules)

Component Details

1. Tetragon (Runtime Security + Network Visibility)

Purpose: eBPF-based kernel-level monitoring of process execution, network connections, file access, and privilege escalation.

Deployment:

Helm chart: cilium/tetragon (CNCF project, part of Cilium ecosystem)
Type: DaemonSet on all 5 nodes
Resources: ~80-120MB RAM/node, ~50m CPU idle
Tier: 1-cluster
Namespace: tetragon
New stack: stacks/tetragon/

TracingPolicy CRDs (what to monitor):

Policy	Detects	Severity
Privilege escalation	`setuid(0)`, `setgid(0)`, dangerous capabilities	Critical
Reverse shell	Shell process with outbound connection to external IP	Critical
Crypto miner	Connections to mining pool ports (3333, 14444, etc.)	Warning
Container escape	`mount` syscalls, `/proc/self/ns/*` access, `nsenter`	Critical
Sensitive file access	Reads of `/etc/shadow`, K8s service account tokens	Warning
Unexpected egress	Outbound connections to non-private IPs (log all)	Info
Unexpected binaries	Shells spawning in non-shell containers	Warning

Observe → Enforce path:

Start: TracingPolicy (observe + alert only)
Later: TracingPolicyEnforced (can SIGKILL processes)

Integration:

Prometheus metrics via pod annotations (auto-scraped by existing kubernetes-pods job)
Events as JSON to stdout → Alloy → Loki
New Prometheus alert rules for critical Tetragon events

2. pfSense Log Collection

Purpose: Centralize firewall logs into Loki for correlation with cluster security events.

Implementation:

Deploy a small syslog-receiver Deployment (1 replica) with a MetalLB LoadBalancer IP
Forward received syslog to Loki via loki.write
OR add loki.source.syslog to existing Alloy config
Configure pfSense: Status → System Logs → Settings → Remote Logging → point to syslog receiver IP:1514

Recommended approach: Dedicated syslog receiver Deployment (not Alloy DaemonSet) because:

Stable LoadBalancer IP for pfSense to target
Doesn't couple to a specific node
Can parse filterlog CSV format independently

Parse pfSense filterlog: Extract interface, action (pass/block), direction, source IP, dest IP, protocol, port into Loki labels.

Resource cost: ~50-100MB for the syslog receiver pod.

3. CoreDNS Query Logging

Purpose: Detect DNS tunneling, C2 callbacks, unusual domain lookups.

Implementation: Uncomment #log → log in CoreDNS ConfigMap (stacks/platform/modules/technitium/main.tf).

Scope: Only enable on the main zone (.), NOT the viktorbarzin.lan zone (Technitium already logs those to MySQL).

Alert rules for Loki:

High NX domain rate from a single pod
DNS tunneling signatures (subdomain labels >40 chars)
Queries to known malicious TLDs

Resource cost: 0 additional (just increased log volume in Loki).

4. NetworkPolicy Strategy (Calico)

Purpose: Restrict pod-to-pod and pod-to-external traffic using Calico NetworkPolicies.

Phased rollout:

Phase	Action	Timeline
Observe	Deploy Inspektor Gadget, capture 24-48h traffic baseline	Week 1
Generate	`kubectl gadget advise network-policy` per namespace	Week 1
Review	Convert to Terraform `kubernetes_network_policy` resources	Week 2
Enforce (low-risk)	Apply to aux-tier namespaces first	Week 3
Enforce (all)	Gradually apply to edge, cluster, core tiers	Week 4+

Key policies:

Default deny egress for aux-tier namespaces
Allow DNS (port 53) + known external endpoints per service
Block inter-namespace traffic except known dependencies (redis, postgresql, loki)

Inspektor Gadget:

CNCF Sandbox project, ~80MB/node as DaemonSet
Temporary deployment — remove after baseline capture (~400MB total while running)
kubectl gadget advise network-policy auto-generates policies from observed traffic

Resource cost: 0 permanent (Calico already enforces). ~400MB temporary.

5. mitmproxy (On-Demand HTTPS Inspection)

Purpose: Deep HTTPS traffic inspection for specific suspicious pods during incident investigation.

Deployment:

Single-replica Deployment, scaled to 0 by default
Namespace: mitmproxy
New stack: stacks/mitmproxy/
Web UI at mitmproxy.viktorbarzin.lan (local-only access)

Usage workflow:

Scale to 1: kubectl scale deployment mitmproxy --replicas=1 -n mitmproxy
Apply Calico NetworkPolicy redirecting suspect pod's egress through mitmproxy
Mount mitmproxy CA cert into target pod's trust store
Inspect traffic via web UI
Scale back to 0 when done

Resource cost: ~200MB when active, 0 when scaled to 0.

6. Kyverno Policy Reporter

Purpose: Surface Kyverno policy violations (currently in audit mode) in Grafana dashboards.

Deployment:

Add as sub-chart or separate Helm release in Kyverno stack
1 replica Deployment
Exports metrics to Prometheus
~50MB RAM

Integration:

Prometheus scrapes Policy Reporter metrics
Grafana dashboard shows violations by policy, namespace, severity

7. Unified Security Dashboard + Alert Rules

Grafana Dashboard layout:

Row	Panels	Data Source
Overview	Active CrowdSec bans, Tetragon alerts/24h, Kyverno violations/24h, pfSense blocks/24h	Prometheus
Attack Timeline	Combined time series of all security events	Prometheus
Runtime Security	Suspicious processes, privilege escalations, file access alerts	Loki (Tetragon)
Network	Top egress destinations by namespace, unusual DNS queries, pfSense blocks	Loki + Prometheus
Policy	Kyverno violations by policy/namespace/severity	Prometheus (Policy Reporter)

New Prometheus Alert Rules:

Alert	Trigger	Severity
`TetragonPrivilegeEscalation`	setuid(0) in non-system container	Critical
`TetragonReverseShell`	Shell + outbound connection	Critical
`TetragonCryptoMiner`	Connection to mining pool ports	Warning
`TetragonUnexpectedEgress`	Pod → unexpected external IP	Warning
`SuspiciousDNSQuery`	High NX rate or long subdomains	Warning
`PfSenseHighBlockRate`	>100 blocks/min from single source	Warning
`KyvernoViolationSpike`	>10 violations in 5 minutes	Warning

Resource Budget

Component	Type	Steady-State RAM	Notes
Tetragon	DaemonSet (5 nodes)	~500MB	Runtime security + egress
Syslog receiver	Deployment (1)	~75MB	pfSense logs
Kyverno Policy Reporter	Deployment (1)	~50MB	Violation metrics
mitmproxy	Deployment (0/1)	0 (200MB active)	On-demand only
CoreDNS logging	Config change	0	More Loki volume
Inspektor Gadget	Temporary DaemonSet	0 (~400MB while running)	Removed after baseline
Total steady-state		~625MB	Well under 5GB budget

Implementation Phases

Phase 1: Core Observability (~625MB)

Deploy Tetragon with TracingPolicy CRDs
Enable CoreDNS query logging
Deploy Kyverno Policy Reporter
Add Prometheus alert rules for Tetragon events

Phase 2: Log Centralization (+0MB permanent)

Deploy syslog receiver for pfSense logs
Configure pfSense remote syslog
Build unified Grafana security dashboard

Phase 3: Network Segmentation (+0MB permanent, ~400MB temporary)

Deploy Inspektor Gadget temporarily
Capture 24-48h traffic baseline
Generate and review NetworkPolicies
Apply policies gradually (aux → edge → cluster → core)
Remove Inspektor Gadget

Phase 4: On-Demand Inspection (+0MB permanent)

Deploy mitmproxy (scaled to 0)
Document investigation workflow

New Terraform Stacks

stacks/tetragon/ — Helm chart + TracingPolicy CRDs + Prometheus rules
stacks/mitmproxy/ — On-demand HTTPS inspection proxy

Modified Stacks

stacks/platform/modules/monitoring/ — Alloy syslog or syslog receiver, Grafana dashboard, alert rules
stacks/platform/modules/technitium/ — CoreDNS log uncomment
stacks/platform/modules/kyverno/ — Policy Reporter sub-chart

Existing Stack (No Changes Needed)

CrowdSec (IDS/IPS with Traefik bouncer) — already covers external attack detection
Prometheus + Alertmanager — alert routing infrastructure ready
Loki + Alloy — log pipeline ready, just needs new sources
Caretta — eBPF service map complements Tetragon's process-level view
GoFlow2 — NetFlow data complements Tetragon's connection tracking
Calico — CNI with full NetworkPolicy enforcement ready

13 KiB Raw Blame History

Security Observability Layer — Design Document

Problem Statement

Requirements

Architecture

Data Flows

Component Details

1. Tetragon (Runtime Security + Network Visibility)

2. pfSense Log Collection

3. CoreDNS Query Logging

4. NetworkPolicy Strategy (Calico)

5. mitmproxy (On-Demand HTTPS Inspection)

6. Kyverno Policy Reporter

7. Unified Security Dashboard + Alert Rules

Resource Budget

Implementation Phases

Phase 1: Core Observability (~625MB)

Phase 2: Log Centralization (+0MB permanent)

Phase 3: Network Segmentation (+0MB permanent, ~400MB temporary)

Phase 4: On-Demand Inspection (+0MB permanent)

New Terraform Stacks

Modified Stacks

Existing Stack (No Changes Needed)

13 KiB

Raw Blame History