infra/stacks/calico/main.tf

134 lines
5.2 KiB
Terraform
Raw Normal View History

[infra] Partial Calico adoption: namespaces only (Wave 5b) ## Context Wave 5b of the state-drift consolidation plan. Calico has run this cluster's pod networking since 2024-07-30, installed via raw kubectl manifests — tigera-operator Deployment + ~20 CRDs + an Installation CR. The plan flagged Calico as HIGH BLAST because the operator + Installation CR sit on the critical path for pod scheduling; any mistake during adoption can break CNI and block new pods cluster-wide within seconds. This session takes the safe sub-step: adopt only the three namespaces. Namespaces are label containers — TF managing their names + PSA labels cannot disrupt Calico networking. Getting the operator, Installation CR, and CRDs under TF requires dedicated prep (picking the right `ignore_changes` fields to absorb operator-generated defaults in the Installation CR, decoupling from the embedded PSA labels applied at admission, and a low-traffic window). Deferred to `code-3ad`. ## This change New Tier 1 stack `stacks/calico/` adopting via import `{}` blocks (Wave 8 convention, commit 8a99be11): - `kubernetes_namespace.calico_system` ← id `calico-system` - `kubernetes_namespace.calico_apiserver` ← id `calico-apiserver` - `kubernetes_namespace.tigera_operator` ← id `tigera-operator` Apply: `3 imported, 0 added, 0 changed, 0 destroyed.` Followed by a second `tg plan` that returns `No changes`. Zero cluster impact — namespaces stayed exactly as they were cluster-side. ### terragrunt dependency choice Deliberately no `dependency "platform"` clause — Calico is lower in the stack than platform, so introducing a `platform → calico` or `calico → platform` edge would invite cycle-like pain on first bootstrap. The plan on this stack is always safe to run standalone. ### `ignore_changes` scope on each namespace - `goldilocks.fairwinds.com/vpa-update-mode` — Kyverno ClusterPolicy stamp (Wave 3B sweep, commit 8b43692a). - `pod-security.kubernetes.io/enforce` + `-version` — tigera-operator stamps these on `calico-system` + `calico-apiserver` to opt them out of PSA. These labels aren't surfaced by the kubernetes provider as part of the import (they arrive through a different field manager), so left unmanaged to keep the plan clean. `tigera-operator` ns doesn't get the PSA labels so they aren't ignored there. ## What is NOT in this change - The three live workloads: `tigera-operator` Deployment in `tigera-operator` ns, `calico-kube-controllers`/`calico-node`/ `calico-typha` workloads in `calico-system`, the `calico-apiserver` in `calico-apiserver`. These are all reconciled by the tigera-operator from the Installation CR — importing them into TF is redundant with importing the CR itself. - The `Installation` CR (`default`, apiVersion `operator.tigera.io/v1`) — the user-authored minimal spec has since been filled to 104 lines of operator-generated defaults. Adopting it requires a well-scoped `ignore_changes` list on the `manifest` field. Separate follow-up `code-3ad`. - `.sops.yaml` / `tier0_stacks` updates — the original plan suggested Tier 0 (local SOPS state) for the full Calico stack on the theory that "network underpins all". With only three namespaces in the stack, the argument doesn't hold: a failed Tier 1 plan on calico namespaces cannot break networking, so no need to pay the Tier 0 tax. ## Verification ``` $ cd stacks/calico && ../../scripts/tg plan No changes. Your infrastructure matches the configuration. $ kubectl get pods -n calico-system NAME READY STATUS RESTARTS calico-kube-controllers-... 1/1 Running 0 calico-node-... 1/1 Running 0 ... (all healthy, pre-existing) ``` Follow-up: code-3ad for operator + Installation CR adoption (needs low-traffic window + ignore_changes scoping). Closes: code-hl1 scope of Wave 5b (namespaces). Remaining subwave in code-3ad. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 22:52:56 +00:00
# Calico CNI
#
# Calico has underpinned this cluster's pod networking since 2024-07-30, installed
# as raw kubectl manifests (tigera-operator Deployment + CRDs + Installation CR).
# Bringing the full stack under Terraform is high-blast — the operator and its
# Deployment must never flap during node pressure or during any apply, because
# new pod scheduling breaks within ~seconds of a CNI outage.
#
# This stack (created 2026-04-18 Wave 5b) adopts the three namespaces only:
# calico-system, calico-apiserver, tigera-operator. The `tigera-operator`
# Deployment, the 20+ CRDs it manages, and the `Installation` CR itself are
# intentionally *not* adopted yet — they require a low-traffic window and a
# careful ignore_changes set to cover operator-generated defaults on the
# Installation CR. Follow-up tracked in beads code-3ad.
#
# The namespaces are safe to adopt (no networking impact — they're just label
# containers) and give TF an audit trail entry for the labels/tier Kyverno
# cares about.
resource "kubernetes_namespace" "calico_system" {
metadata {
name = "calico-system"
labels = {
name = "calico-system"
# calico-system namespace is managed by tigera-operator — auto-update is
# incompatible (operator reverts DaemonSet image from its Installation CR).
# "keel.sh/enrolled" = "true"
[infra] Partial Calico adoption: namespaces only (Wave 5b) ## Context Wave 5b of the state-drift consolidation plan. Calico has run this cluster's pod networking since 2024-07-30, installed via raw kubectl manifests — tigera-operator Deployment + ~20 CRDs + an Installation CR. The plan flagged Calico as HIGH BLAST because the operator + Installation CR sit on the critical path for pod scheduling; any mistake during adoption can break CNI and block new pods cluster-wide within seconds. This session takes the safe sub-step: adopt only the three namespaces. Namespaces are label containers — TF managing their names + PSA labels cannot disrupt Calico networking. Getting the operator, Installation CR, and CRDs under TF requires dedicated prep (picking the right `ignore_changes` fields to absorb operator-generated defaults in the Installation CR, decoupling from the embedded PSA labels applied at admission, and a low-traffic window). Deferred to `code-3ad`. ## This change New Tier 1 stack `stacks/calico/` adopting via import `{}` blocks (Wave 8 convention, commit 8a99be11): - `kubernetes_namespace.calico_system` ← id `calico-system` - `kubernetes_namespace.calico_apiserver` ← id `calico-apiserver` - `kubernetes_namespace.tigera_operator` ← id `tigera-operator` Apply: `3 imported, 0 added, 0 changed, 0 destroyed.` Followed by a second `tg plan` that returns `No changes`. Zero cluster impact — namespaces stayed exactly as they were cluster-side. ### terragrunt dependency choice Deliberately no `dependency "platform"` clause — Calico is lower in the stack than platform, so introducing a `platform → calico` or `calico → platform` edge would invite cycle-like pain on first bootstrap. The plan on this stack is always safe to run standalone. ### `ignore_changes` scope on each namespace - `goldilocks.fairwinds.com/vpa-update-mode` — Kyverno ClusterPolicy stamp (Wave 3B sweep, commit 8b43692a). - `pod-security.kubernetes.io/enforce` + `-version` — tigera-operator stamps these on `calico-system` + `calico-apiserver` to opt them out of PSA. These labels aren't surfaced by the kubernetes provider as part of the import (they arrive through a different field manager), so left unmanaged to keep the plan clean. `tigera-operator` ns doesn't get the PSA labels so they aren't ignored there. ## What is NOT in this change - The three live workloads: `tigera-operator` Deployment in `tigera-operator` ns, `calico-kube-controllers`/`calico-node`/ `calico-typha` workloads in `calico-system`, the `calico-apiserver` in `calico-apiserver`. These are all reconciled by the tigera-operator from the Installation CR — importing them into TF is redundant with importing the CR itself. - The `Installation` CR (`default`, apiVersion `operator.tigera.io/v1`) — the user-authored minimal spec has since been filled to 104 lines of operator-generated defaults. Adopting it requires a well-scoped `ignore_changes` list on the `manifest` field. Separate follow-up `code-3ad`. - `.sops.yaml` / `tier0_stacks` updates — the original plan suggested Tier 0 (local SOPS state) for the full Calico stack on the theory that "network underpins all". With only three namespaces in the stack, the argument doesn't hold: a failed Tier 1 plan on calico namespaces cannot break networking, so no need to pay the Tier 0 tax. ## Verification ``` $ cd stacks/calico && ../../scripts/tg plan No changes. Your infrastructure matches the configuration. $ kubectl get pods -n calico-system NAME READY STATUS RESTARTS calico-kube-controllers-... 1/1 Running 0 calico-node-... 1/1 Running 0 ... (all healthy, pre-existing) ``` Follow-up: code-3ad for operator + Installation CR adoption (needs low-traffic window + ignore_changes scoping). Closes: code-hl1 scope of Wave 5b (namespaces). Remaining subwave in code-3ad. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 22:52:56 +00:00
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode label on every namespace.
# pod-security.kubernetes.io/* labels are applied by the tigera-operator
# reconciler on calico-system + calico-apiserver for PSA 'privileged'.
ignore_changes = [
metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"],
metadata[0].labels["pod-security.kubernetes.io/enforce"],
metadata[0].labels["pod-security.kubernetes.io/enforce-version"],
]
}
}
resource "kubernetes_namespace" "calico_apiserver" {
metadata {
name = "calico-apiserver"
labels = {
name = "calico-apiserver"
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1 + PSA labels applied by tigera-operator (see calico_system).
ignore_changes = [
metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"],
metadata[0].labels["pod-security.kubernetes.io/enforce"],
metadata[0].labels["pod-security.kubernetes.io/enforce-version"],
]
}
}
resource "kubernetes_namespace" "tigera_operator" {
metadata {
name = "tigera-operator"
labels = {
name = "tigera-operator"
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
security(wave1): W1.6 observe phase LIVE — Calico GNP action:Log pilot on recruiter-responder Replaces the abandoned FelixConfiguration.flowLogsFileEnabled approach (Calico Enterprise-only field, rejected by OSS v3.26) with the supported primitive: Calico GlobalNetworkPolicy with `action: Log`. ## Mechanics (verified end-to-end on 2026-05-19) 1. kubectl_manifest applies GNP `wave1-egress-observe-recruiter-responder` with `namespaceSelector: kubernetes.io/metadata.name == 'recruiter-responder'`, `types: [Egress]`, `egress: [{action: Log}, {action: Allow}]`. 2. Felix translates to iptables LOG rule in `cali-po-_ZEv_aILlvyT9fbgWN58` chain with prefix `calico-packet: ` log-level=5. 3. Linux kernel emits LOG entries to ring buffer with transport=kernel. 4. systemd-journald captures kernel transport entries. 5. Alloy DaemonSet ships journal to Loki with `job=node-journal,transport=kernel`. 6. LogQL: `{job="node-journal"} |~ "calico-packet"` returns entries showing SRC/DST/PROTO/PORT for every NEW egress connection. ## Verified output sample `calico-packet: IN=cali6cfdec4abc1 OUT=ens18 MAC=... SRC=10.10.122.132 DST=9.9.9.9 LEN=60 TOS=0x00 PREC=0x00 TTL=...` The Allow rule in the GNP keeps egress functional (recruiter-responder remained 1/1 Running through the apply — verified Python TCP connections to 1.1.1.1, 8.8.8.8, 9.9.9.9 succeed). ## Wave 1 status W1.6 observation infra is LIVE for the recruiter-responder pilot. W1.7 remains pending: collect 1 week of `{job="node-journal"} |~ "calico-packet"` samples, build empirical egress allowlist, flip the GNP rules from `[Log, Allow]` to `[Allow <specific dests>, Deny]`. Expand observation to additional namespaces by adding entries to `spec.namespaceSelector` (e.g. `kubernetes.io/metadata.name in {recruiter-responder,X,Y}`). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-19 22:10:42 +00:00
# Wave 1 W1.6 (beads code-8ywc): observation phase via Calico GlobalNetworkPolicy
# `action: Log`. This is the supported primitive on Calico OSS v3.26 — the
# Calico-Enterprise FelixConfiguration.flowLogsFileEnabled approach is NOT
# accepted by the OSS CRD (verified 2026-05-19: "strict decoding error").
security(wave1): W1.1 audit-log shipping LIVE + W1.5 trusted-registries Enforce LIVE ## W1.1 — K8s API audit log shipping (LIVE) - alloy.yaml: added control-plane toleration so Alloy DaemonSet runs on k8s-master node. Verified alloy-7zg7t scheduled on master, tailing /var/log/kubernetes/audit.log - loki.tf "Security Wave 1" rule group: added K2-K9 alert rules (skipped K1 per Q7 decision): - K2 K8sSATokenFromUnexpectedIP - K3 K8sSensitiveSecretReadByUnexpectedActor - K4 K8sExecIntoSensitiveNamespace - K5 K8sMassDelete (>5 Pod/Secret/CM in 60s by single user) - K6 K8sAuditPolicyModified (kubeadm-config CM change) - K7 K8sClusterRoleWildcardCreated (verbs=* + resources=*) - K8 K8sAnonymousBindingGranted - K9 K8sViktorFromUnexpectedIP - All rules use source-IP regex matching the wave-1 allowlist (10.0.20.0/22, 192.168.1.0/24, 10.10.0.0/16 pod, 10.96.0.0/12 svc, 100.64-127 tailnet) and `lane = "security"` → #security Slack route. - Verified: kubectl-audit logs flowing in Loki query {job="kubernetes-audit"} returns events with node=k8s-master. - Verified: /loki/api/v1/rules lists all K2-K9 + V1-V7 + S1. ## W1.5 — require-trusted-registries Enforce (LIVE) - security-policies.tf: flipped Audit→Enforce with explicit allowlist built by `kubectl get pods -A -o jsonpath='{..image}'` enumeration. - Removed `*/*` catch-all (which made Audit→Enforce a no-op). - Pattern includes 15 explicit registries, 6 DockerHub library bare names, 56 DockerHub user repos. - Verified by admission dry-run: - evilcorp.example/malware:v1 → BLOCKED with custom message - alpine:3.20 → ALLOWED (matches `alpine*`) - docker.io/library/alpine:3.20 → ALLOWED (matches `docker.io/*`) ## W1.6 — Calico flow logs (BLOCKED — Calico OSS limitation) - Tried adding FelixConfiguration with flowLogsFileEnabled=true via kubectl_manifest in stacks/calico/main.tf - Calico OSS rejected with "strict decoding error: unknown field spec.flowLogsFileEnabled" — these fields are Calico Enterprise/Tigera-only - Removed the failed resource. Documented alternative paths in main.tf comment block: GNP with action=Log (iptables NFLOG → journal), Cilium migration, eBPF tooling, or Tigera Operator adoption. ## Docs updates - security.md status table refreshed: W1.1/W1.2/W1.3/W1.4/W1.5 LIVE, W1.6/W1.7 blocked - monitoring.md: Loki marked DEPLOYED (was incorrectly NOT-DEPLOYED in prior session before today's apply) ## Cleanup - Removed stacks/kyverno/imports.tf (TF 1.5+ import blocks completed their job in the 2026-05-18 apply; should not stay in tree per TF docs) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-19 06:37:54 +00:00
#
security(wave1): W1.6 observe phase LIVE — Calico GNP action:Log pilot on recruiter-responder Replaces the abandoned FelixConfiguration.flowLogsFileEnabled approach (Calico Enterprise-only field, rejected by OSS v3.26) with the supported primitive: Calico GlobalNetworkPolicy with `action: Log`. ## Mechanics (verified end-to-end on 2026-05-19) 1. kubectl_manifest applies GNP `wave1-egress-observe-recruiter-responder` with `namespaceSelector: kubernetes.io/metadata.name == 'recruiter-responder'`, `types: [Egress]`, `egress: [{action: Log}, {action: Allow}]`. 2. Felix translates to iptables LOG rule in `cali-po-_ZEv_aILlvyT9fbgWN58` chain with prefix `calico-packet: ` log-level=5. 3. Linux kernel emits LOG entries to ring buffer with transport=kernel. 4. systemd-journald captures kernel transport entries. 5. Alloy DaemonSet ships journal to Loki with `job=node-journal,transport=kernel`. 6. LogQL: `{job="node-journal"} |~ "calico-packet"` returns entries showing SRC/DST/PROTO/PORT for every NEW egress connection. ## Verified output sample `calico-packet: IN=cali6cfdec4abc1 OUT=ens18 MAC=... SRC=10.10.122.132 DST=9.9.9.9 LEN=60 TOS=0x00 PREC=0x00 TTL=...` The Allow rule in the GNP keeps egress functional (recruiter-responder remained 1/1 Running through the apply — verified Python TCP connections to 1.1.1.1, 8.8.8.8, 9.9.9.9 succeed). ## Wave 1 status W1.6 observation infra is LIVE for the recruiter-responder pilot. W1.7 remains pending: collect 1 week of `{job="node-journal"} |~ "calico-packet"` samples, build empirical egress allowlist, flip the GNP rules from `[Log, Allow]` to `[Allow <specific dests>, Deny]`. Expand observation to additional namespaces by adding entries to `spec.namespaceSelector` (e.g. `kubernetes.io/metadata.name in {recruiter-responder,X,Y}`). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-19 22:10:42 +00:00
# How it works:
# - GNP selects pods by namespaceSelector
# - egress rule action=Log writes an iptables NFLOG entry that lands in the
# kernel log / journald with prefix "calico-packet:" on each node
# - Alloy DaemonSet already ships node-journal to Loki (job=node-journal)
# - LogQL query: {job="node-journal"} |= "calico-packet" surfaces egress flows
# - After ~1 week of observation, build the empirical per-namespace egress
# allowlist; then flip the same GNP to [Allow specific dests, Deny rest]
security(wave1): W1.1 audit-log shipping LIVE + W1.5 trusted-registries Enforce LIVE ## W1.1 — K8s API audit log shipping (LIVE) - alloy.yaml: added control-plane toleration so Alloy DaemonSet runs on k8s-master node. Verified alloy-7zg7t scheduled on master, tailing /var/log/kubernetes/audit.log - loki.tf "Security Wave 1" rule group: added K2-K9 alert rules (skipped K1 per Q7 decision): - K2 K8sSATokenFromUnexpectedIP - K3 K8sSensitiveSecretReadByUnexpectedActor - K4 K8sExecIntoSensitiveNamespace - K5 K8sMassDelete (>5 Pod/Secret/CM in 60s by single user) - K6 K8sAuditPolicyModified (kubeadm-config CM change) - K7 K8sClusterRoleWildcardCreated (verbs=* + resources=*) - K8 K8sAnonymousBindingGranted - K9 K8sViktorFromUnexpectedIP - All rules use source-IP regex matching the wave-1 allowlist (10.0.20.0/22, 192.168.1.0/24, 10.10.0.0/16 pod, 10.96.0.0/12 svc, 100.64-127 tailnet) and `lane = "security"` → #security Slack route. - Verified: kubectl-audit logs flowing in Loki query {job="kubernetes-audit"} returns events with node=k8s-master. - Verified: /loki/api/v1/rules lists all K2-K9 + V1-V7 + S1. ## W1.5 — require-trusted-registries Enforce (LIVE) - security-policies.tf: flipped Audit→Enforce with explicit allowlist built by `kubectl get pods -A -o jsonpath='{..image}'` enumeration. - Removed `*/*` catch-all (which made Audit→Enforce a no-op). - Pattern includes 15 explicit registries, 6 DockerHub library bare names, 56 DockerHub user repos. - Verified by admission dry-run: - evilcorp.example/malware:v1 → BLOCKED with custom message - alpine:3.20 → ALLOWED (matches `alpine*`) - docker.io/library/alpine:3.20 → ALLOWED (matches `docker.io/*`) ## W1.6 — Calico flow logs (BLOCKED — Calico OSS limitation) - Tried adding FelixConfiguration with flowLogsFileEnabled=true via kubectl_manifest in stacks/calico/main.tf - Calico OSS rejected with "strict decoding error: unknown field spec.flowLogsFileEnabled" — these fields are Calico Enterprise/Tigera-only - Removed the failed resource. Documented alternative paths in main.tf comment block: GNP with action=Log (iptables NFLOG → journal), Cilium migration, eBPF tooling, or Tigera Operator adoption. ## Docs updates - security.md status table refreshed: W1.1/W1.2/W1.3/W1.4/W1.5 LIVE, W1.6/W1.7 blocked - monitoring.md: Loki marked DEPLOYED (was incorrectly NOT-DEPLOYED in prior session before today's apply) ## Cleanup - Removed stacks/kyverno/imports.tf (TF 1.5+ import blocks completed their job in the 2026-05-18 apply; should not stay in tree per TF docs) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-19 06:37:54 +00:00
#
security(wave1): W1.6 expand observation from recruiter-responder pilot → tier 3+4 (82 namespaces) ## Change - Replaced kubectl_manifest.wave1_egress_observe_recruiter_responder with kubectl_manifest.wave1_egress_observe_tier34 - namespaceSelector changed from `kubernetes.io/metadata.name == 'recruiter-responder'` to `tier in {"3-edge", "4-aux"}` — covers 82 namespaces (17 tier-3-edge + 65 tier-4-aux) - Legacy pilot GNP wave1-egress-observe-recruiter-responder kubectl-deleted (apply_only=true means TF rename does NOT destroy the live old resource; cleanup done manually) - Tier 0/1/2 namespaces explicitly out of wave 1 observation per locked plan (cluster infra + GPU workloads, deferred) ## Verification (live cluster, 2026-05-19) - 82 namespaces match `tier in (3-edge,4-aux)` - Felix translated the new policy into iptables LOG rule in cali-po-* chain - LogQL `{job="node-journal"} |~ "calico-packet"` returns real packet metadata from multiple namespaces with distinct destinations: - east-west pod-to-pod (10.10.108.48, 10.10.122.131) - in-cluster service VIP (10.96.0.10 — kube-dns) - external (149.154.166.110 — Telegram API from recruiter-responder) ## W1.7 next step (calendar-bound, ~1 week) - Let observation run for ~1 week - Aggregate distinct destinations per namespace via LogQL - Build per-namespace egress allowlist module `tier3_egress_baseline` - Flip GNP rules from `[Log, Allow]` to `[Allow <specific dests>, Deny]` - Phased per-namespace as originally planned Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-19 22:14:16 +00:00
# Started with `recruiter-responder` as the pilot on 2026-05-19; expanded
# 2026-05-19 to all tier 3+4 namespaces (per locked plan — tier 3-edge has
# 17 ns, tier 4-aux has 65 ns, all use Calico's WorkloadEndpoint policy
# path). Tier 0/1/2 stay out of observation in wave 1 (cluster infra +
# GPU workloads, deferred per the plan).
#
# `apply_only = true` on the kubectl_manifest means renaming the TF resource
# does NOT destroy the old GNP via TF — we kubectl delete the legacy pilot
# GNP after this applies to clean it up. (Tracked manually.)
resource "kubectl_manifest" "wave1_egress_observe_tier34" {
security(wave1): W1.6 observe phase LIVE — Calico GNP action:Log pilot on recruiter-responder Replaces the abandoned FelixConfiguration.flowLogsFileEnabled approach (Calico Enterprise-only field, rejected by OSS v3.26) with the supported primitive: Calico GlobalNetworkPolicy with `action: Log`. ## Mechanics (verified end-to-end on 2026-05-19) 1. kubectl_manifest applies GNP `wave1-egress-observe-recruiter-responder` with `namespaceSelector: kubernetes.io/metadata.name == 'recruiter-responder'`, `types: [Egress]`, `egress: [{action: Log}, {action: Allow}]`. 2. Felix translates to iptables LOG rule in `cali-po-_ZEv_aILlvyT9fbgWN58` chain with prefix `calico-packet: ` log-level=5. 3. Linux kernel emits LOG entries to ring buffer with transport=kernel. 4. systemd-journald captures kernel transport entries. 5. Alloy DaemonSet ships journal to Loki with `job=node-journal,transport=kernel`. 6. LogQL: `{job="node-journal"} |~ "calico-packet"` returns entries showing SRC/DST/PROTO/PORT for every NEW egress connection. ## Verified output sample `calico-packet: IN=cali6cfdec4abc1 OUT=ens18 MAC=... SRC=10.10.122.132 DST=9.9.9.9 LEN=60 TOS=0x00 PREC=0x00 TTL=...` The Allow rule in the GNP keeps egress functional (recruiter-responder remained 1/1 Running through the apply — verified Python TCP connections to 1.1.1.1, 8.8.8.8, 9.9.9.9 succeed). ## Wave 1 status W1.6 observation infra is LIVE for the recruiter-responder pilot. W1.7 remains pending: collect 1 week of `{job="node-journal"} |~ "calico-packet"` samples, build empirical egress allowlist, flip the GNP rules from `[Log, Allow]` to `[Allow <specific dests>, Deny]`. Expand observation to additional namespaces by adding entries to `spec.namespaceSelector` (e.g. `kubernetes.io/metadata.name in {recruiter-responder,X,Y}`). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-19 22:10:42 +00:00
yaml_body = yamlencode({
apiVersion = "projectcalico.org/v3"
kind = "GlobalNetworkPolicy"
metadata = {
security(wave1): W1.6 expand observation from recruiter-responder pilot → tier 3+4 (82 namespaces) ## Change - Replaced kubectl_manifest.wave1_egress_observe_recruiter_responder with kubectl_manifest.wave1_egress_observe_tier34 - namespaceSelector changed from `kubernetes.io/metadata.name == 'recruiter-responder'` to `tier in {"3-edge", "4-aux"}` — covers 82 namespaces (17 tier-3-edge + 65 tier-4-aux) - Legacy pilot GNP wave1-egress-observe-recruiter-responder kubectl-deleted (apply_only=true means TF rename does NOT destroy the live old resource; cleanup done manually) - Tier 0/1/2 namespaces explicitly out of wave 1 observation per locked plan (cluster infra + GPU workloads, deferred) ## Verification (live cluster, 2026-05-19) - 82 namespaces match `tier in (3-edge,4-aux)` - Felix translated the new policy into iptables LOG rule in cali-po-* chain - LogQL `{job="node-journal"} |~ "calico-packet"` returns real packet metadata from multiple namespaces with distinct destinations: - east-west pod-to-pod (10.10.108.48, 10.10.122.131) - in-cluster service VIP (10.96.0.10 — kube-dns) - external (149.154.166.110 — Telegram API from recruiter-responder) ## W1.7 next step (calendar-bound, ~1 week) - Let observation run for ~1 week - Aggregate distinct destinations per namespace via LogQL - Build per-namespace egress allowlist module `tier3_egress_baseline` - Flip GNP rules from `[Log, Allow]` to `[Allow <specific dests>, Deny]` - Phased per-namespace as originally planned Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-19 22:14:16 +00:00
name = "wave1-egress-observe-tier34"
security(wave1): W1.6 observe phase LIVE — Calico GNP action:Log pilot on recruiter-responder Replaces the abandoned FelixConfiguration.flowLogsFileEnabled approach (Calico Enterprise-only field, rejected by OSS v3.26) with the supported primitive: Calico GlobalNetworkPolicy with `action: Log`. ## Mechanics (verified end-to-end on 2026-05-19) 1. kubectl_manifest applies GNP `wave1-egress-observe-recruiter-responder` with `namespaceSelector: kubernetes.io/metadata.name == 'recruiter-responder'`, `types: [Egress]`, `egress: [{action: Log}, {action: Allow}]`. 2. Felix translates to iptables LOG rule in `cali-po-_ZEv_aILlvyT9fbgWN58` chain with prefix `calico-packet: ` log-level=5. 3. Linux kernel emits LOG entries to ring buffer with transport=kernel. 4. systemd-journald captures kernel transport entries. 5. Alloy DaemonSet ships journal to Loki with `job=node-journal,transport=kernel`. 6. LogQL: `{job="node-journal"} |~ "calico-packet"` returns entries showing SRC/DST/PROTO/PORT for every NEW egress connection. ## Verified output sample `calico-packet: IN=cali6cfdec4abc1 OUT=ens18 MAC=... SRC=10.10.122.132 DST=9.9.9.9 LEN=60 TOS=0x00 PREC=0x00 TTL=...` The Allow rule in the GNP keeps egress functional (recruiter-responder remained 1/1 Running through the apply — verified Python TCP connections to 1.1.1.1, 8.8.8.8, 9.9.9.9 succeed). ## Wave 1 status W1.6 observation infra is LIVE for the recruiter-responder pilot. W1.7 remains pending: collect 1 week of `{job="node-journal"} |~ "calico-packet"` samples, build empirical egress allowlist, flip the GNP rules from `[Log, Allow]` to `[Allow <specific dests>, Deny]`. Expand observation to additional namespaces by adding entries to `spec.namespaceSelector` (e.g. `kubernetes.io/metadata.name in {recruiter-responder,X,Y}`). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-19 22:10:42 +00:00
annotations = {
"security.viktorbarzin.me/wave" = "1"
security(wave1): W1.6 expand observation from recruiter-responder pilot → tier 3+4 (82 namespaces) ## Change - Replaced kubectl_manifest.wave1_egress_observe_recruiter_responder with kubectl_manifest.wave1_egress_observe_tier34 - namespaceSelector changed from `kubernetes.io/metadata.name == 'recruiter-responder'` to `tier in {"3-edge", "4-aux"}` — covers 82 namespaces (17 tier-3-edge + 65 tier-4-aux) - Legacy pilot GNP wave1-egress-observe-recruiter-responder kubectl-deleted (apply_only=true means TF rename does NOT destroy the live old resource; cleanup done manually) - Tier 0/1/2 namespaces explicitly out of wave 1 observation per locked plan (cluster infra + GPU workloads, deferred) ## Verification (live cluster, 2026-05-19) - 82 namespaces match `tier in (3-edge,4-aux)` - Felix translated the new policy into iptables LOG rule in cali-po-* chain - LogQL `{job="node-journal"} |~ "calico-packet"` returns real packet metadata from multiple namespaces with distinct destinations: - east-west pod-to-pod (10.10.108.48, 10.10.122.131) - in-cluster service VIP (10.96.0.10 — kube-dns) - external (149.154.166.110 — Telegram API from recruiter-responder) ## W1.7 next step (calendar-bound, ~1 week) - Let observation run for ~1 week - Aggregate distinct destinations per namespace via LogQL - Build per-namespace egress allowlist module `tier3_egress_baseline` - Flip GNP rules from `[Log, Allow]` to `[Allow <specific dests>, Deny]` - Phased per-namespace as originally planned Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-19 22:14:16 +00:00
"security.viktorbarzin.me/purpose" = "observe-then-enforce egress for tier 3-edge + 4-aux"
security(wave1): W1.6 observe phase LIVE — Calico GNP action:Log pilot on recruiter-responder Replaces the abandoned FelixConfiguration.flowLogsFileEnabled approach (Calico Enterprise-only field, rejected by OSS v3.26) with the supported primitive: Calico GlobalNetworkPolicy with `action: Log`. ## Mechanics (verified end-to-end on 2026-05-19) 1. kubectl_manifest applies GNP `wave1-egress-observe-recruiter-responder` with `namespaceSelector: kubernetes.io/metadata.name == 'recruiter-responder'`, `types: [Egress]`, `egress: [{action: Log}, {action: Allow}]`. 2. Felix translates to iptables LOG rule in `cali-po-_ZEv_aILlvyT9fbgWN58` chain with prefix `calico-packet: ` log-level=5. 3. Linux kernel emits LOG entries to ring buffer with transport=kernel. 4. systemd-journald captures kernel transport entries. 5. Alloy DaemonSet ships journal to Loki with `job=node-journal,transport=kernel`. 6. LogQL: `{job="node-journal"} |~ "calico-packet"` returns entries showing SRC/DST/PROTO/PORT for every NEW egress connection. ## Verified output sample `calico-packet: IN=cali6cfdec4abc1 OUT=ens18 MAC=... SRC=10.10.122.132 DST=9.9.9.9 LEN=60 TOS=0x00 PREC=0x00 TTL=...` The Allow rule in the GNP keeps egress functional (recruiter-responder remained 1/1 Running through the apply — verified Python TCP connections to 1.1.1.1, 8.8.8.8, 9.9.9.9 succeed). ## Wave 1 status W1.6 observation infra is LIVE for the recruiter-responder pilot. W1.7 remains pending: collect 1 week of `{job="node-journal"} |~ "calico-packet"` samples, build empirical egress allowlist, flip the GNP rules from `[Log, Allow]` to `[Allow <specific dests>, Deny]`. Expand observation to additional namespaces by adding entries to `spec.namespaceSelector` (e.g. `kubernetes.io/metadata.name in {recruiter-responder,X,Y}`). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-19 22:10:42 +00:00
}
}
spec = {
security(wave1): W1.6 expand observation from recruiter-responder pilot → tier 3+4 (82 namespaces) ## Change - Replaced kubectl_manifest.wave1_egress_observe_recruiter_responder with kubectl_manifest.wave1_egress_observe_tier34 - namespaceSelector changed from `kubernetes.io/metadata.name == 'recruiter-responder'` to `tier in {"3-edge", "4-aux"}` — covers 82 namespaces (17 tier-3-edge + 65 tier-4-aux) - Legacy pilot GNP wave1-egress-observe-recruiter-responder kubectl-deleted (apply_only=true means TF rename does NOT destroy the live old resource; cleanup done manually) - Tier 0/1/2 namespaces explicitly out of wave 1 observation per locked plan (cluster infra + GPU workloads, deferred) ## Verification (live cluster, 2026-05-19) - 82 namespaces match `tier in (3-edge,4-aux)` - Felix translated the new policy into iptables LOG rule in cali-po-* chain - LogQL `{job="node-journal"} |~ "calico-packet"` returns real packet metadata from multiple namespaces with distinct destinations: - east-west pod-to-pod (10.10.108.48, 10.10.122.131) - in-cluster service VIP (10.96.0.10 — kube-dns) - external (149.154.166.110 — Telegram API from recruiter-responder) ## W1.7 next step (calendar-bound, ~1 week) - Let observation run for ~1 week - Aggregate distinct destinations per namespace via LogQL - Build per-namespace egress allowlist module `tier3_egress_baseline` - Flip GNP rules from `[Log, Allow]` to `[Allow <specific dests>, Deny]` - Phased per-namespace as originally planned Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-19 22:14:16 +00:00
order = 2000
selector = "all()"
namespaceSelector = "tier in {\"3-edge\", \"4-aux\"}"
types = ["Egress"]
security(wave1): W1.6 observe phase LIVE — Calico GNP action:Log pilot on recruiter-responder Replaces the abandoned FelixConfiguration.flowLogsFileEnabled approach (Calico Enterprise-only field, rejected by OSS v3.26) with the supported primitive: Calico GlobalNetworkPolicy with `action: Log`. ## Mechanics (verified end-to-end on 2026-05-19) 1. kubectl_manifest applies GNP `wave1-egress-observe-recruiter-responder` with `namespaceSelector: kubernetes.io/metadata.name == 'recruiter-responder'`, `types: [Egress]`, `egress: [{action: Log}, {action: Allow}]`. 2. Felix translates to iptables LOG rule in `cali-po-_ZEv_aILlvyT9fbgWN58` chain with prefix `calico-packet: ` log-level=5. 3. Linux kernel emits LOG entries to ring buffer with transport=kernel. 4. systemd-journald captures kernel transport entries. 5. Alloy DaemonSet ships journal to Loki with `job=node-journal,transport=kernel`. 6. LogQL: `{job="node-journal"} |~ "calico-packet"` returns entries showing SRC/DST/PROTO/PORT for every NEW egress connection. ## Verified output sample `calico-packet: IN=cali6cfdec4abc1 OUT=ens18 MAC=... SRC=10.10.122.132 DST=9.9.9.9 LEN=60 TOS=0x00 PREC=0x00 TTL=...` The Allow rule in the GNP keeps egress functional (recruiter-responder remained 1/1 Running through the apply — verified Python TCP connections to 1.1.1.1, 8.8.8.8, 9.9.9.9 succeed). ## Wave 1 status W1.6 observation infra is LIVE for the recruiter-responder pilot. W1.7 remains pending: collect 1 week of `{job="node-journal"} |~ "calico-packet"` samples, build empirical egress allowlist, flip the GNP rules from `[Log, Allow]` to `[Allow <specific dests>, Deny]`. Expand observation to additional namespaces by adding entries to `spec.namespaceSelector` (e.g. `kubernetes.io/metadata.name in {recruiter-responder,X,Y}`). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-19 22:10:42 +00:00
egress = [
security(wave1): W1.6 expand observation from recruiter-responder pilot → tier 3+4 (82 namespaces) ## Change - Replaced kubectl_manifest.wave1_egress_observe_recruiter_responder with kubectl_manifest.wave1_egress_observe_tier34 - namespaceSelector changed from `kubernetes.io/metadata.name == 'recruiter-responder'` to `tier in {"3-edge", "4-aux"}` — covers 82 namespaces (17 tier-3-edge + 65 tier-4-aux) - Legacy pilot GNP wave1-egress-observe-recruiter-responder kubectl-deleted (apply_only=true means TF rename does NOT destroy the live old resource; cleanup done manually) - Tier 0/1/2 namespaces explicitly out of wave 1 observation per locked plan (cluster infra + GPU workloads, deferred) ## Verification (live cluster, 2026-05-19) - 82 namespaces match `tier in (3-edge,4-aux)` - Felix translated the new policy into iptables LOG rule in cali-po-* chain - LogQL `{job="node-journal"} |~ "calico-packet"` returns real packet metadata from multiple namespaces with distinct destinations: - east-west pod-to-pod (10.10.108.48, 10.10.122.131) - in-cluster service VIP (10.96.0.10 — kube-dns) - external (149.154.166.110 — Telegram API from recruiter-responder) ## W1.7 next step (calendar-bound, ~1 week) - Let observation run for ~1 week - Aggregate distinct destinations per namespace via LogQL - Build per-namespace egress allowlist module `tier3_egress_baseline` - Flip GNP rules from `[Log, Allow]` to `[Allow <specific dests>, Deny]` - Phased per-namespace as originally planned Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-19 22:14:16 +00:00
# Rule 1: log every egress packet (LOG target writes to kernel/journal,
# alloy ships to Loki with job=node-journal,transport=kernel).
# LogQL: {job="node-journal"} |~ "calico-packet"
security(wave1): W1.6 observe phase LIVE — Calico GNP action:Log pilot on recruiter-responder Replaces the abandoned FelixConfiguration.flowLogsFileEnabled approach (Calico Enterprise-only field, rejected by OSS v3.26) with the supported primitive: Calico GlobalNetworkPolicy with `action: Log`. ## Mechanics (verified end-to-end on 2026-05-19) 1. kubectl_manifest applies GNP `wave1-egress-observe-recruiter-responder` with `namespaceSelector: kubernetes.io/metadata.name == 'recruiter-responder'`, `types: [Egress]`, `egress: [{action: Log}, {action: Allow}]`. 2. Felix translates to iptables LOG rule in `cali-po-_ZEv_aILlvyT9fbgWN58` chain with prefix `calico-packet: ` log-level=5. 3. Linux kernel emits LOG entries to ring buffer with transport=kernel. 4. systemd-journald captures kernel transport entries. 5. Alloy DaemonSet ships journal to Loki with `job=node-journal,transport=kernel`. 6. LogQL: `{job="node-journal"} |~ "calico-packet"` returns entries showing SRC/DST/PROTO/PORT for every NEW egress connection. ## Verified output sample `calico-packet: IN=cali6cfdec4abc1 OUT=ens18 MAC=... SRC=10.10.122.132 DST=9.9.9.9 LEN=60 TOS=0x00 PREC=0x00 TTL=...` The Allow rule in the GNP keeps egress functional (recruiter-responder remained 1/1 Running through the apply — verified Python TCP connections to 1.1.1.1, 8.8.8.8, 9.9.9.9 succeed). ## Wave 1 status W1.6 observation infra is LIVE for the recruiter-responder pilot. W1.7 remains pending: collect 1 week of `{job="node-journal"} |~ "calico-packet"` samples, build empirical egress allowlist, flip the GNP rules from `[Log, Allow]` to `[Allow <specific dests>, Deny]`. Expand observation to additional namespaces by adding entries to `spec.namespaceSelector` (e.g. `kubernetes.io/metadata.name in {recruiter-responder,X,Y}`). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-19 22:10:42 +00:00
{ action = "Log" },
security(wave1): W1.6 expand observation from recruiter-responder pilot → tier 3+4 (82 namespaces) ## Change - Replaced kubectl_manifest.wave1_egress_observe_recruiter_responder with kubectl_manifest.wave1_egress_observe_tier34 - namespaceSelector changed from `kubernetes.io/metadata.name == 'recruiter-responder'` to `tier in {"3-edge", "4-aux"}` — covers 82 namespaces (17 tier-3-edge + 65 tier-4-aux) - Legacy pilot GNP wave1-egress-observe-recruiter-responder kubectl-deleted (apply_only=true means TF rename does NOT destroy the live old resource; cleanup done manually) - Tier 0/1/2 namespaces explicitly out of wave 1 observation per locked plan (cluster infra + GPU workloads, deferred) ## Verification (live cluster, 2026-05-19) - 82 namespaces match `tier in (3-edge,4-aux)` - Felix translated the new policy into iptables LOG rule in cali-po-* chain - LogQL `{job="node-journal"} |~ "calico-packet"` returns real packet metadata from multiple namespaces with distinct destinations: - east-west pod-to-pod (10.10.108.48, 10.10.122.131) - in-cluster service VIP (10.96.0.10 — kube-dns) - external (149.154.166.110 — Telegram API from recruiter-responder) ## W1.7 next step (calendar-bound, ~1 week) - Let observation run for ~1 week - Aggregate distinct destinations per namespace via LogQL - Build per-namespace egress allowlist module `tier3_egress_baseline` - Flip GNP rules from `[Log, Allow]` to `[Allow <specific dests>, Deny]` - Phased per-namespace as originally planned Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-19 22:14:16 +00:00
# Rule 2: allow everything (observation must NOT break workloads).
security(wave1): W1.6 observe phase LIVE — Calico GNP action:Log pilot on recruiter-responder Replaces the abandoned FelixConfiguration.flowLogsFileEnabled approach (Calico Enterprise-only field, rejected by OSS v3.26) with the supported primitive: Calico GlobalNetworkPolicy with `action: Log`. ## Mechanics (verified end-to-end on 2026-05-19) 1. kubectl_manifest applies GNP `wave1-egress-observe-recruiter-responder` with `namespaceSelector: kubernetes.io/metadata.name == 'recruiter-responder'`, `types: [Egress]`, `egress: [{action: Log}, {action: Allow}]`. 2. Felix translates to iptables LOG rule in `cali-po-_ZEv_aILlvyT9fbgWN58` chain with prefix `calico-packet: ` log-level=5. 3. Linux kernel emits LOG entries to ring buffer with transport=kernel. 4. systemd-journald captures kernel transport entries. 5. Alloy DaemonSet ships journal to Loki with `job=node-journal,transport=kernel`. 6. LogQL: `{job="node-journal"} |~ "calico-packet"` returns entries showing SRC/DST/PROTO/PORT for every NEW egress connection. ## Verified output sample `calico-packet: IN=cali6cfdec4abc1 OUT=ens18 MAC=... SRC=10.10.122.132 DST=9.9.9.9 LEN=60 TOS=0x00 PREC=0x00 TTL=...` The Allow rule in the GNP keeps egress functional (recruiter-responder remained 1/1 Running through the apply — verified Python TCP connections to 1.1.1.1, 8.8.8.8, 9.9.9.9 succeed). ## Wave 1 status W1.6 observation infra is LIVE for the recruiter-responder pilot. W1.7 remains pending: collect 1 week of `{job="node-journal"} |~ "calico-packet"` samples, build empirical egress allowlist, flip the GNP rules from `[Log, Allow]` to `[Allow <specific dests>, Deny]`. Expand observation to additional namespaces by adding entries to `spec.namespaceSelector` (e.g. `kubernetes.io/metadata.name in {recruiter-responder,X,Y}`). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-19 22:10:42 +00:00
{ action = "Allow" },
]
}
})
apply_only = true
}
security(wave1): W1.1 audit-log shipping LIVE + W1.5 trusted-registries Enforce LIVE ## W1.1 — K8s API audit log shipping (LIVE) - alloy.yaml: added control-plane toleration so Alloy DaemonSet runs on k8s-master node. Verified alloy-7zg7t scheduled on master, tailing /var/log/kubernetes/audit.log - loki.tf "Security Wave 1" rule group: added K2-K9 alert rules (skipped K1 per Q7 decision): - K2 K8sSATokenFromUnexpectedIP - K3 K8sSensitiveSecretReadByUnexpectedActor - K4 K8sExecIntoSensitiveNamespace - K5 K8sMassDelete (>5 Pod/Secret/CM in 60s by single user) - K6 K8sAuditPolicyModified (kubeadm-config CM change) - K7 K8sClusterRoleWildcardCreated (verbs=* + resources=*) - K8 K8sAnonymousBindingGranted - K9 K8sViktorFromUnexpectedIP - All rules use source-IP regex matching the wave-1 allowlist (10.0.20.0/22, 192.168.1.0/24, 10.10.0.0/16 pod, 10.96.0.0/12 svc, 100.64-127 tailnet) and `lane = "security"` → #security Slack route. - Verified: kubectl-audit logs flowing in Loki query {job="kubernetes-audit"} returns events with node=k8s-master. - Verified: /loki/api/v1/rules lists all K2-K9 + V1-V7 + S1. ## W1.5 — require-trusted-registries Enforce (LIVE) - security-policies.tf: flipped Audit→Enforce with explicit allowlist built by `kubectl get pods -A -o jsonpath='{..image}'` enumeration. - Removed `*/*` catch-all (which made Audit→Enforce a no-op). - Pattern includes 15 explicit registries, 6 DockerHub library bare names, 56 DockerHub user repos. - Verified by admission dry-run: - evilcorp.example/malware:v1 → BLOCKED with custom message - alpine:3.20 → ALLOWED (matches `alpine*`) - docker.io/library/alpine:3.20 → ALLOWED (matches `docker.io/*`) ## W1.6 — Calico flow logs (BLOCKED — Calico OSS limitation) - Tried adding FelixConfiguration with flowLogsFileEnabled=true via kubectl_manifest in stacks/calico/main.tf - Calico OSS rejected with "strict decoding error: unknown field spec.flowLogsFileEnabled" — these fields are Calico Enterprise/Tigera-only - Removed the failed resource. Documented alternative paths in main.tf comment block: GNP with action=Log (iptables NFLOG → journal), Cilium migration, eBPF tooling, or Tigera Operator adoption. ## Docs updates - security.md status table refreshed: W1.1/W1.2/W1.3/W1.4/W1.5 LIVE, W1.6/W1.7 blocked - monitoring.md: Loki marked DEPLOYED (was incorrectly NOT-DEPLOYED in prior session before today's apply) ## Cleanup - Removed stacks/kyverno/imports.tf (TF 1.5+ import blocks completed their job in the 2026-05-18 apply; should not stay in tree per TF docs) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-19 06:37:54 +00:00
# CI retrigger 2026-05-16T13:42:57+00:00 — bulk enrollment apply (pipeline #689 killed)
# CI retrigger v2 2026-05-16T13:46:35+00:00
# CI retrigger v3 2026-05-16T14:06:39Z
# CI retrigger v4 2026-05-16T14:13:59Z
# CI retrigger v5 2026-05-16T23:10:38Z
# CI retrigger v6 2026-05-16T23:18:58Z