[infra] Partial Calico adoption: namespaces only (Wave 5b)
## Context
Wave 5b of the state-drift consolidation plan. Calico has run this cluster's
pod networking since 2024-07-30, installed via raw kubectl manifests —
tigera-operator Deployment + ~20 CRDs + an Installation CR. The plan
flagged Calico as HIGH BLAST because the operator + Installation CR sit on
the critical path for pod scheduling; any mistake during adoption can
break CNI and block new pods cluster-wide within seconds.
This session takes the safe sub-step: adopt only the three namespaces.
Namespaces are label containers — TF managing their names + PSA labels
cannot disrupt Calico networking. Getting the operator, Installation CR,
and CRDs under TF requires dedicated prep (picking the right
`ignore_changes` fields to absorb operator-generated defaults in the
Installation CR, decoupling from the embedded PSA labels applied at
admission, and a low-traffic window). Deferred to `code-3ad`.
## This change
New Tier 1 stack `stacks/calico/` adopting via import `{}` blocks
(Wave 8 convention, commit 8a99be11):
- `kubernetes_namespace.calico_system` ← id `calico-system`
- `kubernetes_namespace.calico_apiserver` ← id `calico-apiserver`
- `kubernetes_namespace.tigera_operator` ← id `tigera-operator`
Apply: `3 imported, 0 added, 0 changed, 0 destroyed.` Followed by a
second `tg plan` that returns `No changes`. Zero cluster impact —
namespaces stayed exactly as they were cluster-side.
### terragrunt dependency choice
Deliberately no `dependency "platform"` clause — Calico is lower in the
stack than platform, so introducing a `platform → calico` or
`calico → platform` edge would invite cycle-like pain on first
bootstrap. The plan on this stack is always safe to run standalone.
### `ignore_changes` scope on each namespace
- `goldilocks.fairwinds.com/vpa-update-mode` — Kyverno ClusterPolicy
stamp (Wave 3B sweep, commit 8b43692a).
- `pod-security.kubernetes.io/enforce` + `-version` — tigera-operator
stamps these on `calico-system` + `calico-apiserver` to opt them out
of PSA. These labels aren't surfaced by the kubernetes provider as
part of the import (they arrive through a different field manager),
so left unmanaged to keep the plan clean. `tigera-operator` ns
doesn't get the PSA labels so they aren't ignored there.
## What is NOT in this change
- The three live workloads: `tigera-operator` Deployment in
`tigera-operator` ns, `calico-kube-controllers`/`calico-node`/
`calico-typha` workloads in `calico-system`, the `calico-apiserver`
in `calico-apiserver`. These are all reconciled by the tigera-operator
from the Installation CR — importing them into TF is redundant with
importing the CR itself.
- The `Installation` CR (`default`, apiVersion
`operator.tigera.io/v1`) — the user-authored minimal spec has since
been filled to 104 lines of operator-generated defaults. Adopting it
requires a well-scoped `ignore_changes` list on the `manifest` field.
Separate follow-up `code-3ad`.
- `.sops.yaml` / `tier0_stacks` updates — the original plan suggested
Tier 0 (local SOPS state) for the full Calico stack on the theory
that "network underpins all". With only three namespaces in the stack,
the argument doesn't hold: a failed Tier 1 plan on calico namespaces
cannot break networking, so no need to pay the Tier 0 tax.
## Verification
```
$ cd stacks/calico && ../../scripts/tg plan
No changes. Your infrastructure matches the configuration.
$ kubectl get pods -n calico-system
NAME READY STATUS RESTARTS
calico-kube-controllers-... 1/1 Running 0
calico-node-... 1/1 Running 0
... (all healthy, pre-existing)
```
Follow-up: code-3ad for operator + Installation CR adoption (needs
low-traffic window + ignore_changes scoping).
Closes: code-hl1 scope of Wave 5b (namespaces). Remaining subwave in code-3ad.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 22:52:56 +00:00
|
|
|
# Calico CNI
|
|
|
|
|
#
|
|
|
|
|
# Calico has underpinned this cluster's pod networking since 2024-07-30, installed
|
|
|
|
|
# as raw kubectl manifests (tigera-operator Deployment + CRDs + Installation CR).
|
|
|
|
|
# Bringing the full stack under Terraform is high-blast — the operator and its
|
|
|
|
|
# Deployment must never flap during node pressure or during any apply, because
|
|
|
|
|
# new pod scheduling breaks within ~seconds of a CNI outage.
|
|
|
|
|
#
|
|
|
|
|
# This stack (created 2026-04-18 Wave 5b) adopts the three namespaces only:
|
|
|
|
|
# calico-system, calico-apiserver, tigera-operator. The `tigera-operator`
|
|
|
|
|
# Deployment, the 20+ CRDs it manages, and the `Installation` CR itself are
|
|
|
|
|
# intentionally *not* adopted yet — they require a low-traffic window and a
|
|
|
|
|
# careful ignore_changes set to cover operator-generated defaults on the
|
|
|
|
|
# Installation CR. Follow-up tracked in beads code-3ad.
|
|
|
|
|
#
|
|
|
|
|
# The namespaces are safe to adopt (no networking impact — they're just label
|
|
|
|
|
# containers) and give TF an audit trail entry for the labels/tier Kyverno
|
|
|
|
|
# cares about.
|
|
|
|
|
|
|
|
|
|
resource "kubernetes_namespace" "calico_system" {
|
|
|
|
|
metadata {
|
|
|
|
|
name = "calico-system"
|
|
|
|
|
labels = {
|
|
|
|
|
name = "calico-system"
|
2026-05-16 13:18:35 +00:00
|
|
|
# calico-system namespace is managed by tigera-operator — auto-update is
|
|
|
|
|
# incompatible (operator reverts DaemonSet image from its Installation CR).
|
|
|
|
|
# "keel.sh/enrolled" = "true"
|
[infra] Partial Calico adoption: namespaces only (Wave 5b)
## Context
Wave 5b of the state-drift consolidation plan. Calico has run this cluster's
pod networking since 2024-07-30, installed via raw kubectl manifests —
tigera-operator Deployment + ~20 CRDs + an Installation CR. The plan
flagged Calico as HIGH BLAST because the operator + Installation CR sit on
the critical path for pod scheduling; any mistake during adoption can
break CNI and block new pods cluster-wide within seconds.
This session takes the safe sub-step: adopt only the three namespaces.
Namespaces are label containers — TF managing their names + PSA labels
cannot disrupt Calico networking. Getting the operator, Installation CR,
and CRDs under TF requires dedicated prep (picking the right
`ignore_changes` fields to absorb operator-generated defaults in the
Installation CR, decoupling from the embedded PSA labels applied at
admission, and a low-traffic window). Deferred to `code-3ad`.
## This change
New Tier 1 stack `stacks/calico/` adopting via import `{}` blocks
(Wave 8 convention, commit 8a99be11):
- `kubernetes_namespace.calico_system` ← id `calico-system`
- `kubernetes_namespace.calico_apiserver` ← id `calico-apiserver`
- `kubernetes_namespace.tigera_operator` ← id `tigera-operator`
Apply: `3 imported, 0 added, 0 changed, 0 destroyed.` Followed by a
second `tg plan` that returns `No changes`. Zero cluster impact —
namespaces stayed exactly as they were cluster-side.
### terragrunt dependency choice
Deliberately no `dependency "platform"` clause — Calico is lower in the
stack than platform, so introducing a `platform → calico` or
`calico → platform` edge would invite cycle-like pain on first
bootstrap. The plan on this stack is always safe to run standalone.
### `ignore_changes` scope on each namespace
- `goldilocks.fairwinds.com/vpa-update-mode` — Kyverno ClusterPolicy
stamp (Wave 3B sweep, commit 8b43692a).
- `pod-security.kubernetes.io/enforce` + `-version` — tigera-operator
stamps these on `calico-system` + `calico-apiserver` to opt them out
of PSA. These labels aren't surfaced by the kubernetes provider as
part of the import (they arrive through a different field manager),
so left unmanaged to keep the plan clean. `tigera-operator` ns
doesn't get the PSA labels so they aren't ignored there.
## What is NOT in this change
- The three live workloads: `tigera-operator` Deployment in
`tigera-operator` ns, `calico-kube-controllers`/`calico-node`/
`calico-typha` workloads in `calico-system`, the `calico-apiserver`
in `calico-apiserver`. These are all reconciled by the tigera-operator
from the Installation CR — importing them into TF is redundant with
importing the CR itself.
- The `Installation` CR (`default`, apiVersion
`operator.tigera.io/v1`) — the user-authored minimal spec has since
been filled to 104 lines of operator-generated defaults. Adopting it
requires a well-scoped `ignore_changes` list on the `manifest` field.
Separate follow-up `code-3ad`.
- `.sops.yaml` / `tier0_stacks` updates — the original plan suggested
Tier 0 (local SOPS state) for the full Calico stack on the theory
that "network underpins all". With only three namespaces in the stack,
the argument doesn't hold: a failed Tier 1 plan on calico namespaces
cannot break networking, so no need to pay the Tier 0 tax.
## Verification
```
$ cd stacks/calico && ../../scripts/tg plan
No changes. Your infrastructure matches the configuration.
$ kubectl get pods -n calico-system
NAME READY STATUS RESTARTS
calico-kube-controllers-... 1/1 Running 0
calico-node-... 1/1 Running 0
... (all healthy, pre-existing)
```
Follow-up: code-3ad for operator + Installation CR adoption (needs
low-traffic window + ignore_changes scoping).
Closes: code-hl1 scope of Wave 5b (namespaces). Remaining subwave in code-3ad.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 22:52:56 +00:00
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
lifecycle {
|
|
|
|
|
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode label on every namespace.
|
|
|
|
|
# pod-security.kubernetes.io/* labels are applied by the tigera-operator
|
|
|
|
|
# reconciler on calico-system + calico-apiserver for PSA 'privileged'.
|
|
|
|
|
ignore_changes = [
|
|
|
|
|
metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"],
|
|
|
|
|
metadata[0].labels["pod-security.kubernetes.io/enforce"],
|
|
|
|
|
metadata[0].labels["pod-security.kubernetes.io/enforce-version"],
|
|
|
|
|
]
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
resource "kubernetes_namespace" "calico_apiserver" {
|
|
|
|
|
metadata {
|
|
|
|
|
name = "calico-apiserver"
|
|
|
|
|
labels = {
|
|
|
|
|
name = "calico-apiserver"
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
lifecycle {
|
|
|
|
|
# KYVERNO_LIFECYCLE_V1 + PSA labels applied by tigera-operator (see calico_system).
|
|
|
|
|
ignore_changes = [
|
|
|
|
|
metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"],
|
|
|
|
|
metadata[0].labels["pod-security.kubernetes.io/enforce"],
|
|
|
|
|
metadata[0].labels["pod-security.kubernetes.io/enforce-version"],
|
|
|
|
|
]
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
resource "kubernetes_namespace" "tigera_operator" {
|
|
|
|
|
metadata {
|
|
|
|
|
name = "tigera-operator"
|
|
|
|
|
labels = {
|
|
|
|
|
name = "tigera-operator"
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
lifecycle {
|
|
|
|
|
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
|
|
|
|
|
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
|
|
|
|
|
}
|
|
|
|
|
}
|
2026-05-16 13:42:57 +00:00
|
|
|
|
security(wave1): W1.1 audit-log shipping LIVE + W1.5 trusted-registries Enforce LIVE
## W1.1 — K8s API audit log shipping (LIVE)
- alloy.yaml: added control-plane toleration so Alloy DaemonSet runs on
k8s-master node. Verified alloy-7zg7t scheduled on master, tailing
/var/log/kubernetes/audit.log
- loki.tf "Security Wave 1" rule group: added K2-K9 alert rules
(skipped K1 per Q7 decision):
- K2 K8sSATokenFromUnexpectedIP
- K3 K8sSensitiveSecretReadByUnexpectedActor
- K4 K8sExecIntoSensitiveNamespace
- K5 K8sMassDelete (>5 Pod/Secret/CM in 60s by single user)
- K6 K8sAuditPolicyModified (kubeadm-config CM change)
- K7 K8sClusterRoleWildcardCreated (verbs=* + resources=*)
- K8 K8sAnonymousBindingGranted
- K9 K8sViktorFromUnexpectedIP
- All rules use source-IP regex matching the wave-1 allowlist
(10.0.20.0/22, 192.168.1.0/24, 10.10.0.0/16 pod, 10.96.0.0/12 svc,
100.64-127 tailnet) and `lane = "security"` → #security Slack route.
- Verified: kubectl-audit logs flowing in Loki query
{job="kubernetes-audit"} returns events with node=k8s-master.
- Verified: /loki/api/v1/rules lists all K2-K9 + V1-V7 + S1.
## W1.5 — require-trusted-registries Enforce (LIVE)
- security-policies.tf: flipped Audit→Enforce with explicit allowlist
built by `kubectl get pods -A -o jsonpath='{..image}'` enumeration.
- Removed `*/*` catch-all (which made Audit→Enforce a no-op).
- Pattern includes 15 explicit registries, 6 DockerHub library bare
names, 56 DockerHub user repos.
- Verified by admission dry-run:
- evilcorp.example/malware:v1 → BLOCKED with custom message
- alpine:3.20 → ALLOWED (matches `alpine*`)
- docker.io/library/alpine:3.20 → ALLOWED (matches `docker.io/*`)
## W1.6 — Calico flow logs (BLOCKED — Calico OSS limitation)
- Tried adding FelixConfiguration with flowLogsFileEnabled=true via
kubectl_manifest in stacks/calico/main.tf
- Calico OSS rejected with "strict decoding error: unknown field
spec.flowLogsFileEnabled" — these fields are Calico Enterprise/Tigera-only
- Removed the failed resource. Documented alternative paths in main.tf
comment block: GNP with action=Log (iptables NFLOG → journal), Cilium
migration, eBPF tooling, or Tigera Operator adoption.
## Docs updates
- security.md status table refreshed: W1.1/W1.2/W1.3/W1.4/W1.5 LIVE,
W1.6/W1.7 blocked
- monitoring.md: Loki marked DEPLOYED (was incorrectly NOT-DEPLOYED in
prior session before today's apply)
## Cleanup
- Removed stacks/kyverno/imports.tf (TF 1.5+ import blocks completed
their job in the 2026-05-18 apply; should not stay in tree per TF docs)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-19 06:37:54 +00:00
|
|
|
# Wave 1 W1.6 (beads code-8ywc): Calico OSS does NOT support flow-log-to-file
|
|
|
|
|
# export via FelixConfiguration — `flowLogsFileEnabled` and related fields are
|
|
|
|
|
# Calico Enterprise / Tigera Cloud features and are rejected by the OSS API
|
|
|
|
|
# (verified 2026-05-19: "strict decoding error: unknown field spec.flowLogsFileEnabled").
|
|
|
|
|
#
|
|
|
|
|
# Alternative observe-then-enforce paths for W1.6/W1.7:
|
|
|
|
|
# 1. Calico GlobalNetworkPolicy with `action: Log` on tier 3+4 — Log action
|
|
|
|
|
# writes to iptables NFLOG which lands in node syslog. Alloy already
|
|
|
|
|
# scrapes journal, but the format needs parsing.
|
|
|
|
|
# 2. Cilium replacement with Hubble flow observability (large migration).
|
|
|
|
|
# 3. Tigera Operator + Calico Enterprise (commercial).
|
|
|
|
|
# 4. eBPF-based flow capture (e.g. inspektor-gadget, retina) sidecar approach.
|
|
|
|
|
#
|
|
|
|
|
# Wave 1 stops at this fork. The observe phase requires further design choice
|
|
|
|
|
# tracked under code-8ywc as a separate W1.6/W1.7 follow-up.
|
|
|
|
|
|
2026-05-16 13:42:57 +00:00
|
|
|
# CI retrigger 2026-05-16T13:42:57+00:00 — bulk enrollment apply (pipeline #689 killed)
|
2026-05-16 13:46:35 +00:00
|
|
|
# CI retrigger v2 2026-05-16T13:46:35+00:00
|
2026-05-16 14:06:39 +00:00
|
|
|
|
|
|
|
|
# CI retrigger v3 2026-05-16T14:06:39Z
|
2026-05-16 14:13:59 +00:00
|
|
|
|
|
|
|
|
# CI retrigger v4 2026-05-16T14:13:59Z
|
2026-05-16 23:10:38 +00:00
|
|
|
|
|
|
|
|
# CI retrigger v5 2026-05-16T23:10:38Z
|
2026-05-16 23:18:59 +00:00
|
|
|
|
|
|
|
|
# CI retrigger v6 2026-05-16T23:18:58Z
|