security(wave1): W1.1 audit-log shipping LIVE + W1.5 trusted-registries Enforce LIVE
## W1.1 — K8s API audit log shipping (LIVE)
- alloy.yaml: added control-plane toleration so Alloy DaemonSet runs on
k8s-master node. Verified alloy-7zg7t scheduled on master, tailing
/var/log/kubernetes/audit.log
- loki.tf "Security Wave 1" rule group: added K2-K9 alert rules
(skipped K1 per Q7 decision):
- K2 K8sSATokenFromUnexpectedIP
- K3 K8sSensitiveSecretReadByUnexpectedActor
- K4 K8sExecIntoSensitiveNamespace
- K5 K8sMassDelete (>5 Pod/Secret/CM in 60s by single user)
- K6 K8sAuditPolicyModified (kubeadm-config CM change)
- K7 K8sClusterRoleWildcardCreated (verbs=* + resources=*)
- K8 K8sAnonymousBindingGranted
- K9 K8sViktorFromUnexpectedIP
- All rules use source-IP regex matching the wave-1 allowlist
(10.0.20.0/22, 192.168.1.0/24, 10.10.0.0/16 pod, 10.96.0.0/12 svc,
100.64-127 tailnet) and `lane = "security"` → #security Slack route.
- Verified: kubectl-audit logs flowing in Loki query
{job="kubernetes-audit"} returns events with node=k8s-master.
- Verified: /loki/api/v1/rules lists all K2-K9 + V1-V7 + S1.
## W1.5 — require-trusted-registries Enforce (LIVE)
- security-policies.tf: flipped Audit→Enforce with explicit allowlist
built by `kubectl get pods -A -o jsonpath='{..image}'` enumeration.
- Removed `*/*` catch-all (which made Audit→Enforce a no-op).
- Pattern includes 15 explicit registries, 6 DockerHub library bare
names, 56 DockerHub user repos.
- Verified by admission dry-run:
- evilcorp.example/malware:v1 → BLOCKED with custom message
- alpine:3.20 → ALLOWED (matches `alpine*`)
- docker.io/library/alpine:3.20 → ALLOWED (matches `docker.io/*`)
## W1.6 — Calico flow logs (BLOCKED — Calico OSS limitation)
- Tried adding FelixConfiguration with flowLogsFileEnabled=true via
kubectl_manifest in stacks/calico/main.tf
- Calico OSS rejected with "strict decoding error: unknown field
spec.flowLogsFileEnabled" — these fields are Calico Enterprise/Tigera-only
- Removed the failed resource. Documented alternative paths in main.tf
comment block: GNP with action=Log (iptables NFLOG → journal), Cilium
migration, eBPF tooling, or Tigera Operator adoption.
## Docs updates
- security.md status table refreshed: W1.1/W1.2/W1.3/W1.4/W1.5 LIVE,
W1.6/W1.7 blocked
- monitoring.md: Loki marked DEPLOYED (was incorrectly NOT-DEPLOYED in
prior session before today's apply)
## Cleanup
- Removed stacks/kyverno/imports.tf (TF 1.5+ import blocks completed
their job in the 2026-05-18 apply; should not stay in tree per TF docs)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
parent
8ef4f06ac0
commit
669ba97078
7 changed files with 166 additions and 88 deletions
|
|
@ -57,7 +57,7 @@ graph TB
|
|||
|-----------|---------|----------|---------|
|
||||
| Prometheus | Latest (Diun monitored) | `stacks/monitoring/modules/monitoring/` | Metrics collection and storage, scrape configs for all services |
|
||||
| Grafana | Latest (Diun monitored) | `stacks/monitoring/modules/monitoring/` | Visualization, 14+ dashboards (API server, CoreDNS, GPU, UPS, etc.) |
|
||||
| Loki | **NOT DEPLOYED as of 2026-05-18** — TF code exists in `stacks/monitoring/modules/monitoring/loki.tf` but `helm_release.loki` has a self-referencing `depends_on` that prevented apply. No `loki` Helm release in cluster, no Loki pods or Service. All "log aggregation" claims below are aspirational. Tracked under beads `code-146x`. | `stacks/monitoring/modules/monitoring/` | Log aggregation and querying (planned) |
|
||||
| Loki | **DEPLOYED 2026-05-18** (SingleBinary mode, 30d retention, 50Gi PVC on `proxmox-lvm`, ruler enabled → Alertmanager). Re-enabled from previous "operational overhead" disable. Ships logs via Alloy DaemonSet (now on all nodes including master after 2026-05-19 toleration add). | `stacks/monitoring/modules/monitoring/` | Log aggregation and querying |
|
||||
| Alertmanager | Latest (Diun monitored) | `stacks/monitoring/modules/monitoring/` | Alert routing with cascade inhibitions |
|
||||
| Uptime Kuma | Latest (Diun monitored) | `stacks/uptime-kuma/` | Internal + external HTTP monitors, status page |
|
||||
| External Monitor Sync | Python 3.12 | `stacks/uptime-kuma/` | CronJob (10min) syncs `[External]` monitors from `cloudflare_proxied_names` |
|
||||
|
|
|
|||
|
|
@ -175,13 +175,13 @@ Beads epic: `code-8ywc`. **Status: partially live as of 2026-05-18.**
|
|||
|---|---|
|
||||
| W1.2 Vault `file` audit device | **LIVE** — `vault_audit.file` in `stacks/vault/main.tf:287`, writing to `/vault/audit/vault-audit.log` on `proxmox-lvm-encrypted` PVC |
|
||||
| W1.2 Vault `x_forwarded_for_authorized_addrs = 10.10.0.0/16` | **LIVE** — applied via `tg apply -target=helm_release.vault` on 2026-05-18; all 3 vault pods restarted cleanly |
|
||||
| W1.2 Vault audit log shipping to Loki | **PARTIAL** — `audit-tail` sidecar live in vault pods (emits JSON audit lines to stdout, viewable via `kubectl logs -n vault vault-X -c audit-tail`). Actual shipping to Loki BLOCKED on `code-146x` (Loki not deployed in the cluster despite TF code existing). |
|
||||
| W1.1 K8s API audit policy | **PENDING** — needs `stacks/infra` kubeadm-config templating |
|
||||
| W1.3 Source-IP anomaly rules (K9, V7, S1) | **PENDING** — gated on `code-146x` (Loki + Alloy not deployed) and W1.1 audit-policy codification |
|
||||
| W1.4 Kyverno security policies → Enforce | **CODE READY, APPLY BLOCKED** by `code-e2dp` (terraform-provider-kubernetes v3.1.0 crash on `kubernetes_manifest` plan) |
|
||||
| W1.5 Kyverno trusted-registries enforce | **CODE PARTIAL** (exclude list added; allowlist tightening + enforce-flip deferred until `code-e2dp` resolved) |
|
||||
| W1.6 Calico flow logs + log-only GNP | **BLOCKED** on `code-3ad` (Calico stack adopts only namespaces today; `Installation` CR + Felix config not under TF) |
|
||||
| W1.7 NetworkPolicy phased enforce | **BLOCKED** on W1.6 observation window |
|
||||
| W1.2 Vault audit log shipping to Loki | **LIVE** — `audit-tail` sidecar in vault pods + Alloy DaemonSet ships to Loki with `container="audit-tail"`. Verified via `{namespace="vault",container="audit-tail"}` LogQL query. |
|
||||
| W1.1 K8s API audit policy + shipping | **LIVE** — kube-apiserver audit policy was already configured (Metadata level, `/var/log/kubernetes/audit.log`, 7d retention). Alloy DaemonSet now tolerates control-plane taint, scrapes the audit log file, ships to Loki with `job=kubernetes-audit`. K2-K9 alert rules in Loki ruler. |
|
||||
| W1.3 Source-IP anomaly rules (K9, V7, S1) | **LIVE** (K9, V7); **S1 PENDING** — fires once promtail/Alloy on PVE host ships sshd journal with `job=sshd-pve`. |
|
||||
| W1.4 Kyverno security policies → Enforce | **LIVE** — 3 policies in Enforce mode with 35-namespace exclude list. |
|
||||
| W1.5 Kyverno trusted-registries → Enforce | **LIVE** — explicit allowlist (15 registries + 6 DockerHub library bare names + 56 DockerHub user repos). Verified by admission dry-run: `evilcorp.example/malware:v1` BLOCKED, `alpine:3.20` and `docker.io/library/alpine:3.20` ALLOWED. |
|
||||
| W1.6 Calico flow logs + log-only GNP | **BLOCKED** — Calico OSS doesn't support `FelixConfiguration.flowLogsFileEnabled` (Calico Enterprise/Tigera-only, rejected 2026-05-19 with "strict decoding error"). Alternative paths: Calico GlobalNetworkPolicy `action: Log` → iptables NFLOG → node journal, OR Cilium migration, OR Tigera Operator adoption. See stacks/calico/main.tf comment block. |
|
||||
| W1.7 NetworkPolicy phased enforce | **BLOCKED** on W1.6 observation-method decision |
|
||||
|
||||
The block below documents the locked design.
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue