diff --git a/docs/architecture/security.md b/docs/architecture/security.md index f2e31f49..6a6286ae 100644 --- a/docs/architecture/security.md +++ b/docs/architecture/security.md @@ -181,7 +181,7 @@ Beads epic: `code-8ywc`. **Status: partially live as of 2026-05-18.** | W1.4 Kyverno security policies → Enforce | **LIVE** — 3 policies in Enforce mode with 35-namespace exclude list. | | W1.5 Kyverno trusted-registries → Enforce | **LIVE** — explicit allowlist (15 registries + 6 DockerHub library bare names + 56 DockerHub user repos). Verified by admission dry-run: `evilcorp.example/malware:v1` BLOCKED, `alpine:3.20` and `docker.io/library/alpine:3.20` ALLOWED. | | W1.6 Calico observe-phase (pilot: recruiter-responder) | **LIVE** (2026-05-19) — GlobalNetworkPolicy `wave1-egress-observe-recruiter-responder` with rules `[action:Log, action:Allow]`. FelixConfiguration.flowLogsFileEnabled approach abandoned (Calico Enterprise-only field, rejected by OSS v3.26). Log action emits iptables LOG with prefix `calico-packet: ` → kernel → journald → Alloy → Loki. Verified: `{job="node-journal"} \|~ "calico-packet"` returns real packet metadata (SRC/DST/PROTO). Expand to more namespaces by adding to `namespaceSelector`. | -| W1.7 NetworkPolicy phased enforce | **PENDING** — needs ~1 week of W1.6 observation, then build empirical allowlist from Loki queries, flip GNP rules from `[Log, Allow]` to `[Allow specific dests, Deny rest]`. | +| W1.7 NetworkPolicy phased enforce | **PARTIAL ANALYSIS** — first observation snapshot at `docs/architecture/wave1-egress-observation-2026-05-22.md` (36 source namespaces seen so far, 29 thin-profile candidates). Recommend continuing observation through 2026-05-29 (full week) before any enforce flip. Pilot enforce target: `recruiter-responder` (2 destinations only). `servarr` stays in Log+Allow indefinitely (BitTorrent P2P incompatible with static enforce). | The block below documents the locked design. diff --git a/docs/architecture/wave1-egress-observation-2026-05-22.md b/docs/architecture/wave1-egress-observation-2026-05-22.md new file mode 100644 index 00000000..1fc00a3f --- /dev/null +++ b/docs/architecture/wave1-egress-observation-2026-05-22.md @@ -0,0 +1,141 @@ +# Wave 1 W1.6/W1.7 — Egress Observation Snapshot (2026-05-22) + +First analysis pass over the Calico GNP `wave1-egress-observe-tier34` data +captured in Loki via `{job="node-journal"} |~ "calico-packet"`. + +**Data scope:** ~10000 flow log lines pulled from Loki over ~6h+24h windows. +Loki caps queries at 5000 records so longer windows are sample-capped. + +**Coverage:** 36 source namespaces observed making egress (out of 82 selected +by `tier in {3-edge, 4-aux}`). Namespaces missing from data are either idle, +scaled to 0, or producing only intra-namespace traffic (which Calico Log +captures from-workload but most pods in those namespaces talk locally). + +## Egress fan-out per namespace + +| Namespace | dests | pod-ns | svc | external | +|---|---:|---:|---:|---:| +| affine | 3 | 2 | 1 | 0 | +| beads-server | 4 | 3 | 1 | 0 | +| cyberchef | 2 | 1 | 1 | 0 | +| dawarich | 3 | 2 | 1 | 0 | +| default | 1 | 0 | 0 | 1 | +| ebooks | 3 | 2 | 1 | 0 | +| f1-stream | 16 | 2 | 1 | 13 | +| forgejo | 2 | 1 | 1 | 0 | +| hackmd | 2 | 1 | 1 | 0 | +| homepage | 2 | 1 | 1 | 0 | +| isponsorblocktv | 2 | 0 | 1 | 1 | +| jsoncrack | 2 | 1 | 1 | 0 | +| kms | 2 | 1 | 1 | 0 | +| mailserver | 2 | 0 | 1 | 1 | +| meshcentral | 2 | 2 | 0 | 0 | +| n8n | 2 | 1 | 1 | 0 | +| nextcloud | 5 | 2 | 1 | 2 | +| onlyoffice | 2 | 1 | 1 | 0 | +| openclaw | 18 | 4 | 1 | 13 | +| paperless-ngx | 3 | 2 | 1 | 0 | +| phpipam | 3 | 2 | 1 | 0 | +| poison-fountain | 2 | 1 | 1 | 0 | +| postiz | 9 | 8 | 1 | 0 | +| realestate-crawler | 2 | 1 | 1 | 0 | +| recruiter-responder | 2 | 0 | 1 | 1 | +| rybbit | 2 | 1 | 1 | 0 | +| send | 2 | 1 | 1 | 0 | +| servarr | 134 | 2 | 2 | 130 | +| speedtest | 2 | 1 | 1 | 0 | +| status-page | 10 | 2 | 1 | 7 | +| tandoor | 2 | 1 | 1 | 0 | +| technitium | 5 | 2 | 1 | 2 | +| trading-bot | 5 | 2 | 1 | 2 | +| url | 2 | 1 | 1 | 0 | +| website | 2 | 1 | 1 | 0 | +| woodpecker | 8 | 2 | 1 | 5 | + +## Common patterns + +**Universal baseline** (every observed namespace makes these): +- `kube-system/kube-dns` UDP/53 — DNS resolution +- Often `dbaas` TCP/3306 (MySQL) or TCP/5432 (Postgres) +- Often `redis` TCP/6379 + +**Per-namespace specifics** (the part that varies): +- External HTTPS to specific IPs (CDNs, APIs) +- Internal pod-to-pod for service-specific clients + +## W1.7 rollout candidates (sorted by simplicity) + +**Tier A — trivial egress (recommend first wave):** + +`recruiter-responder` has the simplest profile of all observed: +- `kube-system/kube-dns` :53/UDP +- `99.83.136.103` :443/TCP (Telegram API) + +That's it. Two destinations. Perfect first enforce candidate. + +**Tier B — small egress (≤3 external + ≤5 internal, 29 namespaces):** + +affine, beads-server, cyberchef, dawarich, ebooks, forgejo, hackmd, homepage, +isponsorblocktv, jsoncrack, kms, mailserver, meshcentral, n8n, nextcloud, +onlyoffice, paperless-ngx, phpipam, poison-fountain, realestate-crawler, +rybbit, send, speedtest, tandoor, technitium, trading-bot, url, website. + +These can be enforce'd in batches of 3-5/day after the recruiter-responder +pilot proves out. + +**Tier C — moderate egress (5–18 external):** + +f1-stream (13 ext), openclaw (13 ext), woodpecker (5 ext), status-page (7 ext). +Need per-IP allowlist or domain-based selectors. + +**Tier D — broad egress (do NOT enforce statically):** + +`servarr` has 130+ external IPs because it runs BitTorrent peer-to-peer. +Static IP enforcement won't work; either leave in Log+Allow mode permanently +or use a port-only allowlist (TCP+UDP 6881+random high ports outbound). + +## Important caveats before flipping to enforce + +1. **Observation horizon is too short.** Only ~6h of dense data and ~24h + total. CronJobs that run weekly, periodic Vault token rotations (7d), + external service maintenance windows, Keel auto-rollouts pulling new + image versions — all missed. Recommend collecting **at least 7 days** + before declaring an allowlist complete. + +2. **`servarr`** is fundamentally incompatible with static enforce — keep + in Log+Allow (or explicit deny for known-bad CIDRs only). + +3. **External IPs are dynamic.** Cloudflare-fronted services rotate IPs. + The recruiter-responder external IP `99.83.136.103` is one of Telegram's + API endpoints — Telegram has a CIDR range. Allowing single IPs will break + when DNS resolves to a different IP. Prefer Calico's `domains:` selector + (Calico OSS supports DNS-based egress allowlists via `dns_policy_resolver`) + OR allow the full Cloudflare/AWS CIDR range OR use a per-app egress + gateway. + +4. **The observation didn't capture intra-namespace traffic** by design — + the Calico Log rule fires on egress from workload endpoint, but + pod-to-same-namespace-pod traffic on the same node may bypass the + filter chain (varies). Real-world testing needed after enforce flip. + +## Suggested next-session sequencing + +1. **Continue observation for at least 7 days** before any enforce flip. + Compare data on 2026-05-29 vs today; if no new destinations show up, + the allowlist is stable. +2. **First enforce: recruiter-responder.** GNP with allowlist = + {kube-dns, telegram CIDR, vault svc, eso svc}. Watch for breakage. +3. **Tier B batch rollout** at 3-5 namespaces/day per Keel-style phased + rollout pattern (memory id=1972). +4. **Tier C requires per-namespace investigation** — what are those + external IPs? Map to known services first. +5. **servarr stays in Log+Allow** indefinitely (or migrate to dedicated + egress proxy). + +## Source data location + +- Loki LogQL: `{job="node-journal"} |~ "calico-packet"` +- Pod IP → namespace map at observation time saved at + `/tmp/pod-ip-map.txt` on the analysis host (ephemeral). +- Analysis scripts: `/tmp/analyze_flows2.py`, `/tmp/build_allowlist.py`. +- Tracked under beads `code-8ywc` (W1.7).