First analysis pass over Calico GNP wave1-egress-observe-tier34 data captured
in Loki since 2026-05-19. Pulled ~10000 flow log lines covering 36 source
namespaces (of 82 selected by tier 3+4). Analysis script outputs preserved
on the dev host at /tmp/{analyze_flows2,build_allowlist}.py.
## Findings
**Universal baseline (every observed ns):**
- DNS to kube-system/kube-dns UDP/53
- Often mysql.dbaas TCP/3306 or pg.dbaas TCP/5432
- Often redis.redis TCP/6379
**Rollout tiering by egress fan-out:**
- Tier A (recruiter-responder only): 2 destinations, ideal pilot
- Tier B (29 namespaces): ≤3 external IPs, ≤5 internal — batch rollout
- Tier C (4 namespaces: f1-stream/openclaw/woodpecker/status-page):
needs per-IP investigation
- Tier D (servarr): 130+ external IPs (BitTorrent P2P) — keep Log+Allow
permanently or move to dedicated egress proxy
## Caveats blocking immediate enforce
- Observation horizon too short: ~6h dense data, ~24h total. Need ≥7 days
to catch weekly CronJobs, Vault token rotations, Keel pulls.
- External IPs are dynamic (Cloudflare/AWS rotate). Static IP allowlists
will break — need DNS-based selectors or CIDR ranges.
- Some intra-namespace traffic bypasses the Calico filter chain.
## Recommended next steps
1. Continue observation through 2026-05-29 (full week). Compare destination
set day-over-day; if stable, allowlist is ready.
2. First enforce: recruiter-responder (allowlist = kube-dns + telegram CIDR
+ vault/ESO service IPs).
3. Tier B phased rollout at 3-5 ns/day after pilot proves out.
Full analysis: docs/architecture/wave1-egress-observation-2026-05-22.md
Tracked under beads code-8ywc.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
5.6 KiB
Wave 1 W1.6/W1.7 — Egress Observation Snapshot (2026-05-22)
First analysis pass over the Calico GNP wave1-egress-observe-tier34 data
captured in Loki via {job="node-journal"} |~ "calico-packet".
Data scope: ~10000 flow log lines pulled from Loki over ~6h+24h windows. Loki caps queries at 5000 records so longer windows are sample-capped.
Coverage: 36 source namespaces observed making egress (out of 82 selected
by tier in {3-edge, 4-aux}). Namespaces missing from data are either idle,
scaled to 0, or producing only intra-namespace traffic (which Calico Log
captures from-workload but most pods in those namespaces talk locally).
Egress fan-out per namespace
| Namespace | dests | pod-ns | svc | external |
|---|---|---|---|---|
| affine | 3 | 2 | 1 | 0 |
| beads-server | 4 | 3 | 1 | 0 |
| cyberchef | 2 | 1 | 1 | 0 |
| dawarich | 3 | 2 | 1 | 0 |
| default | 1 | 0 | 0 | 1 |
| ebooks | 3 | 2 | 1 | 0 |
| f1-stream | 16 | 2 | 1 | 13 |
| forgejo | 2 | 1 | 1 | 0 |
| hackmd | 2 | 1 | 1 | 0 |
| homepage | 2 | 1 | 1 | 0 |
| isponsorblocktv | 2 | 0 | 1 | 1 |
| jsoncrack | 2 | 1 | 1 | 0 |
| kms | 2 | 1 | 1 | 0 |
| mailserver | 2 | 0 | 1 | 1 |
| meshcentral | 2 | 2 | 0 | 0 |
| n8n | 2 | 1 | 1 | 0 |
| nextcloud | 5 | 2 | 1 | 2 |
| onlyoffice | 2 | 1 | 1 | 0 |
| openclaw | 18 | 4 | 1 | 13 |
| paperless-ngx | 3 | 2 | 1 | 0 |
| phpipam | 3 | 2 | 1 | 0 |
| poison-fountain | 2 | 1 | 1 | 0 |
| postiz | 9 | 8 | 1 | 0 |
| realestate-crawler | 2 | 1 | 1 | 0 |
| recruiter-responder | 2 | 0 | 1 | 1 |
| rybbit | 2 | 1 | 1 | 0 |
| send | 2 | 1 | 1 | 0 |
| servarr | 134 | 2 | 2 | 130 |
| speedtest | 2 | 1 | 1 | 0 |
| status-page | 10 | 2 | 1 | 7 |
| tandoor | 2 | 1 | 1 | 0 |
| technitium | 5 | 2 | 1 | 2 |
| trading-bot | 5 | 2 | 1 | 2 |
| url | 2 | 1 | 1 | 0 |
| website | 2 | 1 | 1 | 0 |
| woodpecker | 8 | 2 | 1 | 5 |
Common patterns
Universal baseline (every observed namespace makes these):
kube-system/kube-dnsUDP/53 — DNS resolution- Often
dbaasTCP/3306 (MySQL) or TCP/5432 (Postgres) - Often
redisTCP/6379
Per-namespace specifics (the part that varies):
- External HTTPS to specific IPs (CDNs, APIs)
- Internal pod-to-pod for service-specific clients
W1.7 rollout candidates (sorted by simplicity)
Tier A — trivial egress (recommend first wave):
recruiter-responder has the simplest profile of all observed:
kube-system/kube-dns:53/UDP99.83.136.103:443/TCP (Telegram API)
That's it. Two destinations. Perfect first enforce candidate.
Tier B — small egress (≤3 external + ≤5 internal, 29 namespaces):
affine, beads-server, cyberchef, dawarich, ebooks, forgejo, hackmd, homepage, isponsorblocktv, jsoncrack, kms, mailserver, meshcentral, n8n, nextcloud, onlyoffice, paperless-ngx, phpipam, poison-fountain, realestate-crawler, rybbit, send, speedtest, tandoor, technitium, trading-bot, url, website.
These can be enforce'd in batches of 3-5/day after the recruiter-responder pilot proves out.
Tier C — moderate egress (5–18 external):
f1-stream (13 ext), openclaw (13 ext), woodpecker (5 ext), status-page (7 ext). Need per-IP allowlist or domain-based selectors.
Tier D — broad egress (do NOT enforce statically):
servarr has 130+ external IPs because it runs BitTorrent peer-to-peer.
Static IP enforcement won't work; either leave in Log+Allow mode permanently
or use a port-only allowlist (TCP+UDP 6881+random high ports outbound).
Important caveats before flipping to enforce
-
Observation horizon is too short. Only ~6h of dense data and ~24h total. CronJobs that run weekly, periodic Vault token rotations (7d), external service maintenance windows, Keel auto-rollouts pulling new image versions — all missed. Recommend collecting at least 7 days before declaring an allowlist complete.
-
servarris fundamentally incompatible with static enforce — keep in Log+Allow (or explicit deny for known-bad CIDRs only). -
External IPs are dynamic. Cloudflare-fronted services rotate IPs. The recruiter-responder external IP
99.83.136.103is one of Telegram's API endpoints — Telegram has a CIDR range. Allowing single IPs will break when DNS resolves to a different IP. Prefer Calico'sdomains:selector (Calico OSS supports DNS-based egress allowlists viadns_policy_resolver) OR allow the full Cloudflare/AWS CIDR range OR use a per-app egress gateway. -
The observation didn't capture intra-namespace traffic by design — the Calico Log rule fires on egress from workload endpoint, but pod-to-same-namespace-pod traffic on the same node may bypass the filter chain (varies). Real-world testing needed after enforce flip.
Suggested next-session sequencing
- Continue observation for at least 7 days before any enforce flip. Compare data on 2026-05-29 vs today; if no new destinations show up, the allowlist is stable.
- First enforce: recruiter-responder. GNP with allowlist = {kube-dns, telegram CIDR, vault svc, eso svc}. Watch for breakage.
- Tier B batch rollout at 3-5 namespaces/day per Keel-style phased rollout pattern (memory id=1972).
- Tier C requires per-namespace investigation — what are those external IPs? Map to known services first.
- servarr stays in Log+Allow indefinitely (or migrate to dedicated egress proxy).
Source data location
- Loki LogQL:
{job="node-journal"} |~ "calico-packet" - Pod IP → namespace map at observation time saved at
/tmp/pod-ip-map.txton the analysis host (ephemeral). - Analysis scripts:
/tmp/analyze_flows2.py,/tmp/build_allowlist.py. - Tracked under beads
code-8ywc(W1.7).