infra/docs/architecture/wave1-egress-observation-2026-05-22.md
Viktor Barzin 3962513036 security(wave1): W1.7 analysis snapshot — observation data → allowlist plan
First analysis pass over Calico GNP wave1-egress-observe-tier34 data captured
in Loki since 2026-05-19. Pulled ~10000 flow log lines covering 36 source
namespaces (of 82 selected by tier 3+4). Analysis script outputs preserved
on the dev host at /tmp/{analyze_flows2,build_allowlist}.py.

## Findings

**Universal baseline (every observed ns):**
- DNS to kube-system/kube-dns UDP/53
- Often mysql.dbaas TCP/3306 or pg.dbaas TCP/5432
- Often redis.redis TCP/6379

**Rollout tiering by egress fan-out:**
- Tier A (recruiter-responder only): 2 destinations, ideal pilot
- Tier B (29 namespaces): ≤3 external IPs, ≤5 internal — batch rollout
- Tier C (4 namespaces: f1-stream/openclaw/woodpecker/status-page):
  needs per-IP investigation
- Tier D (servarr): 130+ external IPs (BitTorrent P2P) — keep Log+Allow
  permanently or move to dedicated egress proxy

## Caveats blocking immediate enforce
- Observation horizon too short: ~6h dense data, ~24h total. Need ≥7 days
  to catch weekly CronJobs, Vault token rotations, Keel pulls.
- External IPs are dynamic (Cloudflare/AWS rotate). Static IP allowlists
  will break — need DNS-based selectors or CIDR ranges.
- Some intra-namespace traffic bypasses the Calico filter chain.

## Recommended next steps
1. Continue observation through 2026-05-29 (full week). Compare destination
   set day-over-day; if stable, allowlist is ready.
2. First enforce: recruiter-responder (allowlist = kube-dns + telegram CIDR
   + vault/ESO service IPs).
3. Tier B phased rollout at 3-5 ns/day after pilot proves out.

Full analysis: docs/architecture/wave1-egress-observation-2026-05-22.md
Tracked under beads code-8ywc.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 15:22:25 +00:00

141 lines
5.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Wave 1 W1.6/W1.7 — Egress Observation Snapshot (2026-05-22)
First analysis pass over the Calico GNP `wave1-egress-observe-tier34` data
captured in Loki via `{job="node-journal"} |~ "calico-packet"`.
**Data scope:** ~10000 flow log lines pulled from Loki over ~6h+24h windows.
Loki caps queries at 5000 records so longer windows are sample-capped.
**Coverage:** 36 source namespaces observed making egress (out of 82 selected
by `tier in {3-edge, 4-aux}`). Namespaces missing from data are either idle,
scaled to 0, or producing only intra-namespace traffic (which Calico Log
captures from-workload but most pods in those namespaces talk locally).
## Egress fan-out per namespace
| Namespace | dests | pod-ns | svc | external |
|---|---:|---:|---:|---:|
| affine | 3 | 2 | 1 | 0 |
| beads-server | 4 | 3 | 1 | 0 |
| cyberchef | 2 | 1 | 1 | 0 |
| dawarich | 3 | 2 | 1 | 0 |
| default | 1 | 0 | 0 | 1 |
| ebooks | 3 | 2 | 1 | 0 |
| f1-stream | 16 | 2 | 1 | 13 |
| forgejo | 2 | 1 | 1 | 0 |
| hackmd | 2 | 1 | 1 | 0 |
| homepage | 2 | 1 | 1 | 0 |
| isponsorblocktv | 2 | 0 | 1 | 1 |
| jsoncrack | 2 | 1 | 1 | 0 |
| kms | 2 | 1 | 1 | 0 |
| mailserver | 2 | 0 | 1 | 1 |
| meshcentral | 2 | 2 | 0 | 0 |
| n8n | 2 | 1 | 1 | 0 |
| nextcloud | 5 | 2 | 1 | 2 |
| onlyoffice | 2 | 1 | 1 | 0 |
| openclaw | 18 | 4 | 1 | 13 |
| paperless-ngx | 3 | 2 | 1 | 0 |
| phpipam | 3 | 2 | 1 | 0 |
| poison-fountain | 2 | 1 | 1 | 0 |
| postiz | 9 | 8 | 1 | 0 |
| realestate-crawler | 2 | 1 | 1 | 0 |
| recruiter-responder | 2 | 0 | 1 | 1 |
| rybbit | 2 | 1 | 1 | 0 |
| send | 2 | 1 | 1 | 0 |
| servarr | 134 | 2 | 2 | 130 |
| speedtest | 2 | 1 | 1 | 0 |
| status-page | 10 | 2 | 1 | 7 |
| tandoor | 2 | 1 | 1 | 0 |
| technitium | 5 | 2 | 1 | 2 |
| trading-bot | 5 | 2 | 1 | 2 |
| url | 2 | 1 | 1 | 0 |
| website | 2 | 1 | 1 | 0 |
| woodpecker | 8 | 2 | 1 | 5 |
## Common patterns
**Universal baseline** (every observed namespace makes these):
- `kube-system/kube-dns` UDP/53 — DNS resolution
- Often `dbaas` TCP/3306 (MySQL) or TCP/5432 (Postgres)
- Often `redis` TCP/6379
**Per-namespace specifics** (the part that varies):
- External HTTPS to specific IPs (CDNs, APIs)
- Internal pod-to-pod for service-specific clients
## W1.7 rollout candidates (sorted by simplicity)
**Tier A — trivial egress (recommend first wave):**
`recruiter-responder` has the simplest profile of all observed:
- `kube-system/kube-dns` :53/UDP
- `99.83.136.103` :443/TCP (Telegram API)
That's it. Two destinations. Perfect first enforce candidate.
**Tier B — small egress (≤3 external + ≤5 internal, 29 namespaces):**
affine, beads-server, cyberchef, dawarich, ebooks, forgejo, hackmd, homepage,
isponsorblocktv, jsoncrack, kms, mailserver, meshcentral, n8n, nextcloud,
onlyoffice, paperless-ngx, phpipam, poison-fountain, realestate-crawler,
rybbit, send, speedtest, tandoor, technitium, trading-bot, url, website.
These can be enforce'd in batches of 3-5/day after the recruiter-responder
pilot proves out.
**Tier C — moderate egress (518 external):**
f1-stream (13 ext), openclaw (13 ext), woodpecker (5 ext), status-page (7 ext).
Need per-IP allowlist or domain-based selectors.
**Tier D — broad egress (do NOT enforce statically):**
`servarr` has 130+ external IPs because it runs BitTorrent peer-to-peer.
Static IP enforcement won't work; either leave in Log+Allow mode permanently
or use a port-only allowlist (TCP+UDP 6881+random high ports outbound).
## Important caveats before flipping to enforce
1. **Observation horizon is too short.** Only ~6h of dense data and ~24h
total. CronJobs that run weekly, periodic Vault token rotations (7d),
external service maintenance windows, Keel auto-rollouts pulling new
image versions — all missed. Recommend collecting **at least 7 days**
before declaring an allowlist complete.
2. **`servarr`** is fundamentally incompatible with static enforce — keep
in Log+Allow (or explicit deny for known-bad CIDRs only).
3. **External IPs are dynamic.** Cloudflare-fronted services rotate IPs.
The recruiter-responder external IP `99.83.136.103` is one of Telegram's
API endpoints — Telegram has a CIDR range. Allowing single IPs will break
when DNS resolves to a different IP. Prefer Calico's `domains:` selector
(Calico OSS supports DNS-based egress allowlists via `dns_policy_resolver`)
OR allow the full Cloudflare/AWS CIDR range OR use a per-app egress
gateway.
4. **The observation didn't capture intra-namespace traffic** by design —
the Calico Log rule fires on egress from workload endpoint, but
pod-to-same-namespace-pod traffic on the same node may bypass the
filter chain (varies). Real-world testing needed after enforce flip.
## Suggested next-session sequencing
1. **Continue observation for at least 7 days** before any enforce flip.
Compare data on 2026-05-29 vs today; if no new destinations show up,
the allowlist is stable.
2. **First enforce: recruiter-responder.** GNP with allowlist =
{kube-dns, telegram CIDR, vault svc, eso svc}. Watch for breakage.
3. **Tier B batch rollout** at 3-5 namespaces/day per Keel-style phased
rollout pattern (memory id=1972).
4. **Tier C requires per-namespace investigation** — what are those
external IPs? Map to known services first.
5. **servarr stays in Log+Allow** indefinitely (or migrate to dedicated
egress proxy).
## Source data location
- Loki LogQL: `{job="node-journal"} |~ "calico-packet"`
- Pod IP → namespace map at observation time saved at
`/tmp/pod-ip-map.txt` on the analysis host (ephemeral).
- Analysis scripts: `/tmp/analyze_flows2.py`, `/tmp/build_allowlist.py`.
- Tracked under beads `code-8ywc` (W1.7).