infra/docs/architecture/wave1-egress-observation-2026-05-22.md
Viktor Barzin fd0f4a0365 fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip]
6d224861 came from a --no-checkout worktree whose empty index made the
commit drop every file except two. This restores 05b50d2b's full tree and
correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su
entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the
live infra was never applied from the broken commit.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 08:45:33 +00:00

141 lines
5.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Wave 1 W1.6/W1.7 — Egress Observation Snapshot (2026-05-22)
First analysis pass over the Calico GNP `wave1-egress-observe-tier34` data
captured in Loki via `{job="node-journal"} |~ "calico-packet"`.
**Data scope:** ~10000 flow log lines pulled from Loki over ~6h+24h windows.
Loki caps queries at 5000 records so longer windows are sample-capped.
**Coverage:** 36 source namespaces observed making egress (out of 82 selected
by `tier in {3-edge, 4-aux}`). Namespaces missing from data are either idle,
scaled to 0, or producing only intra-namespace traffic (which Calico Log
captures from-workload but most pods in those namespaces talk locally).
## Egress fan-out per namespace
| Namespace | dests | pod-ns | svc | external |
|---|---:|---:|---:|---:|
| affine | 3 | 2 | 1 | 0 |
| beads-server | 4 | 3 | 1 | 0 |
| cyberchef | 2 | 1 | 1 | 0 |
| dawarich | 3 | 2 | 1 | 0 |
| default | 1 | 0 | 0 | 1 |
| ebooks | 3 | 2 | 1 | 0 |
| f1-stream | 16 | 2 | 1 | 13 |
| forgejo | 2 | 1 | 1 | 0 |
| hackmd | 2 | 1 | 1 | 0 |
| homepage | 2 | 1 | 1 | 0 |
| isponsorblocktv | 2 | 0 | 1 | 1 |
| jsoncrack | 2 | 1 | 1 | 0 |
| kms | 2 | 1 | 1 | 0 |
| mailserver | 2 | 0 | 1 | 1 |
| meshcentral | 2 | 2 | 0 | 0 |
| n8n | 2 | 1 | 1 | 0 |
| nextcloud | 5 | 2 | 1 | 2 |
| onlyoffice | 2 | 1 | 1 | 0 |
| openclaw | 18 | 4 | 1 | 13 |
| paperless-ngx | 3 | 2 | 1 | 0 |
| phpipam | 3 | 2 | 1 | 0 |
| poison-fountain | 2 | 1 | 1 | 0 |
| postiz | 9 | 8 | 1 | 0 |
| realestate-crawler | 2 | 1 | 1 | 0 |
| recruiter-responder | 2 | 0 | 1 | 1 |
| rybbit | 2 | 1 | 1 | 0 |
| send | 2 | 1 | 1 | 0 |
| servarr | 134 | 2 | 2 | 130 |
| speedtest | 2 | 1 | 1 | 0 |
| status-page | 10 | 2 | 1 | 7 |
| tandoor | 2 | 1 | 1 | 0 |
| technitium | 5 | 2 | 1 | 2 |
| trading-bot | 5 | 2 | 1 | 2 |
| url | 2 | 1 | 1 | 0 |
| website | 2 | 1 | 1 | 0 |
| woodpecker | 8 | 2 | 1 | 5 |
## Common patterns
**Universal baseline** (every observed namespace makes these):
- `kube-system/kube-dns` UDP/53 — DNS resolution
- Often `dbaas` TCP/3306 (MySQL) or TCP/5432 (Postgres)
- Often `redis` TCP/6379
**Per-namespace specifics** (the part that varies):
- External HTTPS to specific IPs (CDNs, APIs)
- Internal pod-to-pod for service-specific clients
## W1.7 rollout candidates (sorted by simplicity)
**Tier A — trivial egress (recommend first wave):**
`recruiter-responder` has the simplest profile of all observed:
- `kube-system/kube-dns` :53/UDP
- `99.83.136.103` :443/TCP (Telegram API)
That's it. Two destinations. Perfect first enforce candidate.
**Tier B — small egress (≤3 external + ≤5 internal, 29 namespaces):**
affine, beads-server, cyberchef, dawarich, ebooks, forgejo, hackmd, homepage,
isponsorblocktv, jsoncrack, kms, mailserver, meshcentral, n8n, nextcloud,
onlyoffice, paperless-ngx, phpipam, poison-fountain, realestate-crawler,
rybbit, send, speedtest, tandoor, technitium, trading-bot, url, website.
These can be enforce'd in batches of 3-5/day after the recruiter-responder
pilot proves out.
**Tier C — moderate egress (518 external):**
f1-stream (13 ext), openclaw (13 ext), woodpecker (5 ext), status-page (7 ext).
Need per-IP allowlist or domain-based selectors.
**Tier D — broad egress (do NOT enforce statically):**
`servarr` has 130+ external IPs because it runs BitTorrent peer-to-peer.
Static IP enforcement won't work; either leave in Log+Allow mode permanently
or use a port-only allowlist (TCP+UDP 6881+random high ports outbound).
## Important caveats before flipping to enforce
1. **Observation horizon is too short.** Only ~6h of dense data and ~24h
total. CronJobs that run weekly, periodic Vault token rotations (7d),
external service maintenance windows, Keel auto-rollouts pulling new
image versions — all missed. Recommend collecting **at least 7 days**
before declaring an allowlist complete.
2. **`servarr`** is fundamentally incompatible with static enforce — keep
in Log+Allow (or explicit deny for known-bad CIDRs only).
3. **External IPs are dynamic.** Cloudflare-fronted services rotate IPs.
The recruiter-responder external IP `99.83.136.103` is one of Telegram's
API endpoints — Telegram has a CIDR range. Allowing single IPs will break
when DNS resolves to a different IP. Prefer Calico's `domains:` selector
(Calico OSS supports DNS-based egress allowlists via `dns_policy_resolver`)
OR allow the full Cloudflare/AWS CIDR range OR use a per-app egress
gateway.
4. **The observation didn't capture intra-namespace traffic** by design —
the Calico Log rule fires on egress from workload endpoint, but
pod-to-same-namespace-pod traffic on the same node may bypass the
filter chain (varies). Real-world testing needed after enforce flip.
## Suggested next-session sequencing
1. **Continue observation for at least 7 days** before any enforce flip.
Compare data on 2026-05-29 vs today; if no new destinations show up,
the allowlist is stable.
2. **First enforce: recruiter-responder.** GNP with allowlist =
{kube-dns, telegram CIDR, vault svc, eso svc}. Watch for breakage.
3. **Tier B batch rollout** at 3-5 namespaces/day per Keel-style phased
rollout pattern (memory id=1972).
4. **Tier C requires per-namespace investigation** — what are those
external IPs? Map to known services first.
5. **servarr stays in Log+Allow** indefinitely (or migrate to dedicated
egress proxy).
## Source data location
- Loki LogQL: `{job="node-journal"} |~ "calico-packet"`
- Pod IP → namespace map at observation time saved at
`/tmp/pod-ip-map.txt` on the analysis host (ephemeral).
- Analysis scripts: `/tmp/analyze_flows2.py`, `/tmp/build_allowlist.py`.
- Tracked under beads `code-8ywc` (W1.7).