infra/.planning/research/ARCHITECTURE.md

# Architecture Research

**Domain:** Live stream aggregation and proxy service (F1 streaming)
**Researched:** 2026-02-23
**Confidence:** MEDIUM — HLS spec and proxy mechanics are HIGH confidence from RFC 8216 and Apple docs; extractor patterns are MEDIUM confidence from yt-dlp/streamlink analysis; system composition for this specific use-case is inferred from domain knowledge.

---

## Standard Architecture

### System Overview

```
┌───────────────────────────────────────────────────────────────────┐
│                        CLIENT LAYER                               │
│  ┌──────────────────────────────────────────────────────────┐     │
│  │   Svelte Frontend (schedule view, stream picker, player)  │     │
│  └────────────────────────────┬─────────────────────────────┘     │
└───────────────────────────────│───────────────────────────────────┘
                                │ HTTP/REST
┌───────────────────────────────▼───────────────────────────────────┐
│                        API LAYER                                   │
│  ┌──────────────────────────────────────────────────────────┐     │
│  │   Backend API (schedule, streams, health state)           │     │
│  └────────┬──────────────────┬──────────────────────────────┘     │
└───────────│──────────────────│────────────────────────────────────┘
            │                  │
            ▼                  ▼
┌───────────────────┐  ┌──────────────────────────────────────────┐
│   SCHEDULE        │  │         EXTRACTION LAYER                  │
│   SUBSYSTEM       │  │  ┌───────────┐  ┌───────────┐            │
│                   │  │  │ Extractor │  │ Extractor │  ...        │
│ Jolpica/OpenF1    │  │  │ Site A    │  │ Site B    │            │
│ API client        │  │  └─────┬─────┘  └─────┬─────┘            │
│                   │  │        │               │                   │
│ Cron: refresh     │  │  ┌─────▼───────────────▼──────────────┐   │
│ schedule          │  │  │   Extractor Registry / Dispatcher   │   │
└───────────────────┘  │  └─────────────────────┬──────────────┘   │
                        │                        │                   │
                        │  ┌─────────────────────▼──────────────┐   │
                        │  │   Stream Health Checker             │   │
                        │  │   (HEAD/partial GET on .m3u8 URLs)  │   │
                        │  └─────────────────────────────────────┘   │
                        └──────────────────────────────────────────┘
                                          │
                                          ▼ valid stream URLs
                        ┌──────────────────────────────────────────┐
                        │         PROXY LAYER                       │
                        │                                           │
                        │  Master Playlist Rewriter                 │
                        │  ┌────────────────────────────────────┐   │
                        │  │ GET /proxy?url=<encoded-m3u8>       │   │
                        │  │  → fetch upstream m3u8              │   │
                        │  │  → rewrite all URIs to proxy paths  │   │
                        │  │  → return modified playlist         │   │
                        │  └────────────────────────────────────┘   │
                        │                                           │
                        │  Segment Relay                            │
                        │  ┌────────────────────────────────────┐   │
                        │  │ GET /relay?url=<encoded-segment>    │   │
                        │  │  → upstream fetch with headers      │   │
                        │  │  → pipe response to client          │   │
                        │  └────────────────────────────────────┘   │
                        └──────────────────────────────────────────┘
                                          │
                                          ▼ piped bytes
                        ┌──────────────────────────────────────────┐
                        │         STORAGE / CACHE                   │
                        │  ┌─────────────────┐  ┌───────────────┐  │
                        │  │ In-memory cache  │  │   NFS mount   │  │
                        │  │ (stream links,   │  │ (schedule     │  │
                        │  │  health status)  │  │  snapshots,   │  │
                        │  └─────────────────┘  │  config)      │  │
                        │                        └───────────────┘  │
                        └──────────────────────────────────────────┘
```

---

### Component Responsibilities

| Component | Responsibility | Typical Implementation |
|-----------|----------------|------------------------|
| **Svelte Frontend** | Schedule display, stream picker UI, embedded HLS player | SvelteKit app; hls.js or Video.js for player |
| **Backend API** | Serves schedule, current stream list, health status to frontend | Python (FastAPI) or Node.js; REST endpoints |
| **Schedule Subsystem** | Polls Jolpica/OpenF1 API, normalises session data, stores locally | Async background task with cron interval |
| **Extractor Registry** | Maps site hostnames to extractor implementations; dispatches extraction | Plain dict/map of site-key → extractor class |
| **Per-Site Extractor** | Performs HTTP requests with session cookies/CSRF, parses HTML/JS, follows redirect chains, returns raw stream URL | Python class per site; uses `httpx`/`requests` + `BeautifulSoup`/`regex` |
| **Stream Health Checker** | Verifies extracted URLs are live (partial GET on m3u8, checks HTTP 200 + content-type) | Background poller; marks streams up/down in cache |
| **Proxy / Playlist Rewriter** | Fetches upstream m3u8, rewrites all embedded URIs to go through `/relay`, returns modified playlist | Stateless HTTP handler; no buffering of media data |
| **Segment Relay** | Fetches upstream `.ts`/`.fmp4` segments and pipes bytes to client; forwards necessary headers | Streaming HTTP proxy (not buffered); forwards Range, Content-Type |
| **In-Memory Cache** | Stores current stream states and health, avoids redundant extraction on every client request | Python dict with TTL, or Redis (existing cluster Redis) |
| **NFS Storage** | Persists schedule snapshots, extractor configuration, optional diagnostics | NFS at `10.0.10.15` via existing pattern |

---

## Recommended Project Structure

```
f1-streams/
├── backend/
│   ├── api/
│   │   ├── routes/
│   │   │   ├── schedule.py     # GET /schedule
│   │   │   ├── streams.py      # GET /streams, POST /streams/refresh
│   │   │   └── proxy.py        # GET /proxy, GET /relay
│   │   └── main.py             # FastAPI app, lifespan hooks
│   ├── extractors/
│   │   ├── base.py             # Extractor ABC: extract() -> list[StreamInfo]
│   │   ├── registry.py         # Map site-key -> extractor class
│   │   ├── site_a.py           # Site-A specific extractor
│   │   └── site_b.py           # Site-B specific extractor
│   ├── schedule/
│   │   ├── client.py           # Jolpica/OpenF1 API client
│   │   ├── models.py           # Session, Race pydantic models
│   │   └── poller.py           # Background cron task
│   ├── health/
│   │   └── checker.py          # Stream liveness verification
│   ├── proxy/
│   │   ├── playlist.py         # m3u8 fetch + URI rewriting
│   │   └── relay.py            # Segment pipe-through handler
│   ├── cache.py                # In-memory store with TTL
│   └── config.py               # Site list, polling intervals, NFS paths
├── frontend/
│   ├── src/
│   │   ├── routes/
│   │   │   ├── +page.svelte    # Schedule home
│   │   │   └── watch/
│   │   │       └── +page.svelte # Stream picker + player
│   │   ├── lib/
│   │   │   ├── api.ts           # Backend API client
│   │   │   ├── player.ts        # hls.js wrapper
│   │   │   └── schedule.ts      # Session time formatting
│   │   └── app.html
│   ├── static/
│   └── package.json
├── stacks/
│   └── f1-streams/
│       ├── main.tf
│       └── terragrunt.hcl
└── Dockerfile                   # Multi-stage: backend + frontend
```

### Structure Rationale

- **backend/extractors/**: One file per site; base class enforces interface. Adding a new site = add one file + register it. No change to core.
- **backend/proxy/**: Isolated from extraction. Proxy only knows about URLs — it does not care how they were found.
- **backend/schedule/**: Completely independent subsystem. Can fail without breaking stream delivery.
- **backend/health/**: Decoupled checker; stores results in cache, consulted by API on `/streams` requests.
- **frontend/**: Standard SvelteKit layout. Minimal — schedule + player, nothing else.
- **stacks/f1-streams/**: Single Terragrunt stack following existing pattern in repo.

---

## Architectural Patterns

### Pattern 1: Extractor Plugin Interface

**What:** Each site extractor implements a fixed interface (`extract(session_hint) -> list[StreamURL]`). The registry maps site keys to extractor classes. The dispatcher iterates the registry, calls each extractor, aggregates results.

**When to use:** Always — the number of sites will grow and their anti-scraping measures change independently. Isolation prevents one broken extractor from affecting others.

**Trade-offs:** Slightly more boilerplate per site; but each extractor is testable in isolation and replaceable without touching shared code.

**Example:**
```python
class BaseExtractor(ABC):
    site_key: str  # e.g. "siteA"

    @abstractmethod
    async def extract(self, hint: SessionHint | None = None) -> list[StreamURL]:
        """Return list of live stream URLs found on this site."""
        ...

class SiteAExtractor(BaseExtractor):
    site_key = "siteA"

    async def extract(self, hint=None) -> list[StreamURL]:
        # 1. GET page, parse CSRF token from HTML
        # 2. POST with token to get obfuscated JSON
        # 3. Decode JS-obfuscated URL
        # 4. Follow redirects to final .m3u8
        ...
```

### Pattern 2: Playlist Rewriting Proxy

**What:** The proxy layer fetches the upstream m3u8 and rewrites every URL inside it (both master → variant pointers, and variant → segment pointers) to point back through `/relay?url=<base64-encoded-original>`. The client never contacts upstream directly.

**When to use:** Always when proxying HLS — the player will follow URLs in the playlist; if those URLs point to the origin CDN, the proxy is bypassed for segment delivery.

**Trade-offs:** Adds ~1 hop latency per segment request. For a private service with 1-5 users, this is negligible. Benefit: hides origin, enables header injection (e.g., `Referer`), unified player experience.

**Example:**
```python
def rewrite_playlist(m3u8_text: str, base_url: str, proxy_base: str) -> str:
    """Rewrite all URIs in an m3u8 to go through the proxy relay endpoint."""
    lines = []
    for line in m3u8_text.splitlines():
        if line and not line.startswith("#"):
            # resolve relative URL, then encode through proxy
            absolute = urllib.parse.urljoin(base_url, line)
            proxied = f"{proxy_base}/relay?url={b64encode(absolute)}"
            lines.append(proxied)
        else:
            lines.append(line)
    return "\n".join(lines)
```

### Pattern 3: Background Polling with In-Memory Cache

**What:** Extraction and health checking run as background tasks on a schedule (e.g., every 2 minutes). Results are stored in a shared in-memory dict with timestamps. The API layer reads from cache and returns immediately — no per-request extraction.

**When to use:** Always — on-demand extraction per client request would be slow (2-10s per site) and would hammer the source sites.

**Trade-offs:** Cache staleness window (default 2 min). Acceptable for live sports: streams stay stable once live.

**Example:**
```python
# cache.py
_stream_cache: dict[str, CachedResult] = {}

async def get_streams() -> list[StreamURL]:
    if cache_is_fresh():
        return _stream_cache["streams"].data
    # else trigger background refresh
    ...
```

---

## Data Flow

### Stream Discovery Flow (background)

```
[Cron trigger: every 2 min]
        ↓
[Extractor Registry]
        ↓ (fan-out, concurrent)
[SiteA Extractor]   [SiteB Extractor]   [SiteN Extractor]
        ↓
[Raw stream URLs: list of .m3u8 candidates]
        ↓
[Health Checker: partial GET each URL]
        ↓ (filter: only HTTP 200 + video/mpegURL content-type)
[Validated stream URLs]
        ↓
[Cache: store with timestamp + site metadata]
```

### Client Playback Flow (per request)

```
[User opens /watch in browser]
        ↓
[Frontend GET /api/streams]
        ↓
[Backend reads cache → returns stream list (site, quality, label)]
        ↓
[User picks a stream]
        ↓
[Player requests: GET /proxy?url=<m3u8-url>]
        ↓
[Backend: fetch upstream m3u8, rewrite URIs → return modified m3u8]
        ↓
[Player follows variant playlist: GET /proxy?url=<variant-m3u8>]
        ↓
[Backend: rewrite segment URIs]
        ↓
[Player fetches segments: GET /relay?url=<segment>]
        ↓
[Backend: upstream fetch, pipe bytes → client]
        ↓
[Video plays in browser]
```

### Schedule Flow

```
[Cron: daily or on-demand]
        ↓
[Schedule Client: GET Jolpica API /ergast/f1/current.json]
        ↓
[Parse: races, session types, UTC timestamps]
        ↓
[Normalise: map to internal Session model]
        ↓
[Store: NFS JSON file + in-memory cache]
        ↓
[Frontend GET /api/schedule → displays session list]
```

### Key Data Flows

1. **Extraction → Cache → API → Frontend**: All stream data originates from extractors, flows through the cache as the single source of truth, and is served read-only to the frontend. No frontend-triggered extraction.
2. **Client → Proxy → Upstream CDN**: The proxy is a pure pass-through relay. It does not store segments. Bytes from upstream go directly to client socket.
3. **Schedule API → NFS**: Schedule data is written to NFS on refresh so the pod can serve it immediately on restart without waiting for the next API poll.

---

## Component Boundaries

| Component | Owns | Does Not Own |
|-----------|------|--------------|
| Extractor (per site) | How to get stream URL from that site | Health checking, caching, proxying |
| Health Checker | Liveness state of each URL | How the URL was found |
| Proxy / Relay | Rewriting m3u8 URIs, piping bytes | Authentication with upstream (that's extractor's job) |
| Schedule Subsystem | F1 session calendar data | Stream availability for a given session |
| Backend API | Serving current state to frontend | Fetching or refreshing state |
| Frontend | User interaction, player | Any backend logic |

---

## Suggested Build Order (Phase Dependencies)

The dependencies flow strictly upward — each layer depends only on the layer below it being stable:

```
Phase 1: Schedule Subsystem
    ↓ (F1 data available)
Phase 2: Extractor Framework + First Site Extractor
    ↓ (raw URLs available)
Phase 3: Health Checker
    ↓ (validated URLs available)
Phase 4: Proxy / Relay Layer
    ↓ (streams playable through service)
Phase 5: Frontend (schedule + player)
    ↓ (end-to-end usable)
Phase 6: Additional Site Extractors
    ↓ (stream coverage widened)
Phase 7: K8s Deployment (Terraform/Terragrunt stack)
```

**Rationale:**
- Schedule first: gives a testable data source with zero anti-scraping complexity.
- Extractor framework before specific sites: the base class and registry must exist before any site can plug in.
- Health checker before proxy: no point proxying dead streams; the checker filters the list fed to the proxy.
- Proxy before frontend: the frontend player needs a working `/proxy` endpoint to function.
- Frontend last of core: all backend components are independently testable via curl/httpie before a UI exists.
- Additional extractors after core is working: adding more sites is low-risk incremental work once the pattern is proven.
- Deployment last: deploy once the service works end-to-end locally; avoids debugging infra and app simultaneously.

---

## Anti-Patterns

### Anti-Pattern 1: On-Demand Extraction Per Client Request

**What people do:** Trigger extraction when the user clicks "show streams" in the browser.

**Why it's wrong:** Extraction takes 2-10 seconds per site (HTTP round trips, JS parsing, redirect following). With multiple sites, this is 10-30 seconds of wall time. Source sites may rate-limit aggressive bursts. Multiple concurrent users would multiply the load.

**Do this instead:** Run extraction on a background schedule. Cache results. The API returns immediately from cache. The user sees streams in <100ms.

### Anti-Pattern 2: Single Extractor Handles All Sites

**What people do:** One big function with `if site == "A": ... elif site == "B": ...` branches.

**Why it's wrong:** Sites change their obfuscation methods independently. A change to Site A's extraction logic can accidentally break Site B. Testing is impossible in isolation. Adding Site C requires modifying a shared file.

**Do this instead:** One class per site, implementing a common interface. Changes to Site A's extractor never touch Site B's code.

### Anti-Pattern 3: Buffering Segments in Memory Before Sending

**What people do:** Download the entire `.ts` segment to memory, then serve it to the client.

**Why it's wrong:** HLS segments can be 2-10 MB each. With multiple concurrent viewers, memory pressure grows quickly. Introduces unnecessary latency (client waits for full download before first byte).

**Do this instead:** Pipe bytes from the upstream response directly to the client socket as they arrive (chunked transfer). The client starts receiving immediately, memory stays flat.

### Anti-Pattern 4: Hardcoding Site URLs and Tokens in Extractor Logic

**What people do:** Hardcode `BASE_URL = "https://site-a.example.com"` and referer/cookie values inside the extractor file.

**Why it's wrong:** Sites change domains and anti-scraping parameters frequently. When a site moves, you have to find and edit code rather than config.

**Do this instead:** Extractor reads its config (base URL, required headers, any known static tokens) from a config object injected at construction. The registry passes config to extractors at instantiation.

---

## Integration Points

### External Services

| Service | Integration Pattern | Notes |
|---------|---------------------|-------|
| Jolpica F1 API (`api.jolpi.ca/ergast/f1/`) | REST GET, poll daily | No API key required; backwards-compatible Ergast endpoints; schedule data available |
| OpenF1 API (`api.openf1.org/`) | REST GET, poll as needed | No API key; 3 req/s rate limit; 2023+ data only; useful for session status (live/upcoming) |
| Upstream streaming sites (Site A, B, N) | HTTP GET/POST with session cookies, CSRF tokens | Per-site; no shared pattern; treated as black boxes by the framework |
| Upstream CDN (HLS segments) | HTTP GET with Range support | Proxy relays bytes; must forward `Referer` and sometimes `Origin` headers or CDN rejects |

### Internal Boundaries

| Boundary | Communication | Notes |
|----------|---------------|-------|
| Extractor → Cache | Direct function call (write) | Extractors do not call the cache directly — the dispatcher aggregates results then writes once |
| API → Cache | Direct read | Synchronous, O(1) |
| API → Proxy | Not direct — frontend calls `/proxy` endpoint, which is part of the same backend process | Can be split into separate service later if needed |
| Proxy → Upstream CDN | Outbound HTTP | Must preserve session headers; upstream CDN may check Referer/Origin |
| Schedule Poller → NFS | File write (JSON) | On pod restart, reads NFS before first API poll |

---

## Scaling Considerations

This is a single-user or small-group private service. Scaling is not a primary concern, but here are the natural pressure points:

| Scale | Architecture Adjustments |
|-------|--------------------------|
| 1-5 concurrent viewers | Single backend pod, in-memory cache, direct pipe relay — fully sufficient |
| 10-20 concurrent viewers | Same architecture; segment relay becomes the bandwidth bottleneck (each viewer streams independently) — add HLS caching proxy (nginx) in front of relay |
| 50+ concurrent viewers | Segment relay load increases linearly; consider a CDN or caching layer for segments; extraction/health remain unchanged |

### Scaling Priorities

1. **First bottleneck:** Outbound bandwidth on segment relay. Each viewer pulls full bitrate independently through the service. At private-use scale this is negligible (1-5 viewers).
2. **Second bottleneck:** In-memory cache invalidation if multiple pods deploy (stateless pods don't share cache). Solved by using existing cluster Redis instead of in-process dict — but unnecessary until horizontal scaling.

---

## Sources

- HLS specification: RFC 8216 (IETF) — playlist structure, master/media playlist relationship, segment mechanics (HIGH confidence)
- HLS proxy pattern: Apple Developer Documentation (conceptual), corroborated by yt-dlp extractor framework analysis (MEDIUM confidence)
- yt-dlp plugin architecture: github.com/yt-dlp/yt-dlp README + docs (MEDIUM confidence)
- OpenF1 API: openf1.org official page — endpoints, rate limits, data coverage (HIGH confidence)
- Jolpica F1 API: github.com/jolpica/jolpica-f1 — Ergast compatibility, availability (MEDIUM confidence)
- System composition for this domain: inference from domain patterns, corroborated by extractor tool analysis (MEDIUM confidence)

---

*Architecture research for: Live stream aggregation and proxy service (F1)*
*Researched: 2026-02-23*