fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip]

6d224861 came from a --no-checkout worktree whose empty index made the commit drop every file except two. This restores 05b50d2b's full tree and correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the live infra was never applied from the broken commit. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 08:45:33 +00:00 · 2026-06-09 08:45:33 +00:00 · fd0f4a0365
commit fd0f4a0365
parent 6d224861c4
1166 changed files with 358546 additions and 0 deletions
--- a/.planning/PROJECT.md
+++ b/.planning/PROJECT.md
@ -0,0 +1,66 @@
+# F1 Streaming Service
+
+## What This Is
+
+A private F1 streaming aggregation service that auto-scrapes specific streaming sites, extracts actual video source URLs through custom per-site extractors (bypassing obfuscation, CSRF, and redirect chains), and proxies the streams through a unified Svelte web app. Deployed on the existing K8s cluster.
+
+## Core Value
+
+When an F1 session is live, users open one URL and immediately see working streams — no hunting for links across sketchy sites.
+
+## Requirements
+
+### Validated
+
+- ✓ Kubernetes cluster with ingress, NFS storage, monitoring — existing
+- ✓ Cloudflare DNS and TLS — existing
+- ✓ CI/CD pipeline (Woodpecker) — existing
+- ✓ Terraform/Terragrunt deployment pattern — existing
+
+### Active
+
+- [ ] Auto-scrape configured streaming sites for live F1 stream links
+- [ ] Custom per-site extractors to bypass obfuscation (CSRF tokens, JS rendering, redirect chains) and extract final video source URLs
+- [ ] Stream health checks — verify extracted streams are actually live and working before displaying
+- [ ] Stream proxying/relay through the service for unified playback
+- [ ] Auto-pull F1 race schedule from official data (Ergast/OpenF1 API)
+- [ ] Cover all F1 sessions: FP1-3, Qualifying, Sprint, Race, pre/post shows, press conferences
+- [ ] Svelte web app with schedule view, stream picker, and embedded video player
+- [ ] Deploy as a service on the existing K8s cluster
+
+### Out of Scope
+
+- User authentication — security by obscurity (private URL, not publicly discoverable)
+- Community features (chat, comments, voting) — just streams
+- DVR/recording — live viewing only
+- Mobile app — web-only
+- Official F1TV integration — unofficial re-streams only
+
+## Context
+
+- Stream sites have anti-scraping protections: CSRF tokens, JavaScript-rendered pages, obfuscated video URLs, redirect chains
+- Custom extractors per site are preferred over headless browser for efficiency and reliability
+- User will provide the specific sites to scrape — not a discovery/search problem
+- F1 calendar data available via Ergast API (ergast.com/mrd/) and OpenF1 API
+- HLS (m3u8) is the most common stream format on these sites
+- Existing infra supports Svelte apps (user's preferred frontend framework)
+
+## Constraints
+
+- **Frontend**: Svelte — user preference for all new web apps
+- **Deployment**: K8s cluster via Terraform/Terragrunt stack pattern
+- **Storage**: NFS at 10.0.10.15 for any persistent data
+- **No auth**: Rely on non-discoverable URL, no Authentik integration needed
+- **Extractors**: Custom per-site logic, no headless browser dependency
+
+## Key Decisions
+
+| Decision | Rationale | Outcome |
+|----------|-----------|---------|
+| Custom per-site extractors over headless browser | More efficient, reliable, and lighter on resources | — Pending |
+| No authentication | Private community, security by obscurity sufficient | — Pending |
+| Proxy streams through service | Unified player experience, hides source from end users | — Pending |
+| All sessions coverage | Users want full weekend + extras, not just race day | — Pending |
+
+---
+*Last updated: 2026-02-23 after initialization*
--- a/.planning/REQUIREMENTS.md
+++ b/.planning/REQUIREMENTS.md
@ -0,0 +1,106 @@
+# Requirements: F1 Streaming Service
+
+**Defined:** 2026-02-23
+**Core Value:** When an F1 session is live, users open one URL and immediately see working streams — no hunting for links.
+
+## v1 Requirements
+
+Requirements for initial release. Each maps to roadmap phases.
+
+### Schedule
+
+- [ ] **SCHED-01**: System auto-pulls F1 race calendar with all official sessions (FP1-3, Qualifying, Sprint, Race) from OpenF1/Jolpica API
+
+### Extraction
+
+- [ ] **EXTR-01**: Extractor framework with plugin-per-site pattern — each site is an independent extractor class
+- [ ] **EXTR-02**: Extractors bypass site protections (CSRF tokens, redirect chains, JS-computed URLs) to get final HLS/m3u8 source URLs
+- [ ] **EXTR-03**: Background polling scrapes configured sites periodically, caches results in-memory
+- [ ] **EXTR-04**: Auto-refresh expired CDN tokens mid-stream without interrupting playback
+- [ ] **EXTR-05**: Fallback ordering across multiple sources — rank by reliability, try next on failure
+
+### Proxy
+
+- [ ] **PRXY-01**: HLS proxy with full m3u8 URL rewriting at all playlist levels (master → variant → segments)
+- [ ] **PRXY-02**: CORS headers on all proxy endpoints for browser playback
+- [ ] **PRXY-03**: Chunked segment relay — stream bytes through, never buffer full segments in memory
+- [ ] **PRXY-04**: Quality selection — expose available stream variants, let users pick quality
+- [ ] **PRXY-05**: CDN token refresh loop to keep streams alive during 2+ hour sessions
+
+### Health
+
+- [ ] **HLTH-01**: Pre-display verification — check extracted streams are live and playable before showing to users
+- [ ] **HLTH-02**: Dead stream marking — tag broken/offline streams so users don't click them
+- [ ] **HLTH-03**: Quality metrics — track bitrate, buffering ratio, and latency per active stream
+
+### Frontend
+
+- [ ] **FRNT-01**: Stream picker — display available streams per live session, user selects one
+- [ ] **FRNT-02**: Embedded HLS player using hls.js for in-browser playback
+- [ ] **FRNT-03**: Multi-stream layout — watch multiple streams side by side (e.g., race feed + onboard camera)
+
+### Deployment
+
+- [ ] **DEPL-01**: K8s deployment via Terragrunt stack following existing infra patterns
+- [ ] **DEPL-02**: NFS storage for persistent data (schedule cache, extractor config)
+
+## v2 Requirements
+
+Deferred to future release. Tracked but not in current roadmap.
+
+### Schedule
+
+- **SCHED-02**: Session countdown timer and live/upcoming/past status indicators
+- **SCHED-03**: Pre/post shows, press conferences in schedule (requires per-site detection)
+
+### Frontend
+
+- **FRNT-04**: Live timing overlay with sector times and positions
+
+## Out of Scope
+
+Explicitly excluded. Documented to prevent scope creep.
+
+| Feature | Reason |
+|---------|--------|
+| User authentication | Security by obscurity, private URL |
+| Community features (chat, comments) | Just streams, not a social platform |
+| DVR/recording | Live viewing only |
+| Mobile app | Web-only |
+| Official F1TV integration | Unofficial re-streams only |
+| Headless browser extraction | Custom per-site extractors are lighter and more reliable |
+
+## Traceability
+
+Which phases cover which requirements. Updated during roadmap creation.
+
+| Requirement | Phase | Status |
+|-------------|-------|--------|
+| SCHED-01 | Phase 2 | Pending |
+| EXTR-01 | Phase 3 | Pending |
+| EXTR-02 | Phase 3 | Pending |
+| EXTR-03 | Phase 3 | Pending |
+| EXTR-04 | Phase 6 | Pending |
+| EXTR-05 | Phase 4 | Pending |
+| PRXY-01 | Phase 5 | Pending |
+| PRXY-02 | Phase 5 | Pending |
+| PRXY-03 | Phase 5 | Pending |
+| PRXY-04 | Phase 6 | Pending |
+| PRXY-05 | Phase 6 | Pending |
+| HLTH-01 | Phase 4 | Pending |
+| HLTH-02 | Phase 4 | Pending |
+| HLTH-03 | Phase 4 | Pending |
+| FRNT-01 | Phase 7 | Pending |
+| FRNT-02 | Phase 7 | Pending |
+| FRNT-03 | Phase 8 | Pending |
+| DEPL-01 | Phase 1 | Pending |
+| DEPL-02 | Phase 1 | Pending |
+
+**Coverage:**
+- v1 requirements: 19 total
+- Mapped to phases: 19
+- Unmapped: 0
+
+---
+*Requirements defined: 2026-02-23*
+*Last updated: 2026-02-23 after roadmap creation — all 19 requirements mapped*
--- a/.planning/ROADMAP.md
+++ b/.planning/ROADMAP.md
@ -0,0 +1,138 @@
+# Roadmap: F1 Streaming Service
+
+## Overview
+
+Build a private F1 stream aggregation service from the ground up: first the Kubernetes
+deployment stack, then the F1 schedule subsystem, then the per-site extraction pipeline,
+then health checking and fallback ordering, then the HLS proxy and relay layer, then
+CDN token lifecycle management, and finally the Svelte frontend. Each phase delivers
+a verifiable, independently testable capability that the next phase depends on. The
+system is complete when a user opens one URL during a live F1 session and immediately
+sees working, proxied streams with a functioning embedded player.
+
+## Phases
+
+**Phase Numbering:**
+- Integer phases (1, 2, 3): Planned milestone work
+- Decimal phases (2.1, 2.2): Urgent insertions (marked with INSERTED)
+
+Decimal phases appear between their surrounding integers in numeric order.
+
+- [ ] **Phase 1: Infrastructure and Deployment** - Terragrunt stack on K8s with NFS storage — service exists on the cluster
+- [ ] **Phase 2: F1 Schedule Subsystem** - Pull, persist, and serve the F1 race calendar from OpenF1/jolpica API
+- [ ] **Phase 3: Extractor Framework and First Site** - Plugin registry, BaseExtractor interface, first working site extractor with background polling
+- [ ] **Phase 4: Stream Health and Fallback** - Pre-display health verification, dead stream marking, quality metrics, and fallback ordering
+- [ ] **Phase 5: HLS Proxy Core** - CORS-transparent m3u8 proxy with full URI rewriting and chunked segment relay
+- [ ] **Phase 6: CDN Token Lifecycle and Quality** - Token refresh loops for long-running sessions and quality variant selection
+- [ ] **Phase 7: Frontend Core — Schedule, Picker, and Player** - SvelteKit app with schedule view, stream picker, and embedded hls.js player
+- [ ] **Phase 8: Multi-Stream Layout** - Side-by-side stream viewing for watching multiple feeds simultaneously
+
+## Phase Details
+
+### Phase 1: Infrastructure and Deployment
+**Goal**: The F1 service exists on the Kubernetes cluster, is reachable at its domain, and has NFS storage mounted — ready to run application code.
+**Depends on**: Nothing (first phase)
+**Requirements**: DEPL-01, DEPL-02
+**Success Criteria** (what must be TRUE):
+  1. A request to the service's public URL returns a non-error HTTP response from the cluster
+  2. The Terragrunt stack applies cleanly from a fresh checkout with no manual cluster intervention
+  3. The NFS volume is mounted inside the running pod and a file written to it survives a pod restart
+  4. Woodpecker CI pipeline exists and triggers on push to the service's directory
+**Plans**: 2 plans
+Plans:
+- [ ] 01-01-PLAN.md — Create FastAPI backend app, Dockerfile, and build/push Docker image
+- [ ] 01-02-PLAN.md — Update Terraform deployment, apply stack, verify NFS, add CI pipeline
+
+### Phase 2: F1 Schedule Subsystem
+**Goal**: The system automatically fetches the full F1 race calendar and serves it as structured data — users can see all sessions for the current season with correct times.
+**Depends on**: Phase 1
+**Requirements**: SCHED-01
+**Success Criteria** (what must be TRUE):
+  1. The `/schedule` API endpoint returns all races for the current season with session types (FP1-3, Qualifying, Sprint, Race) and UTC-correct timestamps
+  2. Schedule data persists to NFS and is served correctly after a pod restart without re-fetching the API
+  3. APScheduler triggers a background refresh of schedule data at least once daily without manual intervention
+  4. A race that has already occurred shows a "past" status and an upcoming race shows "upcoming" status
+**Plans**: TBD
+
+### Phase 3: Extractor Framework and First Site
+**Goal**: The extractor plugin system is in place and at least one site extractor returns a valid, live HLS URL — proving the end-to-end extraction architecture.
+**Depends on**: Phase 2
+**Requirements**: EXTR-01, EXTR-02, EXTR-03
+**Success Criteria** (what must be TRUE):
+  1. The extractor registry lists all registered site extractors and dispatches to the correct one by site key
+  2. The first site extractor returns a working m3u8 URL that plays when pasted into VLC, including passing any CSRF or token requirements
+  3. Background polling runs automatically on the APScheduler, re-extracts streams at a configured interval, and caches results in Redis with a TTL
+  4. Adding a second extractor requires only creating a new class file and registering it — no changes to the dispatcher or other extractors
+  5. Extractor failures are logged with enough detail to identify exactly which step failed (request, token parse, URL extraction)
+**Plans**: TBD
+
+### Phase 4: Stream Health and Fallback
+**Goal**: Only verified-live streams reach users, broken streams are flagged, and when multiple sources exist the system automatically tries the next one on failure.
+**Depends on**: Phase 3
+**Requirements**: HLTH-01, HLTH-02, HLTH-03, EXTR-05
+**Success Criteria** (what must be TRUE):
+  1. The `/streams` API endpoint only returns streams that have passed a HEAD/partial-GET liveness check within the last health-check interval
+  2. A stream that returns a non-200 or empty playlist is marked as dead and excluded from the API response without manual intervention
+  3. The `/streams` response includes bitrate and liveness metadata per stream so the frontend can display stream quality
+  4. When configured with multiple sources for the same session, the API returns them in reliability-ranked order (most recently verified first)
+**Plans**: TBD
+
+### Phase 5: HLS Proxy Core
+**Goal**: The proxy layer converts raw CDN HLS URLs into browser-playable same-origin URLs with full CORS support — a stream URL from the extractor can be played in any browser via the proxy.
+**Depends on**: Phase 4
+**Requirements**: PRXY-01, PRXY-02, PRXY-03
+**Success Criteria** (what must be TRUE):
+  1. Fetching `/proxy?url=<master-m3u8>` returns an m3u8 where every URI at every level (master, variant, segment) points back through the `/relay` endpoint — zero requests escape to the original CDN domain
+  2. A browser playing a proxied stream completes all preflight CORS checks without errors, including the `Range` header
+  3. Segment relay streams bytes to the browser as chunked transfer with no full-segment buffering — peak memory per active stream stays under 5 MB
+  4. The proxy correctly handles both master playlists (multi-variant) and media playlists (single-variant) without special-casing at the caller
+**Plans**: TBD
+
+### Phase 6: CDN Token Lifecycle and Quality
+**Goal**: Streams stay alive for full 2+ hour F1 sessions without user intervention, and users can select video quality when multiple variants are available.
+**Depends on**: Phase 5
+**Requirements**: EXTR-04, PRXY-04, PRXY-05
+**Success Criteria** (what must be TRUE):
+  1. A stream that has been playing for 90 minutes continues without interruption — the background token refresh loop re-extracts and updates the cached URL before the CDN token expires
+  2. The `/streams` response exposes available quality variants (resolution labels) for streams that provide multi-variant playlists
+  3. Selecting a different quality variant via the API returns a proxied URL for that specific variant stream
+  4. Token refresh failures are logged and surface in stream health status without crashing the relay or affecting other active streams
+**Plans**: TBD
+
+### Phase 7: Frontend Core — Schedule, Picker, and Player
+**Goal**: Users can open the service in a browser, see the F1 session schedule, pick a live stream from the available sources, and watch it in an embedded player on the same page.
+**Depends on**: Phase 6
+**Requirements**: FRNT-01, FRNT-02
+**Success Criteria** (what must be TRUE):
+  1. The schedule page lists all upcoming and past sessions grouped by Grand Prix, with correct local-timezone display and live/upcoming/past badges
+  2. Clicking a live session shows a stream picker with available sources labeled by site name and liveness status
+  3. Selecting a stream loads and begins playing it in the embedded hls.js player without leaving the page
+  4. The player recovers from transient network errors automatically and displays a clear error message only on unrecoverable failure
+  5. The app is usable on a desktop browser without requiring any browser extension or plugin
+**Plans**: TBD
+
+### Phase 8: Multi-Stream Layout
+**Goal**: Users can watch multiple streams side by side simultaneously — for example, the main race feed alongside a specific driver onboard camera.
+**Depends on**: Phase 7
+**Requirements**: FRNT-03
+**Success Criteria** (what must be TRUE):
+  1. The user can add a second stream to the view and both play simultaneously in a split-screen layout without audio or video interference between streams
+  2. The layout adapts gracefully when two streams are loaded — each player gets equal visible area and independent controls
+  3. Removing one stream from the multi-stream view does not interrupt the other stream
+**Plans**: TBD
+
+## Progress
+
+**Execution Order:**
+Phases execute in numeric order: 1 → 2 → 3 → 4 → 5 → 6 → 7 → 8
+
+| Phase | Plans Complete | Status | Completed |
+|-------|----------------|--------|-----------|
+| 1. Infrastructure and Deployment | 0/2 | Planning complete | - |
+| 2. F1 Schedule Subsystem | 0/TBD | Not started | - |
+| 3. Extractor Framework and First Site | 0/TBD | Not started | - |
+| 4. Stream Health and Fallback | 0/TBD | Not started | - |
+| 5. HLS Proxy Core | 0/TBD | Not started | - |
+| 6. CDN Token Lifecycle and Quality | 0/TBD | Not started | - |
+| 7. Frontend Core — Schedule, Picker, and Player | 0/TBD | Not started | - |
+| 8. Multi-Stream Layout | 0/TBD | Not started | - |
--- a/.planning/STATE.md
+++ b/.planning/STATE.md
@ -0,0 +1,65 @@
+# Project State
+
+## Project Reference
+
+See: .planning/PROJECT.md (updated 2026-02-23)
+
+**Core value:** When an F1 session is live, users open one URL and immediately see working streams — no hunting for links.
+**Current focus:** All 8 phases complete — deployed and verified
+
+## Current Position
+
+Phase: 8 of 8 (Multi-Stream Layout) — COMPLETE
+Status: Deployed and verified at https://f1.viktorbarzin.me
+Last activity: 2026-02-24 — All phases deployed, frontend routing fixed, full verification passed
+
+Progress: [██████████] 100%
+
+## Phase Completion Summary
+
+| Phase | Name | Status | Image |
+|-------|------|--------|-------|
+| 1 | Infrastructure & Deployment | Complete | v2.0.1 |
+| 2 | F1 Schedule Subsystem | Complete | v2.0.3 |
+| 3 | Extractor Framework | Complete | v3.0.0 |
+| 4 | Stream Health Checker | Complete | v5.0.0 |
+| 5 | HLS Proxy & Relay | Complete | v5.0.0 |
+| 6 | CDN Token Lifecycle | Complete | v5.0.0 |
+| 7 | SvelteKit Frontend | Complete | v5.0.0 |
+| 8 | Multi-Stream Layout | Complete | v5.0.0 |
+
+## Verified Endpoints
+
+- `/health` — 200 OK
+- `/` — 200 (SvelteKit schedule page)
+- `/watch` — 200 (multi-stream player)
+- `/schedule` — 200 (24 races, 2026 season)
+- `/streams` — 200 (3 demo streams)
+- `/extractors` — 200
+- `/streams/active` — 200
+- `/proxy?url=...` — 200 (HLS m3u8 rewriting)
+- `/relay?url=...` — streaming (chunked segment relay)
+
+## Accumulated Context
+
+### Decisions
+
+- Custom per-site extractors over headless browser
+- No authentication — security by obscurity
+- Proxy streams through service for unified player
+- APScheduler in-process (no Celery)
+- Kaniko for in-cluster Docker builds (Docker Desktop unavailable)
+- v5.0.0 tag to bypass pull-through cache (10.0.20.10 caches stale :latest)
+- Catch-all FastAPI route for SvelteKit SPA (adapter-static generates {page}.html, not {page}/index.html)
+
+### Known Issues
+
+- Pull-through cache at 10.0.20.10 caches Docker tags aggressively — must use new tags to deploy updates
+- Only demo extractor exists — real streaming site extractors need to be built
+- Woodpecker CI webhook may not be configured for f1-stream builds
+
+## Session Continuity
+
+Last session: 2026-02-24
+Stopped at: All 8 phases deployed and verified
+Next steps: Add real streaming site extractors (Phase 3 expansion)
--- a/.planning/codebase/ARCHITECTURE.md
+++ b/.planning/codebase/ARCHITECTURE.md
@ -0,0 +1,165 @@
+# Architecture
+
+**Analysis Date:** 2026-02-23
+
+## Pattern Overview
+
+**Overall:** Terragrunt-based IaC with per-service state isolation, using Kubernetes as the primary platform and Proxmox for VM infrastructure.
+
+**Key Characteristics:**
+- Monorepo containing ~70 service stacks with independent state files
+- Declarative, GitOps-driven infrastructure using Terraform + Terragrunt
+- DRY provider/backend configuration via root `terragrunt.hcl`
+- Clear layering: platform (core/cluster services) → application stacks → shared modules
+- Service decoupling with explicit dependencies via `dependency` blocks
+- Resource governance through Kubernetes tier system (0-core through 4-aux)
+
+## Layers
+
+**Platform Layer (`stacks/platform/main.tf`):**
+- Purpose: Core infrastructure services that enable all application stacks (22 modules)
+- Location: `stacks/platform/`
+- Contains: MetalLB, DBaaS, Redis, Traefik, Technitium DNS, Headscale VPN, Authentik SSO, RBAC, CrowdSec, Prometheus/Grafana/Loki monitoring, nginx reverse proxy, mailserver, GPU node configuration, Kyverno policy engine
+- Depends on: Kubernetes cluster (declared via `stacks/infra` dependency), External secrets in `terraform.tfvars`
+- Used by: All application stacks declare `dependency "platform"` to ensure platform is applied first
+
+**Infrastructure Layer (`stacks/infra/main.tf`):**
+- Purpose: VM template provisioning and Proxmox resource management
+- Location: `stacks/infra/`
+- Contains: K8s node templates via cloud-init, docker-registry VM, Proxmox VM lifecycle
+- Depends on: Proxmox API credentials
+- Used by: Platform stack depends on it to ensure infrastructure is ready
+
+**Application Stacks (~70 services):**
+- Purpose: User-facing and supplementary services (Nextcloud, Immich, Matrix, Ollama, etc.)
+- Location: `stacks/<service>/main.tf` (102 total stacks)
+- Contains: Kubernetes namespaces, Helm releases, raw Kubernetes resources (Deployments, StatefulSets, Services, PersistentVolumes)
+- Depends on: Platform stack, shared TLS secret via `modules/kubernetes/setup_tls_secret`, optional NFS volumes
+- Used by: Self-contained; declared dependencies control execution order
+
+**Shared Modules:**
+- **Kubernetes utilities** (`modules/kubernetes/`):
+  - `ingress_factory/`: Reusable Traefik ingress + service template with anti-AI scraping, CrowdSec integration, rate limiting, authentication support
+  - `setup_tls_secret/`: TLS certificate secret setup in namespaces
+- **Terraform modules** (`modules/`):
+  - `create-template-vm/`: Ubuntu cloud-init template VM provisioning (K8s and non-K8s variants)
+  - `create-vm/`: VM instance creation from templates
+  - `docker-registry/`: Docker registry pull-through cache configuration
+
+## Data Flow
+
+**Infrastructure Provisioning Flow:**
+
+1. **Initialize**: Root `terragrunt.hcl` loads `terraform.tfvars` globally, generates provider/backend configs
+2. **Infra Stack Apply**: `stacks/infra/` creates/updates Proxmox VMs and Kubernetes node templates
+3. **Platform Apply**: `stacks/platform/` applies all ~22 core services (depends on infra stack)
+4. **Service Apply**: Individual `stacks/<service>/` apply their resources (depend on platform stack)
+
+Example dependency chain for Nextcloud:
+```
+stacks/infra/main.tf (VMs)
+  ↓ (dependency)
+stacks/platform/main.tf (Traefik, Redis, DBaaS, etc.)
+  ↓ (dependency)
+stacks/nextcloud/main.tf (Nextcloud Helm chart + storage)
+```
+
+**State Management:**
+- Each stack has isolated state at `state/stacks/<service>/terraform.tfstate`
+- Root `terragrunt.hcl` defines local backend: `path = "${get_repo_root()}/state/${path_relative_to_include()}/terraform.tfstate"`
+- Variables flow from `terraform.tfvars` → each stack's `terraform` block → Terraform execution
+- Unused variables are silently ignored (Terraform 1.x behavior)
+
+**Configuration Flow:**
+1. User edits `terraform.tfvars` (encrypted via git-crypt)
+2. Each stack includes root terragrunt config: `include "root" { path = find_in_parent_folders() }`
+3. Root config injects `terraform.tfvars` as `required_var_files`
+4. Stack-specific `main.tf` declares which variables it uses
+
+## Key Abstractions
+
+**Tier System:**
+- Purpose: Resource governance via Kubernetes PriorityClasses, LimitRanges, ResourceQuotas
+- Tiers: `0-core` (critical: ingress, DNS, auth) → `4-aux` (optional workloads)
+- Applied via: Kyverno policy engine in `stacks/platform/modules/kyverno/`
+- Usage: Every namespace/pod gets labeled with tier; Kyverno generates corresponding LimitRange + ResourceQuota
+
+**Service Factory Pattern:**
+- Purpose: Multi-tenant/multi-instance services (Actual Budget, Freedify)
+- Pattern: Parent stack (`stacks/<service>/main.tf`) creates namespace + TLS secret, then calls `factory/` module multiple times
+- Examples: `stacks/actualbudget/main.tf` calls `factory/` for viktor, anca, emo instances
+- Each instance: Separate pod, service, NFS share, Cloudflare DNS entry
+
+**Ingress Factory (`modules/kubernetes/ingress_factory/`):**
+- Purpose: DRY, opinionated Traefik ingress pattern with security defaults
+- Variables: `name`, `namespace`, `port`, `host`, `protected`, `anti_ai_scraping` (default true)
+- Provides: Service, Ingress, CrowdSec exemptions, rate limiting, Authentik ForwardAuth integration, anti-AI middleware
+- Anti-AI layers: Bot blocking → X-Robots-Tag → Trap links → Tarpit → Poison content cache
+
+**NFS Volume Pattern:**
+- Purpose: Persistent storage for stateful services
+- Pattern: Inline NFS volumes in pod specs (preferred over PV/PVC)
+- Server: `10.0.10.15` (TrueNAS)
+- Paths: `/mnt/main/<service>` or `/mnt/main/<service>/<instance>`
+- Used by: ~60 services; registered in `secrets/nfs_directories.txt` (git-crypt encrypted)
+
+## Entry Points
+
+**Terragrunt Root (`terragrunt.hcl`):**
+- Location: `/Users/viktorbarzin/code/infra/terragrunt.hcl`
+- Triggers: `cd stacks/<service> && terragrunt plan/apply --non-interactive`
+- Responsibilities: Load providers, backend, `terraform.tfvars`, set kube config path
+
+**Platform Stack (`stacks/platform/main.tf`):**
+- Location: `stacks/platform/main.tf` (1000+ lines)
+- Triggers: Applied before any service stack to ensure platform services exist
+- Responsibilities: 22 module instantiations, tier definition, variable collection from tfvars
+
+**Service Stacks (`stacks/<service>/main.tf`):**
+- Location: `stacks/<service>/main.tf` (27–456 lines, avg ~130)
+- Triggers: `terragrunt apply --non-interactive` in service directory
+- Responsibilities: Create namespace, setup TLS, instantiate Helm charts or raw K8s resources, configure storage
+
+**Proxmox/Infra Stack (`stacks/infra/main.tf`):**
+- Location: `stacks/infra/main.tf` (200+ lines)
+- Triggers: Applied first to ensure VM infrastructure is available
+- Responsibilities: VM template creation, VM instance lifecycle, containerd mirror config
+
+**Factory Module (`stacks/<service>/factory/main.tf`):**
+- Location: `stacks/actualbudget/factory/main.tf`, `stacks/freedify/factory/main.tf`
+- Triggers: Called multiple times from parent `main.tf` with different `name` parameter
+- Responsibilities: Single-instance deployment (pod, service, NFS share, ingress)
+
+## Error Handling
+
+**Strategy:** Declarative state reconciliation (Terraform/Kubernetes watch loop). No imperative error recovery.
+
+**Patterns:**
+- **Helm deployments**: `atomic = true` for rollback on failure
+- **Terraform apply**: `--non-interactive` to prevent hanging on prompts
+- **Cloud-init VM provisioning**: Embedded error logging in scripts; check `/var/log/cloud-init-output.log` on VM
+- **Dependencies**: Explicit `dependency` blocks prevent applying child before parent
+- **Validation**: `terraform plan` executed by CI before apply
+- **Secrets**: git-crypt locking ensures encrypted state checked into repo; no accidental plaintext commits
+
+## Cross-Cutting Concerns
+
+**Logging:** Loki + Alloy (DaemonSet collects container logs) configured in `stacks/platform/modules/monitoring/`
+
+**Validation:**
+- Terraform validation: `terraform validate` in CI/CD pipeline
+- HCL formatting: `terraform fmt -recursive`
+- Kyverno policies: Enforce resource requests, tier labels, pod security standards
+
+**Authentication:**
+- **Kubernetes API**: OIDC via Authentik (issuer: `https://authentik.viktorbarzin.me/application/o/kubernetes/`)
+- **Traefik Ingress**: Authentik ForwardAuth when `protected = true` in ingress_factory
+- **TLS**: Shared secret injected into all namespaces via `setup_tls_secret` module
+
+**Rate Limiting:** Traefik middleware `default-rate-limit` (applied by ingress_factory unless `skip_default_rate_limit = true`)
+
+**Anti-AI Scraping:** 5-layer defense (bot blocking → headers → trap links → tarpit → poison content) applied via `anti_ai_scraping = true` (default) in ingress_factory; disable per-service with `anti_ai_scraping = false`
+
+---
+
+*Architecture analysis: 2026-02-23*
--- a/.planning/codebase/CONCERNS.md
+++ b/.planning/codebase/CONCERNS.md
@ -0,0 +1,244 @@
+# Codebase Concerns
+
+**Analysis Date:** 2026-02-23
+
+## Tech Debt
+
+**MySQL Backup Rotation Not Implemented:**
+- Issue: Backup rotation logic exists (comment at `stacks/platform/modules/dbaas/main.tf:196`) but is incomplete. Backup size noted as 11MB, rotation deferred.
+- Files: `stacks/platform/modules/dbaas/main.tf` (lines 196-206)
+- Impact: Backup directory could grow unbounded; no automated retention policy enforced. Manual cleanup required.
+- Fix approach: Implement full rotation schedule using `find -mtime +N` or migrate to external backup solution (Velero, pgbackrest). Set up CronJob with proper retention (e.g., 14-day backups).
+
+**PostgreSQL Major Version Upgrade Blocked:**
+- Issue: Comment at `stacks/platform/modules/dbaas/main.tf:718` indicates PostgreSQL 17.2 requires `pg_upgrade` to data directory but is not implemented.
+- Files: `stacks/platform/modules/dbaas/main.tf` (line 718)
+- Impact: Cannot upgrade PostgreSQL from 16-master to 17.2. When upgrade is needed, requires manual pg_upgrade procedure; high downtime risk.
+- Fix approach: Implement pg_upgrade CronJob or StatefulSet init container that performs in-place upgrade. Test migration path with backup first.
+
+**TP-Link Gateway Reverse Proxy Not Functional:**
+- Issue: Reverse proxy module for TP-Link gateway marked as "Not working yet" at `stacks/platform/modules/reverse_proxy/main.tf:91`.
+- Files: `stacks/platform/modules/reverse_proxy/main.tf` (lines 91-95)
+- Impact: Gateway access over HTTPS or HTTP routing non-functional. Unknown scope of impact on dependent services.
+- Fix approach: Either complete reverse proxy implementation (Traefik/Nginx config) or document why it's disabled. Clarify if gateway is still accessible via HTTP or LAN-only.
+
+**WireGuard Firewall Rules Incomplete:**
+- Issue: Client firewall restrictions not written at `terraform.tfvars:430`. Only placeholder exists.
+- Files: `terraform.tfvars` (lines 430-434)
+- Impact: No network isolation between VPN clients and cluster-internal services (10.32.0.0/12). All connected clients can access cluster APIs if firewall rules not enforced at kernel level.
+- Fix approach: Define explicit iptables rules for each client role (e.g., "allow DNS only", "deny cluster access"). Test with `iptables -L -v`. Consider VPN network segmentation if multiple trust levels exist.
+
+## Known Bugs & Issues
+
+**Immich Database Compatibility Mismatch:**
+- Symptoms: Custom PostgreSQL image version mismatch between Immich PostgreSQL pod and dbaas PostgreSQL. Immich uses `ghcr.io/immich-app/postgres:15-vectorchord0.3.0-pgvectors0.2.0`, while dbaas PostgreSQL is 16-master with PostGIS/PgVector mix.
+- Files: `stacks/immich/main.tf` (lines 76-77, 276), `stacks/platform/modules/dbaas/main.tf` (line 717)
+- Trigger: If Immich database is migrated to shared dbaas PostgreSQL, extension version incompatibility will cause failures.
+- Workaround: Keep Immich on isolated PostgreSQL 15 with Immich-specific extensions. If consolidation needed, test extension compatibility first.
+
+**Realestate-Crawler Latest Image Tag Ignores Updates:**
+- Symptoms: `realestate-crawler-ui` uses `image = "viktorbarzin/immoweb:latest"` with `lifecycle { ignore_changes = [spec[0].template[0].spec[0].container[0].image] }`.
+- Files: `stacks/real-estate-crawler/main.tf` (lines 64, 79-82)
+- Trigger: New versions of `immoweb:latest` will never be deployed. Terraform ignores image updates; manual image pull/push required.
+- Workaround: Use Diun annotations to track image updates. Consider using version-pinned tags instead of `:latest`. Remove `ignore_changes` if auto-updates desired.
+
+## Security Considerations
+
+**OpenClaw Has Cluster-Admin Permissions:**
+- Risk: OpenClaw ServiceAccount granted unrestricted `cluster-admin` ClusterRoleBinding at `stacks/openclaw/main.tf:41-54`.
+- Files: `stacks/openclaw/main.tf` (lines 34-55)
+- Current mitigation: `dangerouslyDisableDeviceAuth = true` in config (line 89) disables device auth but relies on network access control.
+- Recommendations:
+  - Scope OpenClaw RBAC to specific namespaces/resources needed for skill execution (e.g., `get/list/watch pods`, `exec into pods`, `apply resources in specific namespaces`).
+  - Re-enable device auth or implement mTLS between OpenClaw and operators.
+  - Audit OpenClaw logs for unauthorized API calls (enable API server audit logs).
+
+**Git-Crypt Key Mounted as ConfigMap:**
+- Risk: git-crypt key at `stacks/openclaw/main.tf:68-76` stored as plain-text ConfigMap data. Any pod on cluster can read it (unless RBAC enforces secrets-only access).
+- Files: `stacks/openclaw/main.tf` (lines 68-76)
+- Current mitigation: None; ConfigMap is world-readable by default.
+- Recommendations:
+  - Move git-crypt key to Kubernetes Secret instead of ConfigMap.
+  - Add RBAC policy restricting secret read to openclaw namespace only.
+  - Consider external secret management (Authentik-backed secret injection, Sealed Secrets).
+
+**SSH Private Key Stored as Secret:**
+- Risk: SSH private key for OpenClaw stored at `stacks/openclaw/main.tf:57-66` as unencrypted Secret. Readable by any pod with secret access.
+- Files: `stacks/openclaw/main.tf` (lines 57-66)
+- Current mitigation: Secret only readable by openclaw namespace (if RBAC enforced); encryption at rest not confirmed.
+- Recommendations:
+  - Rotate SSH key regularly; consider using ed25519 keys (shorter, stronger).
+  - Audit Secret access via Kubernetes audit logs.
+  - Use external secret store (HashiCorp Vault, Bitwarden) instead of native Secrets.
+
+**WireGuard VPN Clients Unrestricted:**
+- Risk: VPN clients can reach all cluster-internal services (10.32.0.0/12) unless firewall rules defined. No per-client segmentation.
+- Files: `terraform.tfvars` (lines 430-434)
+- Current mitigation: Attempted iptables rules commented out; not enforced.
+- Recommendations:
+  - Define explicit client restrictions in WireGuard firewall script (uncomment/complete lines 433-434).
+  - Implement deny-by-default firewall (drop all, then allow specific routes).
+  - Consider separate WireGuard interfaces for different trust levels (admin vs. guest).
+
+**Multiple `:latest` Image Tags in Production:**
+- Risk: 17 services use `:latest` tags (e.g., `nextcloud`, `kms`, `calibre`, `speedtest`, `rybbit`, `wealthfolio`, `cyberchef`, `coturn`, `immich-frame`, `health`, others).
+- Files: Multiple stacks (see full list in grep output above).
+- Current mitigation: Diun annotations track updates but don't auto-pull; images are immutable but unversioned.
+- Recommendations:
+  - Pin all production images to specific semantic versions (e.g., `ghcr.io/foo/bar:v1.2.3`, not `:latest`).
+  - Use Diun to track new releases and trigger automated testing in staging.
+  - Update CI/CD pipeline to require version tags for production deployments.
+
+## Performance Bottlenecks
+
+**Insufficient Health Probes on Critical Services:**
+- Problem: Only 14 services have liveness/readiness probes out of 70+ services. Missing probes on databases (MySQL, PostgreSQL, Redis), ingress, auth.
+- Files: All stacks (identified via grep: 14 instances of liveness/readiness out of 70+ services).
+- Cause: Default Kubernetes behavior is to not restart unhealthy pods without probes; cascading failures silent.
+- Improvement path: Add `livenessProbe`, `readinessProbe`, and `startupProbe` to all stateful services (databases, message queues, auth providers). Use TCP/HTTP probes appropriate to each service.
+
+**Pod Disruption Budgets Missing:**
+- Problem: Only 2 services have PodDisruptionBudget resources (identified via grep). Node evictions (updates, failures) can cause service degradation.
+- Files: All stacks (need comprehensive PodDisruptionBudget coverage).
+- Cause: PDBs are optional; many assume single-replica stateless services won't need them.
+- Improvement path: Add PDB with `minAvailable: 1` to all services with `replicas > 1`. For single-replica services, ensure they're marked as non-critical (lower PriorityClass) or accept downtime during node maintenance.
+
+**Resource Requests Sparse, Limits Missing:**
+- Problem: Many services lack explicit resource requests/limits. Kyverno auto-generates defaults but CPU limits often too low for bursty workloads (Immich ML, Ollama, Ebook2Audiobook).
+- Files: Multiple stacks (e.g., `stacks/immich/main.tf`, `stacks/ebook2audiobook/main.tf`, `stacks/ollama/main.tf`).
+- Cause: Request/limit tuning requires load testing; defaults used instead.
+- Improvement path: Run load tests on GPU workloads (Immich ML, Ollama) to determine sustained CPU/memory. Set requests to P50 usage, limits to P99. Monitor via Prometheus and adjust quarterly.
+
+**Large Terraform Modules (900+ lines):**
+- Problem: `stacks/platform/modules/dbaas/main.tf` is 916 lines; `stacks/immich/main.tf` is 660 lines; others > 450 lines.
+- Files: `stacks/platform/modules/dbaas/main.tf` (916 lines), `stacks/platform/modules/nvidia/main.tf` (658 lines), `stacks/platform/modules/kyverno/resource-governance.tf` (809 lines).
+- Cause: Monolithic resource definitions; hard to navigate and test.
+- Improvement path: Split large modules into sub-modules (e.g., `dbaas/` → `mysql/`, `postgresql/`, `pgadmin/`, `backups/`). Use Terraform workspaces for per-database configuration.
+
+## Fragile Areas
+
+**Immich Machine Learning GPU Dependency:**
+- Files: `stacks/immich/main.tf` (lines 380-450).
+- Why fragile: GPU workload (`immich-machine-learning-cuda`) requires Tesla T4 on k8s-node1. If GPU becomes unavailable (hardware failure, driver issues), ML inference fails silently (no fallback). Single GPU point of failure.
+- Safe modification: Add `nodeAffinity` to prefer GPU but allow non-GPU fallback (degraded mode). Implement health checks on GPU availability (`nvidia-smi` probe). Test GPU failure scenario before production use.
+- Test coverage: No tests for GPU unavailability; assumes GPU always available.
+
+**Nextcloud Backup/Restore Procedures Manual:**
+- Files: `stacks/nextcloud/main.tf` (backup.sh and restore.sh ConfigMaps).
+- Why fragile: Backup/restore scripts are ConfigMap-based; no automation. Restoration requires manual `kubectl exec` and script execution. No tested recovery procedure.
+- Safe modification: Implement automated backup via Velero or CSI snapshots. Test restore procedure monthly via staged environment.
+- Test coverage: No automated backup validation; scripts untested.
+
+**NFS Dependency for Data Persistence:**
+- Files: 126 references to NFS volumes across all stacks.
+- Why fragile: All stateful data depends on NFS server at `10.0.10.15`. If NFS becomes unavailable, all services lose data immediately (no local caches). No fallback storage.
+- Safe modification: Implement NFS client-side read caching (Linux NFS mount options `ac,acregmin=3600`). Monitor NFS availability via Prometheus alerts (Mount point offline). Test NFS failover procedure (if replica NFS exists).
+- Test coverage: No chaos engineering tests for NFS unavailability.
+
+**Istio Injection Disabled Cluster-Wide:**
+- Files: `stacks/real-estate-crawler/main.tf` (line 19): `"istio-injection" : "disabled"` on namespace labels.
+- Why fragile: No service mesh observability. Debugging pod-to-pod communication requires manual tracing (tcpdump). No mutual TLS between services.
+- Safe modification: Enable Istio on non-critical services first (e.g., realestate-crawler). Monitor resource overhead. Gradually roll out to production.
+- Test coverage: No mTLS validation; assumes all pods on same network are trusted.
+
+**PostgreSQL Custom Image Not Tracked:**
+- Files: `stacks/platform/modules/dbaas/main.tf` (line 717): `image = "viktorbarzin/postgres:16-master"`.
+- Why fragile: Custom build at Docker Hub with PostGIS + PgVector extensions. No version tag; `:master` tag is mutable. Upstream extension versions unknown.
+- Safe modification: Pin to semantic version (e.g., `:16.4-postgis3.4-pgvector0.8`). Build images locally with Dockerfile tracked in git. Test extension versions against application requirements.
+- Test coverage: No tests for extension availability or version compatibility.
+
+## Scaling Limits
+
+**Single-Replica Critical Services:**
+- Current capacity: Immich server (1 replica), PostgreSQL databases (1 replica), Redis (1 instance), Traefik (varies).
+- Limit: Node failure causes immediate service outage. Kubernetes default takes 5+ minutes to reschedule pod.
+- Scaling path: Increase critical service replicas to 3 (quorum). Add pod anti-affinity to spread across nodes. Implement PodDisruptionBudget with `minAvailable: 2`.
+
+**GPU Capacity Bottleneck:**
+- Current capacity: 1 Tesla T4 GPU on k8s-node1.
+- Limit: Immich ML + Ebook2Audiobook + Ollama all compete for single GPU. Queue time 10+ minutes for CPU-bound inference tasks.
+- Scaling path: Add second GPU (e.g., T4 or RTX 3090) to k8s-node1. Implement GPU scheduling via NVIDIA GPU Operator. Monitor GPU utilization (target 70-80%).
+
+**NFS Storage Capacity:**
+- Current capacity: `/mnt/main/` mounted on TrueNAS (size unknown; typically 4-8TB in home setups).
+- Limit: Immich (image library), Calibre (ebooks), Dawarich (location history) grow unbounded. When storage full, writes fail; services degrade.
+- Scaling path: Monitor NFS capacity monthly (`df -h`). Set up Prometheus alert at 80% capacity. Plan for annual storage growth based on user behavior (e.g., 100GB Immich/month).
+
+**MySQL/PostgreSQL Connection Pool:**
+- Current capacity: PgBouncer at `dbaas/pgbouncer` provides connection pooling. Default pool size likely 100-200 connections.
+- Limit: Many simultaneous connections (Nextcloud, Affine, Gramps Web, Authentik) can exceed pool. New connections queue or fail.
+- Scaling path: Monitor PgBouncer pool utilization (Prometheus metric `pgbouncer_pools_used_connections`). Increase pool size if > 80% utilization. Consider read replicas for read-heavy workloads.
+
+**API Rate Limiting & Bandwidth:**
+- Current capacity: Services exposed via Traefik ingress. No global rate limiting documented.
+- Limit: External tools (Immich mobile app, ebook2audiobook processing) can spike bandwidth. DoS-like behavior possible.
+- Scaling path: Implement Traefik rate limiting middleware (Prometheus-aware). Add Cloudflare rate limiting on public domains. Monitor egress bandwidth.
+
+## Dependencies at Risk
+
+**Redis Stack `:latest` Tag:**
+- Risk: `stacks/platform/modules/redis/main.tf` uses `image = "redis/redis-stack:latest"`. Redis Stack is actively developed; breaking changes possible.
+- Impact: Unexpected version upgrade could introduce incompatibilities with clients expecting specific command set or module versions.
+- Migration plan: Pin to specific Redis Stack version (e.g., `:7.2-rc1`). Test version upgrades in staging first. Monitor Redis logs for deprecated command warnings.
+
+**Immich `:latest` or Floating Tag:**
+- Risk: `stacks/immich/main.tf` pins to `v2.5.6` but Immich frequently releases patch versions. Database migrations can cause downtime.
+- Impact: If Immich version upgrades without testing, database migrations could fail or hang (no rollback mechanism).
+- Migration plan: Pin to specific patch versions (e.g., `v2.5.6`, not `v2.5`). Test Immich upgrades in staging first. Maintain backup before upgrading.
+
+**Unsupported MySQL 9.2.0:**
+- Risk: `stacks/platform/modules/dbaas/main.tf` specifies `image = "mysql:9.2.0"`. MySQL 9.2 is a development version (RC status as of Feb 2026).
+- Impact: RC versions not recommended for production. Stability issues, CVEs possible. No long-term support.
+- Migration plan: Migrate to MySQL 8.4 LTS or 9.0 GA (stable). Test data migration first. Plan for gradual rollout.
+
+**Python Timeouts in Monitoring Scripts:**
+- Risk: `stacks/platform/modules/nvidia/main.tf` uses hardcoded `timeout=10` for HTTP requests and subprocess calls. Slow network conditions will fail.
+- Impact: GPU monitoring will fail if network is slow or unavailable. Silent failures possible.
+- Migration plan: Implement exponential backoff and retry logic (e.g., `tenacity` library). Increase timeout to 30s for unreliable networks. Log timeouts for debugging.
+
+## Missing Critical Features
+
+**No Disaster Recovery Plan:**
+- Problem: Backup procedures exist (Nextcloud, MySQL) but no tested recovery procedure. No runbook for cluster disaster.
+- Blocks: If cluster data lost, recovery would be manual and time-consuming. No RTO/RPO defined.
+- Impact: Data loss risk > 24 hours to recover.
+
+**No Secrets Rotation Policy:**
+- Problem: SSH keys, API tokens, database passwords stored in git-crypt and tfvars. No automated rotation schedule.
+- Blocks: If key leaked, manual intervention required to rotate across all services.
+- Impact: Leaked credentials persist until discovery.
+
+**No Cross-Cluster Failover:**
+- Problem: Single Kubernetes cluster on Proxmox. No HA cluster or backup cluster.
+- Blocks: Cluster-wide failure (network partition, hypervisor crash) causes total outage.
+- Impact: RTO > 1 hour (manual intervention to restart hypervisor or re-provision).
+
+## Test Coverage Gaps
+
+**No Infrastructure Testing:**
+- What's not tested: Terraform applies, Helm charts, manifests only validated via `terraform plan`. No `terratest`, no functional tests of deployed services.
+- Files: All stacks (no test files found).
+- Risk: Typos, variable misconfigurations, missing dependencies not caught until production apply.
+- Priority: High — add `terratest` to validate Terraform. Test critical paths (database connection, ingress routing).
+
+**No Chaos Engineering Tests:**
+- What's not tested: Pod evictions, node failures, NFS unavailability, network partitions.
+- Files: All stacks (no chaos tests found).
+- Risk: Cascading failures and data loss scenarios not validated. Assumptions about resilience untested.
+- Priority: High — run monthly chaos tests (Gremlin, Chaos Toolkit). Document recovery procedures.
+
+**No Backup Restoration Tests:**
+- What's not tested: Nextcloud backups, MySQL backups. Restore procedures exist but never executed.
+- Files: `stacks/nextcloud/main.tf`, `stacks/platform/modules/dbaas/main.tf`.
+- Risk: Backups corrupt or unusable when needed. RPO > 24 hours if discovery slow.
+- Priority: High — monthly restore-to-staging test. Automate backup validation.
+
+**No Security Scanning for Vulnerabilities:**
+- What's not tested: Container images for CVEs, Terraform for security anti-patterns (hardcoded secrets, overpermissive RBAC).
+- Files: All stacks, all container images.
+- Risk: Known vulnerabilities deployed to production. No supply chain security.
+- Priority: Medium — integrate Trivy/Snyk into CI/CD. Scan images weekly; alert on high CVEs.
+
+---
+
+*Concerns audit: 2026-02-23*
--- a/.planning/codebase/CONVENTIONS.md
+++ b/.planning/codebase/CONVENTIONS.md
@ -0,0 +1,192 @@
+# Coding Conventions
+
+**Analysis Date:** 2026-02-23
+
+## Naming Patterns
+
+**Terraform Files:**
+- `main.tf` - Primary resource definitions and module calls
+- `terragrunt.hcl` - Stack-specific Terragrunt configuration
+- `variables.tf` - Variable declarations for a stack
+- `providers.tf` - Generated by Terragrunt root `terragrunt.hcl`
+- `backend.tf` - Generated by Terragrunt for state backend configuration
+
+**Terraform Variables:**
+- snake_case for variable names: `var.tls_secret_name`, `var.dbaas_root_password`
+- snake_case for resource names: `resource "kubernetes_namespace" "nextcloud"`
+- snake_case for local values: `local.tiers`
+- UPPERCASE for environment-like globals in shell: `KUBECONFIG_PATH`, `PASS_COUNT`
+
+**Resource/Module Names:**
+- kebab-case for Kubernetes resources: `nextcloud`, `whiteboard`, `kms-web-page`
+- Leading underscore for prefixed resource names (internal/private pattern): resource names with underscores are module-internal
+- Descriptive names matching functionality: `kubernetes_namespace`, `kubernetes_deployment`, `helm_release`
+
+**Shell Functions:**
+- snake_case for function names: `parse_args()`, `count_lines()`, `check_nodes()`
+- CamelCase for utility color variables: `RED`, `GREEN`, `YELLOW`, `BLUE`, `BOLD`, `NC`
+
+**Go Package/Test Names:**
+- Package-level test functions: `TestContainsVideoMarkers()`, `TestIsDirectVideoContentType()`
+- Table-driven test pattern with struct fields: `name`, `body`, `ct`, `want`
+
+## Code Style
+
+**Terraform Formatting:**
+- Use `terraform fmt -recursive` for consistent formatting
+- No explicit linter/formatter config file (tflint/terraform-lint not present)
+- Indentation: 2 spaces (standard Terraform convention)
+- Multi-line strings use heredoc syntax: `<<EOT ... EOT` for YAML/config blocks
+
+**Bash Script Style:**
+- Shebang: `#!/usr/bin/env bash`
+- Safety flags: `set -euo pipefail` (exit on error, undefined vars, pipe failures)
+- Comments use `# ---` separator for section dividers
+- Comments use `# ---` for grouping related variables/functions
+- One-liner functions defined as: `function_name() { [[ condition ]] && action; }`
+- Multiline functions use explicit function body with local keyword for variables
+
+**Terraform Style:**
+- Comments for major sections use `# =============================================================================`
+- Comments for subsections use `# --------- -------`
+- Inline comments explain why, not what: `# anything secret is fine` (explaining arbitrary choice)
+- Module calls include comments above describing purpose: `# --- Core ---`, `# --- dbaas ---`
+
+## Import Organization
+
+**Terraform:**
+- Locals (tier definitions) defined at top of main.tf
+- Variables declared in order: core/required first, then by feature area (dbaas, traefik, etc.)
+- Modules called after variables, grouped by functional area with comment headers
+- Resources defined after modules
+
+**Go:**
+- Standard imports from `testing` package
+- No grouping (single simple import)
+
+**Bash:**
+- Source definitions at top (colors, globals, helper functions)
+- Argument parsing in dedicated `parse_args()` function
+- Main logic organized by check sections with `section()` calls
+
+## Error Handling
+
+**Terraform:**
+- No explicit error handling (declarative; errors cause apply failure)
+- Dependency management via `depends_on` for explicit ordering
+- `dependency` blocks in terragrunt for cross-stack dependencies
+- `skip_outputs = true` used when only needing ordering, not outputs
+
+**Bash:**
+- Inline error checks: `command 2>&1) || { fail "message"; return 0; }`
+- `set -euo pipefail` prevents silent failures and undefined var issues
+- Error status captured: `$?` implicit via `||` pattern
+- Graceful degradation with fallback values or skip-able steps
+
+**Go:**
+- Standard testing error reporting: `t.Errorf()` with formatted messages
+- Table-driven test pattern allows multiple related test cases
+- Error messages include actual vs expected: `got = %v, want = %v`
+
+## Logging
+
+**Framework:** Not formally configured; uses `echo` and `echo -e` for output
+
+**Bash Logging Patterns:**
+- Color-coded output with status prefixes: `${BLUE}[INFO]${NC}`, `${GREEN}[PASS]${NC}`, `${YELLOW}[WARN]${NC}`, `${RED}[FAIL]${NC}`
+- Helper functions: `info()`, `pass()`, `warn()`, `fail()` - each increments counters and respects `--quiet` flag
+- Section headers: `section()` for verbose output, `section_always()` for always-shown sections
+- Conditional logging: functions check `$JSON`, `$QUIET` flags and skip output as needed
+- JSON output option available via `json_add()` for machine-readable logging
+- Detail strings accumulated in variables for final reporting
+
+**Terraform Logging:**
+- Relies on Terraform's built-in CLI output
+- Human-readable variable values in descriptions (Terraform renders these on errors)
+
+## Comments
+
+**When to Comment:**
+
+Terraform:
+- Section dividers: Major logical groups separated by `# =============================================================================`
+- Feature group headers: `# --- Feature Name ---` before variable/module blocks
+- Commented-out code: Temporarily disabled resources/modules include explanation (e.g., "Do not use until issue #X is solved")
+- Clarifying arbitrary choices: `# anything secret is fine` explains non-obvious variable usage
+
+Bash:
+- Function-level comments: Each check function has purpose on first line
+- Complex logic: Comments before conditional blocks explain intent
+- Inline comments for edge cases: `# Skip nodes where metrics are not yet available`
+- Header comments: Scripts include usage documentation at top
+
+**JSDoc/TSDoc:**
+- Not used in this codebase (Terraform, Bash, Go only)
+
+## Function Design
+
+**Size:**
+- Terraform modules typically 20-50 lines for simple services, variable declaration blocks 30-100+ lines
+- Bash functions average 20-40 lines, check functions 10-30 lines
+- Go test functions 10-60 lines (table + loop)
+
+**Parameters:**
+- Terraform: via `variable` declarations and module input variables
+- Bash: positional parameters passed via `$1`, `$2`, etc. with validation in `parse_args()`
+- Go: test functions accept `*testing.T` parameter
+
+**Return Values:**
+- Terraform: no explicit returns; resource state is the "return"
+- Bash: `return 0` for success, implicit via `echo` output for values, status codes for error handling
+- Go: functions tested for boolean returns or calculated values
+
+**Variables:**
+- Terraform: module variables, locals, and resource attributes (computed values)
+- Bash: Global state tracked via counters (`PASS_COUNT`, `WARN_COUNT`, `FAIL_COUNT`), local variables in functions with `local` keyword
+- Go: table-driven tests use struct fields (no getter/setter pattern)
+
+## Module Design
+
+**Exports:**
+- Terraform: outputs typically omitted unless another stack depends on them (implicit via dependency blocks)
+- Modules called with `source = "./modules/<name>"` or `source = "../../modules/kubernetes/<name>"`
+- Module version pinning used for Terraform registry modules: `version = "0.1.5"`
+
+**Barrel Files:**
+- Not applicable (no aggregating re-exports in this codebase)
+- Directories: `stacks/<service>/` is a unit, `stacks/platform/modules/<service>/` groups related modules
+
+**Module Organization:**
+- Single responsibility per module directory
+- Each module typically contains: `main.tf` (resources) and optional `variables.tf` for input variables
+- Shared Kubernetes utility modules in `modules/kubernetes/`: `ingress_factory/`, `setup_tls_secret/`
+- Platform services grouped in `stacks/platform/modules/<service>/`
+
+## Special Patterns
+
+**Locals for Configuration:**
+- Tier definitions centralized as map in locals (each service defines same tiers locally)
+  ```hcl
+  locals {
+    tiers = {
+      core    = "0-core"
+      cluster = "1-cluster"
+      gpu     = "2-gpu"
+      edge    = "3-edge"
+      aux     = "4-aux"
+    }
+  }
+  ```
+- Tier applied to `kubernetes_namespace` labels and `priority_class_name` for resource governance
+
+**Inline Config Blocks:**
+- YAML/config data stored in `<<EOT ... EOT` heredoc blocks within `data` maps
+- Example: MetalLB address pool config in ConfigMap data
+
+**File Inclusion:**
+- `templatefile()` used for dynamic YAML values: `templatefile("${path.module}/chart_values.yaml", { var1 = value })`
+- `file()` used for static file content in ConfigMap data
+
+---
+
+*Convention analysis: 2026-02-23*
--- a/.planning/codebase/INTEGRATIONS.md
+++ b/.planning/codebase/INTEGRATIONS.md
@ -0,0 +1,210 @@
+# External Integrations
+
+**Analysis Date:** 2026-02-23
+
+## APIs & External Services
+
+**Cloudflare:**
+- DNS management (public domain `viktorbarzin.me`)
+- Tunnel for public HTTPS access
+- Account ID: `cloudflare_account_id` in tfvars
+- SDK/Client: `cloudflare/cloudflare` Terraform provider v4.52.5
+- Auth: API token stored in `cloudflare_api_key`, email in `cloudflare_email`, zone ID in `cloudflare_zone_id`, tunnel ID in `cloudflare_tunnel_id`
+- Implementation: `stacks/platform/modules/cloudflared/` deploys Cloudflare tunnel daemon
+
+**GitHub:**
+- Git repository hosting and CI/CD webhook source
+- Webhook endpoint: `https://webhook.viktorbarzin.me/` (handled by `stacks/webhook_handler/`)
+- Auth: Git token in `webhook_handler_git_token` (terraform.tfvars)
+- User: `webhook_handler_git_user` (terraform.tfvars)
+- SSH key: `webhook_handler_ssh_key` for Git operations (secret in K8s)
+
+**Facebook Messenger:**
+- Chatbot integration via webhook
+- Webhook endpoint: `https://webhook.viktorbarzin.me/` (receives webhook_handler_fb_*)
+- Auth tokens: `webhook_handler_fb_verify_token`, `webhook_handler_fb_page_token`, `webhook_handler_fb_app_secret` (all in tfvars)
+
+**Slack:**
+- Alert routing and notifications
+- Webhook URL: `alertmanager_slack_api_url` (terraform.tfvars)
+- Integration: Alertmanager alerts from `stacks/platform/modules/monitoring/` sent to Slack
+- CrowdSec integration: Security events to Slack via `stacks/platform/modules/crowdsec/`
+
+**Hetrix Tools:**
+- Uptime monitoring service
+- Status page redirects: `https://hetrixtools.com/r/38981b548b5d38b052aca8d01285a3f3/` and `https://hetrixtools.com/r/2ba9d7a5e017794db0fd91f0115a8b3b/`
+- Implementation: Traefik middleware redirect in `stacks/platform/modules/monitoring/main.tf`
+
+**Tiny Tuya:**
+- Smart device control via tuya-bridge
+- Auth: `tiny_tuya_service_secret` (terraform.tfvars)
+
+**Mailgun:**
+- SMTP relay for outgoing mail (primary relay host)
+- Relay: `[smtp.eu.mailgun.org]:587` (Postfix DEFAULT_RELAY_HOST)
+- Auth: SASL credentials in `sasl_passwd` (mailserver config)
+- Alternative: SendGrid (commented out, previously used)
+
+**Home Assistant:**
+- Home automation integration
+- API token: `haos_api_token` (terraform.tfvars)
+- Access: `https://ha-london.viktorbarzin.me`, `https://ha-sofia.viktorbarzin.me`
+
+**Proxmox:**
+- Virtualization platform for VM provisioning
+- Host: `192.168.1.127:8006` (`proxmox_pm_api_url`)
+- Auth: API token ID `terraform-prov@pve!terrform-prov`, secret in tfvars
+- Provider: `telmate/proxmox` v3.0.2-rc07
+- Access: IDRAC credentials for physical server monitoring (`idrac_host`, `idrac_username`, `idrac_password`)
+
+## Data Storage
+
+**Databases:**
+- MySQL 9.2.0
+  - Connection: `mysql.dbaas.svc.cluster.local:3306` (K8s internal)
+  - Client: Direct port access (no ORM in core infrastructure)
+  - Root password: `dbaas_root_password` (tfvars)
+  - Storage: NFS PV at `/mnt/main/mysql`
+
+- PostgreSQL 16.4-bullseye (with PostGIS + PGVector)
+  - Connection: `postgresql.dbaas:5432` (K8s internal)
+  - Connection via PgBouncer: `pgbouncer.authentik:6432` (Authentik only)
+  - Root password: `dbaas_postgresql_root_password` (tfvars)
+  - Root password for pgbouncer: `pgbouncer_root_password` (tfvars)
+  - Admin UI: PgAdmin at `pma.viktorbarzin.me`
+  - PgAdmin password: `dbaas_pgadmin_password` (tfvars)
+  - Storage: NFS PV at `/mnt/main/postgresql`
+
+**File Storage:**
+- NFS (Primary)
+  - Host: `10.0.10.15` (TrueNAS)
+  - Mount path: `/mnt/main/`
+  - Subdirectories: per-service (e.g., `/mnt/main/immich/`, `/mnt/main/affine/`, `/mnt/main/mailserver/`, etc.)
+  - Configuration: `secrets/nfs_directories.txt` (git-crypt encrypted)
+  - Export script: `secrets/nfs_exports.sh` (updates TrueNAS exports)
+
+**Caching:**
+- Redis/redis-stack:latest
+  - Connection: `redis.redis.svc.cluster.local` (K8s internal, no explicit port in code)
+  - Databases: DB 2 (Gramps Web broker), DB 3 (Gramps Web rate limiting)
+  - Storage: Persistent volume for data durability
+  - Implementation: `stacks/platform/modules/redis/main.tf`
+
+## Authentication & Identity
+
+**Auth Provider:**
+- Authentik (self-hosted OIDC/OAuth2 identity provider)
+  - URL: `https://authentik.viktorbarzin.me`
+  - API: `/api/v3/` endpoint
+  - Token: `authentik_api_token` (terraform.tfvars)
+  - Database: PostgreSQL via `postgresql.dbaas:5432` (also PgBouncer at `pgbouncer.authentik:6432`)
+  - Secret key: `authentik_secret_key` (terraform.tfvars)
+  - Postgres password: `authentik_postgres_password` (terraform.tfvars)
+  - K8s OIDC: Issuer `https://authentik.viktorbarzin.me/application/o/kubernetes/`, client `kubernetes` (public)
+  - Implementation: `stacks/platform/modules/authentik/main.tf` + Helm chart
+  - Traefik integration: Forward auth via protected = true in ingress_factory
+
+**RBAC:**
+- Kubernetes API auth via Authentik OIDC
+- SSH keys: `ssh_private_key` (terraform.tfvars)
+- Implementation: `stacks/platform/modules/rbac/` + `stacks/platform/modules/k8s-portal/`
+
+## Monitoring & Observability
+
+**Error Tracking:**
+- None detected - alerts routed to Slack instead
+
+**Metrics:**
+- Prometheus - Time series database
+  - Scrape endpoints: cluster nodes, services, Proxmox IDRAC, Tuya devices, Home Assistant
+  - Implementation: `stacks/platform/modules/monitoring/`
+  - Health check: CronJob monitors prometheus-server pod and alerts to `https://webhook.viktorbarzin.me/fb/message-viktor` if down
+
+**Logs:**
+- Loki 3.6.5 (single binary) + Alloy v1.13.0 (DaemonSet collector)
+  - Retention: 7 days
+  - Storage: NFS PV at `/mnt/main/loki/loki` (15Gi), WAL on tmpfs (2Gi)
+  - Alerting: HighErrorRate, PodCrashLoopBackOff, OOMKilled (ConfigMap `loki-alert-rules`)
+
+**Visualization:**
+- Grafana
+  - Database: PostgreSQL via dbaas
+  - Admin password: `grafana_admin_password` (tfvars)
+  - DB password: `grafana_db_password` (tfvars)
+
+**Status Pages:**
+- Hetrix Tools (external uptime monitoring)
+- Uptime Kuma (self-hosted, `stacks/platform/modules/uptime-kuma/`)
+
+## CI/CD & Deployment
+
+**Hosting:**
+- Proxmox 8.x (hypervisor)
+- Kubernetes 1.34.2 (application platform)
+- Cloudflare Tunnel (public ingress)
+
+**CI Pipeline:**
+- Woodpecker CI (self-hosted, `stacks/woodpecker/`)
+  - Hosted at: `https://ci.viktorbarzin.me`
+  - Config: `.woodpecker/` in repo root
+  - Triggers: Git push, scheduled jobs
+  - Applies platform stack automatically on merge to master
+
+**GitOps:**
+- Webhook-handler service: receives GitHub webhooks, triggers deployments
+  - Endpoint: `https://webhook.viktorbarzin.me/`
+  - Auth: Secret token `webhook_handler_secret` (tfvars)
+  - Can update K8s deployments via RBAC
+  - Implementation: `stacks/webhook_handler/main.tf`, image `viktorbarzin/webhook-handler:latest`
+
+## Environment Configuration
+
+**Required env vars (terraform.tfvars - git-crypt encrypted):**
+- `cloudflare_api_key`, `cloudflare_email`, `cloudflare_zone_id`, `cloudflare_tunnel_id`, `cloudflare_tunnel_token`
+- `dbaas_root_password`, `dbaas_postgresql_root_password`, `dbaas_pgadmin_password`
+- `authentik_secret_key`, `authentik_postgres_password`, `authentik_api_token`
+- `proxmox_pm_api_url`, `proxmox_pm_api_token_id`, `proxmox_pm_api_token_secret`
+- `alertmanager_slack_api_url`, `alertmanager_account_password`
+- `webhook_handler_secret`, `webhook_handler_fb_verify_token`, `webhook_handler_fb_page_token`, `webhook_handler_fb_app_secret`, `webhook_handler_git_token`, `webhook_handler_git_user`, `webhook_handler_ssh_key`
+- `vaultwarden_smtp_password`, `mailserver_accounts`, `postfix_account_aliases`, `sasl_passwd`
+- `crowdsec_enroll_key`, `crowdsec_db_password`, `crowdsec_dash_api_key`, `crowdsec_dash_machine_id`, `crowdsec_dash_machine_password`
+- `headscale_config`, `headscale_acl`
+- `monitoring_idrac_username`, `monitoring_idrac_password`, `tiny_tuya_service_secret`, `haos_api_token`, `pve_password`, `grafana_admin_password`, `grafana_db_password`
+- `k8s_users` (map of SSH keys for K8s RBAC)
+
+**Secrets location:**
+- Primary: `terraform.tfvars` (git-crypt encrypted at rest, decrypted during `terragrunt apply`)
+- K8s Secrets: Created by Terraform from tfvars into namespaces (see `stacks/platform/modules/*/main.tf`)
+- TLS certificates: `secrets/` directory (symlinked into stacks as `secrets/` → `../../secrets`)
+
+## Webhooks & Callbacks
+
+**Incoming (Webhook endpoints):**
+- GitHub webhooks: `https://webhook.viktorbarzin.me/` (deployment triggers)
+- Facebook Messenger webhooks: `https://webhook.viktorbarzin.me/` (chatbot messages)
+- Health alerts: CronJob sends to `https://webhook.viktorbarzin.me/fb/message-viktor` if Prometheus is down
+
+**Outgoing:**
+- Alertmanager → Slack webhook: `alertmanager_slack_api_url`
+- CrowdSec → Slack webhook: same as alertmanager
+- Hetrix Tools status pages: redirect middleware instead of direct integration
+
+## Integration Patterns
+
+**Terraform Secrets Injection:**
+- Template pattern: `templatefile("${path.module}/values.yaml", { var1 = var.value1, ... })`
+- Direct env injection: K8s ConfigMap/Secret created from tfvars variables
+- Example: `stacks/platform/modules/crowdsec/main.tf` renders Helm values with interpolated secrets
+
+**Internal Service Discovery:**
+- DNS: Services accessible via `<name>.<namespace>.svc.cluster.local`
+- Examples: `mysql.dbaas.svc.cluster.local`, `redis.redis.svc.cluster.local`, `postgresql.dbaas.svc.cluster.local`
+
+**External Service Access:**
+- Cloudflare Tunnel: Provides public HTTPS for services (no direct internet access needed)
+- Traefik Ingress: Routes external traffic to internal K8s services
+- Technitium (internal DNS) for `.lan` domain resolution
+
+---
+
+*Integration audit: 2026-02-23*
--- a/.planning/codebase/STACK.md
+++ b/.planning/codebase/STACK.md
@ -0,0 +1,129 @@
+# Technology Stack
+
+**Analysis Date:** 2026-02-23
+
+## Languages
+
+**Primary:**
+- HCL (HashiCorp Configuration Language) - Terraform/Terragrunt infrastructure definitions
+- Bash - Scripting and cluster management (`scripts/` directory)
+- YAML - Kubernetes resource definitions and configuration
+- Python - Monitoring and utility scripts in `stacks/platform/modules/`
+- TypeScript/JavaScript - k8s-portal frontend and webhook-handler (`stacks/platform/modules/k8s-portal/`, `stacks/webhook_handler/`)
+
+**Secondary:**
+- Go - Various utilities
+- Dockerfile - Container image definitions across stacks
+
+## Runtime
+
+**Environment:**
+- Kubernetes v1.34.2 (5 nodes: k8s-master + k8s-node1-4)
+- Linux (Ubuntu cloud images on Proxmox VMs)
+- Bash shell for automation
+
+**Package Manager:**
+- npm (Node.js) - for k8s-portal web UI development
+  - Lockfile: `package-lock.json` present
+- pip (Python) - for utility scripts
+- Terraform/Terragrunt - manages all infrastructure dependencies
+
+## Frameworks
+
+**Core:**
+- Terraform 1.x - Infrastructure-as-Code orchestration
+- Terragrunt - State isolation wrapper around Terraform (`terragrunt.hcl` in each stack)
+- Kubernetes - Container orchestration (kubectl, Helm, kustomize patterns)
+
+**Testing:**
+- Playwright ^1.58.2 - E2E testing framework (root `package.json`)
+
+**Build/Dev:**
+- Helm 3.1.1 - Kubernetes package manager (provider version via Terraform)
+- Svelte - Frontend framework for k8s-portal (`stacks/platform/modules/k8s-portal/files/` Node.js project)
+
+## Key Dependencies
+
+**Critical:**
+- hashicorp/terraform (Kubernetes 3.0.1) - Kubernetes API provider
+- hashicorp/helm (3.1.1) - Helm release management
+- telmate/proxmox (3.0.2-rc07) - Proxmox VM management (`stacks/infra/`)
+- cloudflare/cloudflare (4.52.5) - DNS and tunnel management (`stacks/platform/modules/cloudflared/`)
+- hashicorp/null (3.2.4) - Utility provider for local operations
+- hashicorp/random (3.8.1) - Random value generation
+
+**Infrastructure:**
+- MySQL 9.2.0 - Relational database (`stacks/platform/modules/dbaas/`)
+- PostgreSQL 16.4-bullseye - Primary database with PostGIS/PGVector (`stacks/platform/modules/dbaas/`)
+- Redis/redis-stack:latest - In-memory cache and broker (`stacks/platform/modules/redis/`)
+- Headscale 0.23.0 - WireGuard control plane (`stacks/platform/modules/headscale/`)
+
+**Observability:**
+- Prometheus - Metrics collection and alerting
+- Grafana - Metrics visualization and dashboards
+- Loki 3.6.5 - Log aggregation (from user instructions)
+- Alloy v1.13.0 - Log collector (from user instructions)
+
+**API Gateway & Ingress:**
+- Traefik 3.x - Ingress controller and reverse proxy (`stacks/platform/modules/traefik/`)
+- MetalLB - Load balancer for Kubernetes service IPs (`stacks/platform/modules/metallb/`)
+
+**Security:**
+- Authentik - Identity Provider/OIDC (`stacks/platform/modules/authentik/`)
+- Vaultwarden 1.35.2 - Password manager (`stacks/platform/modules/vaultwarden/`)
+- CrowdSec - Intrusion detection and IP reputation (`stacks/platform/modules/crowdsec/`)
+- Kyverno - Policy enforcement and governance (`stacks/platform/modules/kyverno/`)
+
+**Container Images Registry:**
+- docker.io - Docker Hub public images
+- ghcr.io - GitHub Container Registry (Headscale UI, Immich, etc.)
+- quay.io - Quay.io registry (inferred from mirror config)
+- registry.k8s.io - Kubernetes images
+- Local pull-through cache at `10.0.20.10` (ports 5000/5010/5020/5030/5040)
+
+## Configuration
+
+**Environment:**
+- `terraform.tfvars` (git-crypt encrypted) - All secrets, API keys, DNS records, passwords
+- Environment variables injected into Kubernetes pods via ConfigMap/Secret
+- Kubeconfig: `config` file in repo root (referenced as `$PWD/config` in terragrunt)
+
+**Build:**
+- `terragrunt.hcl` (root) - DRY Terraform provider and backend configuration
+- `stacks/<service>/terragrunt.hcl` - Per-stack overrides
+- `stacks/<service>/main.tf` - Kubernetes/Proxmox resource definitions
+- `.terraform.lock.hcl` - Provider version lock (Terraform 1.x)
+- `.terraform/` - Downloaded providers cached locally
+
+**Secrets:**
+- `secrets/` directory (git-crypt encrypted)
+- TLS certificates and keys in `secrets/` (symlinked from stacks)
+- OpenDKIM keys for mailserver
+- NFS export configuration in `secrets/nfs_directories.txt`
+
+## Platform Requirements
+
+**Development:**
+- Terraform 1.x CLI
+- Terragrunt CLI (uses `terragrunt apply --non-interactive`)
+- kubectl configured with kubeconfig at `$PWD/config`
+- git-crypt for secret decryption
+- curl, bash, standard Unix utilities
+
+**Production:**
+- Kubernetes 1.34.2+ cluster (5 nodes, 192 GB+ total memory)
+- Proxmox 8.x hypervisor (`stacks/infra/` provisions VMs)
+- NFS storage: TrueNAS at `10.0.10.15` with exports at `/mnt/main/`
+- Docker registry pull-through cache at `10.0.20.10`
+- Cloudflare DNS (public domain `viktorbarzin.me`)
+- Technitium DNS (internal domain `viktorbarzin.lan`)
+
+**Networking:**
+- Kubernetes pod CIDR: managed by cluster
+- Service IPs: 10.0.20.200-10.0.20.220 (MetalLB layer 2)
+- Internal DNS: Technitium at cluster IP
+- External DNS: Cloudflare tunnel + traditional DNS records
+
+---
+
+*Stack analysis: 2026-02-23*
--- a/.planning/codebase/STRUCTURE.md
+++ b/.planning/codebase/STRUCTURE.md
@ -0,0 +1,255 @@
+# Codebase Structure
+
+**Analysis Date:** 2026-02-23
+
+## Directory Layout
+
+```
+/Users/viktorbarzin/code/infra/
+├── .claude/                     # Project-level Claude knowledge (skills, reference docs)
+├── .git/                        # Git repository metadata
+├── .git-crypt/                  # git-crypt encryption keys
+├── .planning/codebase/          # GSD codebase analysis documents
+├── .terraform/                  # Terraform cache (gitignored)
+├── .woodpecker/                 # CI/CD pipeline definitions
+├── cli/                         # Custom CLI tools (bash/python scripts)
+├── diagram/                     # Infrastructure diagram sources
+├── docs/                        # Documentation (deployment guides, design docs)
+├── modules/                     # Shared Terraform modules (Proxmox, K8s utilities)
+├── playbooks/                   # Ansible playbooks (infrastructure setup)
+├── scripts/                     # Maintenance scripts (healthcheck, DNS updates, etc.)
+├── secrets/                     # git-crypt encrypted files (NFS dirs, TLS certs, SSH keys)
+├── stacks/                      # Terragrunt stacks (platform + ~70 service stacks)
+├── state/                       # Terraform state files (local backend, gitignored)
+├── terragrunt.hcl              # Root Terragrunt config (DRY provider/backend setup)
+├── terraform.tfvars            # All variables + secrets (git-crypt encrypted, ~48KB)
+├── config                      # Kubernetes config (kubeconfig file)
+├── README.md                   # Project overview
+└── package.json               # Node.js deps (minimal; mostly for cli tools)
+```
+
+## Directory Purposes
+
+**`.claude/`:**
+- Purpose: Project-level Claude knowledge and execution skills
+- Contains: `skills/` (setup-project, authentik workflows), `reference/` (inventory tables, API patterns)
+- Key files: `CLAUDE.md` (this file's counterpart with full infrastructure context)
+
+**`.planning/codebase/`:**
+- Purpose: GSD codebase analysis output directory
+- Contains: `ARCHITECTURE.md`, `STRUCTURE.md` (this file), and focus-specific docs
+- Auto-generated: Yes (by /gsd:map-codebase)
+
+**`modules/`:**
+- Purpose: Reusable Terraform modules for VM creation and Kubernetes utilities
+- Contains:
+  - `create-template-vm/`: Cloud-init Ubuntu template VM provisioning (K8s + non-K8s)
+  - `create-vm/`: VM instance creation from templates with cloud-init injection
+  - `docker-registry/`: Docker registry pull-through cache setup
+  - `kubernetes/`: K8s-specific utilities (ingress_factory, setup_tls_secret)
+
+**`stacks/`:**
+- Purpose: Terragrunt stacks with isolated state and per-service configuration
+- Contains: 1 platform stack + ~70 application stacks
+- Structure: Each stack is a directory with `terragrunt.hcl` + `main.tf` + optional `factory/` (for multi-instance services)
+
+**`stacks/platform/`:**
+- Purpose: Core infrastructure services (22 modules)
+- Contains: Modules for MetalLB, DBaaS, Redis, Traefik, DNS, VPN, auth, monitoring, security
+- Key subdirs: `modules/` (platform-specific modules like traefik, authentik, monitoring)
+
+**`stacks/infra/`:**
+- Purpose: Proxmox VM template and instance provisioning
+- Contains: K8s node templates, docker-registry VM, Proxmox provider configuration
+
+**`stacks/<service>/`:**
+- Purpose: Single application stack with isolated state
+- Pattern: `terragrunt.hcl` (includes root, declares dependencies) + `main.tf` (resources) + optional `factory/` + optional `chart_values.yaml`
+- Examples: `nextcloud/`, `immich/`, `matrix/`, `actualbudget/` (multi-tenant), etc.
+
+**`secrets/`:**
+- Purpose: git-crypt encrypted sensitive files
+- Contains: TLS certificates/keys, NFS export list, SSH keys, Dkim keys, Postfix config
+- Key files:
+  - `nfs_directories.txt`: List of NFS shares (sorted); regenerate exports with `nfs_exports.sh`
+  - `tls/`: TLS certificate chain and keys
+  - `mailserver/`: OpenDKIM keys, Postfix SASL creds
+
+**`scripts/`:**
+- Purpose: Operational and maintenance automation
+- Key scripts:
+  - `cluster_healthcheck.sh`: 24-point cluster health status
+  - `renew2.sh`: TLS certificate renewal via certbot + Cloudflare
+  - `setup_certs.sh`: Initial certificate setup
+  - `pve_*`: Proxmox management scripts
+  - `ha_*`: Home Assistant integration scripts
+
+**`docs/`:**
+- Purpose: Design and deployment documentation
+- Contains: High-level architecture diagrams, deployment guides, troubleshooting
+
+**`cli/`:**
+- Purpose: Custom CLI utilities
+- Contains: Python/bash scripts for common operations (DNS management, NFS, etc.)
+
+## Key File Locations
+
+**Entry Points:**
+- `terragrunt.hcl`: Root Terragrunt config; invoked by `terragrunt apply` in any stack directory
+- `stacks/platform/main.tf`: Platform stack; applies 22 core modules
+- `stacks/infra/main.tf`: Infrastructure stack; creates VM templates and docker-registry VM
+
+**Configuration:**
+- `terraform.tfvars`: Central variables file (~48KB, git-crypt encrypted). Used by all stacks. Contains: Cloudflare credentials, DNS records, service secrets, TLS secret name
+- `stacks/<service>/terragrunt.hcl`: Stack-specific Terragrunt config (includes root, declares `dependency` blocks)
+- `stacks/platform/modules/<service>/main.tf`: Platform module implementation (22 modules)
+
+**Core Logic:**
+- `stacks/platform/main.tf`: 1000+ lines; instantiates all 22 platform modules
+- `stacks/<service>/main.tf`: 30–450 lines; creates namespaces, Helm releases, Kubernetes resources
+- `stacks/<service>/factory/main.tf`: Multi-instance service pattern; called multiple times with different parameters
+- `modules/kubernetes/ingress_factory/main.tf`: Traefik ingress + service template with security defaults
+
+**Testing & Validation:**
+- `.woodpecker/`: CI/CD pipeline (pushes platform apply on merge)
+- `scripts/cluster_healthcheck.sh`: Manual cluster health validation
+
+**Kubernetes & Cluster Config:**
+- `config`: Kubeconfig file for cluster access
+- Namespace pattern: One namespace per service stack
+- TLS secret: `tls-secret` injected into all namespaces via `setup_tls_secret` module
+
+## Naming Conventions
+
+**Files:**
+- `main.tf`: Primary Terraform resource file per stack
+- `terragrunt.hcl`: Terragrunt-specific configuration (includes root, dependencies)
+- `terraform.tfvars`: Global variables (git-crypt encrypted)
+- `chart_values.yaml`: Helm chart values template (uses templatefile for variable substitution)
+- `*_values.tpl`: Helm values template (evaluated with templatefile)
+- `.terraform.lock.hcl`: Provider lock file (one per stack)
+
+**Directories:**
+- `stacks/<service>/`: Kebab-case service names (e.g., `real-estate-crawler`, `k8s-dashboard`)
+- `stacks/platform/modules/<service>/`: Kebab-case module names
+- `state/stacks/<service>/`: Mirrored state directory structure
+- `secrets/`: Single top-level directory for all encrypted files
+- `modules/kubernetes/`, `modules/create-template-vm/`: Category-based grouping
+
+**Terraform Resources:**
+- **Kubernetes**: `kubernetes_*` (namespace, deployment, service, configmap, etc.)
+- **Helm**: `helm_release` (Helm chart deployments)
+- **Local files**: `local_file` (for generated scripts and configs)
+- **Module calls**: `module "<short-name>"` (e.g., `module "traefik"`, `module "redis"`)
+
+**Variables:**
+- Snake_case: `tls_secret_name`, `crowdsec_api_key`, `nextcloud_db_password`
+- Service-prefixed: `<service>_<attribute>` (e.g., `authentik_secret_key`, `mailserver_accounts`)
+
+## Where to Add New Code
+
+**New Service Stack:**
+1. Create `stacks/<service>/` directory
+2. Add `terragrunt.hcl`:
+   ```hcl
+   include "root" {
+     path = find_in_parent_folders()
+   }
+   dependency "platform" {
+     config_path = "../platform"
+     skip_outputs = true
+   }
+   ```
+3. Create `main.tf` with:
+   - Variable declarations for required inputs from `terraform.tfvars`
+   - `locals { tiers = { ... } }` (copy from existing stack)
+   - `kubernetes_namespace` resource with tier label
+   - `module "tls_secret"` call to `../../modules/kubernetes/setup_tls_secret`
+   - Service-specific resources (Helm releases, Deployments, etc.)
+4. Add Cloudflare DNS records in `terraform.tfvars` if needed
+5. Create optional `secrets/` symlink: `ln -s ../../secrets secrets`
+6. Apply: `cd stacks/<service> && terragrunt apply --non-interactive`
+
+**Multi-Tenant Service (using Factory Pattern):**
+1. Create parent stack: `stacks/<service>/main.tf` with namespace + TLS setup
+2. Create `stacks/<service>/factory/main.tf` with single-instance logic
+3. In parent, call factory multiple times:
+   ```hcl
+   module "instance1" {
+     source = "./factory"
+     name = "instance1"
+     # ... other params
+   }
+   ```
+4. Example: `stacks/actualbudget/` has factory instantiated for viktor, anca, emo
+
+**New Platform Module:**
+1. Create `stacks/platform/modules/<service>/` directory
+2. Add `main.tf` with resources (Helm chart, namespace, ConfigMaps, etc.)
+3. Add `variables.tf` or declare variables in `main.tf`
+4. In `stacks/platform/main.tf`, add module call:
+   ```hcl
+   module "<service>" {
+     source = "./modules/<service>"
+     tier = local.tiers.<tier>
+     # ... pass required variables
+   }
+   ```
+5. Add variable declarations in `stacks/platform/main.tf`
+
+**New Shared Module:**
+1. Create `modules/kubernetes/<module_name>/` or `modules/terraform/<module_name>/`
+2. Add `main.tf` with reusable resources
+3. Declare clear variable inputs and output any useful values
+4. Call from service stacks: `module "<name>" { source = "../../modules/kubernetes/<module_name>" ... }`
+
+**Utilities & Scripts:**
+- Shared helpers: `scripts/` directory
+- Custom CLI tools: `cli/` directory
+- CI/CD pipelines: `.woodpecker/`
+
+## Special Directories
+
+**`state/`:**
+- Purpose: Terraform state files (local backend)
+- Generated: Yes (automatically by Terragrunt)
+- Committed: No (gitignored; backed up separately)
+- Structure: `state/stacks/<service>/terraform.tfstate`
+
+**`secrets/`:**
+- Purpose: git-crypt encrypted secrets and sensitive config
+- Generated: No (managed manually or via scripts)
+- Committed: Yes (encrypted via git-crypt)
+- Contents: TLS certs, SSH keys, NFS export list, mailserver config, Dkim keys
+
+**`.terraform/`:**
+- Purpose: Terraform provider cache
+- Generated: Yes (by Terraform during init)
+- Committed: No (gitignored)
+
+**`node_modules/`:**
+- Purpose: Node.js dependencies for CLI tools
+- Generated: Yes (by npm install)
+- Committed: No (gitignored; use lockfile)
+
+## File Patterns & Imports
+
+**Terragrunt Patterns:**
+- Include root: `include "root" { path = find_in_parent_folders() }`
+- Declare dependencies: `dependency "platform" { config_path = "../platform"; skip_outputs = true }`
+- Variable access: `var.<name>` in `main.tf` (variables sourced from `terraform.tfvars`)
+
+**Kubernetes Resource Patterns:**
+- Namespace per service: `kubernetes_namespace.<service>` with tier label
+- Helm releases: `helm_release.<chart_name>` with `templatefile` for values
+- Inline NFS volumes: `volume { name = "data"; nfs { server = "10.0.10.15"; path = "/mnt/main/<service>" } }`
+- TLS injection: Every stack calls `module "tls_secret"` to populate namespace secret
+
+**Module Call Pattern:**
+- Standard: `module "<name>" { source = "./modules/<module>" ... }`
+- Platform modules: `source = "./modules/<service>"`
+- Shared modules: `source = "../../modules/kubernetes/<module>"`
+
+---
+
+*Structure analysis: 2026-02-23*
--- a/.planning/codebase/TESTING.md
+++ b/.planning/codebase/TESTING.md
@ -0,0 +1,279 @@
+# Testing Patterns
+
+**Analysis Date:** 2026-02-23
+
+## Test Framework
+
+**Language-Specific Runners:**
+
+**Go:**
+- Runner: `go test` (standard library `testing` package)
+- Config: No config file (uses built-in conventions)
+- Run Commands:
+  ```bash
+  go test ./...                    # Run all tests
+  go test -v ./...                 # Verbose output
+  go test -run TestContains ./...  # Run specific test
+  go test -cover ./...             # Show coverage
+  ```
+
+**Bash:**
+- Runner: Custom shell scripts in `scripts/`
+- No formal test framework; uses `set -euo pipefail` for error handling
+- Manual health checks via `bash scripts/cluster_healthcheck.sh`
+
+**Terraform:**
+- Framework: No automated testing detected (no terraform test files, no tftest.hcl)
+- Validation: Manual `terraform validate`, `terraform plan`, visual inspection
+- Integration: Terragrunt applies validate before execution
+
+## Test File Organization
+
+**Location:**
+- Go tests: Co-located with source code: `<service>/files/internal/scraper/validate_test.go`
+- Shell/Infrastructure: No test files (manual validation/health checks only)
+
+**Naming:**
+- Go: `*_test.go` suffix
+- Script tests: `.sh` for check/validation scripts
+
+**Structure:**
+```
+stacks/f1-stream/files/internal/scraper/
+├── main.go
+├── validate.go
+└── validate_test.go           # Test file co-located
+```
+
+## Test Structure
+
+**Go Table-Driven Tests:**
+
+```golang
+func TestContainsVideoMarkers(t *testing.T) {
+	tests := []struct {
+		name string
+		body string
+		want bool
+	}{
+		{
+			name: "video tag",
+			body: `<div><video src="stream.mp4"></video></div>`,
+			want: true,
+		},
+		// ... more test cases
+	}
+
+	for _, tt := range tests {
+		t.Run(tt.name, func(t *testing.T) {
+			got := containsVideoMarkers(tt.body)
+			if got != tt.want {
+				t.Errorf("containsVideoMarkers(%q) = %v, want %v", truncate(tt.body, 60), got, tt.want)
+			}
+		})
+	}
+}
+```
+
+**Patterns:**
+- Slice of anonymous structs with `name`, input fields, and `want` for expected result
+- Loop with `t.Run(tt.name, ...)` for individual test case execution and reporting
+- Descriptive test case names: `"video tag"`, `"HLS manifest reference"`, `"empty string"`
+- Separate positive cases (upper) and negative cases (lower) with comments
+
+**Bash Health Check Structure:**
+```bash
+check_nodes() {
+    section 1 "Node Status"
+    local nodes not_ready versions unique_versions detail=""
+
+    nodes=$($KUBECTL get nodes --no-headers 2>&1) || { fail "Cannot reach cluster"; json_add "node_status" "FAIL" "Cannot reach cluster"; return 0; }
+    # ... processing
+    if [[ -n "$not_ready" ]]; then
+        fail "NotReady nodes: $not_ready"
+        json_add "node_status" "FAIL" "$detail"
+    elif [[ "$unique_versions" -gt 1 ]]; then
+        warn "Version mismatch..."
+        json_add "node_status" "WARN" "$detail"
+    else
+        pass "All nodes Ready..."
+        json_add "node_status" "PASS" "$detail"
+    fi
+}
+```
+
+**Patterns:**
+- Each check function follows same structure: setup → validation → status reporting
+- Status reported via `pass()`, `warn()`, `fail()` helper functions
+- JSON output optional via `json_add()` for programmatic consumption
+- Error handling inline with `||` fallback and graceful degradation
+
+## Mocking
+
+**Framework:**
+- Go: No mocking framework detected (table-driven tests use real function calls)
+- Bash: External commands mocked implicitly (KUBECONFIG override, kubectl invocation through `$KUBECTL` variable)
+
+**Patterns (Go):**
+- No mock objects or stubs
+- Real function behavior tested directly
+- Test data provided as input in struct fields
+
+**Patterns (Bash):**
+```bash
+# Kubeconfig override allows testing against different clusters
+KUBECTL="kubectl --kubeconfig $KUBECONFIG_PATH"
+nodes=$($KUBECTL get nodes --no-headers 2>&1) || { fail "Cannot reach cluster"; return 0; }
+```
+
+**What NOT to Mock:**
+- Core functionality being tested (test actual behavior)
+- Standard library functions (test integration)
+
+**What to Mock (Bash):**
+- External kubectl calls via variable indirection: allows `KUBECONFIG` override
+- Conditional output by flag: `--json`, `--quiet` flags change output, not behavior
+
+## Fixtures and Factories
+
+**Test Data (Go):**
+- Inline strings in struct fields: HTML content, MIME types
+- Examples from `validate_test.go`:
+  ```golang
+  {
+      name: "HLS manifest reference",
+      body: `var url = "https://cdn.example.com/live.m3u8";`,
+      want: true,
+  },
+  ```
+
+**Location:**
+- Embedded directly in test file as struct field values
+- No separate fixture files or factories
+
+**Bash Fixtures:**
+- Real cluster fixtures: tests run against actual Kubernetes cluster
+- No data files; tests fetch live state via kubectl
+
+## Coverage
+
+**Requirements:** None enforced (no coverage thresholds, targets, or CI/CD gates detected)
+
+**View Coverage (Go):**
+```bash
+go test -cover ./...              # Show coverage percentages
+go test -coverprofile=coverage.out ./...
+go tool cover -html=coverage.out  # Open HTML report
+```
+
+**Note:** Coverage tools not integrated into CI/CD pipeline; manual check only.
+
+## Test Types
+
+**Unit Tests (Go):**
+- Scope: Single function validation
+- Approach: Table-driven with parameterized inputs
+- Example: `TestContainsVideoMarkers()` tests HTML content detection
+- Example: `TestIsDirectVideoContentType()` tests MIME type classification
+- In file: `stacks/f1-stream/files/internal/scraper/validate_test.go`
+
+**Integration Tests:**
+- Bash health checks (`scripts/cluster_healthcheck.sh`) serve as integration tests
+- Tests 24 separate checks against live Kubernetes cluster:
+  - Node status and readiness
+  - Node resource utilization
+  - Container metrics
+  - Pod crash loops
+  - Persistent volume health
+  - DNS resolution
+  - Networking
+  - RBAC
+  - Logs aggregation
+- Can run with `--fix` flag for auto-remediation
+- Can output JSON for CI integration
+
+**E2E Tests:**
+- Not formally implemented
+- Manual validation via Terragrunt apply → cluster state verification
+
+**Infrastructure Testing:**
+- Terraform: `terraform validate` and `terraform plan` provide syntax/logic validation
+- Application health: Manual checks via scripts and cluster_healthcheck.sh
+- No automated test suite for infrastructure code
+
+## Common Patterns
+
+**Async Testing (Go):**
+- Not applicable (synchronous function testing only)
+
+**Error Testing (Go):**
+```golang
+{
+    name: "empty string",
+    body: "",
+    want: false,
+},
+```
+- Negative test cases included in same table
+- Error/edge cases named descriptively: `"empty string"`, `"reddit link page"`
+- Expected failure behavior verified: `want: false` for invalid inputs
+
+**Error Reporting (Go):**
+```golang
+t.Errorf("containsVideoMarkers(%q) = %v, want %v", truncate(tt.body, 60), got, tt.want)
+```
+- Formatted message includes: function name, input (truncated), actual, expected
+- Test name automatically prefixed by `t.Run(tt.name, ...)`
+
+**Status Reporting (Bash):**
+- Color-coded status: `${GREEN}[PASS]${NC}`, `${YELLOW}[WARN]${NC}`, `${RED}[FAIL]${NC}`
+- Counter incremented per status
+- Optional quiet mode (`--quiet`) suppresses PASS output
+- Optional JSON output (`--json`) for CI integration
+- Summary printed at end: `$PASS_COUNT/$WARN_COUNT/$FAIL_COUNT`
+
+## Running Tests
+
+**Go Tests:**
+```bash
+# From service directory containing *_test.go
+go test -v ./...
+```
+
+**Bash Health Checks:**
+```bash
+# Comprehensive checks
+bash scripts/cluster_healthcheck.sh
+
+# Quiet mode (WARN/FAIL only)
+bash scripts/cluster_healthcheck.sh --quiet
+
+# Auto-fix mode
+bash scripts/cluster_healthcheck.sh --fix
+
+# JSON output
+bash scripts/cluster_healthcheck.sh --json
+
+# Custom kubeconfig
+bash scripts/cluster_healthcheck.sh --kubeconfig /path/to/config
+```
+
+**Terraform Validation:**
+```bash
+# Format check
+terraform fmt -recursive
+
+# Syntax validation
+terraform validate
+
+# Plan without apply
+terraform plan
+
+# From stack directory
+cd stacks/<service> && terragrunt plan
+cd stacks/<service> && terragrunt apply --non-interactive
+```
+
+---
+
+*Testing analysis: 2026-02-23*
--- a/.planning/config.json
+++ b/.planning/config.json
@ -0,0 +1,12 @@
+{
+  "mode": "yolo",
+  "depth": "comprehensive",
+  "parallelization": false,
+  "commit_docs": true,
+  "model_profile": "quality",
+  "workflow": {
+    "research": true,
+    "plan_check": true,
+    "verifier": true
+  }
+}
--- a/.planning/phases/01-infrastructure-and-deployment/01-01-PLAN.md
+++ b/.planning/phases/01-infrastructure-and-deployment/01-01-PLAN.md
@ -0,0 +1,173 @@
+---
+phase: 01-infrastructure-and-deployment
+plan: 01
+type: execute
+wave: 1
+depends_on: []
+files_modified:
+  - stacks/f1-stream/files/backend/main.py
+  - stacks/f1-stream/files/backend/requirements.txt
+  - stacks/f1-stream/files/Dockerfile
+  - stacks/f1-stream/files/redeploy.sh
+autonomous: true
+requirements:
+  - DEPL-01
+
+must_haves:
+  truths:
+    - "A Docker image viktorbarzin/f1-stream:v2.0.0 exists and can be pulled"
+    - "The image starts a FastAPI server on port 8000 that responds to GET /health with 200"
+    - "The image is based on python:3.13-slim-bookworm and runs without errors"
+  artifacts:
+    - path: "stacks/f1-stream/files/backend/main.py"
+      provides: "FastAPI app with health endpoint"
+      contains: "/health"
+    - path: "stacks/f1-stream/files/backend/requirements.txt"
+      provides: "Python dependencies"
+      contains: "fastapi"
+    - path: "stacks/f1-stream/files/Dockerfile"
+      provides: "Multi-stage Docker build for Python FastAPI"
+      contains: "python:3.13-slim-bookworm"
+    - path: "stacks/f1-stream/files/redeploy.sh"
+      provides: "Build, push, restart script"
+      contains: "docker build"
+  key_links:
+    - from: "stacks/f1-stream/files/Dockerfile"
+      to: "stacks/f1-stream/files/backend/main.py"
+      via: "COPY backend/ into image"
+      pattern: "COPY.*backend"
+    - from: "stacks/f1-stream/files/Dockerfile"
+      to: "stacks/f1-stream/files/backend/requirements.txt"
+      via: "pip install requirements"
+      pattern: "pip install.*requirements"
+---
+
+<objective>
+Create a minimal FastAPI backend application with a health endpoint and build a Docker image for it. This replaces the existing Go-based f1-stream application with the new Python/FastAPI stack.
+
+Purpose: Provide a deployable container image that the Terraform stack (Plan 02) will reference. The health endpoint proves the service is running correctly.
+Output: Docker image `viktorbarzin/f1-stream:v2.0.0` pushed to Docker Hub, containing a working FastAPI server.
+</objective>
+
+<execution_context>
+@/Users/viktorbarzin/.claude/get-shit-done/workflows/execute-plan.md
+@/Users/viktorbarzin/.claude/get-shit-done/templates/summary.md
+</execution_context>
+
+<context>
+@.planning/PROJECT.md
+@.planning/ROADMAP.md
+@.planning/STATE.md
+@.planning/research/STACK.md
+
+# Existing files to replace/modify:
+@stacks/f1-stream/files/Dockerfile
+@stacks/f1-stream/files/redeploy.sh
+</context>
+
+<tasks>
+
+<task type="auto">
+  <name>Task 1: Create FastAPI backend application with health endpoint</name>
+  <files>stacks/f1-stream/files/backend/main.py, stacks/f1-stream/files/backend/requirements.txt</files>
+  <action>
+Create the directory `stacks/f1-stream/files/backend/`.
+
+Create `stacks/f1-stream/files/backend/requirements.txt` with pinned versions:
+```
+fastapi==0.132.0
+uvicorn[standard]
+```
+
+Create `stacks/f1-stream/files/backend/main.py` with a minimal FastAPI application:
+- Import FastAPI
+- Create app instance with title "F1 Streams"
+- Add `GET /health` endpoint that returns `{"status": "ok"}`
+- Add `GET /` root endpoint that returns `{"service": "f1-streams", "version": "2.0.0"}`
+- Add an `if __name__ == "__main__"` block that runs uvicorn on host 0.0.0.0 port 8000
+
+This is intentionally minimal -- just enough to prove the deployment works. Later phases will add schedule, extractor, and proxy routes.
+
+Do NOT add any other dependencies or routes beyond the health check and root. Keep it simple.
+  </action>
+  <verify>
+Run `python3 -c "import ast; ast.parse(open('stacks/f1-stream/files/backend/main.py').read()); print('Syntax OK')"` to verify the Python file is valid.
+Verify requirements.txt exists and contains fastapi and uvicorn.
+  </verify>
+  <done>
+`backend/main.py` exists with a valid FastAPI app that has `/health` and `/` endpoints. `requirements.txt` lists fastapi and uvicorn.
+  </done>
+</task>
+
+<task type="auto">
+  <name>Task 2: Create Dockerfile and build/push the container image</name>
+  <files>stacks/f1-stream/files/Dockerfile, stacks/f1-stream/files/redeploy.sh</files>
+  <action>
+Replace the existing Go Dockerfile at `stacks/f1-stream/files/Dockerfile` with a Python-based Dockerfile:
+
+```dockerfile
+FROM python:3.13-slim-bookworm
+
+WORKDIR /app
+
+COPY backend/requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+
+COPY backend/ ./backend/
+
+EXPOSE 8000
+
+CMD ["uvicorn", "backend.main:app", "--host", "0.0.0.0", "--port", "8000"]
+```
+
+Key points:
+- Single-stage build (no build stage needed for Python -- interpreted language)
+- Use `python:3.13-slim-bookworm` as base (from stack research)
+- Install deps first for Docker layer caching
+- Expose port 8000 (FastAPI default, different from old Go app's 8080)
+- Run via uvicorn pointing to `backend.main:app`
+
+Update `stacks/f1-stream/files/redeploy.sh`:
+```bash
+#!/usr/bin/env bash
+set -e
+
+docker build -t viktorbarzin/f1-stream:v2.0.0 -t viktorbarzin/f1-stream:latest .
+docker push viktorbarzin/f1-stream:v2.0.0
+docker push viktorbarzin/f1-stream:latest
+kubectl -n f1-stream rollout restart deployment f1-stream
+```
+
+Then build and push the image by running the redeploy script from the `stacks/f1-stream/files/` directory. Only run the docker build and push steps (skip the kubectl rollout -- that happens after Terraform apply in Plan 02).
+
+Build command: `cd stacks/f1-stream/files && docker build -t viktorbarzin/f1-stream:v2.0.0 -t viktorbarzin/f1-stream:latest . && docker push viktorbarzin/f1-stream:v2.0.0 && docker push viktorbarzin/f1-stream:latest`
+
+IMPORTANT: The old Go application files (main.go, go.mod, go.sum, internal/, node_modules/, package.json, package-lock.json, index.html, static/) should be removed from `stacks/f1-stream/files/` since they are no longer needed. Keep only: Dockerfile, redeploy.sh, and backend/.
+  </action>
+  <verify>
+Run `docker images | grep f1-stream` to confirm the image was built.
+Run `docker run --rm -d -p 18000:8000 --name f1-test viktorbarzin/f1-stream:v2.0.0 && sleep 2 && curl -s http://localhost:18000/health && docker stop f1-test` to verify the container starts and the health endpoint responds.
+  </verify>
+  <done>
+Docker image `viktorbarzin/f1-stream:v2.0.0` is built, pushed to Docker Hub, and serves a 200 response on GET /health when run locally.
+  </done>
+</task>
+
+</tasks>
+
+<verification>
+1. `docker images | grep f1-stream` shows the v2.0.0 tag
+2. Running the image locally and curling /health returns `{"status": "ok"}`
+3. The old Go files have been removed from `stacks/f1-stream/files/`
+4. Only Dockerfile, redeploy.sh, and backend/ remain in the files directory
+</verification>
+
+<success_criteria>
+- Docker image viktorbarzin/f1-stream:v2.0.0 exists on Docker Hub
+- The image runs a FastAPI server on port 8000 with a working /health endpoint
+- Old Go application files are cleaned up
+</success_criteria>
+
+<output>
+After completion, create `.planning/phases/01-infrastructure-and-deployment/01-01-SUMMARY.md`
+</output>
--- a/.planning/phases/01-infrastructure-and-deployment/01-02-PLAN.md
+++ b/.planning/phases/01-infrastructure-and-deployment/01-02-PLAN.md
@ -0,0 +1,235 @@
+---
+phase: 01-infrastructure-and-deployment
+plan: 02
+type: execute
+wave: 2
+depends_on:
+  - "01-01"
+files_modified:
+  - stacks/f1-stream/main.tf
+  - .woodpecker/f1-stream.yml
+autonomous: true
+requirements:
+  - DEPL-01
+  - DEPL-02
+
+must_haves:
+  truths:
+    - "A request to https://f1.viktorbarzin.me/health returns HTTP 200 with JSON {status: ok}"
+    - "The Terragrunt stack applies cleanly with no errors"
+    - "A file written to /data inside the pod survives a pod restart"
+    - "Woodpecker CI pipeline triggers on push for the f1-stream directory"
+  artifacts:
+    - path: "stacks/f1-stream/main.tf"
+      provides: "Kubernetes deployment, service, ingress, TLS for f1-stream"
+      contains: "viktorbarzin/f1-stream:v2.0.0"
+    - path: ".woodpecker/f1-stream.yml"
+      provides: "CI pipeline for f1-stream service"
+      contains: "f1-stream"
+  key_links:
+    - from: "stacks/f1-stream/main.tf"
+      to: "Docker Hub viktorbarzin/f1-stream:v2.0.0"
+      via: "kubernetes_deployment image reference"
+      pattern: "viktorbarzin/f1-stream:v2.0.0"
+    - from: "stacks/f1-stream/main.tf"
+      to: "NFS /mnt/main/f1-stream"
+      via: "inline NFS volume mount"
+      pattern: "/mnt/main/f1-stream"
+    - from: "stacks/f1-stream/main.tf"
+      to: "modules/kubernetes/ingress_factory"
+      via: "ingress module call"
+      pattern: "ingress_factory"
+---
+
+<objective>
+Update the Terraform stack to deploy the new Python/FastAPI container, verify NFS mount persistence, and add a Woodpecker CI pipeline. This completes Phase 1 by making the service live on the cluster and reachable at its public URL.
+
+Purpose: The service must be running on the Kubernetes cluster, reachable at f1.viktorbarzin.me, with NFS storage mounted and CI/CD in place -- ready for application development in Phase 2.
+Output: Live deployment at f1.viktorbarzin.me, NFS-backed persistent storage, Woodpecker CI pipeline.
+</objective>
+
+<execution_context>
+@/Users/viktorbarzin/.claude/get-shit-done/workflows/execute-plan.md
+@/Users/viktorbarzin/.claude/get-shit-done/templates/summary.md
+</execution_context>
+
+<context>
+@.planning/PROJECT.md
+@.planning/ROADMAP.md
+@.planning/STATE.md
+@.planning/phases/01-infrastructure-and-deployment/01-01-SUMMARY.md
+
+# Key reference files:
+@stacks/f1-stream/main.tf
+@stacks/f1-stream/terragrunt.hcl
+@.woodpecker/build-cli.yml
+@.woodpecker/default.yml
+</context>
+
+<tasks>
+
+<task type="auto">
+  <name>Task 1: Update Terraform deployment for Python/FastAPI and verify NFS mount</name>
+  <files>stacks/f1-stream/main.tf</files>
+  <action>
+Modify `stacks/f1-stream/main.tf` to update the deployment for the new Python/FastAPI application:
+
+1. **Change the container image** from `viktorbarzin/f1-stream:v1.3.1` to `viktorbarzin/f1-stream:v2.0.0`
+
+2. **Change the container port** from 8080 to 8000 (FastAPI/uvicorn default)
+
+3. **Update the service target_port** from 8080 to 8000
+
+4. **Remove old Go-specific environment variables** that are no longer needed:
+   - Remove `WEBAUTHN_RPID`
+   - Remove `WEBAUTHN_ORIGIN`
+   - Remove `WEBAUTHN_DISPLAY_NAME`
+   - Remove `HEADLESS_EXTRACT_ENABLED`
+   - Remove `TURN_URL`
+   - Remove `TURN_SHARED_SECRET`
+   - Remove `TURN_INTERNAL_URL`
+
+5. **Remove unused variables** from the top of the file:
+   - Remove `variable "coturn_turn_secret"` (was for WebRTC/TURN)
+   - Remove `variable "public_ip"` (was for TURN URL)
+   - Keep `variable "tls_secret_name"` and `variable "nfs_server"` (still needed)
+
+6. **Keep the NFS volume mount** exactly as-is -- it already follows the inline NFS pattern:
+   ```hcl
+   volume {
+     name = "data"
+     nfs {
+       server = var.nfs_server
+       path   = "/mnt/main/f1-stream"
+     }
+   }
+   ```
+   The volume_mount at `/data` stays the same.
+
+7. **Update resource limits** for Python:
+   ```hcl
+   resources {
+     limits = {
+       cpu    = "500m"
+       memory = "256Mi"
+     }
+     requests = {
+       cpu    = "50m"
+       memory = "64Mi"
+     }
+   }
+   ```
+   Python/FastAPI with uvicorn needs less CPU than Go+Chromium but similar memory.
+
+8. **Keep everything else unchanged**: namespace, service, tls_secret module, ingress module.
+
+After editing, apply the Terraform stack:
+```bash
+cd stacks/f1-stream && terragrunt apply --non-interactive
+```
+
+Wait for the deployment to roll out:
+```bash
+kubectl --kubeconfig $(pwd)/config -n f1-stream rollout status deployment/f1-stream --timeout=120s
+```
+
+Verify the pod is running:
+```bash
+kubectl --kubeconfig $(pwd)/config -n f1-stream get pods
+```
+
+Verify the health endpoint responds through the public URL:
+```bash
+curl -s https://f1.viktorbarzin.me/health
+```
+
+Verify NFS mount persistence by writing a test file, restarting the pod, and reading it back:
+```bash
+POD=$(kubectl --kubeconfig $(pwd)/config -n f1-stream get pods -l app=f1-stream -o jsonpath='{.items[0].metadata.name}')
+kubectl --kubeconfig $(pwd)/config -n f1-stream exec $POD -- sh -c 'echo "nfs-test-$(date +%s)" > /data/test-file.txt && cat /data/test-file.txt'
+kubectl --kubeconfig $(pwd)/config -n f1-stream rollout restart deployment/f1-stream
+kubectl --kubeconfig $(pwd)/config -n f1-stream rollout status deployment/f1-stream --timeout=120s
+NEW_POD=$(kubectl --kubeconfig $(pwd)/config -n f1-stream get pods -l app=f1-stream -o jsonpath='{.items[0].metadata.name}')
+kubectl --kubeconfig $(pwd)/config -n f1-stream exec $NEW_POD -- cat /data/test-file.txt
+```
+The test file should contain the same content after the pod restart.
+  </action>
+  <verify>
+1. `terragrunt apply` exits with 0 (no errors)
+2. `kubectl get pods -n f1-stream` shows 1/1 Running
+3. `curl -s https://f1.viktorbarzin.me/health` returns `{"status":"ok"}`
+4. NFS persistence test passes (file survives pod restart)
+  </verify>
+  <done>
+The f1-stream deployment is running on the cluster with the new Python/FastAPI image, reachable at https://f1.viktorbarzin.me/health, and the NFS volume at /data persists data across pod restarts.
+  </done>
+</task>
+
+<task type="auto">
+  <name>Task 2: Create Woodpecker CI pipeline for f1-stream</name>
+  <files>.woodpecker/f1-stream.yml</files>
+  <action>
+Create `.woodpecker/f1-stream.yml` following the pattern from `build-cli.yml`:
+
+```yaml
+when:
+  event: push
+  path:
+    include:
+      - "stacks/f1-stream/files/**"
+
+clone:
+  git:
+    image: woodpeckerci/plugin-git
+    settings:
+      attempts: 5
+      backoff: 10s
+
+steps:
+  - name: build-image
+    image: woodpeckerci/plugin-docker-buildx
+    settings:
+      username: "viktorbarzin"
+      password:
+        from_secret: dockerhub-pat
+      repo: viktorbarzin/f1-stream
+      dockerfile: stacks/f1-stream/files/Dockerfile
+      context: stacks/f1-stream/files
+      auto_tag: true
+```
+
+Key differences from the default pipeline:
+- **Path filter**: Only triggers when files under `stacks/f1-stream/files/` change (the application code)
+- **Builds and pushes the Docker image** using the same `woodpeckerci/plugin-docker-buildx` pattern as build-cli.yml
+- **Docker context** points to the `stacks/f1-stream/files/` directory where the Dockerfile lives
+- Does NOT run Terragrunt apply (that is done manually or by the default pipeline for the platform stack)
+  </action>
+  <verify>
+Verify the YAML is valid: `python3 -c "import yaml; yaml.safe_load(open('.woodpecker/f1-stream.yml')); print('YAML OK')"`
+Verify the file exists and references f1-stream correctly.
+  </verify>
+  <done>
+Woodpecker CI pipeline file exists at `.woodpecker/f1-stream.yml`, configured to build and push the Docker image when files under `stacks/f1-stream/files/` change.
+  </done>
+</task>
+
+</tasks>
+
+<verification>
+1. `curl -s https://f1.viktorbarzin.me/health` returns `{"status":"ok"}`
+2. `cd stacks/f1-stream && terragrunt plan --non-interactive` shows no changes (stack is clean)
+3. NFS test file written before pod restart is readable after pod restart
+4. `.woodpecker/f1-stream.yml` exists and is valid YAML
+5. `kubectl --kubeconfig $(pwd)/config -n f1-stream get pods` shows 1/1 Running
+</verification>
+
+<success_criteria>
+- The service is live at https://f1.viktorbarzin.me and responds with 200 on /health
+- Terragrunt stack applies cleanly with no manual cluster intervention
+- NFS volume mount at /data persists data across pod restarts
+- Woodpecker CI pipeline exists for automated image builds
+</success_criteria>
+
+<output>
+After completion, create `.planning/phases/01-infrastructure-and-deployment/01-02-SUMMARY.md`
+</output>
--- a/.planning/quick/1-fix-broken-demo-streams-and-improve-heal/1-PLAN.md
+++ b/.planning/quick/1-fix-broken-demo-streams-and-improve-heal/1-PLAN.md
@ -0,0 +1,47 @@
+# Quick Task 1: Fix Broken Demo Streams and Improve Health Checking
+
+## Objective
+
+Replace the broken Akamai live test stream (whose variant playlists return 404 despite master playlist returning 200) with a working test stream, and improve the health checker to validate variant playlists so broken streams are caught before being displayed to users. Rebuild and deploy the updated image.
+
+## Context
+
+- The F1 streaming site at f1.viktorbarzin.me has 3 demo streams
+- Akamai live test stream (`cph-p2p-msl.akamaized.net/hls/live/2000341/test/master.m3u8`) has a working master playlist but all variant playlists return 404
+- Current health check only validates the master playlist URL (checks for `#EXTM3U`), missing the broken variants
+- When hls.js tries to load the variant through the proxy, it gets 502 errors
+- The other 2 streams (Big Buck Bunny, Apple Bipbop) work correctly end-to-end
+- Confirmed working replacement: Tears of Steel (`demo.unified-streaming.com/k8s/features/stable/video/tears-of-steel/tears-of-steel.ism/.m3u8`) - all variants return 200
+
+## Tasks
+
+### Task 1: Replace broken Akamai stream URL in demo extractor
+
+**files:** `stacks/f1-stream/files/backend/extractors/demo.py`
+**action:** Replace the Akamai live test stream URL with Tears of Steel. Update the title, quality, and any other metadata.
+**verify:** Run the demo extractor's URL through curl to confirm master and variant playlists both return 200.
+**done:** Demo extractor returns 3 working stream URLs, none of which have broken variants.
+
+Replace:
+- URL: `https://cph-p2p-msl.akamaized.net/hls/live/2000341/test/master.m3u8`
+- Title: "Akamai Live Test Stream"
+- Quality: "" (empty)
+
+With:
+- URL: `https://demo.unified-streaming.com/k8s/features/stable/video/tears-of-steel/tears-of-steel.ism/.m3u8`
+- Title: "Tears of Steel (Test Stream)"
+- Quality: "1080p"
+
+### Task 2: Improve health checker to validate variant playlists
+
+**files:** `stacks/f1-stream/files/backend/health.py`
+**action:** After the existing health check passes (master playlist has `#EXTM3U`), if the playlist is a master playlist (contains `#EXT-X-STREAM-INF:`), extract the first variant URI and do a HEAD/GET check on it. Mark the stream as unhealthy if the variant returns non-200.
+**verify:** A stream with a broken variant (like the old Akamai one) would be marked `is_live=False`.
+**done:** Health checker validates at least one variant playlist when the stream is a master playlist.
+
+### Task 3: Rebuild Docker image and deploy
+
+**files:** `stacks/f1-stream/main.tf`
+**action:** Build new Docker image with tag v5.1.0, push to registry, update Terraform deployment image tag, apply the stack.
+**verify:** `curl https://f1.viktorbarzin.me/streams` returns 3 streams all with `is_live: true`. Visit f1.viktorbarzin.me/watch in browser and confirm all 3 streams play.
+**done:** All 3 demo streams are playable in the browser at f1.viktorbarzin.me/watch.
--- a/.planning/quick/resource-audit-live-metrics.md
+++ b/.planning/quick/resource-audit-live-metrics.md
@ -0,0 +1,614 @@
+# Kubernetes Cluster Resource Audit - Live Metrics
+
+**Collected**: 2026-03-01
+**Cluster**: 5 nodes (k8s-master + k8s-node1-4), Kubernetes v1.34.2
+
+---
+
+## EXECUTIVE SUMMARY
+
+### Critical Issues
+
+#### OOMKilled Pods
+| Namespace | Pod | Status |
+|-----------|-----|--------|
+| dbaas | mysql-cluster-0 | OOMKilled (last state) |
+
+#### CrashLoopBackOff / ImagePullBackOff Pods
+| Namespace | Pod | Status |
+|-----------|-----|--------|
+| vpa | vpa-admission-certgen-kdvqj | ImagePullBackOff |
+
+#### Pods with NO Resource Limits (unbounded)
+These pods have `<none>` for CPU and/or memory limits -- they can consume unlimited node resources:
+
+| Namespace | Pod | Container | CPU Limit | Mem Limit |
+|-----------|-----|-----------|-----------|-----------|
+| calico-apiserver | calico-apiserver-*-bq6zp | calico-apiserver | <none> | <none> |
+| calico-apiserver | calico-apiserver-*-q794h | calico-apiserver | <none> | <none> |
+| calico-system | calico-kube-controllers-* | calico-kube-controllers | <none> | <none> |
+| calico-system | calico-node-* (5 pods) | calico-node | <none> | <none> |
+| calico-system | calico-typha-*-9wr7z | calico-typha | <none> | <none> |
+| calico-system | calico-typha-*-hw8wt | calico-typha | <none> | <none> |
+| calico-system | calico-typha-*-z69vx | calico-typha | <none> | <none> |
+| calico-system | csi-node-driver-* (5 pods) | calico-csi, csi-node-driver-registrar | <none> | <none> |
+| kube-system | etcd-k8s-master | etcd | <none> | <none> |
+| kube-system | kube-apiserver-k8s-master | kube-apiserver | <none> | <none> |
+| kube-system | kube-controller-manager-k8s-master | kube-controller-manager | <none> | <none> |
+| kube-system | kube-proxy-* (5 pods) | kube-proxy | <none> | <none> |
+| kube-system | kube-scheduler-k8s-master | kube-scheduler | <none> | <none> |
+| kyverno | kyverno-admission-controller-* (2 pods) | kyverno | <none> (CPU) | 768Mi |
+| kyverno | kyverno-background-controller-* | controller | <none> (CPU) | 128Mi |
+| kyverno | kyverno-cleanup-controller-* | controller | <none> (CPU) | 128Mi |
+| kyverno | kyverno-reports-controller-* | controller | <none> (CPU) | 128Mi |
+| metallb-system | controller-* | controller | <none> | <none> |
+| metallb-system | speaker-dn9bk | speaker | <none> | <none> |
+| metallb-system | speaker-mnpsl | speaker | <none> | <none> |
+| metallb-system | speaker-pl8dz | speaker | <none> | <none> |
+| nvidia | nvidia-driver-daemonset-x2r6b | nvidia-driver-ctr | <none> | <none> |
+
+**Note**: kube-system and calico-system pods without limits are standard for control-plane components. The NVIDIA driver daemonset is also expected. MetalLB pods without limits should be monitored.
+
+#### Pods Near or Exceeding Memory Limits (>75% utilization)
+
+| Namespace | Pod | Current Usage | Memory Limit | % Used |
+|-----------|-----|--------------|--------------|--------|
+| dbaas | mysql-cluster-0 | 1845Mi | 2Gi (sidecar:512Mi + mysql:2Gi) | ~90% of mysql container |
+| dbaas | mysql-cluster-2 | 1212Mi | 2Gi (sidecar:512Mi + mysql:2Gi) | ~59% combined |
+| dbaas | mysql-cluster-1 | 1083Mi | 2Gi (sidecar:512Mi + mysql:2Gi) | ~53% combined |
+| dashy | dashy-* | 1048Mi | 4Gi | 26% but NOTE: 490m CPU near 500m limit (98%) |
+| onlyoffice | onlyoffice-document-server-* | 1007Mi | 4Gi | 25% |
+| stirling-pdf | stirling-pdf-* | 902Mi | 4Gi | 23% |
+| trading-bot | trading-bot-workers-* | 1901Mi | 2Gi (sentiment-analyzer) | ~95% of largest container |
+| authentik | goauthentik-server-*-x68p7 | 593Mi | 1Gi | 58% |
+| authentik | goauthentik-server-*-4bjll | 583Mi | 1Gi | 57% |
+| authentik | goauthentik-server-*-z68g8 | 548Mi | 1Gi | 54% |
+| authentik | goauthentik-worker-*-klk6z | 551Mi | 1Gi | 54% |
+| servarr | flaresolverr-* | 148Mi | 256Mi | 58% |
+| speedtest | speedtest-* | 147Mi | ~1.2Gi | 12% |
+| cnpg-system | cnpg-cloudnative-pg-* | 72Mi | 256Mi | 28% |
+| mailserver | mailserver-* | 183Mi | 256Mi+256Mi | 36% per container |
+| vpa | vpa-recommender-* | 74Mi | 512Mi | 14% (but 500Mi req = nearly full request!) |
+
+#### Pods with CPU Near Limit (potential throttling)
+
+| Namespace | Pod | Current CPU | CPU Limit | % Used |
+|-----------|-----|------------|-----------|--------|
+| dashy | dashy-* | 490m | 500m | **98%** -- actively throttling |
+| stirling-pdf | stirling-pdf-* | 299m | 300m | **99.7%** -- actively throttling |
+| frigate | frigate-* | 860m | 8000m | 11% |
+| crowdsec | crowdsec-agent-rkvf2 | 13m | 500m | 3% (but req=limit=500m) |
+| redis | redis-node-0 | 44m | 500m (redis) + 200m (sentinel) | 6% |
+| redis | redis-node-1 | 43m | 1260m (redis) + 140m (sentinel) | 3% |
+
+---
+
+## NODE-LEVEL RESOURCE USAGE
+
+| Node | CPU (cores) | CPU % | Memory | Memory % |
+|------|-------------|-------|--------|----------|
+| k8s-master | 805m | 10% | 5132Mi | 65% |
+| k8s-node1 | 1002m | 6% | 9192Mi | 57% |
+| k8s-node2 | 894m | 11% | 11517Mi | 48% |
+| k8s-node3 | 781m | 9% | 13103Mi | 54% |
+| k8s-node4 | 1333m | 16% | 13122Mi | 54% |
+| **TOTAL** | **4815m** | **~10%** | **52066Mi** | **~55%** |
+
+**Observations**:
+- Memory is the tighter resource (~55% cluster-wide), CPU is abundant (~10%)
+- k8s-master at 65% memory -- highest, but still has headroom
+- k8s-node3 and k8s-node4 carry the most memory workloads (~13Gi each)
+
+---
+
+## POD RESOURCE USAGE BY NAMESPACE (sorted by total memory)
+
+### Top 20 Memory Consumers
+
+| Rank | Namespace/Pod | CPU | Memory | Mem Limit |
+|------|--------------|-----|--------|-----------|
+| 1 | frigate/frigate | 860m | 3835Mi | 16Gi |
+| 2 | kube-system/kube-apiserver | 376m | 2531Mi | <none> |
+| 3 | monitoring/prometheus-server | 36m | 1912Mi | 4Gi |
+| 4 | trading-bot/trading-bot-workers | 7m | 1901Mi | 2Gi (largest) |
+| 5 | dbaas/mysql-cluster-0 | 62m | 1845Mi | 2Gi (mysql) |
+| 6 | monitoring/loki-0 | 95m | 1335Mi | ~2.9Gi |
+| 7 | immich/immich-machine-learning | 8m | 1215Mi | 16Gi |
+| 8 | dbaas/mysql-cluster-2 | 32m | 1212Mi | 2Gi (mysql) |
+| 9 | nvidia/nvidia-driver-daemonset | 0m | 1168Mi | <none> |
+| 10 | dbaas/mysql-cluster-1 | 40m | 1083Mi | 2Gi (mysql) |
+| 11 | dashy/dashy | 490m | 1048Mi | 4Gi |
+| 12 | onlyoffice/onlyoffice-document-server | 3m | 1007Mi | 4Gi |
+| 13 | stirling-pdf/stirling-pdf | 299m | 902Mi | 4Gi |
+| 14 | tandoor/tandoor | 1m | 754Mi | ~3.1Gi |
+| 15 | paperless-ngx/paperless-ngx | 4m | 691Mi | ~3.7Gi |
+| 16 | linkwarden/linkwarden | 8m | 682Mi | ~3.3Gi |
+| 17 | ollama/ollama-ui | 2m | 658Mi | ~5.8Gi |
+| 18 | whisper/whisper | 1m | 628Mi | ~5.8Gi |
+| 19 | realestate-crawler/celery | 2m | 608Mi | 2Gi |
+| 20 | authentik/goauthentik-server (x3) | ~17m each | ~575Mi each | 1Gi |
+
+### Top 10 CPU Consumers
+
+| Rank | Namespace/Pod | CPU | CPU Limit |
+|------|--------------|-----|-----------|
+| 1 | frigate/frigate | 860m | 8000m |
+| 2 | dashy/dashy | 490m | 500m |
+| 3 | kube-system/kube-apiserver | 376m | <none> |
+| 4 | stirling-pdf/stirling-pdf | 299m | 300m |
+| 5 | kube-system/etcd | 216m | <none> |
+| 6 | monitoring/loki-0 | 95m | 504m |
+| 7 | authentik/goauthentik-worker-c5zfs | 81m | 2000m |
+| 8 | authentik/goauthentik-worker-b5wzk | 62m | 2000m |
+| 9 | dbaas/mysql-cluster-0 | 62m | 2000m |
+| 10 | calico-system/calico-node-wllsb | 49m | <none> |
+
+---
+
+## DETAILED NAMESPACE BREAKDOWN
+
+### actualbudget
+| Pod | CPU Used | Mem Used | CPU Req/Lim | Mem Req/Lim |
+|-----|----------|----------|-------------|-------------|
+| actualbudget-anca | 1m | 42Mi | 25m/250m | 64Mi/256Mi |
+| actualbudget-emo | 1m | 40Mi | 25m/250m | 64Mi/256Mi |
+| actualbudget-http-api-anca | 1m | 26Mi | 25m/250m | 64Mi/256Mi |
+| actualbudget-http-api-emo | 0m | 26Mi | 25m/250m | 64Mi/256Mi |
+| actualbudget-http-api-viktor | 1m | 29Mi | 25m/250m | 64Mi/256Mi |
+| actualbudget-viktor | 1m | 56Mi | 25m/250m | 64Mi/256Mi |
+**Quota**: 150m/4000m CPU used, 384Mi/4Gi mem used, 6/30 pods
+
+### affine
+| Pod | CPU Used | Mem Used | CPU Req/Lim | Mem Req/Lim |
+|-----|----------|----------|-------------|-------------|
+| affine | 4m | 174Mi | 35m/700m | ~237Mi/~1.9Gi |
+**Quota**: 35m/2000m CPU, ~237Mi/2Gi mem, 1/20 pods
+
+### aiostreams
+| Pod | CPU Used | Mem Used | CPU Req/Lim | Mem Req/Lim |
+|-----|----------|----------|-------------|-------------|
+| aiostreams | 1m | 215Mi | 50m/500m | 256Mi/768Mi |
+
+### atuin
+| Pod | CPU Used | Mem Used | CPU Req/Lim | Mem Req/Lim |
+|-----|----------|----------|-------------|-------------|
+| atuin | 1m | 2Mi | 50m/500m | 64Mi/256Mi |
+
+### audiobookshelf
+| Pod | CPU Used | Mem Used | CPU Req/Lim | Mem Req/Lim |
+|-----|----------|----------|-------------|-------------|
+| audiobookshelf | 1m | 55Mi | 15m/150m | ~100Mi/400Mi |
+
+### authentik
+| Pod | CPU Used | Mem Used | CPU Req/Lim | Mem Req/Lim |
+|-----|----------|----------|-------------|-------------|
+| ak-outpost-embedded | 6m | 18Mi | 50m/500m | 64Mi/512Mi |
+| goauthentik-server (x3) | 14-21m | 548-593Mi | 100m/2000m | 512Mi/1Gi |
+| goauthentik-worker (x3) | 40-81m | 420-551Mi | 50-100m/1-2000m | 384Mi-600Mi/1-1.6Gi |
+| pgbouncer (x3) | 1-2m | 2Mi | 15-50m/150-500m | ~100Mi/512-800Mi |
+**Quota**: 680m/16000m CPU, ~3.3Gi/16Gi mem, 10/50 pods
+
+### calibre
+| Pod | CPU Used | Mem Used | CPU Req/Lim | Mem Req/Lim |
+|-----|----------|----------|-------------|-------------|
+| annas-archive-stacks | 1m | 60Mi | 25m/250m | 64Mi/256Mi |
+| calibre-web-automated | 1m | 196Mi | 23m/460m | ~640Mi/~2.6Gi |
+
+### changedetection
+| Pod | CPU Used | Mem Used | CPU Req/Lim | Mem Req/Lim |
+|-----|----------|----------|-------------|-------------|
+| changedetection (2 containers) | 6m | 111Mi | 25m+25m/250m+250m | 64Mi+64Mi/256Mi+256Mi |
+
+### cloudflared
+| Pod | CPU Used | Mem Used | CPU Req/Lim | Mem Req/Lim |
+|-----|----------|----------|-------------|-------------|
+| cloudflared (x3) | 3-9m | 31-59Mi | 50m/500m | 64Mi/512Mi |
+
+### crowdsec
+| Pod | CPU Used | Mem Used | CPU Req/Lim | Mem Req/Lim |
+|-----|----------|----------|-------------|-------------|
+| crowdsec-agent (x3) | 3-13m | 43-48Mi | 500m/500m | 250Mi/250Mi |
+| crowdsec-lapi (x3) | 1m | 30-34Mi | 23m/23m | ~121Mi/~121Mi |
+| crowdsec-web | 2m | 46Mi | 50m/500m | 64Mi/512Mi |
+**Note**: crowdsec-agent has CPU req=limit=500m (Guaranteed QoS). Same for memory at 250Mi.
+
+### dashy
+| Pod | CPU Used | Mem Used | CPU Req/Lim | Mem Req/Lim |
+|-----|----------|----------|-------------|-------------|
+| dashy | **490m** | 1048Mi | 15m/**500m** | 512Mi/4Gi |
+**WARNING**: CPU at 98% of limit -- actively being throttled!
+
+### dawarich
+| Pod | CPU Used | Mem Used | CPU Req/Lim | Mem Req/Lim |
+|-----|----------|----------|-------------|-------------|
+| dawarich | 1m | 438Mi | 15m/150m | ~600Mi/~2.4Gi |
+
+### dbaas
+| Pod | CPU Used | Mem Used | CPU Req/Lim | Mem Req/Lim |
+|-----|----------|----------|-------------|-------------|
+| mysql-cluster-0 | 62m | 1845Mi | 50m+250m/500m+2000m | 64Mi+1Gi/512Mi+2Gi |
+| mysql-cluster-1 | 40m | 1083Mi | 50m+250m/500m+2000m | 64Mi+1Gi/512Mi+2Gi |
+| mysql-cluster-2 | 32m | 1212Mi | 50m+250m/500m+2000m | 64Mi+1Gi/512Mi+2Gi |
+| pg-cluster-1 | 22m | 335Mi | 250m/2000m | 512Mi/4Gi |
+| pg-cluster-2 | 11m | 155Mi | 250m/2000m | 512Mi/4Gi |
+| pgadmin | 1m | 265Mi | 50m/500m | 64Mi/512Mi |
+| phpmyadmin | 1m | 46Mi | 50m/500m | 64Mi/512Mi |
+**WARNING**: mysql-cluster-0 was OOMKilled previously. Currently at 1845Mi with 2Gi limit on mysql container (~90%).
+**Quota**: 1500m/8000m CPU, 4416Mi/12Gi mem, 7/30 pods
+
+### echo
+| Pod | CPU Used | Mem Used | CPU Req/Lim | Mem Req/Lim |
+|-----|----------|----------|-------------|-------------|
+| echo (x5) | 0-1m | 19-30Mi | 15-25m/150-250m | 64Mi-100Mi/256-400Mi |
+
+### forgejo
+| Pod | CPU Used | Mem Used | CPU Req/Lim | Mem Req/Lim |
+|-----|----------|----------|-------------|-------------|
+| forgejo | 1m | 170Mi | 15m/500m | ~215Mi/~1.7Gi |
+
+### freedify
+| Pod | CPU Used | Mem Used | CPU Req/Lim | Mem Req/Lim |
+|-----|----------|----------|-------------|-------------|
+| music-emo | 2m | 68Mi | 100m/500m | 256Mi/512Mi |
+| music-viktor | 2m | 57Mi | 100m/500m | 256Mi/512Mi |
+
+### frigate
+| Pod | CPU Used | Mem Used | CPU Req/Lim | Mem Req/Lim |
+|-----|----------|----------|-------------|-------------|
+| frigate | 860m | 3835Mi | 800m/8000m | 2Gi/16Gi |
+**Note**: Highest memory consumer in the cluster. GPU tier (2-gpu).
+
+### headscale
+| Pod | CPU Used | Mem Used | CPU Req/Lim | Mem Req/Lim |
+|-----|----------|----------|-------------|-------------|
+| headscale (2 containers) | 1m | 65Mi | 50m+25m/200m+100m | 64Mi+32Mi/256Mi+128Mi |
+
+### homepage
+| Pod | CPU Used | Mem Used | CPU Req/Lim | Mem Req/Lim |
+|-----|----------|----------|-------------|-------------|
+| homepage | 1m | 86Mi | 15m/150m | ~121Mi/~484Mi |
+
+### immich
+| Pod | CPU Used | Mem Used | CPU Req/Lim | Mem Req/Lim |
+|-----|----------|----------|-------------|-------------|
+| immich-frame | 1m | 30Mi | 15m/150m | ~105Mi/~838Mi |
+| immich-machine-learning | 8m | 1215Mi | 15m/150m | 2Gi/16Gi |
+| immich-postgresql | 1m | 268Mi | 15m/150m | ~990Mi/~7.9Gi |
+| immich-server | 3m | 404Mi | 800m/8000m | ~990Mi/~7.9Gi |
+**Quota**: 845m/8000m CPU, ~4.1Gi/8Gi mem, 4/40 pods. Note: mem at ~51% of quota.
+
+### kms
+| Pod | CPU Used | Mem Used | CPU Req/Lim | Mem Req/Lim |
+|-----|----------|----------|-------------|-------------|
+| kms | 0m | 0Mi | 15m/15m | ~100Mi/1Gi |
+| kms-web-page | 0m | 10Mi | 500m/500m | 512Mi/512Mi |
+**Note**: kms-web-page has req=limit (Guaranteed QoS) at 500m CPU and 512Mi, but uses 0m/10Mi.
+
+### linkwarden
+| Pod | CPU Used | Mem Used | CPU Req/Lim | Mem Req/Lim |
+|-----|----------|----------|-------------|-------------|
+| linkwarden | 8m | 682Mi | 15m/150m | ~826Mi/~3.3Gi |
+
+### mailserver
+| Pod | CPU Used | Mem Used | CPU Req/Lim | Mem Req/Lim |
+|-----|----------|----------|-------------|-------------|
+| mailserver (2 containers) | 9m | 183Mi | 25m+25m/250m+250m | 64Mi+64Mi/256Mi+256Mi |
+| roundcubemail | 1m | 44Mi | 25m/250m | 64Mi/256Mi |
+
+### meshcentral
+| Pod | CPU Used | Mem Used | CPU Req/Lim | Mem Req/Lim |
+|-----|----------|----------|-------------|-------------|
+| meshcentral | 1m | 127Mi | 15m/300m | ~283Mi/~850Mi |
+
+### monitoring
+| Pod | CPU Used | Mem Used | CPU Req/Lim | Mem Req/Lim |
+|-----|----------|----------|-------------|-------------|
+| alloy (x3, DaemonSet) | 44-47m | 182-201Mi | 63m+11m/252m+550m | ~422Mi+50Mi/~845Mi+512Mi |
+| caretta (x4, DaemonSet) | 2-4m | 250-267Mi | 15m/225m | ~422Mi/~2.5Gi |
+| goflow2 | 11m | 28Mi | 15m/60m | ~100Mi/400Mi |
+| grafana (x3) | 18m | 232-235Mi | 11m+11m+35m/110m+110m+350m | multi-container |
+| idrac-redfish-exporter | 3m | 9Mi | 15m/150m | ~100Mi/800Mi |
+| loki-0 (2 containers) | 95m | 1335Mi | 126m+11m/504m+110m | ~1.9Gi+~121Mi/~2.9Gi+~968Mi |
+| node-exporter (x5) | 1m | 9-24Mi | 15m/150m | ~100Mi/800Mi |
+| prometheus-alertmanager | 2m | 24Mi | 15m/150m | ~100Mi/800Mi |
+| prometheus-kube-state-metrics | 3m | 33Mi | 15m/150m | ~100Mi/800Mi |
+| prometheus-pushgateway | 1m | 18Mi | 15m/150m | ~100Mi/800Mi |
+| prometheus-server (2 containers) | 36m | 1912Mi | 11m+93m/110m+930m | 50Mi+512Mi/400Mi+4Gi |
+| proxmox-exporter | 1m | 41Mi | 23m/230m | ~100Mi/800Mi |
+| snmp-exporter | 2m | 14Mi | 15m/150m | ~100Mi/800Mi |
+| sysctl-inotify (x5) | 0m | 0Mi | 15m/15m | ~100Mi/~100Mi |
+**Quota**: 1177m/16000m CPU, ~9Gi/16Gi mem, 32/100 pods
+
+### mysql-operator
+| Pod | CPU Used | Mem Used | CPU Req/Lim | Mem Req/Lim |
+|-----|----------|----------|-------------|-------------|
+| mysql-operator | 4m | 254Mi | 23m/230m | ~309Mi/~1.2Gi |
+
+### n8n
+| Pod | CPU Used | Mem Used | CPU Req/Lim | Mem Req/Lim |
+|-----|----------|----------|-------------|-------------|
+| n8n | 2m | 425Mi | 15m/150m | ~524Mi/~2.1Gi |
+
+### netbox
+| Pod | CPU Used | Mem Used | CPU Req/Lim | Mem Req/Lim |
+|-----|----------|----------|-------------|-------------|
+| netbox | 1m | 480Mi | 50m/2000m | 512Mi/4Gi |
+
+### nextcloud
+| Pod | CPU Used | Mem Used | CPU Req/Lim | Mem Req/Lim |
+|-----|----------|----------|-------------|-------------|
+| nextcloud (2 containers) | 9m | 234Mi | 100m+11m/16000m+110m | ~1.3Gi+~121Mi/~8Gi+~484Mi |
+| whiteboard | 1m | 62Mi | 25m/250m | 64Mi/256Mi |
+**Quota**: 136m/4000m CPU, ~1.5Gi/8Gi mem, 2/10 pods
+
+### nvidia
+| Pod | CPU Used | Mem Used | CPU Req/Lim | Mem Req/Lim |
+|-----|----------|----------|-------------|-------------|
+| gpu-feature-discovery | 1m | 76Mi | 100m+100m/1+1 | 256Mi+256Mi/2Gi+2Gi |
+| gpu-operator | 14m | 63Mi | 200m/500m | 100Mi/350Mi |
+| gpu-pod-exporter | 2m | 50Mi | 50m/200m | 128Mi/256Mi |
+| nvidia-container-toolkit | 1m | 27Mi | 100m/1000m | 256Mi/2Gi |
+| nvidia-dcgm-exporter | 17m | 538Mi | 100m/1000m | 256Mi/2Gi |
+| nvidia-device-plugin | 1m | 47Mi | 100m+100m/1+1 | 256Mi+256Mi/2Gi+2Gi |
+| nvidia-driver-daemonset | 0m | 1168Mi | <none> | <none> |
+| nvidia-exporter | 1m | 138Mi | 15m/150m | ~121Mi/~968Mi |
+| nfd-gc | 1m | 9Mi | 15m/1500m | ~100Mi/800Mi |
+| nfd-master | 1m | 27Mi | 100m/4000m | 128Mi/4Gi |
+| nfd-worker (x5) | 1m | 14-18Mi | 15m/3000m | ~100Mi/800Mi |
+| nvidia-operator-validator | 0m | 1Mi | 100m/1000m | 256Mi/2Gi |
+
+### ollama
+| Pod | CPU Used | Mem Used | CPU Req/Lim | Mem Req/Lim |
+|-----|----------|----------|-------------|-------------|
+| ollama | 1m | 11Mi | 500m/4000m | 4Gi/12Gi |
+| ollama-ui | 2m | 658Mi | 15m/150m | ~729Mi/~5.8Gi |
+**Note**: ollama pod at only 11Mi but reserves 4Gi -- GPU workload likely using VRAM instead.
+
+### onlyoffice
+| Pod | CPU Used | Mem Used | CPU Req/Lim | Mem Req/Lim |
+|-----|----------|----------|-------------|-------------|
+| onlyoffice-document-server | 3m | 1007Mi | 250m/8000m | 512Mi/4Gi |
+**Quota**: 250m/4000m CPU, 512Mi/4Gi mem, 1/10 pods
+
+### openclaw
+| Pod | CPU Used | Mem Used | CPU Req/Lim | Mem Req/Lim |
+|-----|----------|----------|-------------|-------------|
+| openclaw (2 containers) | 2m | 447Mi | 100m+25m/2000m+500m | 512Mi+64Mi/2Gi+256Mi |
+
+### osm-routing
+| Pod | CPU Used | Mem Used | CPU Req/Lim | Mem Req/Lim |
+|-----|----------|----------|-------------|-------------|
+| osrm-bicycle | 0m | 366Mi | 15m/250m | ~454Mi/~909Mi |
+| osrm-foot | 0m | 359Mi | 15m/150m | ~454Mi/~1.8Gi |
+
+### paperless-ngx
+| Pod | CPU Used | Mem Used | CPU Req/Lim | Mem Req/Lim |
+|-----|----------|----------|-------------|-------------|
+| paperless-ngx | 4m | 691Mi | 49m/980m | ~933Mi/~3.7Gi |
+
+### realestate-crawler
+| Pod | CPU Used | Mem Used | CPU Req/Lim | Mem Req/Lim |
+|-----|----------|----------|-------------|-------------|
+| realestate-crawler-api (x2) | 2m | 133-134Mi | 15m/600m | ~194Mi/~1.6Gi |
+| realestate-crawler-celery | 2m | 608Mi | 100m/2000m | 512Mi/2Gi |
+| realestate-crawler-celery-beat | 0m | 107Mi | 15m/300m | ~175Mi/~699Mi |
+| realestate-crawler-ui (x2) | 0m | 7-8Mi | 15-25m/150-250m | 64-100Mi/256-400Mi |
+
+### redis
+| Pod | CPU Used | Mem Used | CPU Req/Lim | Mem Req/Lim |
+|-----|----------|----------|-------------|-------------|
+| redis-node-0 (redis+sentinel) | 44m | 47Mi | 50m+50m/500m+200m | 64Mi+64Mi/256Mi+128Mi |
+| redis-node-1 (redis+sentinel) | 43m | 25Mi | 126m+35m/1260m+140m | ~50Mi+~50Mi/200Mi+100Mi |
+
+### resume
+| Pod | CPU Used | Mem Used | CPU Req/Lim | Mem Req/Lim |
+|-----|----------|----------|-------------|-------------|
+| printer | 3m | 109Mi | 15m/300m | 1Gi/4Gi |
+| resume | 1m | 116Mi | 15m/300m | ~215Mi/~645Mi |
+
+### rybbit
+| Pod | CPU Used | Mem Used | CPU Req/Lim | Mem Req/Lim |
+|-----|----------|----------|-------------|-------------|
+| rybbit | 2m | 185Mi | 15m/150m | ~215Mi/~860Mi |
+| rybbit-client | 1m | 89Mi | 25m/250m | 64Mi/256Mi |
+**Note**: rybbit-client at 89Mi with 256Mi limit (35%).
+
+### servarr
+| Pod | CPU Used | Mem Used | CPU Req/Lim | Mem Req/Lim |
+|-----|----------|----------|-------------|-------------|
+| flaresolverr | 1m | 148Mi | 25m/250m | 64Mi/256Mi |
+| listenarr | 2m | 383Mi | 15m/600m | ~640Mi/~2.6Gi |
+| prowlarr | 1m | 149Mi | 15m/150m | ~260Mi/~1Gi |
+| qbittorrent | 1m | 29Mi | 25m/250m | 64Mi/256Mi |
+**WARNING**: flaresolverr at 148Mi / 256Mi = 58% of mem limit.
+
+### speedtest
+| Pod | CPU Used | Mem Used | CPU Req/Lim | Mem Req/Lim |
+|-----|----------|----------|-------------|-------------|
+| speedtest | 1m | 147Mi | 200m/2000m | ~309Mi/~1.2Gi |
+
+### stirling-pdf
+| Pod | CPU Used | Mem Used | CPU Req/Lim | Mem Req/Lim |
+|-----|----------|----------|-------------|-------------|
+| stirling-pdf | **299m** | 902Mi | 15m/**300m** | 1Gi/4Gi |
+**WARNING**: CPU at 99.7% of limit -- actively being throttled!
+
+### tandoor
+| Pod | CPU Used | Mem Used | CPU Req/Lim | Mem Req/Lim |
+|-----|----------|----------|-------------|-------------|
+| tandoor | 1m | 754Mi | 15m/150m | ~776Mi/~3.1Gi |
+
+### technitium
+| Pod | CPU Used | Mem Used | CPU Req/Lim | Mem Req/Lim |
+|-----|----------|----------|-------------|-------------|
+| technitium | 1m | 184Mi | 100m/500m | 128Mi/512Mi |
+| technitium-secondary | 9m | 123Mi | 100m/500m | 128Mi/512Mi |
+
+### trading-bot
+| Pod | CPU Used | Mem Used | CPU Req/Lim | Mem Req/Lim |
+|-----|----------|----------|-------------|-------------|
+| trading-bot-frontend (2 containers) | 2m | 174Mi | 10m+50m/200m+1000m | 32Mi+128Mi/128Mi+512Mi |
+| trading-bot-workers (6 containers) | 7m | 1901Mi | 10m+100m+10m+10m+10m+10m/500m+2000m+500m+500m+500m+500m | 64Mi*5+512Mi/256Mi*5+2Gi |
+**WARNING**: trading-bot-workers at 1901Mi. The sentiment-analyzer container has 2Gi limit, possibly near OOM.
+
+### traefik
+| Pod | CPU Used | Mem Used | CPU Req/Lim | Mem Req/Lim |
+|-----|----------|----------|-------------|-------------|
+| auth-proxy (x2) | 1m | 7Mi | 5m/50m | 16Mi/32Mi |
+| bot-block-proxy (x2) | 1m | 7Mi | 5m/50m | 16Mi/32Mi |
+| traefik (x3) | 4-14m | 81-120Mi | 100m/500m | 128Mi/512Mi |
+
+### uptime-kuma
+| Pod | CPU Used | Mem Used | CPU Req/Lim | Mem Req/Lim |
+|-----|----------|----------|-------------|-------------|
+| uptime-kuma | 23m | 163Mi | 49m/196m | ~237Mi/~947Mi |
+
+### vpa
+| Pod | CPU Used | Mem Used | CPU Req/Lim | Mem Req/Lim |
+|-----|----------|----------|-------------|-------------|
+| goldilocks-controller | 7m | 30Mi | 49m/980m | ~105Mi/~209Mi |
+| goldilocks-dashboard | 1m | 8Mi | 15m/300m | ~105Mi/~209Mi |
+| vpa-admission-certgen | N/A | N/A | 50m/500m | 64Mi/512Mi |
+| vpa-admission-controller | 3m | 48Mi | 50m/500m | 200Mi/512Mi |
+| vpa-recommender | 13m | 74Mi | 50m/500m | 500Mi/512Mi |
+| vpa-updater | 2m | 68Mi | 50m/500m | 500Mi/512Mi |
+**WARNING**: vpa-admission-certgen in ImagePullBackOff.
+
+### whisper
+| Pod | CPU Used | Mem Used | CPU Req/Lim | Mem Req/Lim |
+|-----|----------|----------|-------------|-------------|
+| piper | 0m | 32Mi | 100m/1000m | 256Mi/2Gi |
+| whisper | 1m | 628Mi | 15m/150m | ~729Mi/~5.8Gi |
+
+### wireguard
+| Pod | CPU Used | Mem Used | CPU Req/Lim | Mem Req/Lim |
+|-----|----------|----------|-------------|-------------|
+| wireguard (2 containers) | 1m | 2Mi | 50m+50m/500m+500m | 64Mi+64Mi/512Mi+512Mi |
+
+### woodpecker
+| Pod | CPU Used | Mem Used | CPU Req/Lim | Mem Req/Lim |
+|-----|----------|----------|-------------|-------------|
+| woodpecker-agent-0 | 1m | 17Mi | 15m/150m | ~100Mi/400Mi |
+| woodpecker-agent-1 | 1m | 28Mi | 25m/250m | 64Mi/256Mi |
+| woodpecker-server-0 | 4m | 32Mi | 25m/250m | 64Mi/256Mi |
+
+### website
+| Pod | CPU Used | Mem Used | CPU Req/Lim | Mem Req/Lim |
+|-----|----------|----------|-------------|-------------|
+| blog (x3, 2 containers each) | 0-1m | 17-19Mi | 11m+11m/22m+110m | ~50Mi+~50Mi/512Mi+200Mi |
+
+### Other Small Namespaces
+| Namespace | Pod | CPU Used | Mem Used | CPU Req/Lim | Mem Req/Lim |
+|-----------|-----|----------|----------|-------------|-------------|
+| city-guesser | city-guesser | 1m | 23Mi | 250m/500m | 50Mi/512Mi |
+| coturn | coturn | 1m | 7Mi | 15m/150m | ~100Mi/400Mi |
+| cyberchef | cyberchef | 0m | 8Mi | 15m/150m | ~100Mi/400Mi |
+| diun | diun | 1m | 24Mi | 15m/150m | ~100Mi/400Mi |
+| excalidraw | excalidraw | 0m | 2Mi | 15m/150m | ~100Mi/400Mi |
+| f1-stream | f1-stream | 7m | 53Mi | 50m/500m | 64Mi/256Mi |
+| freshrss | freshrss | 1m | 56Mi | 25m/250m | 64Mi/256Mi |
+| hackmd | hackmd | 2m | 82Mi | 15m/150m | ~138Mi/~552Mi |
+| health | health | 2m | 101Mi | 100m/1000m | 256Mi/1Gi |
+| isponsorblocktv | isponsorblocktv-vermont | 1m | 42Mi | 15m/150m | ~100Mi/400Mi |
+| jsoncrack | jsoncrack | 0m | 7Mi | 15m/150m | ~100Mi/400Mi |
+| k8s-portal | k8s-portal | 0m | 14Mi | 25m/250m | 64Mi/256Mi |
+| navidrome | navidrome | 1m | 62Mi | 15m/150m | ~156Mi/~623Mi |
+| ntfy | ntfy | 1m | 20Mi | 25m/250m | 64Mi/256Mi |
+| owntracks | owntracks | 1m | 1Mi | 15m/150m | ~100Mi/400Mi |
+| plotting-book | plotting-book | 0m | 22Mi | 50m/500m | 128Mi/512Mi |
+| privatebin | privatebin | 1m | 46Mi | 15m/150m | ~100Mi/400Mi |
+| send | send | 0m | 53Mi | 15m/150m | ~100Mi/400Mi |
+| shadowsocks | shadowsocks | 1m | 0Mi | 15m/150m | ~100Mi/400Mi |
+| tor-proxy | tor-proxy | 1m | 61Mi | 15m/150m | ~105Mi/~419Mi |
+| vaultwarden | vaultwarden | 1m | 49Mi | 50m/200m | 64Mi/256Mi |
+| wealthfolio | wealthfolio | 0m | 8Mi | 15m/150m | ~100Mi/400Mi |
+| webhook-handler | webhook-handler | 1m | 8Mi | 15m/30m | ~100Mi/1Gi |
+| xray | xray | 0m | 11Mi | 50m/500m | 64Mi/512Mi |
+
+---
+
+## LIMITRANGE DEFAULTS BY NAMESPACE
+
+| Namespace | Default CPU | Default Mem | Max CPU | Max Mem | Tier |
+|-----------|-------------|-------------|---------|---------|------|
+| **GPU tier (2-gpu)** | | | | | |
+| ebook2audiobook | 1 | 2Gi | 8 | 16Gi | 2-gpu |
+| frigate | 1 | 2Gi | 8 | 16Gi | 2-gpu |
+| immich | 1 | 2Gi | 8 | 16Gi | 2-gpu |
+| nvidia | 1 | 2Gi | 8 | 16Gi | 2-gpu |
+| ollama | 1 | 2Gi | 8 | 16Gi | 2-gpu |
+| whisper | 1 | 2Gi | 8 | 16Gi | 2-gpu |
+| **Core tier (0-core)** | | | | | |
+| cloudflared | 500m | 512Mi | 4 | 8Gi | 0-core |
+| headscale | 500m | 512Mi | 4 | 8Gi | 0-core |
+| technitium | 500m | 512Mi | 4 | 8Gi | 0-core |
+| traefik | 500m | 512Mi | 4 | 8Gi | 0-core |
+| wireguard | 500m | 512Mi | 4 | 8Gi | 0-core |
+| xray | 500m | 512Mi | 4 | 8Gi | 0-core |
+| **Cluster tier (1-cluster)** | | | | | |
+| authentik | 500m | 512Mi | 2 | 4Gi | 1-cluster |
+| cnpg-system | 500m | 512Mi | 2 | 4Gi | 1-cluster |
+| crowdsec | 500m | 512Mi | 2 | 4Gi | 1-cluster |
+| dbaas | 500m | 512Mi | 2 | 4Gi | 1-cluster |
+| metrics-server | 500m | 512Mi | 2 | 4Gi | 1-cluster |
+| monitoring | 500m | 512Mi | 2 | 4Gi | 1-cluster |
+| poison-fountain | 500m | 512Mi | 2 | 4Gi | 1-cluster |
+| redis | 500m | 512Mi | 2 | 4Gi | 1-cluster |
+| tuya-bridge | 500m | 512Mi | 2 | 4Gi | 1-cluster |
+| uptime-kuma | 500m | 512Mi | 2 | 4Gi | 1-cluster |
+| vpa | 500m | 512Mi | 2 | 4Gi | 1-cluster |
+| **Edge tier (3-edge)** | | | | | |
+| Most app namespaces | 250m | 256Mi | 2 | 4Gi | 3-edge |
+| **Aux tier (4-aux)** | | | | | |
+| Some app namespaces | 250m | 256Mi | 2 | 4Gi | 4-aux |
+| **Custom LimitRanges** | | | | | |
+| nextcloud | 250m | 256Mi | 16 | 8Gi | Custom |
+| onlyoffice | 250m | 256Mi | 8 | 8Gi | Custom |
+| **No tier** | | | | | |
+| aiostreams | 250m | 256Mi | 1 | 2Gi | None |
+| default | 250m | 256Mi | 1 | 2Gi | None |
+| descheduler | 250m | 256Mi | 1 | 2Gi | None |
+| gadget | 250m | 256Mi | 1 | 2Gi | None |
+| kured | 250m | 256Mi | 1 | 2Gi | None |
+| local-path-storage | 250m | 256Mi | 1 | 2Gi | None |
+| mysql-operator | 250m | 256Mi | 1 | 2Gi | None |
+| reverse-proxy | 250m | 256Mi | 1 | 2Gi | None |
+| tigera-operator | 250m | 256Mi | 1 | 2Gi | None |
+
+---
+
+## RESOURCEQUOTA UTILIZATION (top consumers)
+
+| Namespace | CPU Req Used/Hard | Mem Req Used/Hard | Pods Used/Hard | % Mem Req |
+|-----------|-------------------|-------------------|----------------|-----------|
+| monitoring | 1177m/16000m | ~9Gi/16Gi | 32/100 | ~56% |
+| authentik | 680m/16000m | ~3.3Gi/16Gi | 10/50 | ~21% |
+| crowdsec | 1619m/8000m | ~1.1Gi/8Gi | 7/30 | ~14% |
+| dbaas | 1500m/8000m | 4416Mi/12Gi | 7/30 | ~36% |
+| immich | 845m/8000m | ~4.1Gi/8Gi | 4/40 | ~51% |
+| ollama | 515m/8000m | ~4.7Gi/8Gi | 2/40 | ~59% |
+| nextcloud | 136m/4000m | ~1.5Gi/8Gi | 2/10 | ~19% |
+| rybbit | 140m/2000m | ~791Mi/2Gi | 3/20 | ~39% |
+
+---
+
+## ACTION ITEMS
+
+### Immediate (potential service impact)
+1. **dashy** -- CPU throttled at 98% (490m/500m). Increase CPU limit or investigate high CPU usage.
+2. **stirling-pdf** -- CPU throttled at 99.7% (299m/300m). Increase CPU limit.
+3. **dbaas/mysql-cluster-0** -- Previously OOMKilled. Currently at ~1845Mi with 2Gi limit on mysql container (~90%). Monitor closely or increase limit.
+4. **vpa/vpa-admission-certgen** -- ImagePullBackOff. Fix image reference.
+5. **trading-bot-workers** -- 1901Mi across 6 containers, sentiment-analyzer at 2Gi limit. Verify not OOMing.
+
+### Medium Priority (resource waste or risk)
+6. **kms/kms-web-page** -- Guaranteed QoS at 500m CPU / 512Mi, but only uses 0m/10Mi. Massive overprovisioning.
+7. **ollama/ollama** -- Requests 4Gi memory but uses 11Mi (GPU model in VRAM). If not using CPU memory, reduce request.
+8. **resume/printer** -- Requests 1Gi memory but uses 109Mi. Consider reducing.
+9. **nvidia-driver-daemonset** -- No limits set, using 1168Mi. Standard for driver but worth noting.
+10. **servarr/flaresolverr** -- At 58% memory (148Mi/256Mi). Trending toward limit.
+
+### Low Priority (optimization opportunities)
+11. Multiple pods in the monitoring namespace have generous limits but low actual usage (node-exporters at 9-24Mi with 800Mi limits).
+12. crowdsec-agent pods have Guaranteed QoS (req=limit) at 500m/250Mi but use only 3-13m CPU and 43-48Mi memory.
+13. Many edge-tier pods using <10% of their memory limits -- VPA recommendations could help right-size.
--- a/.planning/quick/resource-audit-terraform-definitions.md
+++ b/.planning/quick/resource-audit-terraform-definitions.md
@ -0,0 +1,273 @@
+# Terraform Container Resource Audit
+
+Generated: 2026-03-01
+
+## Tier Defaults (Kyverno LimitRange)
+
+For reference, containers WITHOUT explicit `resources {}` blocks receive these defaults from Kyverno-generated LimitRanges:
+
+| Tier | Default CPU | Default Mem | Request CPU | Request Mem | Max CPU | Max Mem |
+|------|-------------|-------------|-------------|-------------|---------|---------|
+| 0-core | 500m | 512Mi | 50m | 64Mi | 4 | 8Gi |
+| 1-cluster | 500m | 512Mi | 50m | 64Mi | 2 | 4Gi |
+| 2-gpu | 1 | 2Gi | 100m | 256Mi | 8 | 16Gi |
+| 3-edge | 250m | 256Mi | 25m | 64Mi | 2 | 4Gi |
+| 4-aux | 250m | 256Mi | 25m | 64Mi | 2 | 4Gi |
+
+Namespaces with custom LimitRange (opt-out): `nextcloud`, `onlyoffice`
+
+---
+
+## Section 1: Containers WITHOUT Explicit Resources (Relying on LimitRange Defaults)
+
+These are the highest-risk containers -- they receive LimitRange defaults which may be too low or too high.
+
+| Stack | Namespace | Deployment/Resource | Container | Tier | Default CPU Lim | Default Mem Lim | Risk Notes |
+|-------|-----------|-------------------|-----------|------|-----------------|-----------------|------------|
+| blog | website | blog | nginx-exporter | 4-aux | 250m | 256Mi | Sidecar; likely fine |
+| cyberchef | cyberchef | cyberchef | cyberchef | 4-aux | 250m | 256Mi | |
+| echo | echo | echo | echo | 3-edge | 250m | 256Mi | 5 replicas, no resources |
+| networking-toolbox | networking-toolbox | networking-toolbox | networking-toolbox | 4-aux | 250m | 256Mi | 3 replicas |
+| shadowsocks | shadowsocks | shadowsocks | shadowsocks | 3-edge | 250m | 256Mi | |
+| tor-proxy | tor-proxy | tor-proxy | tor-proxy | 4-aux | 250m | 256Mi | |
+| tuya-bridge | tuya-bridge | tuya-bridge | tuya-bridge | 1-cluster | 500m | 512Mi | 3 replicas in cluster tier |
+| audiobookshelf | audiobookshelf | audiobookshelf | audiobookshelf | 4-aux | 250m | 256Mi | May need more for transcoding |
+| changedetection | changedetection | changedetection | sockpuppetbrowser | 4-aux | 250m | 256Mi | Chromium browser; likely needs more |
+| changedetection | changedetection | changedetection | changedetection | 4-aux | 250m | 256Mi | |
+| diun | diun | diun | diun | 4-aux | 250m | 256Mi | |
+| excalidraw | excalidraw | excalidraw | excalidraw | 4-aux | 250m | 256Mi | |
+| freshrss | freshrss | freshrss | freshrss | 4-aux | 250m | 256Mi | |
+| isponsorblocktv | isponsorblocktv | isponsorblocktv-vermont | isponsorblocktv-vermont | 3-edge | 250m | 256Mi | |
+| matrix | matrix | matrix | matrix | 4-aux | 250m | 256Mi | 0 replicas (disabled) |
+| navidrome | navidrome | navidrome | navidrome | 4-aux | 250m | 256Mi | Music streaming |
+| ntfy | ntfy | ntfy | ntfy | 4-aux | 250m | 256Mi | |
+| owntracks | owntracks | owntracks | owntracks | 4-aux | 250m | 256Mi | |
+| privatebin | privatebin | privatebin | privatebin | 3-edge | 250m | 256Mi | |
+| wealthfolio | wealthfolio | wealthfolio | wealthfolio | 4-aux | 250m | 256Mi | |
+| whisper | whisper | whisper | whisper | 2-gpu | 1 | 2Gi | No GPU resource claim; GPU tier |
+| whisper | whisper | piper | piper | 2-gpu | 1 | 2Gi | No GPU resource claim; GPU tier |
+| send | send | send | send | 4-aux | 250m | 256Mi | |
+| n8n | n8n | n8n | n8n | 4-aux | 250m | 256Mi | Workflow automation; may need more |
+| linkwarden | linkwarden | linkwarden | linkwarden | 4-aux | 250m | 256Mi | Next.js app; may OOM |
+| dawarich | dawarich | dawarich | dawarich | 3-edge | 250m | 256Mi | Rails app; may OOM |
+| hackmd | hackmd | hackmd | codimd | 3-edge | 250m | 256Mi | Node.js; may need more |
+| tandoor | tandoor | tandoor | recipes | 4-aux | 250m | 256Mi | Django app |
+| grampsweb | grampsweb | grampsweb | grampsweb | 4-aux | 250m | 256Mi | Flask app |
+| grampsweb | grampsweb | grampsweb | grampsweb-celery | 4-aux | 250m | 256Mi | Celery worker |
+| affine | affine | affine | migration (init) | 4-aux | 250m | 256Mi | Init container; runs prisma migrate |
+| actualbudget (factory) | actualbudget | actualbudget-{name} | actualbudget | 3-edge | 250m | 256Mi | 3 instances (viktor, anca, emo) |
+| actualbudget (factory) | actualbudget | actualbudget-http-api-{name} | actualbudget | 3-edge | 250m | 256Mi | Conditional (budget_encryption_password) |
+| actualbudget (factory) | actualbudget | bank-sync-{name} (CronJob) | bank-sync | 3-edge | 250m | 256Mi | Curl container |
+| osm_routing | osm-routing | osrm-foot | osrm-foot | 4-aux | 250m | 256Mi | OSRM needs ~1GB RAM for routing data |
+| osm_routing | osm-routing | otp | otp | 4-aux | 250m | 256Mi | 0 replicas (disabled); OTP needs 2Gi+ |
+| servarr/prowlarr | servarr | prowlarr | prowlarr | 4-aux | 250m | 256Mi | |
+| servarr/qbittorrent | servarr | qbittorrent | qbittorrent | 4-aux | 250m | 256Mi | |
+| servarr/flaresolverr | servarr | flaresolverr | flaresolverr | 4-aux | 250m | 256Mi | Chromium-based; likely needs more |
+| real-estate-crawler | realestate-crawler | realestate-crawler-ui | realestate-crawler-ui | 4-aux | 250m | 256Mi | 2 replicas |
+| real-estate-crawler | realestate-crawler | realestate-crawler-celery | celery-worker | 4-aux | 250m | 256Mi | |
+| nextcloud | nextcloud | whiteboard | whiteboard | custom (3-edge) | 250m | 256Mi | Custom LimitRange: max 16 CPU/8Gi |
+| nextcloud | nextcloud | nextcloud-backup (CronJob) | backup | custom (3-edge) | 250m | 256Mi | rsync container |
+| calibre | calibre | annas-archive-stacks | annas-archive-stacks | 3-edge | 250m | 256Mi | |
+| ollama | ollama | ollama-ui | ollama-ui | 2-gpu | 1 | 2Gi | Open WebUI; needs significant mem |
+| immich | immich | immich-server | immich-server | 2-gpu | 1 | 2Gi | Photo server; needs resources |
+| immich | immich | immich-postgresql | immich-postgresql | 2-gpu | 1 | 2Gi | PostgreSQL; needs resources |
+| immich | immich | postgresql-backup (CronJob) | postgresql-backup | 2-gpu | 1 | 2Gi | |
+| rybbit | rybbit | rybbit | rybbit | 4-aux | 250m | 256Mi | Node.js backend |
+| rybbit | rybbit | rybbit-client | rybbit-client | 4-aux | 250m | 256Mi | |
+| poison-fountain | poison-fountain | poison-fetcher (CronJob) | fetcher | 1-cluster | 500m | 512Mi | curl container |
+| platform/dbaas | dbaas | mysql-backup (CronJob) | mysql-backup | 1-cluster | 500m | 512Mi | |
+| platform/dbaas | dbaas | phpmyadmin | phpmyadmin | 1-cluster | 500m | 512Mi | |
+| platform/dbaas | dbaas | pgadmin | pgadmin | 1-cluster | 500m | 512Mi | |
+| platform/dbaas | dbaas | postgresql-backup (CronJob) | postgresql-backup | 1-cluster | 500m | 512Mi | |
+| platform/xray | xray | xray | xray | 0-core | 500m | 512Mi | |
+| platform/wireguard | wireguard | wireguard | sysctl-setup (init) | 0-core | 500m | 512Mi | |
+| platform/wireguard | wireguard | wireguard | wireguard | 0-core | 500m | 512Mi | |
+| platform/wireguard | wireguard | wireguard | prometheus-exporter | 0-core | 500m | 512Mi | |
+| platform/cloudflared | cloudflared | cloudflared | cloudflared | 0-core | 500m | 512Mi | |
+| platform/mailserver | mailserver | mailserver | docker-mailserver | 0-core | 500m | 512Mi | Mail server needs more RAM |
+| platform/mailserver | mailserver | dovecot-exporter | dovecot-exporter | 0-core | 500m | 512Mi | |
+| platform/crowdsec | crowdsec | crowdsec-web | crowdsec-web | 1-cluster | 500m | 512Mi | |
+| platform/crowdsec | crowdsec | blocklist-import (CronJob) | blocklist-import | 1-cluster | 500m | 512Mi | |
+| platform/k8s-portal | k8s-portal | k8s-portal | portal | 0-core | 500m | 512Mi | |
+| platform/monitoring | monitoring | monitor-prometheus (CronJob) | monitor-prometheus | opted-out | N/A | N/A | No LimitRange in monitoring ns |
+| platform/redis | redis | redis-backup (CronJob) | redis-backup | 1-cluster | 500m | 512Mi | |
+| platform/infra-maint | kube-system | backup-etcd (CronJob) | backup-etcd | N/A | N/A | N/A | kube-system; no Kyverno LimitRange |
+| platform/infra-maint | kube-system | backup-purge (CronJob) | backup-purge | N/A | N/A | N/A | |
+| platform/infra-maint | kube-system | cleanup-failed (CronJob) | cleanup | N/A | N/A | N/A | |
+
+---
+
+## Section 2: Containers WITH Explicit Resources
+
+| Stack | Namespace | Deployment/Resource | Container | CPU Req | CPU Lim | Mem Req | Mem Lim | Tier | Notes |
+|-------|-----------|-------------------|-----------|---------|---------|---------|---------|------|-------|
+| blog | website | blog | blog | 250m | 500m | 50Mi | 512Mi | 4-aux | |
+| city-guesser | city-guesser | city-guesser | city-guesser | 250m | 500m | 50Mi | 512Mi | 4-aux | |
+| coturn | coturn | coturn | coturn | 100m | 1 | 128Mi | 512Mi | 3-edge | |
+| kms | kms | kms-web-page | kms-web-page | 500m | 500m | 512Mi | 512Mi | 4-aux | Req==Lim, high for nginx |
+| kms | kms | kms (windows) | windows-kms | 1 | 1 | 50Mi | 512Mi | 4-aux | 1 CPU req seems high |
+| travel_blog | travel-blog | travel-blog | travel-blog | 250m | 500m | 50Mi | 512Mi | 4-aux | |
+| webhook_handler | webhook-handler | webhook-handler | webhook-handler | 250m | 500m | 50Mi | 512Mi | 4-aux | |
+| freedify (factory) | freedify | music-{name} | freedify | 100m | 500m | 256Mi | 512Mi | 4-aux | Parameterized; 2 instances |
+| health | health | health | health | 100m | 1 | 256Mi | 1Gi | 4-aux | |
+| plotting-book | plotting-book | plotting-book | plotting-book | 50m | 500m | 128Mi | 512Mi | 4-aux | |
+| frigate | frigate | frigate | frigate | -- | GPU:1 | -- | -- | 2-gpu | Only nvidia.com/gpu limit |
+| ebook2audiobook | ebook2audiobook | ebook2audiobook | ebook2audiobook | -- | GPU:1 | -- | -- | 2-gpu | Only nvidia.com/gpu limit |
+| ebook2audiobook | ebook2audiobook | audiblez | audiblez | -- | GPU:1 | -- | -- | 2-gpu | Only nvidia.com/gpu; 0 replicas |
+| ebook2audiobook | ebook2audiobook | audiblez-web | audiblez-web | -- | GPU:1 | -- | -- | 2-gpu | Only nvidia.com/gpu limit |
+| ytdlp | ytdlp | ytdlp | ytdlp | 25m | 500m | 128Mi | 512Mi | 4-aux | |
+| ytdlp | ytdlp | yt-highlights | yt-highlights | -- | GPU:1 | -- | -- | 4-aux | GPU workload in aux-tier ns |
+| real-estate-crawler | realestate-crawler | realestate-crawler-api | realestate-crawler-api | 50m | 2000m | 128Mi | 1Gi | 4-aux | |
+| real-estate-crawler | realestate-crawler | realestate-crawler-celery-beat | celery-beat | 10m | 200m | 64Mi | 256Mi | 4-aux | |
+| affine | affine | affine | affine | 100m | 2 | 512Mi | 4Gi | 4-aux | |
+| atuin | atuin | atuin | atuin | 50m | 500m | 64Mi | 256Mi | 4-aux | |
+| osm_routing | osm-routing | osrm-bicycle | osrm-bicycle | 15m | 250m | 512Mi | 1Gi | 4-aux | |
+| paperless-ngx | paperless-ngx | paperless-ngx | paperless-ngx | 100m | 2 | 256Mi | 1Gi | 3-edge | |
+| stirling-pdf | stirling-pdf | stirling-pdf | stirling-pdf | 100m | 2 | 256Mi | 1Gi | 4-aux | |
+| netbox | netbox | netbox | netbox | 25m | 1 | 64Mi | 512Mi | 4-aux | |
+| speedtest | speedtest | speedtest | speedtest | 25m | 500m | 64Mi | 512Mi | 4-aux | |
+| meshcentral | meshcentral | meshcentral | meshcentral | 15m | 500m | 64Mi | 384Mi | 4-aux | |
+| forgejo | forgejo | forgejo | forgejo | 15m | 500m | 64Mi | 512Mi | 3-edge | |
+| dashy | dashy | dashy | dashy | 15m | 500m | 64Mi | 512Mi | 4-aux | |
+| url | url | shlink | shlink | 25m | -- | 128Mi | 512Mi | 4-aux | No CPU limit |
+| url | url | shlink-web | shlink-web | 250m | 500m | 50Mi | 512Mi | 4-aux | |
+| f1-stream | f1-stream | f1-stream | f1-stream | 50m | 500m | 64Mi | 256Mi | 4-aux | |
+| calibre | calibre | calibre-web-automated | calibre-web-automated | 50m | 1 | 256Mi | 1Gi | 3-edge | |
+| poison-fountain | poison-fountain | poison-fountain | poison-fountain | 10m | 100m | 32Mi | 128Mi | 1-cluster | |
+| ollama | ollama | ollama | ollama | 500m | 4 | 4Gi | 12Gi + GPU:1 | 2-gpu | |
+| onlyoffice | onlyoffice | onlyoffice-document-server | onlyoffice-document-server | 250m | 8 | 512Mi | 4Gi | 3-edge | Custom LimitRange |
+| openclaw | openclaw | openclaw | openclaw | 100m | 2 | 512Mi | 2Gi | 4-aux | |
+| openclaw | openclaw | openclaw | modelrelay (sidecar) | 25m | 500m | 64Mi | 256Mi | 4-aux | |
+| openclaw | openclaw | cluster-healthcheck (CronJob) | healthcheck | 50m | -- | 64Mi | 128Mi | 4-aux | No CPU limit |
+| resume | resume | printer | printer | 50m | 1 | 128Mi | 512Mi | 4-aux | Chromium |
+| resume | resume | resume | resume | 25m | 500m | 128Mi | 384Mi | 4-aux | |
+| rybbit | rybbit | clickhouse | clickhouse | 100m | 2 | 512Mi | 4Gi | 4-aux | |
+| immich | immich | immich-machine-learning | immich-machine-learning | -- | GPU:1 | -- | -- | 2-gpu | Only nvidia.com/gpu limit |
+| trading-bot | trading-bot | trading-bot-frontend | dashboard | 10m | 200m | 32Mi | 128Mi | 3-edge | |
+| trading-bot | trading-bot | trading-bot-frontend | api-gateway | 50m | 1000m | 128Mi | 512Mi | 3-edge | |
+| trading-bot | trading-bot | trading-bot-workers | news-fetcher | 10m | 500m | 64Mi | 256Mi | 3-edge | |
+| trading-bot | trading-bot | trading-bot-workers | sentiment-analyzer | 100m | 2000m | 512Mi | 2Gi | 3-edge | |
+| trading-bot | trading-bot | trading-bot-workers | signal-generator | 10m | 500m | 64Mi | 256Mi | 3-edge | |
+| trading-bot | trading-bot | trading-bot-workers | trade-executor | 10m | 500m | 64Mi | 256Mi | 3-edge | |
+| trading-bot | trading-bot | trading-bot-workers | learning-engine | 10m | 500m | 64Mi | 256Mi | 3-edge | |
+| trading-bot | trading-bot | trading-bot-workers | market-data | 10m | 500m | 64Mi | 256Mi | 3-edge | |
+| platform/technitium | technitium | technitium | technitium | YES | YES | YES | YES | 0-core | Has resources block |
+| platform/vaultwarden | vaultwarden | vaultwarden | vaultwarden | YES | YES | YES | YES | 0-core | Has resources block |
+| platform/uptime-kuma | uptime-kuma | uptime-kuma | uptime-kuma | YES | YES | YES | YES | 0-core | Has resources block |
+| platform/headscale | headscale | headscale | headscale | YES | YES | YES | YES | 0-core | Has resources block |
+| platform/headscale | headscale | headscale | headscale-ui | YES | YES | YES | YES | 0-core | Has resources block |
+| platform/traefik | traefik | traefik-default-backend | nginx | YES | YES | YES | YES | 0-core | Has resources block |
+| platform/traefik | traefik | traefik-local-backend | nginx | YES | YES | YES | YES | 0-core | Has resources block |
+| platform/nvidia | nvidia | nvidia-exporter | nvidia-exporter | YES | YES | YES | YES | 2-gpu | Has resources block |
+| platform/nvidia | nvidia | nvidia-power-exporter | exporter | YES | YES | YES | YES | 2-gpu | Has resources block |
+| platform/monitoring | monitoring | goflow2 | goflow2 | YES | YES | YES | YES | 1-cluster | Has resources block |
+
+---
+
+## Section 3: Helm Chart Deployments (Resources via values.yaml)
+
+These services are deployed via Helm charts. Resource configuration is in the chart's values files, not directly visible in main.tf.
+
+| Stack | Namespace | Chart | Values File | Tier | Notes |
+|-------|-----------|-------|-------------|------|-------|
+| homepage | homepage | jameswynn/homepage | values.yaml | 4-aux | Check values for resources |
+| k8s-dashboard | kubernetes-dashboard | kubernetes-dashboard v7.12.0 | -- | 1-cluster | No custom values for resources |
+| reloader | reloader | stakater/reloader | -- | 4-aux | No custom values |
+| descheduler | descheduler | descheduler | values.yaml | -- | No tier label |
+| woodpecker | woodpecker | woodpecker v3.5.1 | values.yaml | 3-edge | Custom quota; check values |
+| nextcloud | nextcloud | nextcloud/nextcloud v8.8.1 | chart_values.yaml | 3-edge | Custom LimitRange/Quota |
+| platform/traefik | traefik | traefik | chart values | 0-core | |
+| platform/metallb | metallb | metallb | -- | 0-core | |
+| platform/redis | redis | bitnami/redis | chart values | 1-cluster | |
+| platform/monitoring | monitoring | prometheus, grafana, loki | various | 1-cluster | Opted out of Kyverno quota |
+| platform/kyverno | kyverno | kyverno | chart values | 1-cluster | |
+| platform/cnpg | cnpg | cnpg-operator | -- | 1-cluster | |
+| platform/metrics-server | metrics-server | metrics-server | -- | 1-cluster | |
+| platform/vpa | vpa | fairwinds/vpa | -- | 1-cluster | |
+| platform/crowdsec | crowdsec | crowdsec | chart values | 1-cluster | |
+| platform/nvidia | nvidia | nvidia gpu-operator | chart values | 2-gpu | Opted out of Kyverno quota |
+| platform/authentik | authentik | authentik | chart values | 0-core | Custom quota |
+| platform/dbaas | dbaas | mysql-operator/innodbcluster | chart values | 1-cluster | Custom quota |
+
+---
+
+## Section 4: High-Risk Findings Summary
+
+### OOM-Kill Risk (containers likely needing more than 256Mi default)
+
+| Container | Namespace | Tier Default Mem | Why It's Risky |
+|-----------|-----------|-----------------|----------------|
+| sockpuppetbrowser | changedetection | 256Mi | Headless Chromium browser |
+| flaresolverr | servarr | 256Mi | Chromium-based solver |
+| osrm-foot | osm-routing | 256Mi | OSRM loads routing graph into memory (~500MB+) |
+| navidrome | navidrome | 256Mi | Music library indexing |
+| linkwarden | linkwarden | 256Mi | Next.js app with screenshot capture |
+| n8n | n8n | 256Mi | Workflow automation with many nodes |
+| dawarich | dawarich | 256Mi | Rails app |
+| hackmd (codimd) | hackmd | 256Mi | Node.js collaborative editor |
+| ollama-ui | ollama | 2Gi | Open WebUI; may be fine in GPU tier |
+| immich-server | immich | 2Gi | Photo processing server |
+| immich-postgresql | immich | 2Gi | PostgreSQL with pgvector |
+| docker-mailserver | mailserver | 512Mi | ClamAV, SpamAssassin, etc. |
+| audiobookshelf | audiobookshelf | 256Mi | Media server with transcoding |
+
+### GPU Containers with Only nvidia.com/gpu Limit (no CPU/Mem specified)
+
+These get LimitRange defaults for CPU/Mem but only have GPU limits set:
+
+| Container | Namespace | Tier | Gets Default |
+|-----------|-----------|------|-------------|
+| frigate | frigate | 2-gpu | 1 CPU / 2Gi |
+| ebook2audiobook | ebook2audiobook | 2-gpu | 1 CPU / 2Gi |
+| audiblez | ebook2audiobook | 2-gpu | 1 CPU / 2Gi |
+| audiblez-web | ebook2audiobook | 2-gpu | 1 CPU / 2Gi |
+| yt-highlights | ytdlp | 4-aux | 250m / 256Mi (!) |
+| immich-machine-learning | immich | 2-gpu | 1 CPU / 2Gi |
+
+**Note**: `yt-highlights` is in the `ytdlp` namespace (4-aux tier) but runs on GPU node. Its default of 256Mi is very low for a Whisper ASR model.
+
+### Containers with No Resources in Core/Cluster Tier (higher defaults but still worth checking)
+
+| Container | Namespace | Tier | Default |
+|-----------|-----------|------|---------|
+| xray | xray | 0-core | 500m / 512Mi |
+| wireguard | wireguard | 0-core | 500m / 512Mi |
+| wireguard prometheus-exporter | wireguard | 0-core | 500m / 512Mi |
+| cloudflared | cloudflared | 0-core | 500m / 512Mi |
+| docker-mailserver | mailserver | 0-core | 500m / 512Mi |
+| dovecot-exporter | mailserver | 0-core | 500m / 512Mi |
+| k8s-portal | k8s-portal | 0-core | 500m / 512Mi |
+| tuya-bridge | tuya-bridge | 1-cluster | 500m / 512Mi |
+| phpmyadmin | dbaas | 1-cluster | 500m / 512Mi |
+| pgadmin | dbaas | 1-cluster | 500m / 512Mi |
+| crowdsec-web | crowdsec | 1-cluster | 500m / 512Mi |
+
+---
+
+## Section 5: Statistics
+
+### Totals
+
+- **Total unique containers audited**: ~120+
+- **Containers WITH explicit resources**: ~55
+- **Containers WITHOUT explicit resources**: ~65
+- **Helm-managed (resources in values)**: ~18 charts
+
+### By Tier (containers without resources)
+
+| Tier | Count | Risk Level |
+|------|-------|------------|
+| 0-core | 7 | Medium (512Mi default is usually OK) |
+| 1-cluster | 7 | Medium |
+| 2-gpu | 5 | Low (2Gi default is generous) |
+| 3-edge | 8 | High (256Mi can OOM Node/Rails/Java apps) |
+| 4-aux | 25+ | High (256Mi is tight for many services) |
+| monitoring (opted-out) | 1 | Low (no LimitRange at all) |
+| kube-system | 3 | Low (no Kyverno) |
+
+### Recommendations
+
+1. **Immediate action**: Add explicit resources to `sockpuppetbrowser`, `flaresolverr`, `osrm-foot`, `docker-mailserver`, `immich-server`, `immich-postgresql`, `linkwarden`, `n8n`
+2. **GPU containers**: Add explicit CPU/Mem alongside nvidia.com/gpu for `frigate`, `ebook2audiobook`, `audiblez-web`, `immich-machine-learning`, `yt-highlights`
+3. **Review**: `kms-web-page` has 500m/512Mi request==limit for nginx (wasteful)
+4. **CronJobs**: Most CronJob containers lack resources -- acceptable for short-lived jobs but adds to ResourceQuota consumption
--- a/.planning/quick/resource-audit-vpa-recommendations.md
+++ b/.planning/quick/resource-audit-vpa-recommendations.md
--- a/.planning/quick/resource-plan.md
+++ b/.planning/quick/resource-plan.md
@ -0,0 +1,285 @@
+# Resource Right-Sizing Plan
+
+## Methodology
+- **Conservative**: limits = max(VPA upper bound * 2, current live usage * 2, minimum sane value)
+- **Requests**: VPA target or current usage, whichever is higher
+- **Floor values**: 10m CPU req, 25m CPU lim, 32Mi mem req, 64Mi mem lim (nothing goes below these)
+- **GPU containers**: keep nvidia.com/gpu, add CPU/mem based on VPA data
+- **Ollama special case**: remove CPU/mem limits entirely (keep only GPU + minimal requests)
+
+## Wave 1: CRITICAL FIXES (actively broken)
+
+### dashy — CPU throttled at 98% (490m/500m), mem needs 2.36Gi
+- File: stacks/dashy/main.tf
+- VPA target: 15m CPU, 2.36Gi mem | Upper: 15m CPU, 3.23Gi mem
+- Live: 490m CPU, 1048Mi mem
+- **New**: req 50m/512Mi, lim 2/4Gi
+
+### stirling-pdf — CPU throttled at 99.7% (299m/300m)
+- File: stacks/stirling-pdf/main.tf
+- VPA target: 29m CPU, 1.41Gi mem | Upper: 29m CPU, 1.41Gi mem
+- Live: 299m CPU, 902Mi mem
+- **New**: req 100m/512Mi, lim 2/2Gi
+
+### MySQL cluster — OOMKilled, 1845Mi with 2Gi limit
+- File: stacks/platform/modules/dbaas/main.tf
+- Already bumped to 3Gi in previous session, but pods show 512Mi (VPA override legacy)
+- VPA target: 2.77Gi | Upper: 6.90Gi
+- **New**: top-level resources: req 250m/2Gi, lim 2/4Gi; podSpec.containers mysql: same
+
+### traefik auth-proxy & bot-block-proxy — VPA says need 100Mi, limit is 32Mi
+- File: stacks/platform/modules/traefik/main.tf
+- **New**: req 5m/32Mi, lim 50m/128Mi
+
+## Wave 2: STANDALONE STACKS — containers without explicit resources
+
+### affine — over-provisioned (2 CPU / 4Gi, uses 4m/174Mi)
+- VPA upper: 63m/307Mi
+- **New**: req 25m/128Mi, lim 250m/512Mi
+
+### aiostreams — mem at 215Mi with 768Mi limit, VPA says 641Mi target
+- **New**: req 25m/256Mi, lim 500m/1Gi
+
+### audiobookshelf — no resources, 55Mi usage
+- VPA upper: 15m/170Mi
+- **New**: req 15m/64Mi, lim 250m/512Mi
+
+### changedetection — sockpuppetbrowser (Chromium) + changedetection
+- changedetection: VPA 15m/100Mi | **New**: req 15m/64Mi, lim 250m/256Mi
+- sockpuppetbrowser: Chromium needs more | **New**: req 25m/128Mi, lim 500m/512Mi
+
+### cyberchef — tiny (8Mi), no resources
+- **New**: req 10m/32Mi, lim 100m/128Mi
+
+### dawarich — Rails app at 438Mi
+- VPA upper: 15m/838Mi
+- **New**: req 15m/256Mi, lim 250m/1Gi
+
+### diun — tiny (24Mi)
+- **New**: req 10m/32Mi, lim 100m/128Mi
+
+### echo — 5 replicas, tiny (19-30Mi each)
+- **New**: req 10m/32Mi, lim 100m/128Mi
+
+### excalidraw — tiny (2Mi)
+- **New**: req 10m/16Mi, lim 100m/64Mi
+
+### flaresolverr — Chromium at 148Mi/256Mi (58%)
+- VPA upper: 15m/348Mi
+- **New**: req 25m/128Mi, lim 500m/512Mi
+
+### freshrss — 56Mi
+- VPA upper: 15m/167Mi
+- **New**: req 15m/64Mi, lim 250m/256Mi
+
+### hackmd — Node.js at 82Mi
+- VPA upper: 15m/256Mi
+- **New**: req 15m/64Mi, lim 250m/512Mi
+
+### isponsorblocktv — 42Mi
+- **New**: req 10m/32Mi, lim 150m/256Mi
+
+### linkwarden — Next.js at 682Mi
+- VPA upper: 15m/1.04Gi
+- **New**: req 25m/256Mi, lim 500m/1.5Gi
+
+### n8n — workflow automation at 425Mi
+- VPA upper: 15m/766Mi
+- **New**: req 25m/256Mi, lim 500m/1Gi
+
+### navidrome — music at 62Mi
+- VPA upper: 15m/179Mi
+- **New**: req 15m/64Mi, lim 250m/384Mi
+
+### ntfy — 20Mi
+- **New**: req 10m/32Mi, lim 100m/128Mi
+
+### owntracks — tiny (1Mi)
+- **New**: req 10m/16Mi, lim 100m/64Mi
+
+### privatebin — 46Mi
+- **New**: req 10m/32Mi, lim 150m/256Mi
+
+### send — 53Mi
+- **New**: req 10m/32Mi, lim 150m/256Mi
+
+### shadowsocks — tiny (0Mi)
+- **New**: req 10m/16Mi, lim 100m/64Mi
+
+### tandoor — Django at 754Mi
+- VPA upper: 15m/1.14Gi
+- **New**: req 25m/256Mi, lim 250m/1.5Gi
+
+### tor-proxy — 61Mi
+- VPA upper: 15m/167Mi
+- **New**: req 10m/64Mi, lim 150m/256Mi
+
+### wealthfolio — tiny (8Mi)
+- **New**: req 10m/32Mi, lim 100m/128Mi
+
+### networking-toolbox — tiny, 3 replicas
+- **New**: req 10m/32Mi, lim 100m/128Mi
+
+### tuya-bridge — IoT bridge, 3 replicas
+- VPA upper: 15m/100Mi
+- **New**: req 10m/32Mi, lim 150m/256Mi
+
+### rybbit — Node.js backend at 185Mi
+- **New**: req 25m/128Mi, lim 250m/512Mi
+### rybbit-client — 89Mi
+- **New**: req 10m/64Mi, lim 150m/256Mi
+
+## Wave 3: PLATFORM MODULES — containers without explicit resources
+
+### mailserver — docker-mailserver at 183Mi (needs more for ClamAV)
+- VPA upper: 15m/317Mi
+- **New**: req 25m/128Mi, lim 500m/512Mi
+### dovecot-exporter
+- **New**: req 10m/16Mi, lim 100m/64Mi
+
+### cloudflared — 31-59Mi each, 3 replicas
+- VPA upper: 15m/110Mi
+- **New**: req 15m/32Mi, lim 200m/256Mi
+
+### pgadmin — 265Mi
+- VPA upper: 15m/413Mi
+- **New**: req 25m/128Mi, lim 500m/512Mi
+
+### phpmyadmin — 46Mi
+- VPA upper: 15m/100Mi
+- **New**: req 15m/32Mi, lim 250m/256Mi
+
+### crowdsec-web — 46Mi
+- **New**: req 15m/32Mi, lim 250m/256Mi
+
+### xray — 11Mi
+- **New**: req 10m/32Mi, lim 100m/128Mi
+
+### wireguard — tiny (2Mi)
+- **New**: req 10m/16Mi, lim 100m/128Mi
+### wireguard prometheus-exporter
+- **New**: req 10m/16Mi, lim 50m/64Mi
+
+### k8s-portal — 14Mi
+- **New**: req 10m/32Mi, lim 100m/128Mi
+
+## Wave 4: GPU CONTAINERS — add CPU/mem to GPU-only containers
+
+### ollama — SPECIAL: remove limits, keep minimal requests + GPU
+- **New**: req 100m/256Mi, lim nvidia.com/gpu=1 ONLY (no CPU/mem limits)
+
+### frigate — highest mem (3835Mi), CPU (860m)
+- VPA upper: 1.8 CPU, 6.65Gi mem
+- **New**: req 500m/2Gi, lim 4/8Gi + GPU:1
+
+### immich-machine-learning — 1215Mi
+- VPA upper: 15m/2.90Gi
+- **New**: req 100m/1Gi, lim 2/4Gi + GPU:1
+
+### immich-server — no resources, 404Mi, VPA 920m CPU
+- **New**: req 100m/256Mi, lim 2/2Gi
+
+### immich-postgresql — no resources, 268Mi
+- **New**: req 50m/256Mi, lim 1/1Gi
+
+### ollama-ui — 658Mi, no resources
+- VPA upper: 15m/969Mi
+- **New**: req 25m/256Mi, lim 500m/1.5Gi
+
+### whisper — 628Mi, no resources
+- VPA upper: 15m/969Mi
+- **New**: req 25m/256Mi, lim 500m/1.5Gi
+
+### piper — 32Mi
+- **New**: req 25m/64Mi, lim 250m/512Mi
+
+## Wave 5: RIGHT-SIZE OVER-PROVISIONED
+
+### kms-web-page — uses 0m/10Mi but has 500m/512Mi Guaranteed QoS
+- **New**: req 10m/16Mi, lim 50m/64Mi
+
+### kms (windows) — uses 0m/0Mi but has 1/512Mi
+- **New**: req 10m/32Mi, lim 100m/128Mi
+
+### city-guesser — uses 1m/23Mi but has 250m/500m CPU req
+- **New**: req 10m/32Mi, lim 100m/256Mi
+
+### blog — uses 0m/17Mi but has 250m/500m
+- **New**: req 10m/32Mi, lim 100m/256Mi
+
+### travel-blog — uses 0m/9Mi, has 250m/500m
+- **New**: req 10m/32Mi, lim 100m/256Mi
+
+### webhook-handler — uses 1m/8Mi, has 250m/500m
+- **New**: req 10m/32Mi, lim 100m/256Mi
+
+### coturn — uses 1m/7Mi, has 100m/1 CPU
+- **New**: req 10m/32Mi, lim 100m/128Mi
+
+### health — uses 2m/101Mi, has 100m/1
+- **New**: req 15m/64Mi, lim 250m/256Mi
+
+### plotting-book — uses 0m/22Mi, has 50m/500m
+- **New**: req 10m/32Mi, lim 100m/256Mi
+
+### resume/printer — uses 3m/109Mi, VPA says 1.29Gi mem (Chromium!)
+- **New**: req 25m/128Mi, lim 500m/1.5Gi (Chromium headless)
+
+### resume — uses 1m/116Mi, has 25m/500m
+- **New**: req 15m/64Mi, lim 250m/384Mi
+
+### openclaw/modelrelay — uses low, VPA upper 1.22Gi mem
+- **New**: req 25m/64Mi, lim 500m/512Mi
+
+### atuin — uses 1m/2Mi
+- **New**: req 10m/16Mi, lim 100m/128Mi
+
+### vaultwarden — uses 1m/49Mi
+- **New**: req 10m/32Mi, lim 100m/256Mi
+
+### f1-stream — uses 7m/53Mi
+- **New**: req 25m/64Mi, lim 250m/256Mi
+
+### speedtest — uses 1m/147Mi, has 25m/500m
+- VPA upper: 418m CPU (spikes during tests!)
+- **New**: req 25m/128Mi, lim 1/512Mi
+
+### netbox — uses 1m/480Mi
+- VPA upper: 383m CPU, 605Mi mem
+- **New**: req 25m/256Mi, lim 500m/1Gi
+
+### meshcentral — uses 1m/127Mi
+- VPA upper: 15m/367Mi
+- **New**: req 15m/64Mi, lim 250m/512Mi
+
+### forgejo — uses 1m/170Mi
+- VPA upper: 15m/284Mi
+- **New**: req 15m/64Mi, lim 250m/512Mi
+
+### calibre-web-automated — uses 1m/196Mi
+- VPA upper: 63m/829Mi
+- **New**: req 25m/256Mi, lim 500m/1Gi
+
+### paperless-ngx — uses 4m/691Mi, VPA upper 1.70Gi
+- **New**: req 50m/512Mi, lim 1/2Gi
+
+### realestate-crawler-api — uses 2m/133Mi, has 50m/2000m CPU lim
+- **New**: req 15m/64Mi, lim 250m/512Mi
+
+### realestate-crawler-celery-beat — uses 0m/107Mi
+- **New**: req 10m/64Mi, lim 100m/256Mi
+
+### osrm-bicycle — uses 0m/366Mi
+- VPA upper: 15m/679Mi
+- **New**: req 15m/256Mi, lim 100m/1Gi
+
+### osrm-foot — no resources, uses 0m/359Mi
+- VPA upper similar to bicycle
+- **New**: req 15m/256Mi, lim 100m/1Gi
+
+### freedify — uses 2m/57-68Mi, has 100m/500m
+- **New**: req 15m/64Mi, lim 250m/256Mi
+
+### onlyoffice — uses 3m/1007Mi, has 250m/8 CPU (177x waste on CPU)
+- Keep memory at 4Gi (needs it), reduce CPU
+- **New**: req 100m/512Mi, lim 2/4Gi
--- a/.planning/research/ARCHITECTURE.md
+++ b/.planning/research/ARCHITECTURE.md
@ -0,0 +1,434 @@
+# Architecture Research
+
+**Domain:** Live stream aggregation and proxy service (F1 streaming)
+**Researched:** 2026-02-23
+**Confidence:** MEDIUM — HLS spec and proxy mechanics are HIGH confidence from RFC 8216 and Apple docs; extractor patterns are MEDIUM confidence from yt-dlp/streamlink analysis; system composition for this specific use-case is inferred from domain knowledge.
+
+---
+
+## Standard Architecture
+
+### System Overview
+
+```
+┌───────────────────────────────────────────────────────────────────┐
+│                        CLIENT LAYER                               │
+│  ┌──────────────────────────────────────────────────────────┐     │
+│  │   Svelte Frontend (schedule view, stream picker, player)  │     │
+│  └────────────────────────────┬─────────────────────────────┘     │
+└───────────────────────────────│───────────────────────────────────┘
+                                │ HTTP/REST
+┌───────────────────────────────▼───────────────────────────────────┐
+│                        API LAYER                                   │
+│  ┌──────────────────────────────────────────────────────────┐     │
+│  │   Backend API (schedule, streams, health state)           │     │
+│  └────────┬──────────────────┬──────────────────────────────┘     │
+└───────────│──────────────────│────────────────────────────────────┘
+            │                  │
+            ▼                  ▼
+┌───────────────────┐  ┌──────────────────────────────────────────┐
+│   SCHEDULE        │  │         EXTRACTION LAYER                  │
+│   SUBSYSTEM       │  │  ┌───────────┐  ┌───────────┐            │
+│                   │  │  │ Extractor │  │ Extractor │  ...        │
+│ Jolpica/OpenF1    │  │  │ Site A    │  │ Site B    │            │
+│ API client        │  │  └─────┬─────┘  └─────┬─────┘            │
+│                   │  │        │               │                   │
+│ Cron: refresh     │  │  ┌─────▼───────────────▼──────────────┐   │
+│ schedule          │  │  │   Extractor Registry / Dispatcher   │   │
+└───────────────────┘  │  └─────────────────────┬──────────────┘   │
+                        │                        │                   │
+                        │  ┌─────────────────────▼──────────────┐   │
+                        │  │   Stream Health Checker             │   │
+                        │  │   (HEAD/partial GET on .m3u8 URLs)  │   │
+                        │  └─────────────────────────────────────┘   │
+                        └──────────────────────────────────────────┘
+                                          │
+                                          ▼ valid stream URLs
+                        ┌──────────────────────────────────────────┐
+                        │         PROXY LAYER                       │
+                        │                                           │
+                        │  Master Playlist Rewriter                 │
+                        │  ┌────────────────────────────────────┐   │
+                        │  │ GET /proxy?url=<encoded-m3u8>       │   │
+                        │  │  → fetch upstream m3u8              │   │
+                        │  │  → rewrite all URIs to proxy paths  │   │
+                        │  │  → return modified playlist         │   │
+                        │  └────────────────────────────────────┘   │
+                        │                                           │
+                        │  Segment Relay                            │
+                        │  ┌────────────────────────────────────┐   │
+                        │  │ GET /relay?url=<encoded-segment>    │   │
+                        │  │  → upstream fetch with headers      │   │
+                        │  │  → pipe response to client          │   │
+                        │  └────────────────────────────────────┘   │
+                        └──────────────────────────────────────────┘
+                                          │
+                                          ▼ piped bytes
+                        ┌──────────────────────────────────────────┐
+                        │         STORAGE / CACHE                   │
+                        │  ┌─────────────────┐  ┌───────────────┐  │
+                        │  │ In-memory cache  │  │   NFS mount   │  │
+                        │  │ (stream links,   │  │ (schedule     │  │
+                        │  │  health status)  │  │  snapshots,   │  │
+                        │  └─────────────────┘  │  config)      │  │
+                        │                        └───────────────┘  │
+                        └──────────────────────────────────────────┘
+```
+
+---
+
+### Component Responsibilities
+
+| Component | Responsibility | Typical Implementation |
+|-----------|----------------|------------------------|
+| **Svelte Frontend** | Schedule display, stream picker UI, embedded HLS player | SvelteKit app; hls.js or Video.js for player |
+| **Backend API** | Serves schedule, current stream list, health status to frontend | Python (FastAPI) or Node.js; REST endpoints |
+| **Schedule Subsystem** | Polls Jolpica/OpenF1 API, normalises session data, stores locally | Async background task with cron interval |
+| **Extractor Registry** | Maps site hostnames to extractor implementations; dispatches extraction | Plain dict/map of site-key → extractor class |
+| **Per-Site Extractor** | Performs HTTP requests with session cookies/CSRF, parses HTML/JS, follows redirect chains, returns raw stream URL | Python class per site; uses `httpx`/`requests` + `BeautifulSoup`/`regex` |
+| **Stream Health Checker** | Verifies extracted URLs are live (partial GET on m3u8, checks HTTP 200 + content-type) | Background poller; marks streams up/down in cache |
+| **Proxy / Playlist Rewriter** | Fetches upstream m3u8, rewrites all embedded URIs to go through `/relay`, returns modified playlist | Stateless HTTP handler; no buffering of media data |
+| **Segment Relay** | Fetches upstream `.ts`/`.fmp4` segments and pipes bytes to client; forwards necessary headers | Streaming HTTP proxy (not buffered); forwards Range, Content-Type |
+| **In-Memory Cache** | Stores current stream states and health, avoids redundant extraction on every client request | Python dict with TTL, or Redis (existing cluster Redis) |
+| **NFS Storage** | Persists schedule snapshots, extractor configuration, optional diagnostics | NFS at `10.0.10.15` via existing pattern |
+
+---
+
+## Recommended Project Structure
+
+```
+f1-streams/
+├── backend/
+│   ├── api/
+│   │   ├── routes/
+│   │   │   ├── schedule.py     # GET /schedule
+│   │   │   ├── streams.py      # GET /streams, POST /streams/refresh
+│   │   │   └── proxy.py        # GET /proxy, GET /relay
+│   │   └── main.py             # FastAPI app, lifespan hooks
+│   ├── extractors/
+│   │   ├── base.py             # Extractor ABC: extract() -> list[StreamInfo]
+│   │   ├── registry.py         # Map site-key -> extractor class
+│   │   ├── site_a.py           # Site-A specific extractor
+│   │   └── site_b.py           # Site-B specific extractor
+│   ├── schedule/
+│   │   ├── client.py           # Jolpica/OpenF1 API client
+│   │   ├── models.py           # Session, Race pydantic models
+│   │   └── poller.py           # Background cron task
+│   ├── health/
+│   │   └── checker.py          # Stream liveness verification
+│   ├── proxy/
+│   │   ├── playlist.py         # m3u8 fetch + URI rewriting
+│   │   └── relay.py            # Segment pipe-through handler
+│   ├── cache.py                # In-memory store with TTL
+│   └── config.py               # Site list, polling intervals, NFS paths
+├── frontend/
+│   ├── src/
+│   │   ├── routes/
+│   │   │   ├── +page.svelte    # Schedule home
+│   │   │   └── watch/
+│   │   │       └── +page.svelte # Stream picker + player
+│   │   ├── lib/
+│   │   │   ├── api.ts           # Backend API client
+│   │   │   ├── player.ts        # hls.js wrapper
+│   │   │   └── schedule.ts      # Session time formatting
+│   │   └── app.html
+│   ├── static/
+│   └── package.json
+├── stacks/
+│   └── f1-streams/
+│       ├── main.tf
+│       └── terragrunt.hcl
+└── Dockerfile                   # Multi-stage: backend + frontend
+```
+
+### Structure Rationale
+
+- **backend/extractors/**: One file per site; base class enforces interface. Adding a new site = add one file + register it. No change to core.
+- **backend/proxy/**: Isolated from extraction. Proxy only knows about URLs — it does not care how they were found.
+- **backend/schedule/**: Completely independent subsystem. Can fail without breaking stream delivery.
+- **backend/health/**: Decoupled checker; stores results in cache, consulted by API on `/streams` requests.
+- **frontend/**: Standard SvelteKit layout. Minimal — schedule + player, nothing else.
+- **stacks/f1-streams/**: Single Terragrunt stack following existing pattern in repo.
+
+---
+
+## Architectural Patterns
+
+### Pattern 1: Extractor Plugin Interface
+
+**What:** Each site extractor implements a fixed interface (`extract(session_hint) -> list[StreamURL]`). The registry maps site keys to extractor classes. The dispatcher iterates the registry, calls each extractor, aggregates results.
+
+**When to use:** Always — the number of sites will grow and their anti-scraping measures change independently. Isolation prevents one broken extractor from affecting others.
+
+**Trade-offs:** Slightly more boilerplate per site; but each extractor is testable in isolation and replaceable without touching shared code.
+
+**Example:**
+```python
+class BaseExtractor(ABC):
+    site_key: str  # e.g. "siteA"
+
+    @abstractmethod
+    async def extract(self, hint: SessionHint | None = None) -> list[StreamURL]:
+        """Return list of live stream URLs found on this site."""
+        ...
+
+class SiteAExtractor(BaseExtractor):
+    site_key = "siteA"
+
+    async def extract(self, hint=None) -> list[StreamURL]:
+        # 1. GET page, parse CSRF token from HTML
+        # 2. POST with token to get obfuscated JSON
+        # 3. Decode JS-obfuscated URL
+        # 4. Follow redirects to final .m3u8
+        ...
+```
+
+### Pattern 2: Playlist Rewriting Proxy
+
+**What:** The proxy layer fetches the upstream m3u8 and rewrites every URL inside it (both master → variant pointers, and variant → segment pointers) to point back through `/relay?url=<base64-encoded-original>`. The client never contacts upstream directly.
+
+**When to use:** Always when proxying HLS — the player will follow URLs in the playlist; if those URLs point to the origin CDN, the proxy is bypassed for segment delivery.
+
+**Trade-offs:** Adds ~1 hop latency per segment request. For a private service with 1-5 users, this is negligible. Benefit: hides origin, enables header injection (e.g., `Referer`), unified player experience.
+
+**Example:**
+```python
+def rewrite_playlist(m3u8_text: str, base_url: str, proxy_base: str) -> str:
+    """Rewrite all URIs in an m3u8 to go through the proxy relay endpoint."""
+    lines = []
+    for line in m3u8_text.splitlines():
+        if line and not line.startswith("#"):
+            # resolve relative URL, then encode through proxy
+            absolute = urllib.parse.urljoin(base_url, line)
+            proxied = f"{proxy_base}/relay?url={b64encode(absolute)}"
+            lines.append(proxied)
+        else:
+            lines.append(line)
+    return "\n".join(lines)
+```
+
+### Pattern 3: Background Polling with In-Memory Cache
+
+**What:** Extraction and health checking run as background tasks on a schedule (e.g., every 2 minutes). Results are stored in a shared in-memory dict with timestamps. The API layer reads from cache and returns immediately — no per-request extraction.
+
+**When to use:** Always — on-demand extraction per client request would be slow (2-10s per site) and would hammer the source sites.
+
+**Trade-offs:** Cache staleness window (default 2 min). Acceptable for live sports: streams stay stable once live.
+
+**Example:**
+```python
+# cache.py
+_stream_cache: dict[str, CachedResult] = {}
+
+async def get_streams() -> list[StreamURL]:
+    if cache_is_fresh():
+        return _stream_cache["streams"].data
+    # else trigger background refresh
+    ...
+```
+
+---
+
+## Data Flow
+
+### Stream Discovery Flow (background)
+
+```
+[Cron trigger: every 2 min]
+        ↓
+[Extractor Registry]
+        ↓ (fan-out, concurrent)
+[SiteA Extractor]   [SiteB Extractor]   [SiteN Extractor]
+        ↓
+[Raw stream URLs: list of .m3u8 candidates]
+        ↓
+[Health Checker: partial GET each URL]
+        ↓ (filter: only HTTP 200 + video/mpegURL content-type)
+[Validated stream URLs]
+        ↓
+[Cache: store with timestamp + site metadata]
+```
+
+### Client Playback Flow (per request)
+
+```
+[User opens /watch in browser]
+        ↓
+[Frontend GET /api/streams]
+        ↓
+[Backend reads cache → returns stream list (site, quality, label)]
+        ↓
+[User picks a stream]
+        ↓
+[Player requests: GET /proxy?url=<m3u8-url>]
+        ↓
+[Backend: fetch upstream m3u8, rewrite URIs → return modified m3u8]
+        ↓
+[Player follows variant playlist: GET /proxy?url=<variant-m3u8>]
+        ↓
+[Backend: rewrite segment URIs]
+        ↓
+[Player fetches segments: GET /relay?url=<segment>]
+        ↓
+[Backend: upstream fetch, pipe bytes → client]
+        ↓
+[Video plays in browser]
+```
+
+### Schedule Flow
+
+```
+[Cron: daily or on-demand]
+        ↓
+[Schedule Client: GET Jolpica API /ergast/f1/current.json]
+        ↓
+[Parse: races, session types, UTC timestamps]
+        ↓
+[Normalise: map to internal Session model]
+        ↓
+[Store: NFS JSON file + in-memory cache]
+        ↓
+[Frontend GET /api/schedule → displays session list]
+```
+
+### Key Data Flows
+
+1. **Extraction → Cache → API → Frontend**: All stream data originates from extractors, flows through the cache as the single source of truth, and is served read-only to the frontend. No frontend-triggered extraction.
+2. **Client → Proxy → Upstream CDN**: The proxy is a pure pass-through relay. It does not store segments. Bytes from upstream go directly to client socket.
+3. **Schedule API → NFS**: Schedule data is written to NFS on refresh so the pod can serve it immediately on restart without waiting for the next API poll.
+
+---
+
+## Component Boundaries
+
+| Component | Owns | Does Not Own |
+|-----------|------|--------------|
+| Extractor (per site) | How to get stream URL from that site | Health checking, caching, proxying |
+| Health Checker | Liveness state of each URL | How the URL was found |
+| Proxy / Relay | Rewriting m3u8 URIs, piping bytes | Authentication with upstream (that's extractor's job) |
+| Schedule Subsystem | F1 session calendar data | Stream availability for a given session |
+| Backend API | Serving current state to frontend | Fetching or refreshing state |
+| Frontend | User interaction, player | Any backend logic |
+
+---
+
+## Suggested Build Order (Phase Dependencies)
+
+The dependencies flow strictly upward — each layer depends only on the layer below it being stable:
+
+```
+Phase 1: Schedule Subsystem
+    ↓ (F1 data available)
+Phase 2: Extractor Framework + First Site Extractor
+    ↓ (raw URLs available)
+Phase 3: Health Checker
+    ↓ (validated URLs available)
+Phase 4: Proxy / Relay Layer
+    ↓ (streams playable through service)
+Phase 5: Frontend (schedule + player)
+    ↓ (end-to-end usable)
+Phase 6: Additional Site Extractors
+    ↓ (stream coverage widened)
+Phase 7: K8s Deployment (Terraform/Terragrunt stack)
+```
+
+**Rationale:**
+- Schedule first: gives a testable data source with zero anti-scraping complexity.
+- Extractor framework before specific sites: the base class and registry must exist before any site can plug in.
+- Health checker before proxy: no point proxying dead streams; the checker filters the list fed to the proxy.
+- Proxy before frontend: the frontend player needs a working `/proxy` endpoint to function.
+- Frontend last of core: all backend components are independently testable via curl/httpie before a UI exists.
+- Additional extractors after core is working: adding more sites is low-risk incremental work once the pattern is proven.
+- Deployment last: deploy once the service works end-to-end locally; avoids debugging infra and app simultaneously.
+
+---
+
+## Anti-Patterns
+
+### Anti-Pattern 1: On-Demand Extraction Per Client Request
+
+**What people do:** Trigger extraction when the user clicks "show streams" in the browser.
+
+**Why it's wrong:** Extraction takes 2-10 seconds per site (HTTP round trips, JS parsing, redirect following). With multiple sites, this is 10-30 seconds of wall time. Source sites may rate-limit aggressive bursts. Multiple concurrent users would multiply the load.
+
+**Do this instead:** Run extraction on a background schedule. Cache results. The API returns immediately from cache. The user sees streams in <100ms.
+
+### Anti-Pattern 2: Single Extractor Handles All Sites
+
+**What people do:** One big function with `if site == "A": ... elif site == "B": ...` branches.
+
+**Why it's wrong:** Sites change their obfuscation methods independently. A change to Site A's extraction logic can accidentally break Site B. Testing is impossible in isolation. Adding Site C requires modifying a shared file.
+
+**Do this instead:** One class per site, implementing a common interface. Changes to Site A's extractor never touch Site B's code.
+
+### Anti-Pattern 3: Buffering Segments in Memory Before Sending
+
+**What people do:** Download the entire `.ts` segment to memory, then serve it to the client.
+
+**Why it's wrong:** HLS segments can be 2-10 MB each. With multiple concurrent viewers, memory pressure grows quickly. Introduces unnecessary latency (client waits for full download before first byte).
+
+**Do this instead:** Pipe bytes from the upstream response directly to the client socket as they arrive (chunked transfer). The client starts receiving immediately, memory stays flat.
+
+### Anti-Pattern 4: Hardcoding Site URLs and Tokens in Extractor Logic
+
+**What people do:** Hardcode `BASE_URL = "https://site-a.example.com"` and referer/cookie values inside the extractor file.
+
+**Why it's wrong:** Sites change domains and anti-scraping parameters frequently. When a site moves, you have to find and edit code rather than config.
+
+**Do this instead:** Extractor reads its config (base URL, required headers, any known static tokens) from a config object injected at construction. The registry passes config to extractors at instantiation.
+
+---
+
+## Integration Points
+
+### External Services
+
+| Service | Integration Pattern | Notes |
+|---------|---------------------|-------|
+| Jolpica F1 API (`api.jolpi.ca/ergast/f1/`) | REST GET, poll daily | No API key required; backwards-compatible Ergast endpoints; schedule data available |
+| OpenF1 API (`api.openf1.org/`) | REST GET, poll as needed | No API key; 3 req/s rate limit; 2023+ data only; useful for session status (live/upcoming) |
+| Upstream streaming sites (Site A, B, N) | HTTP GET/POST with session cookies, CSRF tokens | Per-site; no shared pattern; treated as black boxes by the framework |
+| Upstream CDN (HLS segments) | HTTP GET with Range support | Proxy relays bytes; must forward `Referer` and sometimes `Origin` headers or CDN rejects |
+
+### Internal Boundaries
+
+| Boundary | Communication | Notes |
+|----------|---------------|-------|
+| Extractor → Cache | Direct function call (write) | Extractors do not call the cache directly — the dispatcher aggregates results then writes once |
+| API → Cache | Direct read | Synchronous, O(1) |
+| API → Proxy | Not direct — frontend calls `/proxy` endpoint, which is part of the same backend process | Can be split into separate service later if needed |
+| Proxy → Upstream CDN | Outbound HTTP | Must preserve session headers; upstream CDN may check Referer/Origin |
+| Schedule Poller → NFS | File write (JSON) | On pod restart, reads NFS before first API poll |
+
+---
+
+## Scaling Considerations
+
+This is a single-user or small-group private service. Scaling is not a primary concern, but here are the natural pressure points:
+
+| Scale | Architecture Adjustments |
+|-------|--------------------------|
+| 1-5 concurrent viewers | Single backend pod, in-memory cache, direct pipe relay — fully sufficient |
+| 10-20 concurrent viewers | Same architecture; segment relay becomes the bandwidth bottleneck (each viewer streams independently) — add HLS caching proxy (nginx) in front of relay |
+| 50+ concurrent viewers | Segment relay load increases linearly; consider a CDN or caching layer for segments; extraction/health remain unchanged |
+
+### Scaling Priorities
+
+1. **First bottleneck:** Outbound bandwidth on segment relay. Each viewer pulls full bitrate independently through the service. At private-use scale this is negligible (1-5 viewers).
+2. **Second bottleneck:** In-memory cache invalidation if multiple pods deploy (stateless pods don't share cache). Solved by using existing cluster Redis instead of in-process dict — but unnecessary until horizontal scaling.
+
+---
+
+## Sources
+
+- HLS specification: RFC 8216 (IETF) — playlist structure, master/media playlist relationship, segment mechanics (HIGH confidence)
+- HLS proxy pattern: Apple Developer Documentation (conceptual), corroborated by yt-dlp extractor framework analysis (MEDIUM confidence)
+- yt-dlp plugin architecture: github.com/yt-dlp/yt-dlp README + docs (MEDIUM confidence)
+- OpenF1 API: openf1.org official page — endpoints, rate limits, data coverage (HIGH confidence)
+- Jolpica F1 API: github.com/jolpica/jolpica-f1 — Ergast compatibility, availability (MEDIUM confidence)
+- System composition for this domain: inference from domain patterns, corroborated by extractor tool analysis (MEDIUM confidence)
+
+---
+
+*Architecture research for: Live stream aggregation and proxy service (F1)*
+*Researched: 2026-02-23*
--- a/.planning/research/FEATURES.md
+++ b/.planning/research/FEATURES.md
@ -0,0 +1,215 @@
+# Feature Research
+
+**Domain:** Live Stream Aggregation / Sports Stream Proxy Service
+**Researched:** 2026-02-23
+**Confidence:** MEDIUM
+
+---
+
+## Feature Landscape
+
+### Table Stakes (Users Expect These)
+
+Features users assume exist. Missing these = product feels incomplete.
+
+| Feature | Why Expected | Complexity | Notes |
+|---------|--------------|------------|-------|
+| Race schedule view | Users need to know when sessions are live without external lookup | LOW | Pull from OpenF1 API (`/sessions` endpoint). Session types: FP1, FP2, FP3, Quali, Sprint, Sprint Quali, Race. Confidence: HIGH (OpenF1 API confirmed). |
+| Live session indicator | Users need to distinguish live vs upcoming vs finished sessions at a glance | LOW | Visual status badge (LIVE / UPCOMING / FINISHED) based on session start time + duration. No polling needed at schedule level. |
+| Stream picker | Multiple stream sources per session — user picks which one to watch | LOW | List available extracted stream links with source label. Core UX of the whole product. |
+| Embedded video player | Users won't navigate to external players for each stream | MEDIUM | HLS.js in Svelte for in-page playback. Must handle m3u8 sources natively. Confidence: HIGH (HLS.js is the standard client-side HLS library). |
+| Stream health indicator | Users don't want to click a dead stream and stare at a spinner | MEDIUM | Backend health-check each extracted URL before displaying. Simple HEAD or short-lived GET on the m3u8 playlist. Mark dead streams visually. |
+| CORS-transparent stream proxy | Browsers block cross-origin HLS requests; streams can't play directly from scraped origins | HIGH | Proxy all m3u8 manifests + .ts/.m4s segments through your own backend. Rewrite manifest URLs to point to your proxy. This is architecturally mandatory, not optional. Confidence: HIGH (HLS-Proxy documentation confirms this). |
+| All F1 session types covered | Users specifically want FP, Quali, Sprint, Race, and pre/post content — not just race day | MEDIUM | Scraper scheduler must run for every session type on the F1 calendar. OpenF1 `/sessions` endpoint returns `session_type` field. |
+| Session countdown timer | For upcoming sessions, users want to know time-until-start without mental math | LOW | Client-side countdown from schedule data already fetched. Zero backend cost. |
+| Stream auto-refresh / re-extraction | Stream links expire (tokens, redirect chains rotate) — stale links silently fail | HIGH | Periodic re-extraction (e.g., every 5-10 min during a live session). Depends on extractor infrastructure. |
+| Multiple quality options (if available) | Users on slow connections need lower bitrate; users on fast connections want max quality | MEDIUM | Expose quality variants from multi-variant HLS playlists if source provides them. Let user pick or default to auto (hls.js handles ABR natively). |
+
+---
+
+### Differentiators (Competitive Advantage)
+
+Features that set the product apart. Not required, but valuable.
+
+| Feature | Value Proposition | Complexity | Notes |
+|---------|-------------------|------------|-------|
+| Automatic stream extraction at session start | Zero manual effort — streams appear when the session goes live | HIGH | Cron/scheduler tied to F1 calendar. Triggers extractors N minutes before session start. Eliminates "is there a stream yet?" manual checking. |
+| Per-site extractor isolation | Bypassing CSRF/JS obfuscation cleanly per site without shared code that breaks globally | HIGH | Each extractor is a self-contained module. One site's changes don't break others. Confidence: MEDIUM (pattern from streamlink plugin system). |
+| Session timeline: pre/post shows + press conferences | Competitors (scrapers, IPTV playlists) cover race only; full weekend coverage is rare | MEDIUM | Requires scheduling extractors for non-race events. OpenF1 does not cover pre/post shows — need site-specific session detection. |
+| Stream source labeling | Shows which site/feed each stream came from — users learn which sources are reliable | LOW | Store source metadata with each extracted URL. Display in picker. |
+| Fallback stream ordering | Automatically surfaces known-good streams first when multiple sources exist | MEDIUM | Health-check result + historical success rate drives ordering. Depends on: stream health checking + a minimal persistence layer to store success history. |
+| Proxy-cached segment prefetch | Reduces buffering by prefetching upcoming .ts segments into local cache | HIGH | Node-HLS-Proxy pattern: maintain per-stream segment cache up to N segments ahead. High implementation cost for marginal UX gain at private scale. |
+| Session notes / source reputation | Lightweight annotations (e.g., "this source often drops at lap 40") | LOW | Simple static config or admin-editable markdown. No database needed at MVP. |
+| Race weekend overview page | One page showing all sessions for a Grand Prix weekend — not just next session | LOW | Group sessions by event/round from schedule API. Pure frontend feature once schedule data is available. |
+
+---
+
+### Anti-Features (Commonly Requested, Often Problematic)
+
+Features to explicitly NOT build.
+
+| Feature | Why Requested | Why Problematic | Alternative |
+|---------|---------------|-----------------|-------------|
+| DVR / stream recording | Users want to rewatch if they miss something | Massive storage cost, legal exposure, complexity (recording live HLS streams, serving VOD). Out of scope by design. | Live viewing only. Accept the constraint. |
+| Chat / comments | Social viewing experience | Scope creep. You're building a stream aggregator, not a community platform. Auth, moderation, and DB schema all follow. | None — explicitly out of scope. |
+| User accounts / watchlists | "Remember my preferred stream source" | Requires auth layer, session storage, DB. Contradicts the "no auth, private URL" design decision. | Persist last-used quality/source in browser localStorage. Zero backend cost. |
+| Stream transcoding / re-encoding | Normalize quality across sources | Enormous CPU cost, latency, and complexity. An FFmpeg transcoding pipeline per stream is overkill for a private service. | Pass-through proxy only. Let hls.js handle ABR on the client. |
+| Headless browser extraction | Universal extractor that handles any site's JS obfuscation | Puppeteer/Playwright adds 200-400 MB RAM per session, slow cold starts, flaky in containers, and complex cluster scheduling. Per-site custom extractors are faster and more reliable. | Custom per-site extractors (Go/Python HTTP + regex/DOM parser). |
+| Mobile app | Access on phone | Web app with responsive Svelte layout is sufficient. Native app is weeks of work for a private tool. | Responsive web design. PWA if needed. |
+| Discovery / search for new stream sites | Auto-find new sources | Scraping discovery is an unsolved problem and a rabbit hole. You have a fixed list of sites. | User-provided site list. Extractor per site. |
+| Telemetry overlay / timing data | F1 fans love live timing alongside streams | Different product category (timing dashboard vs stream aggregator). OpenF1 has timing data but integrating it is a separate project. | Link to existing timing tools (e.g., openf1.org). |
+| DRM stream support | Some quality sources use Widevine/FairPlay | DRM circumvention is legally distinct from re-streaming. Avoid. | Non-DRM HLS sources only. |
+
+---
+
+## Feature Dependencies
+
+```
+Race Schedule View
+    └──requires──> F1 Schedule API Integration (OpenF1 or Ergast)
+                       └──enables──> Session Countdown Timer
+                       └──enables──> Automatic Extraction Trigger
+
+Stream Picker
+    └──requires──> CORS-Transparent Stream Proxy (browser cannot directly fetch cross-origin m3u8)
+    └──requires──> Stream Health Indicator (to filter dead streams before display)
+                       └──requires──> Stream Health Checker (backend periodic HEAD/GET)
+
+Embedded Video Player
+    └──requires──> CORS-Transparent Stream Proxy (proxied URLs served from same origin)
+    └──requires──> Stream Picker (to know which URL to play)
+
+Stream Auto-Refresh
+    └──requires──> Per-Site Extractor (to re-run extraction)
+    └──requires──> Session-live detection (know when to run vs stop)
+
+Fallback Stream Ordering
+    └──requires──> Stream Health Indicator
+    └──enhances──> Stream Picker (surfaces best streams first)
+
+Multiple Quality Options
+    └──requires──> CORS-Transparent Stream Proxy (proxy must rewrite variant playlist URLs too)
+    └──enhances──> Embedded Video Player (user control or ABR)
+
+Proxy-Cached Segment Prefetch
+    └──requires──> CORS-Transparent Stream Proxy (must be same proxy layer)
+    └──conflicts──> Minimal resource footprint (high memory cost)
+
+Session Timeline (pre/post/press conf)
+    └──requires──> F1 Schedule API Integration (for race events)
+    └──requires──> Per-Site Session Detection (API doesn't include pre/post show timing)
+```
+
+### Dependency Notes
+
+- **Stream Picker requires CORS proxy:** Browsers enforce same-origin policy. A scraped m3u8 URL from `site.com` cannot be fetched by a Svelte app on `f1.viktorbarzin.me`. Every user-facing stream URL must route through the proxy backend. This is a hard architectural dependency, not an option.
+- **Stream health checker enables stream picker quality:** Without health checking, the picker shows dead links. Health checking must run before streams are displayed and periodically during live sessions.
+- **Automatic extraction trigger depends on schedule:** The scheduler must know when sessions start. Schedule API integration is therefore the first thing to build — everything else gates on it.
+- **Multiple quality options conflict with simple proxy:** If the source provides a multi-variant HLS playlist, the proxy must rewrite ALL variant URLs (not just the master manifest). Adds complexity to the proxy rewriting layer.
+- **Fallback ordering conflicts with stateless proxy:** Tracking success history requires at least a lightweight persistence layer (e.g., Redis or SQLite). If staying fully stateless, fall back to health-check-only ordering.
+
+---
+
+## MVP Definition
+
+### Launch With (v1)
+
+Minimum viable product — what's needed to validate the concept.
+
+- [ ] **F1 Schedule view** — Show upcoming/live sessions for the current season. Single page, no navigation needed.
+- [ ] **CORS-transparent HLS proxy** — Proxy m3u8 manifests + segment URLs through the backend. Without this, nothing plays in the browser.
+- [ ] **Per-site stream extractor(s)** — At least one working extractor for at least one reliable source site. Proves the extraction pipeline end-to-end.
+- [ ] **Stream health checker** — Validate extracted URLs before showing. Dead streams must not surface to users.
+- [ ] **Stream picker** — List available working streams for the current session. User clicks, player loads.
+- [ ] **Embedded HLS player** — HLS.js in Svelte. Plays proxied m3u8 URL in-page.
+- [ ] **Session countdown** — Time-until-start for upcoming sessions. Pure frontend, zero cost.
+- [ ] **Live session indicator** — Visual LIVE/UPCOMING/FINISHED badge. Core navigational signal.
+
+### Add After Validation (v1.x)
+
+Features to add once core pipeline is working and streams actually play reliably.
+
+- [ ] **Stream auto-refresh** — Re-run extractors every 5-10 min during live sessions. Trigger: user reports dead stream or health check fails on previously-valid URL.
+- [ ] **Fallback stream ordering** — Sort by health-check recency and past reliability. Trigger: multiple sources available per session.
+- [ ] **Source labeling in picker** — Show site name with each stream link. Low effort, high trust signal for users.
+- [ ] **Race weekend overview** — All sessions grouped per Grand Prix. Trigger: users navigating between sessions in a weekend.
+- [ ] **Additional extractors** — Expand site coverage once first extractor is stable. Each adds incremental reliability.
+
+### Future Consideration (v2+)
+
+Features to defer until product-market fit is established.
+
+- [ ] **Pre/post show + press conference coverage** — Complex site-specific session detection. Defer until core race coverage is solid.
+- [ ] **Multiple quality options** — Source sites may or may not provide multi-variant playlists. Complexity of rewriting variant URLs in proxy is non-trivial. Validate first if sources actually offer quality tiers.
+- [ ] **Proxy segment prefetch/cache** — High memory cost. Only valuable if buffering is a real user complaint at private scale.
+- [ ] **Session reputation annotations** — Nice UX polish. Not needed at launch.
+
+---
+
+## Feature Prioritization Matrix
+
+| Feature | User Value | Implementation Cost | Priority |
+|---------|------------|---------------------|----------|
+| F1 Schedule view | HIGH | LOW | P1 |
+| CORS-transparent HLS proxy | HIGH | HIGH | P1 (architectural blocker) |
+| Per-site stream extractor | HIGH | HIGH | P1 (core value) |
+| Embedded HLS player | HIGH | LOW | P1 |
+| Stream health checker | HIGH | MEDIUM | P1 |
+| Stream picker | HIGH | LOW | P1 |
+| Session countdown timer | MEDIUM | LOW | P1 |
+| Live session indicator | HIGH | LOW | P1 |
+| Stream auto-refresh | HIGH | MEDIUM | P2 |
+| Source labeling | MEDIUM | LOW | P2 |
+| Fallback stream ordering | MEDIUM | MEDIUM | P2 |
+| Race weekend overview page | MEDIUM | LOW | P2 |
+| Additional extractors | HIGH | MEDIUM | P2 |
+| Multiple quality options | MEDIUM | HIGH | P3 |
+| Pre/post show coverage | MEDIUM | HIGH | P3 |
+| Proxy segment prefetch | LOW | HIGH | P3 |
+| Session reputation annotations | LOW | LOW | P3 |
+
+**Priority key:**
+- P1: Must have for launch
+- P2: Should have, add when possible
+- P3: Nice to have, future consideration
+
+---
+
+## Competitor Feature Analysis
+
+Reference products surveyed: RaceControl (unofficial F1TV client), f1viewer (TUI F1TV client), streamlink (stream extraction CLI), HLS-Proxy (node HLS proxy), Threadfin (M3U proxy), ErsatzTV (self-hosted IPTV).
+
+| Feature | RaceControl (F1TV client) | Streamlink (CLI extractor) | HLS-Proxy (node) | Our Approach |
+|---------|--------------------------|---------------------------|-----------------|--------------|
+| Session schedule | F1TV API (official, auth required) | None (site-specific) | None | OpenF1/Ergast (free, unauthenticated) |
+| Stream extraction | Official F1TV API | Plugin-per-site Python | N/A | Custom per-site extractors (Go/Python HTTP) |
+| Stream quality selection | Multi-variant picker + Chromecast | CLI flag `--default-stream` | Pass-through | HLS.js ABR + manual picker |
+| Multi-stream view | Yes (layout builder, experimental sync) | Multiple instances | N/A | Single stream (MVP), multi optional later |
+| Health checking | None visible | None | None | Active periodic health checks (our differentiator) |
+| Stream proxy | No (plays direct from F1TV CDN) | No (piped to local player) | Yes (manifest + segment rewrite) | Yes (mandatory for browser CORS) |
+| CORS handling | N/A (desktop app) | N/A (local) | Yes (adds permissive CORS headers) | Yes (same-origin proxy) |
+| Auto-extraction at session start | Via F1TV live schedule | None | None | Yes (scheduler + extractor trigger) |
+| Embedded browser player | No (external VLC/mpv) | No (external player) | N/A | Yes (HLS.js in Svelte) |
+| No auth required | No (F1TV subscription) | Varies by source | None | Yes (private URL, no auth layer) |
+
+**Key insight:** Existing tools either require official F1TV credentials (RaceControl, f1viewer) or extract streams to local players (streamlink). None combine automated extraction from unofficial sources + browser-native proxied playback + schedule integration in a single web service. That combination is the product's core novelty.
+
+---
+
+## Sources
+
+- OpenF1 API documentation: https://openf1.org/ — MEDIUM confidence (marketing page, limited technical detail on session endpoints)
+- HLS-Proxy (warren-bank/HLS-Proxy) README — HIGH confidence for proxy architecture requirements (CORS, manifest rewriting, segment caching)
+- HLS.js README (video-dev/hls.js) — HIGH confidence for client-side HLS capabilities (ABR modes, quality switching, error recovery)
+- Streamlink documentation: https://streamlink.github.io/ — HIGH confidence for extraction patterns and plugin architecture
+- yt-dlp README — HIGH confidence for extractor-per-site pattern and format selection
+- RaceControl (robvdpol/RaceControl) README — MEDIUM confidence for F1 streaming UX expectations
+- f1viewer (SoMuchForSubtlety/f1viewer) README — MEDIUM confidence for F1 session coverage expectations
+- Threadfin README — MEDIUM confidence for IPTV/HLS proxy feature patterns
+- Telly README — LOW confidence (Plex-specific, limited relevance)
+- Eyevinn/hls-proxy README — HIGH confidence for HLS manifest manipulation patterns
+
+---
+
+*Feature research for: F1 Live Stream Aggregation Service*
+*Researched: 2026-02-23*
--- a/.planning/research/PITFALLS.md
+++ b/.planning/research/PITFALLS.md
@ -0,0 +1,291 @@
+# Pitfalls Research
+
+**Domain:** Live stream aggregation and proxy service (F1-focused)
+**Researched:** 2026-02-23
+**Confidence:** MEDIUM — findings synthesized from yt-dlp/streamlink source analysis (HIGH), nginx proxy documentation (HIGH), HLS RFC/spec analysis (HIGH), OpenF1 API docs (HIGH), and web searches that returned sparse results (LOW where noted)
+
+---
+
+## Critical Pitfalls
+
+### Pitfall 1: Treating JavaScript-Rendered Tokens as Static
+
+**What goes wrong:**
+Stream URLs on sports streaming sites are not present in raw HTML. They are computed client-side by obfuscated JavaScript — the page HTML contains an encrypted or encoded config blob, and the actual HLS URL is assembled by executing that JS. A scraper that fetches the raw HTML page and runs a regex over it finds nothing.
+
+**Why it happens:**
+Developers assume "the URL must be somewhere in the page source." They inspect the page in DevTools, see the URL in the network tab, and try to replicate what they observe — but miss that the URL was produced by JS execution, not served in the initial response.
+
+**How to avoid:**
+- For each target site: trace the actual network request that fetches the m3u8 in browser DevTools (Network tab, filter by `.m3u8`). Identify the API endpoint that the JS calls to get the signed URL. Replicate *that* API call (often a JSON endpoint), not the page fetch.
+- If the token is computed entirely client-side (e.g., via CryptoJS with a hardcoded key), implement the same algorithm in your extractor. Do not run headless browser — reverse-engineer the JS algorithm.
+- Document which sites require JS execution vs. which expose a clean API endpoint. Sites often have a backend API that the JavaScript calls; scraping that API is faster and more stable than re-implementing the JS.
+
+**Warning signs:**
+- Extractor returns empty results on a page you can watch in the browser
+- Network tab shows the m3u8 URL appearing only after JavaScript fires an XHR/fetch call
+- Page HTML contains a large base64 blob or heavily obfuscated JS variable (e.g., `var _0x1a2b = [...]`)
+
+**Phase to address:**
+Extractor design phase (before writing a single extractor). Establish upfront: for each target site, determine if raw HTTP fetch is sufficient or if API reverse-engineering is required.
+
+---
+
+### Pitfall 2: m3u8 Segment URLs Break When Proxied Through a Different Domain
+
+**What goes wrong:**
+When you fetch an m3u8 playlist and serve it through your proxy, the segment URLs inside the playlist may be absolute URLs pointing to the original CDN. The browser (or HLS.js) follows those segment URLs directly, bypassing your proxy entirely. This means you cannot control access, cannot inject headers, and CORS blocks the segments if the CDN doesn't allow cross-origin requests from your frontend domain.
+
+**Why it happens:**
+HLS playlists can contain either absolute URLs (`https://cdn.example.com/seg001.ts`) or relative paths (`seg001.ts`). Most streaming CDNs use absolute URLs with signed tokens. Proxying only the m3u8 is insufficient — every segment URL must also be rewritten to route through your relay.
+
+**How to avoid:**
+- When serving the m3u8 through your proxy, **rewrite all segment URLs** to point to your relay endpoint before sending the playlist to the client. Example: replace `https://cdn.site.com/segment001.ts?token=xyz` with `https://your-relay.domain/proxy/segment?url=<encoded-original-url>`.
+- Your relay endpoint then fetches the original segment and streams it to the client.
+- Handle multi-level playlists: master playlists (variant streams) reference child playlists which reference segments — rewrite at each level.
+
+**Warning signs:**
+- Client-side CORS errors in browser console referencing the original CDN domain
+- Network tab shows segment fetches bypassing your proxy after the m3u8 loads
+- Some quality variants play but others don't (partial rewriting)
+
+**Phase to address:**
+Stream relay/proxy phase. Must be designed before the first end-to-end stream test.
+
+---
+
+### Pitfall 3: CDN-Signed Token URLs Expire Mid-Stream
+
+**What goes wrong:**
+Many CDNs sign stream URLs with a short-lived token (often 5–30 minutes). The m3u8 playlist URL itself may be signed, and so may each segment URL. A user who starts watching near the token expiry time will get a working stream for the first few segments, then receive 403 Forbidden errors as the token expires mid-playback.
+
+**Why it happens:**
+Developers test stream extraction, confirm the URL works, and ship it — without accounting for token TTL. The token was valid at extraction time but expires before or during playback. Live streams compound this: the m3u8 playlist updates every few seconds and each update may contain newly signed segment URLs. If your relay cached an old playlist, segments within it are expired.
+
+**How to avoid:**
+- Never cache m3u8 playlist files. Always fetch the live playlist from upstream on each client request. Cache only TS/m4s segments (which have longer or no expiry).
+- When extracting the initial stream URL, record the extraction timestamp and the token TTL (if discoverable from response headers like `Cache-Control: max-age=N`). Re-extract before expiry.
+- Implement a background refresh: when serving a stream, periodically re-run the extractor to get a fresh URL and pivot the relay to the new upstream without interrupting the client.
+- Test expiry by extracting a URL and waiting 30 minutes before playing — a failing test here reveals token TTL issues.
+
+**Warning signs:**
+- Stream plays for exactly N minutes then fails with 403 on segments
+- Extracting a URL works in isolation but fails when embedded in the player after a delay
+- Sites with Cloudflare or custom CDN always add `?token=` or `?sig=` parameters to segment URLs
+
+**Phase to address:**
+Stream relay phase. The relay architecture must include a URL refresh loop from day one.
+
+---
+
+### Pitfall 4: Per-Site Extractor Maintenance Burden Is Dramatically Underestimated
+
+**What goes wrong:**
+Each target site is a custom engineering problem. Sites change their HTML structure, JavaScript obfuscation, API endpoints, or anti-bot measures without notice. A working extractor can break silently overnight. With 5 target sites, you effectively have 5 separate maintenance tracks. Expecting "set and forget" behavior is the most common planning mistake in this domain.
+
+**Why it happens:**
+Developers build an extractor, it works, and they move on. Sites then deploy a CDN update, change their frontend framework, or rotate their obfuscation keys. The failure is silent — no exception is raised, the extractor just returns no URL, and the user sees an empty player.
+
+**How to avoid:**
+- Build a health check system that runs each extractor on a schedule (every 15 minutes during race weekends, every hour otherwise), logs success/failure, and triggers alerts on failure.
+- Design extractors with failure visibility: log exactly which step failed (page fetch, URL parse, API call, etc.) so debugging is fast.
+- Keep extractor logic isolated and testable: each extractor is a module that takes no inputs and returns a stream URL or raises an exception. Run integration tests against live sites on a schedule.
+- Plan 1–2 hours of maintenance per extractor per month as baseline, more during site redesigns.
+
+**Warning signs:**
+- No automated testing of extractors against live sites
+- Extractor code tightly coupled to specific HTML element IDs or class names (breaks on any frontend change)
+- No alerting when an extractor returns no URL
+
+**Phase to address:**
+Extractor design phase. The monitoring/health-check system must be built alongside the first extractor, not added later.
+
+---
+
+### Pitfall 5: Missing or Incorrect CORS Headers on the Relay Breaks Browser Playback
+
+**What goes wrong:**
+HLS.js in the browser makes cross-origin requests to fetch m3u8 playlists and segment files. If your relay doesn't serve the correct CORS headers, every segment request fails with a CORS error. Even a single missing header (e.g., on `.ts` segment responses but not `.m3u8` responses) breaks the stream.
+
+**Why it happens:**
+Developers test the relay with `curl` or server-to-server calls, where CORS is irrelevant. The relay works in isolation but fails when the browser's HLS.js player makes the requests.
+
+**How to avoid:**
+- Set CORS headers on **all** relay endpoints: `Access-Control-Allow-Origin: *` (or your specific frontend domain), `Access-Control-Allow-Methods: GET, HEAD, OPTIONS`, `Access-Control-Allow-Headers: Range`.
+- The `Range` header is critical: HLS.js often sends range requests for segments. If `Range` is not in `Allow-Headers`, preflight OPTIONS requests fail.
+- Do not use wildcard `*` if you also send `Access-Control-Allow-Credentials: true` — that combination is invalid and browsers reject it.
+- Test from the actual browser environment (or use a CORS testing tool) before calling any relay endpoint "done."
+
+**Warning signs:**
+- Browser console shows `No 'Access-Control-Allow-Origin' header` errors
+- Streams work when loaded directly in a `<video>` src but fail when loaded via HLS.js
+- Preflight OPTIONS requests returning 405 Method Not Allowed
+
+**Phase to address:**
+Stream relay phase, first integration test.
+
+---
+
+### Pitfall 6: IP-Based Banning From Target Streaming Sites
+
+**What goes wrong:**
+Streaming sites detect and ban server IP ranges because your relay sends requests from a datacenter IP (K8s node) with no residential IP characteristics. The extractor initially works from a developer laptop (residential ISP), then fails when deployed to the cluster because the server IP is blocked or fingerprinted differently.
+
+**Why it happens:**
+Streaming sites use IP reputation databases. Cloud provider IP ranges (AWS, GCP, Hetzner, OVH, etc.) are pre-blocked or rate-limited on many streaming platforms. Your home cluster may or may not be in a residential IP range depending on your ISP.
+
+**How to avoid:**
+- Test extractors from the same environment they'll run in (the K8s cluster) before committing to a site as a target. A site that works from your laptop may be blocked from the server.
+- Use realistic HTTP headers: `User-Agent` matching a current browser, `Accept`, `Accept-Language`, `Accept-Encoding` headers that match a real browser session. Missing or mismatched headers are a primary signal.
+- Include `Referer` headers matching the expected source page. Many CDNs check that the referer is the streaming site itself before serving signed URLs.
+- Rotate request patterns if hitting the same site repeatedly: add random delays, avoid predictable polling intervals.
+
+**Warning signs:**
+- Extractor works in local testing but returns 403/429 in deployment
+- Sites return Cloudflare IUAM challenge pages (JS challenge) when scraped from the server IP
+- Response body contains "Access Denied" or "Bot Detected" rather than the expected HTML
+
+**Phase to address:**
+Extractor design phase. Test against the production network before finalizing site targets.
+
+---
+
+## Technical Debt Patterns
+
+Shortcuts that seem reasonable but create long-term problems.
+
+| Shortcut | Immediate Benefit | Long-term Cost | When Acceptable |
+|----------|-------------------|----------------|-----------------|
+| Hardcode stream URL for the first race | Ship fast | Breaks immediately; no automation | Never — defeats the purpose |
+| Use `BeautifulSoup` regex on entire page HTML | Simple implementation | Breaks on any frontend change; misses JS-rendered content | Only for static HTML pages with predictable structure |
+| Cache m3u8 playlists at the relay | Reduce upstream requests | Serves expired segment URLs; stream breaks mid-playback | Never for live content |
+| Single-threaded sequential extractor polling | Simple code | Can't handle concurrent users fetching different streams | Only in MVP with a single stream |
+| Skip extractor health checks for MVP | Faster to ship | Silent failures, no visibility into broken extractors | Only if you have < 1 stream and check manually |
+| Proxy all segments (relay mode) without segment caching | Correct behavior | High bandwidth usage; each viewer multiplies bandwidth | Only at low viewer count (< 5) |
+| Use headless browser for all extractors | Handles all JS | Slow (3–10s per extraction), high memory, complex ops | Fallback for sites that truly cannot be handled otherwise |
+
+---
+
+## Integration Gotchas
+
+Common mistakes when connecting to external services.
+
+| Integration | Common Mistake | Correct Approach |
+|-------------|----------------|------------------|
+| Target streaming sites | Fetch the HTML page and regex for the stream URL | Identify the actual API endpoint the site JS calls; hit that directly |
+| Target streaming sites | Ignore cookies and session state | Maintain a cookie jar per site; some sites require a session cookie from the homepage before serving stream API |
+| Target streaming sites | Send requests without `Referer` header | Always set `Referer` to the page that would normally contain the player |
+| CDN segment URLs | Use the same signed URL for the full stream duration | Re-fetch the m3u8 on each playlist poll to get freshly signed segment URLs |
+| OpenF1 API (race schedule) | Call live-data endpoints during a session on the free tier | Free tier allows only historical data; live data costs €9.90/month — use F1 calendar static JSON for schedule |
+| OpenF1 API | Assume the API is official F1 data | OpenF1 is a third-party fan project, not affiliated with F1; data may lag or be incorrect |
+| Ergast API | Expect stable availability in 2026 | Ergast deprecated their API in late 2024; use OpenF1 or the unofficial `api.formula1.com` instead |
+| HLS.js player | Load the proxied m3u8 URL directly without error handler | Always attach `hls.on(Hls.Events.ERROR, ...)` with media error recovery; live streams have transient failures |
+| HLS.js player | Assume autoplay works | Browser autoplay policies block unmuted video. Always mute by default or show a play button |
+
+---
+
+## Performance Traps
+
+Patterns that work at small scale but fail as usage grows.
+
+| Trap | Symptoms | Prevention | When It Breaks |
+|------|----------|------------|----------------|
+| Relay proxies all segment bytes | Works for 1 viewer; saturates uplink for 5+ | Serve the rewritten m3u8 but let clients fetch segments directly from CDN (only proxy the playlist) | > 3 concurrent viewers on a typical F1 stream (5–8 Mbps per viewer) |
+| Polling all extractors every minute | Works with 1–2 sites; CPU/memory spike at race time | Poll only during race windows; use event-driven triggers from the schedule | Always — race starts matter, not constant polling |
+| Synchronous extractor execution blocks the API response | First request takes 5–10s while extractor runs | Pre-warm extractors before the race start time; cache last-known working URL | First user to request a stream before pre-warming |
+| No connection pooling to upstream CDNs | High segment fetch latency | Reuse HTTP connections with keep-alive | > 10 segments/second through the relay |
+| Storing stream session state in memory (in-process) | Works on one pod | Lost on pod restart; user stream breaks | Any Kubernetes pod restart or rolling deployment |
+
+---
+
+## Security Mistakes
+
+Domain-specific security issues beyond general web security.
+
+| Mistake | Risk | Prevention |
+|---------|------|------------|
+| Exposing the raw upstream CDN URL in API response | Users bypass your relay; sites can track and block the raw URL if scraped | Keep upstream URLs server-side only; serve an opaque relay URL to the client |
+| Open relay endpoint with no auth | Your relay becomes a public proxy for any content, burning bandwidth and attracting abuse | Require at minimum a shared secret or same-origin check; this is private infrastructure |
+| Logging full signed CDN URLs | Signed URLs in logs = anyone with log access can watch the stream | Log only the site name and stream quality, not the signed URL |
+| Storing site credentials (if target site requires login) in source code | Credentials rotate or get revoked; leaked credentials cause account bans | Use environment variables / Kubernetes secrets; never commit credentials |
+| No rate limiting on the relay API | A single misbehaving client can exhaust bandwidth | Add rate limiting per IP on the `/proxy/` endpoints |
+
+---
+
+## UX Pitfalls
+
+Common user experience mistakes in this domain.
+
+| Pitfall | User Impact | Better Approach |
+|---------|-------------|-----------------|
+| Showing all streams including dead/offline ones | User clicks a stream, gets a black player; no indication of why | Pre-validate streams before the race and tag each as "live", "offline", or "extracting"; surface status in the stream picker |
+| Player starts with audio on (bypassing autoplay mute) | Browser blocks autoplay; user sees a broken play state | Start muted by default; show a prominent unmute button |
+| No stream quality selector | Users on slow connections buffer constantly | Expose the HLS quality levels via HLS.js API; let users pick |
+| Race schedule shows times in UTC only | Users outside UTC miss sessions | Detect browser timezone and display in local time; let users configure their timezone |
+| Stream picker has no quality or language indicator | User has to try each stream to find the best one | Label streams with: source site, resolution (1080p/720p/480p), language, and status |
+| No loading state feedback during extraction | User sees blank screen for 5–10 seconds during extractor run | Show a "Finding stream..." spinner with a progress indicator |
+
+---
+
+## "Looks Done But Isn't" Checklist
+
+Things that appear complete but are missing critical pieces.
+
+- [ ] **Extractor for site X works:** Verify it works from the production K8s network (not just localhost) and handles the case where the site returns a challenge page
+- [ ] **Stream proxy works:** Verify with HLS.js in a browser, not just with curl — CORS errors only appear in browser context
+- [ ] **m3u8 rewriting works:** Verify that multi-level playlists (master → variant → segments) are all rewritten, not just the top-level m3u8
+- [ ] **Token expiry handled:** Wait 30 minutes after extracting a URL, then try to play it — test that the refresh mechanism kicks in
+- [ ] **Race schedule is accurate:** Verify timezone handling — F1 races in different countries, and session times shift with DST changes mid-season
+- [ ] **Relay is actually private:** Confirm the relay endpoints are not publicly accessible without auth — check via Traefik ingress rules
+- [ ] **Extractor monitoring alerts:** Trigger an extractor failure manually and verify an alert fires before the next race
+
+---
+
+## Recovery Strategies
+
+When pitfalls occur despite prevention, how to recover.
+
+| Pitfall | Recovery Cost | Recovery Steps |
+|---------|---------------|----------------|
+| Extractor breaks because site changed its JS | MEDIUM | Open browser DevTools on the site, re-trace the API call sequence, update the extractor; typical fix time 30–90 minutes per site |
+| CDN-signed URL expires mid-stream | LOW | Restart the stream extraction (background refresh handles this if implemented); user may need to re-click play |
+| Relay bandwidth saturated | MEDIUM | Switch relay strategy: serve rewritten m3u8 but redirect segment fetches directly to CDN (remove relay from segment path) |
+| Site IP-bans the cluster | HIGH | Either accept the site as unavailable or route extractor requests through a residential proxy/VPN exit; may require re-evaluating the site as a target |
+| OpenF1 API unavailable | LOW | Fall back to F1 calendar static JSON; race schedule data changes infrequently so a cached fallback is safe |
+| HLS.js CORS errors in browser | LOW | Add missing `Access-Control-Allow-Origin` and `Access-Control-Allow-Headers: Range` to relay responses; deploy fix |
+
+---
+
+## Pitfall-to-Phase Mapping
+
+How roadmap phases should address these pitfalls.
+
+| Pitfall | Prevention Phase | Verification |
+|---------|------------------|--------------|
+| JS-rendered tokens not found in HTML | Extractor design (Phase 1) | Each extractor spec documents: "token source = API endpoint or JS algorithm" |
+| m3u8 segment URLs bypass proxy | Stream relay (Phase 2) | End-to-end browser test: open Network tab and confirm zero requests go to original CDN domain |
+| CDN token expiry mid-stream | Stream relay (Phase 2) | Play stream for 45 minutes; verify no 403 errors on segments |
+| Extractor maintenance burden | Extractor design + monitoring (Phase 1 + Phase 3) | Health check system alerts fire within 5 minutes of extractor failure |
+| Missing CORS on relay | Stream relay (Phase 2) | Browser-based smoke test with CORS error detection |
+| IP-based banning | Extractor design (Phase 1) | Test all extractors from production network before finalizing site list |
+| Silent extractor failures | Monitoring (Phase 3) | Inject a deliberate failure; verify alert reaches notification channel |
+| Race schedule timezone errors | Schedule integration (Phase 1) | Test with browser timezone set to UTC+11 (Australia) and UTC-5 (Americas) |
+| Open relay as public proxy | Infrastructure (Phase 2) | Verify relay endpoint returns 401/403 without auth from an external network |
+
+---
+
+## Sources
+
+- yt-dlp `common.py` extractor base class source (HIGH confidence — production extractor framework): https://github.com/yt-dlp/yt-dlp/blob/master/yt_dlp/extractor/common.py
+- yt-dlp contribution guidelines and extractor authoring notes (HIGH confidence): https://github.com/yt-dlp/yt-dlp/blob/master/CONTRIBUTING.md
+- RFC 8216 — HTTP Live Streaming specification, segment URL and playlist requirements (HIGH confidence): https://datatracker.ietf.org/doc/html/rfc8216
+- nginx `ngx_http_proxy_module` documentation — proxy buffering, URL rewriting, timeout configuration (HIGH confidence): https://nginx.org/en/docs/http/ngx_http_proxy_module.html
+- OpenF1 API documentation — rate limits, live vs. historical data, usage rights (HIGH confidence): https://openf1.org/
+- HLS.js API documentation — initialization order, error handling, quality level management (HIGH confidence): https://github.com/video-dev/hls.js/blob/master/docs/API.md
+- MDN CORS documentation — credential restrictions, preflight requirements, header rules (HIGH confidence): https://developer.mozilla.org/en-US/docs/Web/HTTP/CORS
+- F1 calendar site (f1calendar.com) — timezone complexity observations, session structure (MEDIUM confidence)
+- Web search findings on streaming site anti-scraping techniques (LOW confidence — search returned sparse results for this specific domain)
+
+---
+*Pitfalls research for: F1 live stream aggregation and proxy service*
+*Researched: 2026-02-23*
--- a/.planning/research/STACK.md
+++ b/.planning/research/STACK.md
@ -0,0 +1,172 @@
+# Stack Research
+
+**Domain:** Live stream aggregation and proxy service (F1 streams)
+**Researched:** 2026-02-23
+**Confidence:** HIGH (versions verified against PyPI, GitHub releases, and official docs)
+
+---
+
+## Recommended Stack
+
+### Core Technologies
+
+| Technology | Version | Purpose | Why Recommended |
+|------------|---------|---------|-----------------|
+| Python | 3.13 (image: `python:3.13-slim-bookworm`) | Backend runtime | Latest stable; JIT in 3.13 helps CPU-bound m3u8 rewriting; `python:3.13-slim-bookworm` is the official minimal production image. Async-native for concurrent stream proxying. |
+| FastAPI | 0.132.0 | HTTP API + stream relay | Async-first ASGI framework with native streaming response support (`StreamingResponse`). Best-in-class for HTTP proxy patterns where you relay chunked data. Built-in OpenAPI docs. Pydantic integration. |
+| yt-dlp | 2026.2.21 | Stream URL extraction | De-facto standard for extracting final HLS URLs from obfuscated sites. Supports 1000+ extractors, handles redirect chains, CSRF cookies, JS-rendered pages. Used as a Python library (`yt_dlp.YoutubeDL`), not just CLI. Updated continuously — this is critical for staying current with sites that change obfuscation. |
+| Playwright (Python) | 1.58.0 | JS-rendered scraping | Required for sites that serve stream links only after JavaScript execution. `playwright.async_api` integrates naturally with FastAPI's async event loop. Use only when yt-dlp extractors don't cover the target site — Playwright is the fallback for custom per-site extractors. Chromium headless. |
+| Svelte 5 + SvelteKit 2 | Svelte 5.53.3 / SvelteKit 2.53.0 | Frontend | Matches user's stated preference. Svelte 5's runes reactivity model is stable and production-ready. SvelteKit 2 provides SSR + SPA routing. Minimal bundle size matters for embedded player page. |
+| hls.js | 1.6.15 | In-browser HLS playback | Native `<video>` does not support HLS on non-Safari. hls.js is the only production-grade MSE-based HLS player. Handles adaptive bitrate switching. Integrates trivially into Svelte via `onMount`. |
+| FastF1 | 3.8.1 | F1 race schedule data | Wraps jolpica-f1 API (Ergast-compatible) with built-in caching, pandas DataFrames, and Python-native models. Direct replacement for the deprecated Ergast API. Provides race schedule, session times, round numbers. |
+| Redis | 7.x (existing cluster: `redis.redis.svc.cluster.local`) | Extracted URL caching, dedup | Extracted HLS URLs are expensive (JS render + redirect chain). Cache them with TTL (15–30 min). Existing Redis in cluster — zero additional infra. Use `redis-py` (async: `redis.asyncio`). |
+| APScheduler | 3.11.2 | Schedule polling cron | Runs the F1 schedule sync job (daily) and triggers scrape jobs before sessions. AsyncIOScheduler integrates with FastAPI lifespan. Avoids needing a separate Celery worker for simple periodic tasks at this scale. |
+
+### Supporting Libraries
+
+| Library | Version | Purpose | When to Use |
+|---------|---------|---------|-------------|
+| httpx | 0.28.1 | Async HTTP client for scraping | Use for fetching pages and following redirect chains without JS. Preferred over `requests` in async context (requests is sync-only). Supports HTTP/2, connection pooling, cookie jars. |
+| BeautifulSoup4 | 4.14.3 | HTML parsing for link extraction | Parse scraped HTML to find stream link candidates before handing to yt-dlp. Use with `lxml` parser (faster). Only needed for custom extractors; yt-dlp handles most internally. |
+| m3u8 | 6.0.0 | HLS playlist parsing and rewriting | Parse master playlists to rewrite segment URLs through the proxy. Required if you relay streams (rewrite absolute URLs → proxy URLs). |
+| Pydantic | 2.12.5 | Data models and validation | FastAPI already depends on it. Use for `StreamSource`, `RaceEvent`, `ExtractorResult` models. Pydantic v2 is significantly faster than v1 (Rust-backed). |
+| aiosqlite | 0.22.1 | Lightweight persistent store | Store race schedule cache and scrape job state. Single pod → no concurrency issues with SQLite. Use for data that outlives Redis TTL (schedule, extractor configs). |
+| streamlink | 8.2.0 | Secondary stream extractor | Plugin-based extractor as an alternative/complement to yt-dlp. Has different coverage. Use as fallback when yt-dlp lacks an extractor for a specific site. Can also pipe stream bytes directly. |
+| python-multipart | latest | Form data parsing | Required by FastAPI for any form-based endpoints. |
+| uvicorn | latest | ASGI server | FastAPI production server. Use `uvicorn[standard]` (includes uvloop + httptools for 2x throughput). |
+| redis-py | 5.x | Redis async client | `redis.asyncio.from_url(...)` — async Redis client. Part of the `redis` package. |
+| Tailwind CSS | 4.2.1 | Frontend styling | v4 is stable. Zero-runtime CSS. Works natively with SvelteKit via Vite plugin. |
+
+### Development Tools
+
+| Tool | Purpose | Notes |
+|------|---------|-------|
+| Vite 8 | Frontend build | SvelteKit 2.53.0 ships with Vite 8. No separate install needed. |
+| pytest + pytest-asyncio | Backend tests | Test extractors and stream proxy logic. `pytest-asyncio` for async route tests. |
+| ruff | Python linting + formatting | Replaces flake8 + black + isort. Single fast binary. |
+| mypy | Type checking | FastAPI + Pydantic 2 provide good type inference. Catches extractor return type bugs early. |
+| Playwright test runner | Extractor integration tests | Use `playwright codegen` to record site interactions for new extractors. |
+
+---
+
+## Installation
+
+```bash
+# Backend (Python 3.13)
+pip install fastapi==0.132.0 uvicorn[standard] httpx==0.28.1
+pip install yt-dlp==2026.2.21 playwright==1.58.0 streamlink==8.2.0
+pip install fastf1==3.8.1 apscheduler==3.11.2
+pip install beautifulsoup4==4.14.3 lxml m3u8==6.0.0
+pip install pydantic==2.12.5 aiosqlite==0.22.1 redis==5.x
+pip install python-multipart
+
+# Install Playwright browser binaries (in Dockerfile)
+playwright install chromium --with-deps
+
+# Frontend (Node/SvelteKit)
+npm create svelte@latest frontend
+npm install -D tailwindcss @tailwindcss/vite
+npm install hls.js
+```
+
+---
+
+## Alternatives Considered
+
+| Recommended | Alternative | When to Use Alternative |
+|-------------|-------------|-------------------------|
+| FastAPI + Python | Go + gin/fiber | If throughput > 10k concurrent streams. Go is more efficient at raw TCP proxying, but Python's yt-dlp/playwright ecosystem has no equivalent in Go — would require subprocess shelling out. Python wins for this use case. |
+| yt-dlp (library) | yt-dlp (subprocess) | Never subprocess — library mode gives direct access to extractor info_dict, format selection, and cookie jars without shell overhead. |
+| APScheduler | Celery | Celery requires a separate worker process + broker queue. APScheduler runs in-process. Overkill for 2 periodic jobs (schedule sync + pre-session scrape trigger). Use Celery only if scrape jobs need distributed execution across many workers. |
+| hls.js | Video.js + @videojs/http-streaming | Video.js (v8.23.4) wraps hls.js internally but adds significant bundle overhead. Use hls.js directly in Svelte for minimal footprint and full control. |
+| Redis (existing) | In-memory dict cache | In-memory cache is lost on pod restart (K8s). Redis provides persistence across restarts + shared state if you ever run multiple replicas. Already in cluster at no cost. |
+| SvelteKit | Next.js / Nuxt | User preference is Svelte. SvelteKit's adapter-node works well in Docker/K8s. Bundle size advantage matters for embedded player page load time. |
+| FastF1 | Direct Jolpica API calls | FastF1 adds built-in caching, retry logic, and Python models. Saves implementing Ergast-compatible parsing from scratch. jolpica-f1 API endpoint: `https://api.jolpi.ca/ergast/f1/<season>/races/` |
+| Playwright (Python) | Selenium | Playwright is faster, has async API, and Chromium headless is more stable than Selenium's ChromeDriver setup. Playwright's `page.route()` lets you intercept XHR/fetch calls to capture stream URLs without loading full pages. |
+| aiosqlite | PostgreSQL | PostgreSQL is overkill for single-node persistent schedule cache. SQLite with aiosqlite on NFS is fine for read-heavy, write-rare schedule data. If you need multi-pod writes, migrate to PostgreSQL. |
+
+---
+
+## What NOT to Use
+
+| Avoid | Why | Use Instead |
+|-------|-----|-------------|
+| `requests` library | Synchronous only — blocks the event loop in async FastAPI handlers, causing stream proxy delays | `httpx` (drop-in async replacement with identical API style) |
+| `youtube-dl` | Archived/unmaintained since 2021. yt-dlp is the maintained fork with 3x more extractors and weekly updates. Sites actively break youtube-dl. | `yt-dlp` |
+| Selenium | Heavy (~500MB browser + driver), poor async support, ChromeDriver version pinning causes constant breakage in K8s | `playwright` (native async, auto-manages browser binaries) |
+| Django | Sync-first ORM, not designed for long-lived streaming HTTP connections. FastAPI handles `StreamingResponse` for chunked relay natively. | `FastAPI` |
+| FFmpeg-based re-encoding | Introduces CPU-intensive transcode step, adds latency, and is unnecessary — HLS segments are already in a playable format. Proxy segments as-is. | Direct HLS segment relay (fetch + stream through) |
+| Tornado | Was the async Python standard before asyncio. Replaced by asyncio-native frameworks. Less ecosystem support. | `FastAPI` + `uvicorn` |
+| Celery for simple scheduling | Requires Redis as broker + a separate worker pod. Two extra moving parts for tasks that can run in-process. | `APScheduler` (AsyncIOScheduler in FastAPI lifespan) |
+| `m3u8` for full stream relay | Segment-by-segment proxying via Python is high-latency (Python overhead per segment fetch). Use HTTP redirect for public segments; only rewrite the playlist if segments require auth cookies. | Redirect to original segments where possible; only proxy if cookies are required |
+
+---
+
+## Stack Patterns by Variant
+
+**If a site requires only cookie passing (no JS rendering):**
+- Use `httpx` with cookie jar from initial login request
+- Extract HLS URL from HTML with BeautifulSoup4 or regex
+- No Playwright needed; much faster startup
+
+**If a site requires JavaScript execution to reveal the stream URL:**
+- Use Playwright async API: `page.route()` to intercept XHR requests containing `.m3u8` URLs
+- Use `page.evaluate()` to extract obfuscated CSRF tokens from JS context
+- Cache the result in Redis with 20-minute TTL
+
+**If the HLS segments require authentication cookies:**
+- Use m3u8 library to rewrite segment URLs → point to proxy endpoint
+- Proxy endpoint fetches segments with stored cookies and streams back bytes
+- Required when segments have signed URLs or cookie gates
+
+**If yt-dlp has an extractor for the site:**
+- Use `yt_dlp.YoutubeDL(opts).extract_info(url)` — returns `formats` list with direct HLS URLs
+- Much simpler than custom Playwright extractor; always try yt-dlp first
+
+---
+
+## Version Compatibility
+
+| Package | Compatible With | Notes |
+|---------|-----------------|-------|
+| FastAPI 0.132.0 | Pydantic 2.x | FastAPI 0.100+ requires Pydantic v2. Do not mix with Pydantic v1. |
+| SvelteKit 2.53.0 | Vite 8.x | SvelteKit 2.53.0 explicitly added Vite 8 support (Feb 2025). |
+| yt-dlp 2026.2.21 | Python 3.9+ | yt-dlp follows CalVer; latest version works with Python 3.13. |
+| Playwright 1.58.0 | Python 3.9+ | Requires `playwright install chromium` in Dockerfile. Chromium 145.0.7632.6. |
+| APScheduler 3.11.2 | Python 3.8+ | Use `AsyncIOScheduler` with `asyncio` event loop. APScheduler 4.x is in beta — stick with 3.x. |
+| FastF1 3.8.1 | Python 3.10+ | Requires Python 3.10+. Uses jolpica-f1 API under the hood (Ergast-compatible). |
+| hls.js 1.6.15 | Modern browsers | MSE required (Chrome, Firefox, Edge). No iOS Safari (use native HLS). |
+| redis-py 5.x | Redis 7.x | `redis.asyncio` for async usage. Cluster in infra already runs Redis 7.x. |
+| Tailwind CSS 4.2.1 | Vite 8 | Tailwind v4 uses Vite plugin (`@tailwindcss/vite`), not PostCSS config. Breaking change from v3. |
+
+---
+
+## Sources
+
+- PyPI: yt-dlp 2026.2.21 — https://pypi.org/project/yt-dlp/ (HIGH confidence, verified Feb 2026)
+- PyPI: FastAPI 0.132.0 — https://pypi.org/project/fastapi/ (HIGH confidence, verified Feb 2026)
+- PyPI: Playwright 1.58.0 — https://pypi.org/project/playwright/ (HIGH confidence, verified Feb 2026)
+- PyPI: httpx 0.28.1 — https://pypi.org/project/httpx/ (HIGH confidence, verified Feb 2026)
+- PyPI: APScheduler 3.11.2 — https://pypi.org/project/APScheduler/ (HIGH confidence, verified Feb 2026)
+- PyPI: FastF1 3.8.1 — https://pypi.org/project/fastf1/ (HIGH confidence, verified Feb 2026)
+- PyPI: Pydantic 2.12.5 — https://pypi.org/project/pydantic/ (HIGH confidence, verified Feb 2026)
+- PyPI: BeautifulSoup4 4.14.3 — https://pypi.org/project/beautifulsoup4/ (HIGH confidence, verified Feb 2026)
+- PyPI: aiohttp 3.13.3 — https://pypi.org/project/aiohttp/ (HIGH confidence, verified Feb 2026)
+- PyPI: aiosqlite 0.22.1 — https://pypi.org/project/aiosqlite/ (HIGH confidence, verified Feb 2026)
+- PyPI: streamlink 8.2.0 — https://pypi.org/project/streamlink/ (HIGH confidence, verified Feb 2026)
+- PyPI: m3u8 6.0.0 — https://pypi.org/project/m3u8/ (HIGH confidence, verified Feb 2026)
+- PyPI: requests 2.32.5 — https://pypi.org/project/requests/ (HIGH confidence, verified Aug 2025)
+- GitHub: svelte 5.53.3 — https://github.com/sveltejs/svelte/releases (HIGH confidence, verified Feb 2026)
+- GitHub: SvelteKit 2.53.0 — https://github.com/sveltejs/kit/releases (HIGH confidence, verified Feb 2026)
+- GitHub: hls.js 1.6.15 — https://github.com/video-dev/hls.js/releases (HIGH confidence, verified)
+- GitHub: Tailwind CSS 4.2.1 — https://github.com/tailwindlabs/tailwindcss/releases (HIGH confidence, verified Feb 2026)
+- GitHub: Traefik v3.6.9 — https://github.com/traefik/traefik/releases (HIGH confidence, verified Feb 2026)
+- GitHub: jolpica-f1 — https://github.com/jolpica/jolpica-f1 (MEDIUM confidence — Ergast API replacement, community-maintained)
+- GitHub: Python Docker images — https://github.com/docker-library/python/blob/master/versions.json (HIGH confidence — python:3.13-slim-bookworm, Python 3.13.12)
+- GitHub: Video.js v8.23.4 — https://github.com/videojs/video.js/releases (HIGH confidence, confirmed hls.js is preferred for direct integration)
+- PyPI: Celery 5.6.2 — https://pypi.org/project/celery/ (HIGH confidence — confirmed as overkill vs APScheduler for this use case)
+
+---
+
+*Stack research for: F1 stream aggregation and proxy service*
+*Researched: 2026-02-23*
--- a/.planning/research/SUMMARY.md
+++ b/.planning/research/SUMMARY.md
@ -0,0 +1,244 @@
+# Project Research Summary
+
+**Project:** F1 Live Stream Aggregation and Proxy Service
+**Domain:** Live stream aggregation, HLS proxy, sports scheduling
+**Researched:** 2026-02-23
+**Confidence:** MEDIUM (stack HIGH, architecture MEDIUM, features MEDIUM, pitfalls MEDIUM)
+
+## Executive Summary
+
+This project builds a self-hosted web service that aggregates live F1 streams from unofficial streaming sites, proxies them through the service to handle CORS and authentication, and presents them via an embedded HLS player with an F1 race schedule. The recommended approach is a Python/FastAPI backend (async, streaming-capable) paired with a Svelte 5/SvelteKit 2 frontend. The backend has four distinct responsibilities that must be built in dependency order: schedule data retrieval, per-site stream extraction, stream health checking, and HLS proxy/relay. Each component is independently testable and the architecture enforces clean separation so that one broken extractor cannot affect the rest of the system.
+
+The core novelty of this product — and its hardest engineering challenge — is the per-site extractor subsystem. Each target streaming site uses custom anti-scraping measures (JS-rendered tokens, signed CDN URLs, IP-based blocking) that require a custom extractor per site, maintained independently. Existing tools (streamlink, yt-dlp) provide extraction patterns but not out-of-the-box support for private F1 streaming aggregators. The recommended approach treats yt-dlp as a first-pass extractor where it has coverage, and uses httpx + BeautifulSoup for custom extractors, with Playwright as a fallback only when JS execution is strictly required.
+
+The primary risks are: (1) extractor brittleness — sites change without notice and extractors silently fail, requiring a health-check monitoring loop from day one; (2) CDN-signed URL expiry mid-stream, requiring the proxy to never cache m3u8 playlists and to implement background URL refresh; (3) IP-based blocking from the K8s cluster — all extractors must be tested from production network before finalizing site targets. These risks are all addressable through upfront architectural decisions rather than retrofitting.
+
+---
+
+## Key Findings
+
+### Recommended Stack
+
+The backend runs Python 3.13 on FastAPI 0.132.0 with uvicorn, using async throughout. yt-dlp 2026.2.21 is the primary extractor library (used as a Python library, not CLI). Playwright 1.58.0 (async Chromium) is the fallback for JS-rendered pages. httpx handles async HTTP for custom extractors. FastF1 3.8.1 provides the F1 race schedule via the Ergast-compatible jolpica API. APScheduler 3.11.2 runs periodic jobs (schedule refresh, extraction triggers) in-process without a separate worker. The existing cluster Redis is used for URL caching with TTL. SQLite via aiosqlite persists schedule snapshots to survive pod restarts.
+
+The frontend is Svelte 5.53.3 / SvelteKit 2.53.0 (user preference, also well-suited for minimal bundle size). hls.js 1.6.15 handles in-browser HLS playback via MSE. Tailwind CSS 4.2.1 provides styling via the Vite plugin (not PostCSS — breaking change from v3). All infrastructure deploys as a single Terragrunt stack following the existing repo pattern.
+
+**Core technologies:**
+- **Python 3.13 + FastAPI 0.132.0**: Async-first, StreamingResponse for HLS relay, Pydantic models
+- **yt-dlp 2026.2.21**: Primary stream extraction library — 1000+ extractors, Python library mode
+- **Playwright 1.58.0**: JS-rendered page fallback — async API, `page.route()` for XHR interception
+- **httpx 0.28.1**: Async HTTP for custom extractors and redirect chain following
+- **FastF1 3.8.1**: F1 race schedule via jolpica API with built-in caching
+- **APScheduler 3.11.2**: In-process async scheduler — avoids Celery overhead for 2 periodic jobs
+- **hls.js 1.6.15**: Browser HLS playback via MSE — mandatory on non-Safari
+- **Svelte 5 + SvelteKit 2**: Frontend framework (user preference, minimal bundle size)
+- **Redis (existing cluster)**: Extracted URL cache with TTL — no additional infra cost
+- **aiosqlite 0.22.1**: Schedule persistence to NFS — survives pod restarts
+
+**Do not use:** `requests` (sync, blocks event loop), `youtube-dl` (unmaintained), Selenium (poor async/K8s support), FFmpeg re-encoding (unnecessary latency), Celery (overkill for 2 jobs).
+
+### Expected Features
+
+Research confirms this product's novelty: no existing tool combines automated extraction from unofficial sources + browser-native proxied playback + schedule integration in a single web service.
+
+**Must have (table stakes):**
+- F1 schedule view — show all session types (FP, Quali, Sprint, Race) with live/upcoming/finished indicator
+- CORS-transparent HLS proxy — mandatory architectural requirement; streams cannot play in browser without it
+- Per-site stream extractor — at least one working extractor proves the end-to-end pipeline
+- Stream health checker — validates URLs before display; dead streams must not surface
+- Stream picker — list available working streams, user clicks to load player
+- Embedded HLS player — hls.js in Svelte, plays proxied m3u8 in-page
+- Session countdown timer — client-side, zero backend cost
+- Live session indicator — visual LIVE/UPCOMING/FINISHED badge
+
+**Should have (add after MVP validation):**
+- Stream auto-refresh — re-extract every 5-10 min during live sessions
+- Fallback stream ordering — health-check + reliability history drives ordering
+- Source labeling in picker — show site name with each stream
+- Race weekend overview — all sessions grouped per Grand Prix
+- Additional site extractors — expand coverage once first extractor is stable
+
+**Defer (v2+):**
+- Pre/post show and press conference coverage — complex site-specific session detection
+- Multiple quality tiers — only if sources actually provide multi-variant playlists
+- Proxy segment prefetch — high memory cost; only if buffering complaints emerge at scale
+- Session reputation annotations — UX polish, not launch-critical
+
+**Explicit anti-features (do not build):** DVR/recording, chat, user accounts, stream transcoding, DRM support, telemetry overlay.
+
+### Architecture Approach
+
+The system has five clearly bounded layers: (1) Schedule Subsystem — polls jolpica/OpenF1 API, stores to NFS; (2) Extractor Layer — plugin-per-site pattern with a registry dispatcher, concurrent fan-out execution; (3) Health Checker — validates extracted URLs via partial GET, stores liveness state in cache; (4) Proxy/Relay Layer — rewrites m3u8 URIs at all levels (master → variant → segments) through `/relay`; (5) Svelte Frontend — schedule view, stream picker, hls.js player. All state flows from extractors through cache to the API; the frontend never triggers extraction directly.
+
+**Major components:**
+1. **Extractor Registry** — maps site-key to extractor class; fan-out concurrent dispatch; one file per site
+2. **Playlist Rewriter** — fetches upstream m3u8, rewrites all URIs to point through `/relay`; stateless
+3. **Segment Relay** — pipes upstream `.ts`/`.m4s` bytes to client as chunked transfer; no buffering
+4. **Schedule Subsystem** — daily cron via APScheduler, NFS persistence, jolpica API client
+5. **Stream Health Checker** — background poller, HEAD/partial-GET on m3u8 URLs, results in Redis/memory cache
+6. **Backend API (FastAPI)** — serves `/schedule`, `/streams`, `/proxy`, `/relay` endpoints; reads from cache only
+7. **Svelte Frontend** — schedule page, watch page with stream picker and hls.js player
+
+**Critical patterns:**
+- Extraction runs on background schedule, never on client request (on-demand extraction = 10-30s wait)
+- One extractor class per site; common `BaseExtractor` interface; isolation prevents cross-site failures
+- Proxy must rewrite m3u8 at every level — master, variant, and segment; partial rewriting breaks streams
+- Segment relay must stream bytes chunked, never buffer entire segment in memory
+
+### Critical Pitfalls
+
+1. **JS-rendered tokens not in HTML** — Before writing any extractor, trace network traffic in DevTools to find the actual API endpoint the site JS calls. Replicate the API call, not the page fetch. Using Playwright is the last resort; most sites expose a clean JSON API once reverse-engineered.
+
+2. **m3u8 segment URLs bypass the proxy** — Rewrite all URLs in the playlist at every level (master → variant → segment). Verify with browser Network tab that zero requests reach the original CDN domain.
+
+3. **CDN-signed URLs expire mid-stream** — Never cache m3u8 playlists in the relay. Always fetch the live playlist from upstream on each poll. Implement background URL refresh that re-extracts before token TTL expires.
+
+4. **Extractor maintenance burden underestimated** — Sites break extractors without notice. Build health-check monitoring alongside the first extractor, not later. Alert on extractor failure within 5 minutes. Budget 1-2 hours/extractor/month for maintenance.
+
+5. **IP-based blocking from K8s cluster** — Test all extractors from the production cluster network before finalizing site targets. Datacenter IPs are pre-blocked on many streaming platforms. Simulate realistic browser headers (User-Agent, Referer, Accept-Language).
+
+6. **CORS missing on relay endpoints** — Set `Access-Control-Allow-Origin`, `Access-Control-Allow-Methods`, and `Access-Control-Allow-Headers: Range` on all relay responses. Missing `Range` header causes preflight failures for segment requests. Test from actual browser, not curl.
+
+---
+
+## Implications for Roadmap
+
+The architecture's strict dependency chain dictates phase ordering. No phase can be skipped — each provides inputs required by the next. The recommended build order from ARCHITECTURE.md is confirmed by the pitfall analysis: the schedule subsystem must exist before extractors know when to run; extractors must work before the proxy has URLs to relay; the proxy must exist before the frontend has anything to play.
+
+### Phase 1: Foundation — Schedule, Infrastructure, and Extractor Framework
+
+**Rationale:** Schedule data is the trigger for everything downstream. Building the extractor framework (base class + registry) before writing any site-specific code prevents architectural lock-in. Both are low anti-scraping complexity — schedule uses a public API, framework is pure Python scaffolding.
+
+**Delivers:** Working F1 schedule API endpoint, extractor plugin system with registry, Terragrunt deployment stack, NFS mount, development environment
+
+**Addresses features:** F1 schedule view, live/upcoming/finished indicators, session countdown timer (frontend-only, depends on schedule data)
+
+**Avoids pitfalls:** Establishes upfront which extraction approach each target site requires (API endpoint vs JS reverse-engineering vs Playwright); tests extractors from production network before committing to sites; implements timezone-aware schedule storage
+
+**Research flag:** STANDARD — jolpica API is well-documented, Terragrunt stack pattern is established in repo
+
+---
+
+### Phase 2: Extraction Pipeline — First Working Extractor
+
+**Rationale:** One end-to-end working extractor (raw URL → validated stream URL) proves the extraction architecture before scaling to multiple sites. Health checker must be built alongside first extractor — not after — because silent failures are the primary operational risk.
+
+**Delivers:** First site extractor returning live HLS URLs, stream health checker (HEAD/GET validation), Redis caching with TTL, background polling scheduler
+
+**Addresses features:** Per-site stream extractor, stream health checker, stream auto-refresh (background polling)
+
+**Avoids pitfalls:** Extractor built with full failure visibility (logs which step fails); health-check alerts configured from day one; extractor tested from production K8s network before finalizing
+
+**Research flag:** NEEDS RESEARCH DURING PLANNING — specific target sites unknown; each site requires independent reverse-engineering; Playwright requirement depends on site-specific JS analysis
+
+---
+
+### Phase 3: Stream Proxy and Relay Layer
+
+**Rationale:** The proxy layer converts raw CDN URLs (which browsers cannot fetch cross-origin) into browser-playable same-origin URLs. This is the architectural blocker for the frontend — no proxy, no browser playback. Must be built before any UI work.
+
+**Delivers:** `/proxy` endpoint (m3u8 fetch + full URI rewrite at all levels), `/relay` endpoint (chunked segment pipe-through), CORS headers on all relay responses, URL refresh loop for token expiry
+
+**Addresses features:** CORS-transparent HLS proxy (mandatory for all browser playback), multiple quality options (variant playlist rewriting), stream picker (proxied URLs safe to expose to frontend)
+
+**Avoids pitfalls:** Rewrites m3u8 at master + variant + segment levels; never caches playlists; streams segments as chunked transfer (no memory buffering); CORS headers include `Range` header; relay endpoint is not publicly accessible (Traefik auth)
+
+**Research flag:** STANDARD — HLS spec (RFC 8216) and proxy patterns are well-documented; implementation is mechanical once architecture is understood
+
+---
+
+### Phase 4: Frontend — Schedule, Picker, and Player
+
+**Rationale:** All backend components are independently testable via curl before the UI exists. The frontend is the final assembly step, not an intermediate one. Building it last means it integrates against a working backend rather than mocking everything.
+
+**Delivers:** SvelteKit app with schedule view, stream picker, embedded hls.js player, session countdown timer, live/upcoming/finished badges
+
+**Addresses features:** Embedded HLS player, stream picker, session countdown, live session indicator, race weekend overview (grouping sessions by Grand Prix)
+
+**Avoids pitfalls:** hls.js error handler attached from day one; autoplay muted by default; streams display with source label and liveness status; timezone displayed in browser local time
+
+**Research flag:** STANDARD — SvelteKit + hls.js integration is well-documented; component structure is straightforward given small scope
+
+---
+
+### Phase 5: Coverage Expansion and Reliability
+
+**Rationale:** Once the full pipeline is proven end-to-end with one extractor, adding more sites is low-risk incremental work following the established pattern. Stream reliability features (fallback ordering, source labeling) are only meaningful once multiple sources exist.
+
+**Delivers:** Additional site extractors (2-3 more sites), fallback stream ordering by health-check recency, source labels in stream picker, extractor monitoring alerts (notification channel)
+
+**Addresses features:** Additional extractors, fallback stream ordering, source labeling, stream auto-refresh improvements
+
+**Avoids pitfalls:** Each new extractor reverse-engineered independently; health-check alerts tested by deliberate failure injection before each race weekend
+
+**Research flag:** NEEDS RESEARCH DURING PLANNING — each new target site requires individual analysis of extraction approach; cannot be planned generically
+
+---
+
+### Phase Ordering Rationale
+
+- **Schedule first:** Public API, no anti-scraping complexity, required by extraction scheduler. Proves the Terragrunt stack without risking extractor failures.
+- **Extractor framework before site-specific extractors:** Base class and registry must exist first; forces interface design before implementation.
+- **Health checker with first extractor:** Silent failures are the top operational risk; monitoring must not be deferred.
+- **Proxy before frontend:** The frontend's player cannot function without a working `/proxy` endpoint; building UI against a mock wastes time.
+- **Frontend last of core phases:** All backend endpoints are curl-testable; UI is integration, not a prerequisite.
+- **Additional extractors after core works:** Pattern is proven, risk is low, each site is independently scoped.
+
+### Research Flags
+
+Phases needing `/gsd:research-phase` during planning:
+- **Phase 2 (Extraction Pipeline):** Target sites unknown; each requires independent DevTools session to determine extraction approach (API endpoint, JS algorithm, or Playwright). Cannot scope extractors without site-specific analysis.
+- **Phase 5 (Coverage Expansion):** Each new target site is a fresh reverse-engineering problem. Budget per-site research before each extractor is scoped.
+
+Phases with standard patterns (skip research-phase):
+- **Phase 1 (Foundation):** jolpica/OpenF1 API is public and documented. Terragrunt stack follows established repo pattern. Extractor base class is standard Python ABC.
+- **Phase 3 (Proxy/Relay):** HLS spec is RFC 8216. Proxy rewriting pattern is well-documented in HLS-Proxy and yt-dlp literature. CORS mechanics are standard.
+- **Phase 4 (Frontend):** SvelteKit + hls.js integration has clear documentation. Component scope is small.
+
+---
+
+## Confidence Assessment
+
+| Area | Confidence | Notes |
+|------|------------|-------|
+| Stack | HIGH | All versions verified against PyPI and GitHub releases as of 2026-02-23. Version compatibility matrix confirmed. |
+| Features | MEDIUM | Feature list is well-grounded; competitor analysis confirms novelty. OpenF1 API confidence is MEDIUM (third-party fan project, not official F1). |
+| Architecture | MEDIUM | HLS spec and proxy mechanics are HIGH confidence (RFC 8216, Apple docs). System composition for this specific use-case is inferred from domain patterns. |
+| Pitfalls | MEDIUM | yt-dlp/streamlink source analysis and HLS RFC are HIGH; streaming site anti-scraping behavior is LOW (sparse public documentation). |
+
+**Overall confidence:** MEDIUM
+
+### Gaps to Address
+
+- **Target site list not defined:** The research assumes a list of specific streaming sites to target but does not name them. Phase 2 cannot be scoped until specific sites are identified and reverse-engineered in a DevTools session. This is the largest planning gap.
+- **OpenF1 live data cost:** OpenF1's live session data costs €9.90/month on free tier. Research recommends using the F1 calendar static JSON for schedule. Validate whether jolpica API provides sufficient real-time session status (live/upcoming) before finalizing the schedule integration approach.
+- **Home ISP IP classification:** Whether the K8s cluster's home ISP IP is treated as residential or datacenter by streaming site IP reputation databases is unknown. Must test each target site from the cluster before committing. Recovery if blocked: residential proxy or VPN exit node.
+- **Multi-variant playlist availability:** The multiple-quality feature depends on source sites providing multi-variant HLS playlists. This cannot be confirmed until specific sites are targeted. Phase 3 proxy rewriting should handle it correctly regardless, but the UX feature may not be usable at launch.
+- **Token TTL per site:** Each site's CDN token TTL is unknown until extractors are built and tested. The background refresh architecture is in place, but the refresh interval must be configured per-site based on observed TTLs.
+
+---
+
+## Sources
+
+### Primary (HIGH confidence)
+- PyPI release pages — all stack versions (FastAPI, yt-dlp, Playwright, httpx, APScheduler, FastF1, Pydantic, hls.js, Tailwind CSS, SvelteKit, Svelte)
+- RFC 8216 (IETF) — HLS specification, playlist structure, segment URL mechanics
+- yt-dlp `common.py` + CONTRIBUTING.md — extractor plugin pattern, format selection
+- HLS.js API documentation — initialization, error handling, quality level management
+- MDN CORS documentation — preflight requirements, credential restrictions, header rules
+- OpenF1 API documentation — rate limits, live vs. historical tiers, session endpoints
+
+### Secondary (MEDIUM confidence)
+- jolpica-f1 GitHub README — Ergast-compatible API, availability guarantees (community-maintained)
+- Streamlink plugin documentation — per-site extractor isolation pattern
+- HLS-Proxy (warren-bank) README — CORS proxy architecture requirements
+- RaceControl (robvdpol), f1viewer (SoMuchForSubtlety) READMEs — F1 streaming UX expectations
+
+### Tertiary (LOW confidence)
+- Web searches on streaming site anti-scraping techniques — sparse results; pitfalls inferred from yt-dlp source patterns
+- f1calendar.com — timezone complexity observations; not an authoritative source
+
+---
+
+*Research completed: 2026-02-23*
+*Ready for roadmap: yes*