f1-stream: only show streams confirmed playable by headless browser

Cuts the stream list from 23 mostly-broken entries to ~6 confirmed-playable ones, and adds an iframe-stripping proxy so embed sources (hmembeds, etc.) load through our origin without X-Frame-Options / CSP / JS frame-buster blocks. Why: the previous list was dominated by Discord-shared news article URLs, hardcoded aggregator landing pages, and other non-stream URLs that all sat at is_live=true because embed streams skipped the health check entirely. Users could not tell which links would actually play. What: - backend/playback_verifier.py: new headless-Chromium verifier (Playwright) that polls each candidate stream for a codec-independent "playable" signal (hls.js MANIFEST_PARSED for m3u8; <video>/player div for embed). Replaces the unconditional is_live=True for embed streams in service.py. - backend/embed_proxy.py: new /embed and /embed-asset routes that fetch upstream embed pages, strip X-Frame-Options/CSP/Set-Cookie, and inject a <base href> + frame-buster-defeat <script> that locks down window.top, document.referrer, console.clear/table, and window.location so the hmembeds disable-devtool.js redirect-to-google trap can't fire. - extractors/curated.py: new always-on extractor with two known-good 24/7 hmembeds embeds (Sky Sports F1, DAZN F1) so the list isn't empty between race weekends. - extractors/__init__.py: register CuratedExtractor first; drop FallbackExtractor (its 10 aggregator landing-pages can't iframe-play). - extractors/discord_source.py: positive-match path filter (must look like /embed/, /stream, /watch, /live, /player, *.m3u8, *.php) plus expanded domain blocklist for news sites — was 10 noise URLs, now ~1. - extractors/service.py: run_extraction now health-checks AND verifier- checks both stream types; only verified-playable streams reach is_live. - main.py: register /embed + /embed-asset routes; defer initial extraction by 8s so the verifier can reach the local /embed proxy on 127.0.0.1:8000. - frontend/lib/api.js + watch/+page.svelte: route embed iframes through /embed proxy instead of the upstream URL, so X-Frame-Options/CSP can't block them. - Dockerfile: install Playwright chromium + system codec-runtime libs. - main.tf: bump pod memory 256Mi → 1Gi for chromium. Verified end-to-end with Playwright against https://f1.viktorbarzin.me/watch — 6/6 streams reach a player UI; the 3 demo m3u8s actually play (codec-bearing browser); the 3 embeds (Sky Sports F1, DAZN F1, sportsurge) render iframes through the proxy. Image: viktorbarzin/f1-stream:v6.0.5 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-06 21:00:07 +00:00 · 2026-05-06 21:00:07 +00:00 · 89c59ccc80
commit 89c59ccc80
parent 1cc745bbec
15 changed files with 2128 additions and 22 deletions
--- a/stacks/f1-stream/files/backend/extractors/service.py
+++ b/stacks/f1-stream/files/backend/extractors/service.py
@ -6,6 +6,7 @@ from datetime import datetime, timezone
 from backend.extractors.models import ExtractedStream
 from backend.extractors.registry import ExtractorRegistry
 from backend.health import StreamHealthChecker
+from backend.playback_verifier import PlaybackVerifier

 logger = logging.getLogger(__name__)

@ -29,6 +30,11 @@ class ExtractionService:
        self._last_run: str | None = None
        self._last_run_stream_count: int = 0
        self._health_checker = StreamHealthChecker()
+        self._playback_verifier = PlaybackVerifier()
+
+    async def shutdown(self) -> None:
+        """Release the headless browser instance owned by the verifier."""
+        await self._playback_verifier.shutdown()

    async def run_extraction(self) -> None:
        """Run all extractors, health-check results, and cache them.
@ -43,31 +49,63 @@ class ExtractionService:

        streams = await self._registry.extract_all()

-        # Run health checks on all extracted streams
+        # Run health checks + headless-browser playback verification.
+        # Both stream types are now verified end-to-end so the user only
+        # ever sees streams that actually play in a browser.
        if streams:
-            # Separate m3u8 streams (need health check) from embed streams (skip)
            m3u8_streams = [s for s in streams if s.stream_type != "embed"]
            embed_streams = [s for s in streams if s.stream_type == "embed"]

-            # Mark embed streams as live (no health check possible for iframes)
-            for stream in embed_streams:
-                stream.is_live = True
-                stream.response_time_ms = 0
-                stream.checked_at = start.isoformat()
-
-            # Health-check only m3u8 streams
+            # m3u8 streams: cheap structural health check (validates manifest,
+            # checks first variant playlist), then a headless-browser test
+            # to confirm hls.js can decode and render frames.
            if m3u8_streams:
                stream_dicts = [s.to_dict() for s in m3u8_streams]
                health_map = await self._health_checker.check_all(stream_dicts)
-
                for stream in m3u8_streams:
                    health = health_map.get(stream.url)
                    if health:
-                        stream.is_live = health.is_live
                        stream.response_time_ms = health.response_time_ms
                        stream.checked_at = health.checked_at
                        if health.bitrate > 0:
                            stream.bitrate = health.bitrate
+                        # tentatively mark live; final word comes from the verifier
+                        stream.is_live = health.is_live
+
+            # Browser verification: applies to both m3u8 (only those that
+            # passed structural health) and embed (always — they have no
+            # other way to verify).
+            verify_items: list[tuple[str, str]] = []
+            for stream in m3u8_streams:
+                if stream.is_live:
+                    verify_items.append((stream.url, "m3u8"))
+            for stream in embed_streams:
+                verify_items.append((stream.embed_url or stream.url, "embed"))
+
+            verdicts = await self._playback_verifier.verify_many(verify_items)
+
+            now_iso = datetime.now(timezone.utc).isoformat()
+            for stream in m3u8_streams:
+                if not stream.is_live:
+                    continue  # already failed health check
+                verdict = verdicts.get(stream.url)
+                if verdict is None:
+                    continue  # verifier disabled or unavailable
+                stream.is_live = verdict.is_playable
+                stream.checked_at = now_iso
+
+            for stream in embed_streams:
+                key = stream.embed_url or stream.url
+                verdict = verdicts.get(key)
+                stream.checked_at = now_iso
+                if verdict is None:
+                    # Verifier unavailable — fall back to "trust extractor".
+                    # This keeps the service usable even without playwright.
+                    stream.is_live = True
+                    stream.response_time_ms = 0
+                else:
+                    stream.is_live = verdict.is_playable
+                    stream.response_time_ms = verdict.elapsed_ms

        # Group streams by site_key and update cache
        new_cache: dict[str, list[ExtractedStream]] = {}