infra/stacks/f1-stream/files/backend/extractors/curated.py

"""Curated extractor — known-good 24/7 F1 channels via direct embed URLs.

Returns a small, hand-picked list of embed URLs that are reliable enough to
be served as fallback "always-on" streams when the dynamic extractors find
nothing (e.g. between race weekends, when API providers are down).

These are direct embed URLs. The frontend routes them through /embed so the
iframe-stripping proxy bypasses any frame-buster JS in the upstream player.
"""

import logging

from backend.extractors.base import BaseExtractor
from backend.extractors.models import ExtractedStream

logger = logging.getLogger(__name__)


# Curated list. Each entry is a known direct embed URL. These were sourced
# from the timstreams.py ALWAYS_INCLUDE_HASHES list (Sky Sports F1, DAZN F1)
# and are documented as 24/7 channels that play F1 content year-round.
_CURATED_STREAMS = [
    {
        "url": "https://hmembeds.one/embed/888520f36cd94c5da4c71fddc1a5fc9b",
        "title": "Sky Sports F1 (24/7)",
        "quality": "HD",
    },
    {
        "url": "https://hmembeds.one/embed/fc3a54634d0867b0c02ee3223292e7c6",
        "title": "DAZN F1 (24/7)",
        "quality": "HD",
    },
]


class CuratedExtractor(BaseExtractor):
    """Returns curated known-good 24/7 F1 channel embed URLs."""

    @property
    def site_key(self) -> str:
        return "curated"

    @property
    def site_name(self) -> str:
        return "Curated 24/7 Channels"

    async def extract(self) -> list[ExtractedStream]:
        streams = [
            ExtractedStream(
                url=entry["url"],
                site_key=self.site_key,
                site_name=self.site_name,
                quality=entry["quality"],
                title=entry["title"],
                stream_type="embed",
                embed_url=entry["url"],
            )
            for entry in _CURATED_STREAMS
        ]
        logger.info("[curated] Returning %d curated stream(s)", len(streams))
        return streams
f1-stream: only show streams confirmed playable by headless browser Cuts the stream list from 23 mostly-broken entries to ~6 confirmed-playable ones, and adds an iframe-stripping proxy so embed sources (hmembeds, etc.) load through our origin without X-Frame-Options / CSP / JS frame-buster blocks. Why: the previous list was dominated by Discord-shared news article URLs, hardcoded aggregator landing pages, and other non-stream URLs that all sat at is_live=true because embed streams skipped the health check entirely. Users could not tell which links would actually play. What: - backend/playback_verifier.py: new headless-Chromium verifier (Playwright) that polls each candidate stream for a codec-independent "playable" signal (hls.js MANIFEST_PARSED for m3u8; <video>/player div for embed). Replaces the unconditional is_live=True for embed streams in service.py. - backend/embed_proxy.py: new /embed and /embed-asset routes that fetch upstream embed pages, strip X-Frame-Options/CSP/Set-Cookie, and inject a <base href> + frame-buster-defeat <script> that locks down window.top, document.referrer, console.clear/table, and window.location so the hmembeds disable-devtool.js redirect-to-google trap can't fire. - extractors/curated.py: new always-on extractor with two known-good 24/7 hmembeds embeds (Sky Sports F1, DAZN F1) so the list isn't empty between race weekends. - extractors/__init__.py: register CuratedExtractor first; drop FallbackExtractor (its 10 aggregator landing-pages can't iframe-play). - extractors/discord_source.py: positive-match path filter (must look like /embed/, /stream, /watch, /live, /player, .m3u8, .php) plus expanded domain blocklist for news sites — was 10 noise URLs, now ~1. - extractors/service.py: run_extraction now health-checks AND verifier- checks both stream types; only verified-playable streams reach is_live. - main.py: register /embed + /embed-asset routes; defer initial extraction by 8s so the verifier can reach the local /embed proxy on 127.0.0.1:8000. - frontend/lib/api.js + watch/+page.svelte: route embed iframes through /embed proxy instead of the upstream URL, so X-Frame-Options/CSP can't block them. - Dockerfile: install Playwright chromium + system codec-runtime libs. - main.tf: bump pod memory 256Mi → 1Gi for chromium. Verified end-to-end with Playwright against https://f1.viktorbarzin.me/watch — 6/6 streams reach a player UI; the 3 demo m3u8s actually play (codec-bearing browser); the 3 embeds (Sky Sports F1, DAZN F1, sportsurge) render iframes through the proxy. Image: viktorbarzin/f1-stream:v6.0.5 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> 2026-05-06 21:00:07 +00:00			`"""Curated extractor — known-good 24/7 F1 channels via direct embed URLs.`

			`Returns a small, hand-picked list of embed URLs that are reliable enough to`
			`be served as fallback "always-on" streams when the dynamic extractors find`
			`nothing (e.g. between race weekends, when API providers are down).`

			`These are direct embed URLs. The frontend routes them through /embed so the`
			`iframe-stripping proxy bypasses any frame-buster JS in the upstream player.`
			`"""`

			`import logging`

			`from backend.extractors.base import BaseExtractor`
			`from backend.extractors.models import ExtractedStream`

			`logger = logging.getLogger(__name__)`


			`# Curated list. Each entry is a known direct embed URL. These were sourced`
			`# from the timstreams.py ALWAYS_INCLUDE_HASHES list (Sky Sports F1, DAZN F1)`
			`# and are documented as 24/7 channels that play F1 content year-round.`
			`_CURATED_STREAMS = [`
			`{`
			`"url": "https://hmembeds.one/embed/888520f36cd94c5da4c71fddc1a5fc9b",`
			`"title": "Sky Sports F1 (24/7)",`
			`"quality": "HD",`
			`},`
			`{`
			`"url": "https://hmembeds.one/embed/fc3a54634d0867b0c02ee3223292e7c6",`
			`"title": "DAZN F1 (24/7)",`
			`"quality": "HD",`
			`},`
			`]`


			`class CuratedExtractor(BaseExtractor):`
			`"""Returns curated known-good 24/7 F1 channel embed URLs."""`

			`@property`
			`def site_key(self) -> str:`
			`return "curated"`

			`@property`
			`def site_name(self) -> str:`
			`return "Curated 24/7 Channels"`

			`async def extract(self) -> list[ExtractedStream]:`
			`streams = [`
			`ExtractedStream(`
			`url=entry["url"],`
			`site_key=self.site_key,`
			`site_name=self.site_name,`
			`quality=entry["quality"],`
			`title=entry["title"],`
			`stream_type="embed",`
			`embed_url=entry["url"],`
			`)`
			`for entry in _CURATED_STREAMS`
			`]`
			`logger.info("[curated] Returning %d curated stream(s)", len(streams))`
			`return streams`