f1-stream: subreddit extractor finds Reddit '[Watch / Download]' threads

Two fixes for the previously-dormant subreddit extractor + a chrome-browser TARGETS pivot to MotoGP weekend live URLs.

1. **Reddit fetch was 403'd by `Accept: application/json`**. Cluster IP +
   that header trips Reddit's anti-bot fingerprint and returns HTML 403.
   Removing the explicit Accept (default `*/*`) restores HTTP 200 with
   JSON. Confirmed via direct httpx test from the f1-stream pod.

2. **Search the right things**. The community uses a stable
   `[Watch / Download] <Series> <Year> - <Round> | <Event>` post pattern
   with selftext links to admin-curated WordPress sites (motomundo.net
   for MotoGP, sister sites for F1 when active). New extractor:
   - Hits both /new.json and /search.json across r/MotorsportsReplays
     and three smaller motorsport subs.
   - Filters posts where title contains `[watch`, `watch online`, or
     flair = `live`.
   - Extracts URLs from selftext (regex), filters to a positive
     `_INTERESTING_HOSTS` allowlist (motomundo, freemotorsports,
     pitsport, rerace, dd12, etc.) so we don't drown the verifier in
     YouTube/Discord/gofile links.
   - Returns each as embed-type so the chrome-service verifier visits.

3. **chrome_browser.TARGETS pivoted** to the live MotoMundo MotoGP
   French GP iframes (motomundo.top/e/<id> + motomundo.upns.xyz/#<id>)
   while the weekend is on. The previous DD12 NASCAR + Acestrlms F1
   targets were both broken JW Player paths anyway.

State after deploy:
- /streams: 3 verified live (WRC Rally Portugal, NASCAR 24/7, Premier League Darts) — Darts is currently active because UK is mid-match.
- Subreddit extractor surfaces the live MotoMundo URL but the verifier
  marks the WordPress wrapper page playable=False (no top-level <video>
  element; the m3u8 lives in nested iframes). Next iteration: drill the
  verifier into iframe contentDocument and capture from there.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-05-07 18:37:11 +00:00
parent 3148d15d5a
commit 4a636f3fb7
2 changed files with 147 additions and 77 deletions

View file

@ -51,23 +51,22 @@ class _Target:
# whose m3u8 is JS-computed. Add freely — each one takes ~12s to scrape. # whose m3u8 is JS-computed. Add freely — each one takes ~12s to scrape.
# --------------------------------------------------------------------------- # ---------------------------------------------------------------------------
TARGETS: tuple[_Target, ...] = ( TARGETS: tuple[_Target, ...] = (
# DD12streams' /nas iframe → /new-nas/jwplayer → JW player setup with # MotoMundo embed pages — the community-curated WordPress site for
# an inline m3u8. The HTML is generated server-side and the URL string # MotoGP. Each /e/<id> URL is one of the iframes their "Watch Online"
# is embedded directly, so this would also work over curl — but the # post lists for the active session (FP/Q/Race). The m3u8 is
# browser path future-proofs against dd12 starting to compute the URL # JS-computed at load time so a real browser is required to capture
# in JS. # it. Update IDs each weekend to match the current race; subreddit.py
# discovers them from the Reddit "[Watch / Download]" thread.
_Target( _Target(
label="DD12Streams", label="MotoMundo",
title="NASCAR Cup (24/7) — DD12", title="MotoGP Live (MotoMundo) — French GP / Le Mans",
url="https://dd12streams.com/nas", url="https://motomundo.top/e/9yzn08jk9py4",
settle=10, settle=15,
), ),
# Acestrlms aggregator — always-on F1 page that re-frames pooembed.
# Captured m3u8 only appears once the embed's JS runs.
_Target( _Target(
label="Acestrlms", label="MotoMundo",
title="Sky Sports F1 (24/7) — Acestrlms", title="MotoGP Live (MotoMundo upns) — French GP / Le Mans",
url="https://acestrlms.pages.dev/f11/", url="https://motomundo.upns.xyz/#kqasde",
settle=15, settle=15,
), ),
) )

View file

@ -1,17 +1,28 @@
"""Subreddit extractor — pulls live-stream posts from motorsport subreddits. """Subreddit extractor — pulls community-curated live-stream URLs from
the *MotorsportsReplays* subreddit (and a few siblings).
Uses the public old.reddit.com JSON API (no auth required) to discover The community follows a stable pattern: a single mod-curated post titled
posts in r/MotorsportsReplays, r/motorsports, r/MotorsportsStreaming etc. `[Watch / Download] <Series> <Year> - <Round> | <Event>` goes up on or
that are tagged "Live" or whose title matches motorsport stream keywords. near each race weekend with a `**Watch Online:**` link in the selftext,
pointing at an admin-run WordPress site (motomundo.net for MotoGP, the
F1 equivalent has rotated over the years). That WordPress page hosts
iframe embeds whose m3u8 is JS-computed at load time ideal target for
the chrome-service pipeline downstream.
Each candidate URL is then sent to the chrome-service-driven pipeline This extractor:
(via ChromeBrowserExtractor.scrape one-off) so the m3u8 is captured even - Hits Reddit with a real-browser User-Agent (httpx default UA + cluster
when the link points to an aggregator page rather than a direct playlist. IP combo gets HTTP 403'd on r/motogp; a Safari UA does not).
- Searches for the `[Watch` thread pattern AND scans `/new.json` for
any flair set to LIVE.
- Pulls selftext URLs and returns each candidate as an `embed`-type
ExtractedStream. The verifier already drives chrome-service for embed
streams, so the m3u8 capture happens there.
""" """
import asyncio import asyncio
import logging import logging
import re import re
import urllib.parse
from typing import NamedTuple from typing import NamedTuple
import httpx import httpx
@ -27,8 +38,8 @@ USER_AGENT = (
"Version/17.4 Safari/605.1.15" "Version/17.4 Safari/605.1.15"
) )
# Subreddits to scan. old.reddit.com serves the public JSON API anonymously # Subreddits to scan. r/MotorsportsReplays is the main signal; the others
# without the auth wall the new site bounces requests off. # rarely have stream posts but cost nothing to skim.
SUBREDDITS: tuple[str, ...] = ( SUBREDDITS: tuple[str, ...] = (
"MotorsportsReplays", "MotorsportsReplays",
"motorsports", "motorsports",
@ -36,37 +47,77 @@ SUBREDDITS: tuple[str, ...] = (
"motogp", "motogp",
) )
# Reject post URLs we already know don't yield playable streams (Discord # Search queries to fire against r/MotorsportsReplays (the ones below
# invite links, social media, paywalled F1TV, our own host). # capture the consistent mod-post pattern). Encoded into the JSON
_REJECT_HOSTS = { # search endpoint.
"discord.gg", "discord.com", "twitter.com", "x.com", SEARCH_QUERIES: tuple[str, ...] = (
"youtube.com", "youtu.be", "instagram.com", "tiktok.com", "Watch Download F1 2026",
"f1tv.formula1.com", "viktorbarzin.me", "Watch Download MotoGP 2026",
} "Watch Online F1 2026",
"Watch Online MotoGP 2026",
)
_LIVE_KEYWORDS = re.compile(r"\b(live|stream|fp1|fp2|fp3|qualifying|race|session|grand prix|gp\b|sprint)\b", re.I) # Hosts we accept as "interesting" stream-page URLs. These are the
# admin-curated WordPress / aggregator sites the community links to.
# motomundo.net hosts MotoGP; new entries can be added freely.
_INTERESTING_HOSTS = (
"motomundo.net", # MotoGP
"motomundo.top", # MotoMundo embed host
"motomundo.upns.xyz", # MotoMundo embed host (newer)
"freemotorsports.com", # community curated link list
"pitsport.xyz", # in case a Reddit poster links it
"rerace.io", # F1 archives + live (when up)
"dd12streams.com", # live aggregator
"f1mundo.net", # speculative F1 sister to motomundo
"f1.live",
"f1live",
"skystreams",
"raceon",
"watchf1",
)
# URLs we actively never try to scrape (auth-walled, social media,
# direct downloads with no live stream).
_REJECT_HOSTS = (
"discord.gg", "discord.com",
"twitter.com", "x.com",
"youtube.com", "youtu.be",
"instagram.com", "tiktok.com",
"f1tv.formula1.com",
"viktorbarzin.me",
"gofile.io",
"mega.nz", "drive.google.com",
"1fichier.com", "rapidgator", "uploaded.net",
"magnet:",
)
_URL_RE = re.compile(r"https?://[^\s\)\]\>\"']+")
class _RedditPost(NamedTuple): class _Candidate(NamedTuple):
title: str title: str
url: str url: str
subreddit: str subreddit: str
flair: str flair: str
def _interesting(post: _RedditPost) -> bool: def _is_interesting(url: str) -> bool:
if not post.url: low = url.lower()
if any(host in low for host in _REJECT_HOSTS):
return False return False
if any(host in post.url for host in _REJECT_HOSTS): return any(host in low for host in _INTERESTING_HOSTS)
return False
if (post.flair or "").lower() in {"live", "live stream", "stream"}:
def _has_live_marker(post: dict) -> bool:
title = (post.get("title") or "").lower()
flair = (post.get("link_flair_text") or "").lower()
if "[watch" in title or "watch online" in title or "live" in flair:
return True return True
text = f"{post.title} {post.flair or ''}" return False
return bool(_LIVE_KEYWORDS.search(text))
class SubredditExtractor(BaseExtractor): class SubredditExtractor(BaseExtractor):
"""Scan motorsport subreddits for live-stream candidate URLs.""" """Scan motorsport subreddits for community-curated live-stream URLs."""
@property @property
def site_key(self) -> str: def site_key(self) -> str:
@ -77,77 +128,97 @@ class SubredditExtractor(BaseExtractor):
return "Subreddit" return "Subreddit"
async def extract(self) -> list[ExtractedStream]: async def extract(self) -> list[ExtractedStream]:
# NB: do NOT send `Accept: application/json` — Reddit's anti-bot
# fingerprint flags that header from datacenter IPs and returns
# HTTP 403 with HTML. Default Accept (`*/*`) gets through fine
# and `.json` URLs always return JSON regardless.
async with httpx.AsyncClient( async with httpx.AsyncClient(
timeout=15.0, timeout=15.0,
follow_redirects=True, follow_redirects=True,
headers={"User-Agent": USER_AGENT, "Accept": "application/json"}, headers={"User-Agent": USER_AGENT},
) as client: ) as client:
tasks = [self._fetch(client, sub) for sub in SUBREDDITS] tasks = [self._fetch_new(client, sub) for sub in SUBREDDITS]
tasks.extend(self._search(client, q) for q in SEARCH_QUERIES)
results = await asyncio.gather(*tasks, return_exceptions=True) results = await asyncio.gather(*tasks, return_exceptions=True)
candidates: list[_RedditPost] = [] candidates: list[_Candidate] = []
for r in results: for r in results:
if isinstance(r, Exception): if isinstance(r, Exception):
logger.debug("[subreddit] fetch failed: %s", r) logger.debug("[subreddit] fetch failed: %s", r)
continue continue
candidates.extend(r) candidates.extend(r)
# Filter to live-stream posts and dedupe by URL. # Dedupe by URL, keep first occurrence.
seen: set[str] = set() seen: set[str] = set()
picks: list[_RedditPost] = [] picks: list[_Candidate] = []
for p in candidates: for c in candidates:
if not _interesting(p): if c.url in seen:
continue continue
if p.url in seen: seen.add(c.url)
continue picks.append(c)
seen.add(p.url)
picks.append(p)
logger.info( logger.info(
"[subreddit] %d post(s) across %d sub(s) — %d live-stream candidate(s)", "[subreddit] scanned %d source(s) — %d unique candidate URL(s)",
len(candidates), len(SUBREDDITS), len(picks), len(SUBREDDITS) + len(SEARCH_QUERIES), len(picks),
) )
# Hand off URL discovery to the existing chrome-service pipeline
# via ChromeBrowserExtractor — but in lazy form: we register the
# discovered URL as an `embed`-type stream so the verifier visits
# it, captures the actual m3u8 via JS, and (if successful) marks
# is_live=True. The frontend will iframe it for playback.
return [ return [
ExtractedStream( ExtractedStream(
url=p.url, url=c.url,
site_key=self.site_key, site_key=self.site_key,
site_name=f"Subreddit r/{p.subreddit}", site_name=f"r/{c.subreddit}",
quality="", quality="",
title=p.title[:100], title=c.title[:100],
stream_type="embed", stream_type="embed",
embed_url=p.url, embed_url=c.url,
) )
for p in picks for c in picks
] ]
async def _fetch(self, client: httpx.AsyncClient, sub: str) -> list[_RedditPost]: async def _fetch_new(self, client: httpx.AsyncClient, sub: str) -> list[_Candidate]:
url = f"https://old.reddit.com/r/{sub}/new.json?limit=25" return await self._collect(
client,
f"https://www.reddit.com/r/{sub}/new.json?limit=25",
sub,
)
async def _search(self, client: httpx.AsyncClient, query: str) -> list[_Candidate]:
q = urllib.parse.quote_plus(query)
return await self._collect(
client,
f"https://www.reddit.com/r/MotorsportsReplays/search.json?q={q}&restrict_sr=on&sort=new&limit=10",
"MotorsportsReplays",
)
async def _collect(
self, client: httpx.AsyncClient, url: str, sub: str
) -> list[_Candidate]:
try: try:
resp = await client.get(url) resp = await client.get(url)
except Exception as e: except Exception as e:
logger.debug("[subreddit] r/%s fetch failed: %s", sub, e) logger.debug("[subreddit] fetch %s failed: %s", url, e)
return [] return []
if resp.status_code != 200: if resp.status_code != 200:
logger.debug("[subreddit] r/%s HTTP %d", sub, resp.status_code) logger.debug("[subreddit] %s -> HTTP %d", url, resp.status_code)
return [] return []
try: try:
data = resp.json() data = resp.json()
except Exception: except Exception:
return [] return []
posts: list[_RedditPost] = [] out: list[_Candidate] = []
for child in (data.get("data", {}) or {}).get("children", []): for child in (data.get("data", {}) or {}).get("children", []):
d = child.get("data", {}) or {} d = child.get("data", {}) or {}
posts.append( if not _has_live_marker(d):
_RedditPost( continue
title=d.get("title", "") or "", text = (d.get("selftext") or "")
url=d.get("url", "") or "", title = d.get("title") or ""
subreddit=sub, flair = d.get("link_flair_text") or ""
flair=d.get("link_flair_text", "") or "", # First, the linked URL itself (if it's a recognised live site).
) top = d.get("url") or ""
) if top and _is_interesting(top):
return posts out.append(_Candidate(title, top, sub, flair))
# Then any URL embedded in the selftext that points at a
# community-curated live page.
for u in _URL_RE.findall(text):
if _is_interesting(u):
out.append(_Candidate(title, u, sub, flair))
return out