infra/stacks/f1-stream/files/.planning/phases/01-scraper-validation/01-01-PLAN.md
Viktor Barzin c7c7047f1c [ci skip] Flatten module wrappers into stack roots
Remove the module "xxx" { source = "./module" } indirection layer
from all 66 service stacks. Resources are now defined directly in
each stack's main.tf instead of through a wrapper module.

- Merge module/main.tf contents into stack main.tf
- Apply variable replacements (var.tier -> local.tiers.X, renamed vars)
- Fix shared module paths (one fewer ../ at each level)
- Move extra files/dirs (factory/, chart_values, subdirs) to stack root
- Update state files to strip module.<name>. prefix
- Update CLAUDE.md to reflect flat structure

Verified: terragrunt plan shows 0 add, 0 destroy across all stacks.
2026-02-22 15:13:55 +00:00

11 KiB

phase plan type wave depends_on files_modified autonomous requirements must_haves
01-scraper-validation 01 execute 1
internal/scraper/validate.go
internal/scraper/validate_test.go
internal/scraper/scraper.go
main.go
true
SCRP-01
SCRP-02
SCRP-03
SCRP-04
truths artifacts key_links
Scraper still discovers F1-related posts using keyword filtering (existing isF1Post behavior unchanged)
Each newly scraped URL is fetched and inspected for video/player content markers before being saved to scraped_links.json
URLs without video content markers are discarded and do not appear in scraped_links.json
Validation uses a configurable timeout (SCRAPER_VALIDATE_TIMEOUT env var, default 10s) that prevents slow sites from blocking the scrape cycle
path provides contains
internal/scraper/validate.go URL validation logic with video marker detection func validateLinks
path provides contains
internal/scraper/validate_test.go Unit tests for marker detection and content type checks func TestContainsVideoMarkers
path provides contains
internal/scraper/scraper.go Updated Scraper struct with validateTimeout field and validation call in scrape() validateTimeout
path provides contains
main.go SCRAPER_VALIDATE_TIMEOUT env var configuration SCRAPER_VALIDATE_TIMEOUT
from to via pattern
internal/scraper/scraper.go internal/scraper/validate.go validateLinks call in scrape() between URL extraction and merge validateLinks(links
from to via pattern
main.go internal/scraper/scraper.go scraper.New() call with validateTimeout parameter scraper.New(st.*validateTimeout
Add URL validation to the scraper pipeline so that each extracted URL is proxy-fetched and inspected for video/player content markers before being saved. URLs without video markers are discarded at the source.

Purpose: Eliminate junk links (blog posts, news articles, social media) from scraped results so users only see actual stream pages. Output: Working validation step integrated into scraper pipeline, with unit tests and configurable timeout.

<execution_context> @/Users/viktorbarzin/.claude/get-shit-done/workflows/execute-plan.md @/Users/viktorbarzin/.claude/get-shit-done/templates/summary.md </execution_context>

@.planning/PROJECT.md @.planning/ROADMAP.md @.planning/STATE.md @.planning/phases/01-scraper-validation/01-RESEARCH.md

@internal/scraper/scraper.go @internal/scraper/reddit.go @internal/models/models.go @main.go

Task 1: Create validate.go with video marker detection internal/scraper/validate.go Create `internal/scraper/validate.go` in package `scraper` with the following:
  1. Define videoMarkers string slice (case-insensitive markers checked against lowercased HTML body):

    • HTML5: <video
    • HLS: .m3u8, application/x-mpegurl, application/vnd.apple.mpegurl
    • DASH: .mpd, application/dash+xml
    • Player libraries: hls.js, hls.min.js, dash.js, dash.all.min.js, video.js, video.min.js, videojs, jwplayer, clappr, flowplayer, plyr, shaka-player, mediaelement, fluidplayer
  2. Define videoContentTypes string slice for direct video Content-Type detection:

    • video/, application/x-mpegurl, application/vnd.apple.mpegurl, application/dash+xml
  3. Define validateBodyLimit = 2 * 1024 * 1024 (2MB).

  4. Implement validateLinks(links []models.ScrapedLink, timeout time.Duration) []models.ScrapedLink:

    • Create http.Client with the given timeout and a CheckRedirect function that stops after 3 redirects (return http.ErrUseLastResponse when len(via) >= 3).
    • Iterate over links. For each, call hasVideoContent(client, link.URL). If true, keep the link. If false, log: scraper: discarded %s (no video markers) using truncate(link.URL, 60).
    • Return the filtered slice.
  5. Implement hasVideoContent(client *http.Client, rawURL string) bool:

    • Create GET request with User-Agent set to the existing userAgent constant from reddit.go.
    • Execute request. On error, log and return false.
    • Check status code: if < 200 or >= 400, return false.
    • Check Content-Type header (lowercased) against videoContentTypes -- if match, return true (it's a direct video file).
    • If Content-Type is not text/html or application/xhtml, return false (not a video file, not an HTML page to inspect).
    • Read body with io.LimitReader (2MB limit).
    • Return containsVideoMarkers(strings.ToLower(string(body))).
  6. Implement containsVideoMarkers(loweredBody string) bool:

    • Iterate over videoMarkers, return true on first strings.Contains match.
  7. Implement isDirectVideoContentType(ct string) bool:

    • Lowercase ct, iterate videoContentTypes, return true on first strings.Contains match.

Imports needed: io, log, net/http, strings, time, and f1-stream/internal/models.

Do NOT use golang.org/x/net/html (reserved for Phase 4). This is detection, not extraction -- string matching is sufficient. Run cd /Users/viktorbarzin/code/infra/modules/kubernetes/f1-stream/files && go build ./... -- must compile without errors. validate.go exists with validateLinks, hasVideoContent, containsVideoMarkers, isDirectVideoContentType functions. The file compiles as part of the scraper package.

Task 2: Wire validation into scraper pipeline and add config internal/scraper/scraper.go, main.go **In `internal/scraper/scraper.go`:**
  1. Add validateTimeout time.Duration field to the Scraper struct.

  2. Update New() signature to accept validateTimeout parameter:

    func New(s *store.Store, interval time.Duration, validateTimeout time.Duration) *Scraper {
        return &Scraper{store: s, interval: interval, validateTimeout: validateTimeout}
    }
    
  3. In scrape() method, add validation step between the scrapeReddit() return and the merge-with-existing logic. Insert AFTER the line log.Printf("scraper: reddit scrape completed in %v, got %d links", ...) and BEFORE the line existing, err := s.store.LoadScrapedLinks():

    // Validate links - only keep those with video content markers
    if len(links) > 0 {
        validated := validateLinks(links, s.validateTimeout)
        log.Printf("scraper: validated %d/%d links as streams", len(validated), len(links))
        links = validated
    }
    

    This preserves SCRP-01: existing keyword filtering in scrapeReddit() via isF1Post() runs first, then validation filters the results.

In main.go:

  1. Add validateTimeout env var read after the existing scrapeInterval line:

    validateTimeout := envDuration("SCRAPER_VALIDATE_TIMEOUT", 10*time.Second)
    
  2. Update the scraper.New() call to pass the new parameter:

    sc := scraper.New(st, scrapeInterval, validateTimeout)
    

Both changes are minimal and follow the existing configuration pattern used for SCRAPE_INTERVAL, PROXY_TIMEOUT, etc. Run cd /Users/viktorbarzin/code/infra/modules/kubernetes/f1-stream/files && go build ./... -- must compile without errors. Then run go vet ./... -- no issues. Scraper struct has validateTimeout field. New() accepts 3 parameters. scrape() calls validateLinks between extraction and merge. main.go reads SCRAPER_VALIDATE_TIMEOUT env var (default 10s) and passes it to scraper.New().

Task 3: Add unit tests for validation functions internal/scraper/validate_test.go Create `internal/scraper/validate_test.go` in package `scraper` with the following test functions:

TestContainsVideoMarkers - table-driven test covering:

  • Positive cases (should return true):

    • <video> tag: <div><video src="stream.mp4"></video></div>
    • HLS manifest ref: var url = "https://cdn.example.com/live.m3u8";
    • DASH manifest ref: <source src="stream.mpd" type="application/dash+xml">
    • HLS.js library: <script src="/js/hls.min.js"></script>
    • Video.js library: <script src="https://cdn.example.com/video.js"></script>
    • JW Player: <div id="jwplayer-container"></div><script>jwplayer("jwplayer-container")</script>
    • Clappr: <script src="clappr.min.js"></script>
    • Flowplayer: <script>flowplayer("#player")</script>
  • Negative cases (should return false):

    • Plain HTML: <html><body><p>Hello world</p></body></html>
    • Reddit link page: <html><body><a href="https://example.com">Click here</a></body></html>
    • Blog post: <html><body><article>F1 race results and analysis...</article></body></html>
    • Empty string: ``

Each test calls containsVideoMarkers(body) (input is already lowercased in real usage, but test with lowercase content to match real behavior) and asserts the result.

TestIsDirectVideoContentType - table-driven test covering:

  • Positive: video/mp4, video/webm, application/x-mpegurl, application/vnd.apple.mpegurl, application/dash+xml, video/mp4; charset=utf-8 (with params)
  • Negative: text/html, application/json, image/png, text/plain, empty string

Each test calls isDirectVideoContentType(ct) and asserts the result.

Use standard testing package only. Follow existing test conventions in the codebase (table-driven tests with t.Run). Run cd /Users/viktorbarzin/code/infra/modules/kubernetes/f1-stream/files && go test ./internal/scraper/ -v -run "TestContainsVideoMarkers|TestIsDirectVideoContentType" -- all tests pass. validate_test.go exists with TestContainsVideoMarkers (8+ positive, 4+ negative cases) and TestIsDirectVideoContentType (6+ positive, 5+ negative cases). All tests pass.

1. `go build ./...` compiles without errors 2. `go vet ./...` reports no issues 3. `go test ./internal/scraper/ -v` -- all tests pass 4. Verify `validate.go` contains the `videoMarkers` slice with at least 15 markers 5. Verify `scraper.go:scrape()` calls `validateLinks` between `scrapeReddit()` return and `LoadScrapedLinks()` 6. Verify `main.go` reads `SCRAPER_VALIDATE_TIMEOUT` with default `10*time.Second` 7. Verify the existing `isF1Post` keyword filtering in `scrapeReddit()` is untouched (SCRP-01)

<success_criteria>

  • The scraper pipeline compiles and all tests pass
  • validateLinks is called in scrape() after URL extraction but before merge, filtering out URLs without video markers
  • The validation timeout is configurable via SCRAPER_VALIDATE_TIMEOUT env var (default 10s)
  • Existing F1 keyword filtering behavior is preserved unchanged
  • No new external dependencies are introduced (stdlib only) </success_criteria>
After completion, create `.planning/phases/01-scraper-validation/01-01-SUMMARY.md`