[ci skip] Flatten module wrappers into stack roots
Remove the module "xxx" { source = "./module" } indirection layer
from all 66 service stacks. Resources are now defined directly in
each stack's main.tf instead of through a wrapper module.
- Merge module/main.tf contents into stack main.tf
- Apply variable replacements (var.tier -> local.tiers.X, renamed vars)
- Fix shared module paths (one fewer ../ at each level)
- Move extra files/dirs (factory/, chart_values, subdirs) to stack root
- Update state files to strip module.<name>. prefix
- Update CLAUDE.md to reflect flat structure
Verified: terragrunt plan shows 0 add, 0 destroy across all stacks.
This commit is contained in:
parent
b0499a7f31
commit
c7c7047f1c
245 changed files with 11733 additions and 12432 deletions
|
|
@ -0,0 +1,237 @@
|
|||
---
|
||||
phase: 01-scraper-validation
|
||||
plan: 01
|
||||
type: execute
|
||||
wave: 1
|
||||
depends_on: []
|
||||
files_modified:
|
||||
- internal/scraper/validate.go
|
||||
- internal/scraper/validate_test.go
|
||||
- internal/scraper/scraper.go
|
||||
- main.go
|
||||
autonomous: true
|
||||
requirements:
|
||||
- SCRP-01
|
||||
- SCRP-02
|
||||
- SCRP-03
|
||||
- SCRP-04
|
||||
|
||||
must_haves:
|
||||
truths:
|
||||
- "Scraper still discovers F1-related posts using keyword filtering (existing isF1Post behavior unchanged)"
|
||||
- "Each newly scraped URL is fetched and inspected for video/player content markers before being saved to scraped_links.json"
|
||||
- "URLs without video content markers are discarded and do not appear in scraped_links.json"
|
||||
- "Validation uses a configurable timeout (SCRAPER_VALIDATE_TIMEOUT env var, default 10s) that prevents slow sites from blocking the scrape cycle"
|
||||
artifacts:
|
||||
- path: "internal/scraper/validate.go"
|
||||
provides: "URL validation logic with video marker detection"
|
||||
contains: "func validateLinks"
|
||||
- path: "internal/scraper/validate_test.go"
|
||||
provides: "Unit tests for marker detection and content type checks"
|
||||
contains: "func TestContainsVideoMarkers"
|
||||
- path: "internal/scraper/scraper.go"
|
||||
provides: "Updated Scraper struct with validateTimeout field and validation call in scrape()"
|
||||
contains: "validateTimeout"
|
||||
- path: "main.go"
|
||||
provides: "SCRAPER_VALIDATE_TIMEOUT env var configuration"
|
||||
contains: "SCRAPER_VALIDATE_TIMEOUT"
|
||||
key_links:
|
||||
- from: "internal/scraper/scraper.go"
|
||||
to: "internal/scraper/validate.go"
|
||||
via: "validateLinks call in scrape() between URL extraction and merge"
|
||||
pattern: "validateLinks\\(links"
|
||||
- from: "main.go"
|
||||
to: "internal/scraper/scraper.go"
|
||||
via: "scraper.New() call with validateTimeout parameter"
|
||||
pattern: "scraper\\.New\\(st.*validateTimeout"
|
||||
---
|
||||
|
||||
<objective>
|
||||
Add URL validation to the scraper pipeline so that each extracted URL is proxy-fetched and inspected for video/player content markers before being saved. URLs without video markers are discarded at the source.
|
||||
|
||||
Purpose: Eliminate junk links (blog posts, news articles, social media) from scraped results so users only see actual stream pages.
|
||||
Output: Working validation step integrated into scraper pipeline, with unit tests and configurable timeout.
|
||||
</objective>
|
||||
|
||||
<execution_context>
|
||||
@/Users/viktorbarzin/.claude/get-shit-done/workflows/execute-plan.md
|
||||
@/Users/viktorbarzin/.claude/get-shit-done/templates/summary.md
|
||||
</execution_context>
|
||||
|
||||
<context>
|
||||
@.planning/PROJECT.md
|
||||
@.planning/ROADMAP.md
|
||||
@.planning/STATE.md
|
||||
@.planning/phases/01-scraper-validation/01-RESEARCH.md
|
||||
|
||||
@internal/scraper/scraper.go
|
||||
@internal/scraper/reddit.go
|
||||
@internal/models/models.go
|
||||
@main.go
|
||||
</context>
|
||||
|
||||
<tasks>
|
||||
|
||||
<task type="auto">
|
||||
<name>Task 1: Create validate.go with video marker detection</name>
|
||||
<files>internal/scraper/validate.go</files>
|
||||
<action>
|
||||
Create `internal/scraper/validate.go` in package `scraper` with the following:
|
||||
|
||||
1. Define `videoMarkers` string slice (case-insensitive markers checked against lowercased HTML body):
|
||||
- HTML5: `<video`
|
||||
- HLS: `.m3u8`, `application/x-mpegurl`, `application/vnd.apple.mpegurl`
|
||||
- DASH: `.mpd`, `application/dash+xml`
|
||||
- Player libraries: `hls.js`, `hls.min.js`, `dash.js`, `dash.all.min.js`, `video.js`, `video.min.js`, `videojs`, `jwplayer`, `clappr`, `flowplayer`, `plyr`, `shaka-player`, `mediaelement`, `fluidplayer`
|
||||
|
||||
2. Define `videoContentTypes` string slice for direct video Content-Type detection:
|
||||
- `video/`, `application/x-mpegurl`, `application/vnd.apple.mpegurl`, `application/dash+xml`
|
||||
|
||||
3. Define `validateBodyLimit = 2 * 1024 * 1024` (2MB).
|
||||
|
||||
4. Implement `validateLinks(links []models.ScrapedLink, timeout time.Duration) []models.ScrapedLink`:
|
||||
- Create `http.Client` with the given timeout and a `CheckRedirect` function that stops after 3 redirects (return `http.ErrUseLastResponse` when `len(via) >= 3`).
|
||||
- Iterate over links. For each, call `hasVideoContent(client, link.URL)`. If true, keep the link. If false, log: `scraper: discarded %s (no video markers)` using `truncate(link.URL, 60)`.
|
||||
- Return the filtered slice.
|
||||
|
||||
5. Implement `hasVideoContent(client *http.Client, rawURL string) bool`:
|
||||
- Create GET request with `User-Agent` set to the existing `userAgent` constant from `reddit.go`.
|
||||
- Execute request. On error, log and return false.
|
||||
- Check status code: if < 200 or >= 400, return false.
|
||||
- Check Content-Type header (lowercased) against `videoContentTypes` -- if match, return true (it's a direct video file).
|
||||
- If Content-Type is not `text/html` or `application/xhtml`, return false (not a video file, not an HTML page to inspect).
|
||||
- Read body with `io.LimitReader` (2MB limit).
|
||||
- Return `containsVideoMarkers(strings.ToLower(string(body)))`.
|
||||
|
||||
6. Implement `containsVideoMarkers(loweredBody string) bool`:
|
||||
- Iterate over `videoMarkers`, return true on first `strings.Contains` match.
|
||||
|
||||
7. Implement `isDirectVideoContentType(ct string) bool`:
|
||||
- Lowercase ct, iterate `videoContentTypes`, return true on first `strings.Contains` match.
|
||||
|
||||
Imports needed: `io`, `log`, `net/http`, `strings`, `time`, and `f1-stream/internal/models`.
|
||||
|
||||
Do NOT use `golang.org/x/net/html` (reserved for Phase 4). This is detection, not extraction -- string matching is sufficient.
|
||||
</action>
|
||||
<verify>
|
||||
Run `cd /Users/viktorbarzin/code/infra/modules/kubernetes/f1-stream/files && go build ./...` -- must compile without errors.
|
||||
</verify>
|
||||
<done>
|
||||
`validate.go` exists with `validateLinks`, `hasVideoContent`, `containsVideoMarkers`, `isDirectVideoContentType` functions. The file compiles as part of the `scraper` package.
|
||||
</done>
|
||||
</task>
|
||||
|
||||
<task type="auto">
|
||||
<name>Task 2: Wire validation into scraper pipeline and add config</name>
|
||||
<files>internal/scraper/scraper.go, main.go</files>
|
||||
<action>
|
||||
**In `internal/scraper/scraper.go`:**
|
||||
|
||||
1. Add `validateTimeout time.Duration` field to the `Scraper` struct.
|
||||
|
||||
2. Update `New()` signature to accept `validateTimeout` parameter:
|
||||
```go
|
||||
func New(s *store.Store, interval time.Duration, validateTimeout time.Duration) *Scraper {
|
||||
return &Scraper{store: s, interval: interval, validateTimeout: validateTimeout}
|
||||
}
|
||||
```
|
||||
|
||||
3. In `scrape()` method, add validation step between the `scrapeReddit()` return and the merge-with-existing logic. Insert AFTER the line `log.Printf("scraper: reddit scrape completed in %v, got %d links", ...)` and BEFORE the line `existing, err := s.store.LoadScrapedLinks()`:
|
||||
|
||||
```go
|
||||
// Validate links - only keep those with video content markers
|
||||
if len(links) > 0 {
|
||||
validated := validateLinks(links, s.validateTimeout)
|
||||
log.Printf("scraper: validated %d/%d links as streams", len(validated), len(links))
|
||||
links = validated
|
||||
}
|
||||
```
|
||||
|
||||
This preserves SCRP-01: existing keyword filtering in `scrapeReddit()` via `isF1Post()` runs first, then validation filters the results.
|
||||
|
||||
**In `main.go`:**
|
||||
|
||||
1. Add `validateTimeout` env var read after the existing `scrapeInterval` line:
|
||||
```go
|
||||
validateTimeout := envDuration("SCRAPER_VALIDATE_TIMEOUT", 10*time.Second)
|
||||
```
|
||||
|
||||
2. Update the `scraper.New()` call to pass the new parameter:
|
||||
```go
|
||||
sc := scraper.New(st, scrapeInterval, validateTimeout)
|
||||
```
|
||||
|
||||
Both changes are minimal and follow the existing configuration pattern used for `SCRAPE_INTERVAL`, `PROXY_TIMEOUT`, etc.
|
||||
</action>
|
||||
<verify>
|
||||
Run `cd /Users/viktorbarzin/code/infra/modules/kubernetes/f1-stream/files && go build ./...` -- must compile without errors. Then run `go vet ./...` -- no issues.
|
||||
</verify>
|
||||
<done>
|
||||
`Scraper` struct has `validateTimeout` field. `New()` accepts 3 parameters. `scrape()` calls `validateLinks` between extraction and merge. `main.go` reads `SCRAPER_VALIDATE_TIMEOUT` env var (default 10s) and passes it to `scraper.New()`.
|
||||
</done>
|
||||
</task>
|
||||
|
||||
<task type="auto">
|
||||
<name>Task 3: Add unit tests for validation functions</name>
|
||||
<files>internal/scraper/validate_test.go</files>
|
||||
<action>
|
||||
Create `internal/scraper/validate_test.go` in package `scraper` with the following test functions:
|
||||
|
||||
**`TestContainsVideoMarkers`** - table-driven test covering:
|
||||
- Positive cases (should return true):
|
||||
- `<video>` tag: `<div><video src="stream.mp4"></video></div>`
|
||||
- HLS manifest ref: `var url = "https://cdn.example.com/live.m3u8";`
|
||||
- DASH manifest ref: `<source src="stream.mpd" type="application/dash+xml">`
|
||||
- HLS.js library: `<script src="/js/hls.min.js"></script>`
|
||||
- Video.js library: `<script src="https://cdn.example.com/video.js"></script>`
|
||||
- JW Player: `<div id="jwplayer-container"></div><script>jwplayer("jwplayer-container")</script>`
|
||||
- Clappr: `<script src="clappr.min.js"></script>`
|
||||
- Flowplayer: `<script>flowplayer("#player")</script>`
|
||||
|
||||
- Negative cases (should return false):
|
||||
- Plain HTML: `<html><body><p>Hello world</p></body></html>`
|
||||
- Reddit link page: `<html><body><a href="https://example.com">Click here</a></body></html>`
|
||||
- Blog post: `<html><body><article>F1 race results and analysis...</article></body></html>`
|
||||
- Empty string: ``
|
||||
|
||||
Each test calls `containsVideoMarkers(body)` (input is already lowercased in real usage, but test with lowercase content to match real behavior) and asserts the result.
|
||||
|
||||
**`TestIsDirectVideoContentType`** - table-driven test covering:
|
||||
- Positive: `video/mp4`, `video/webm`, `application/x-mpegurl`, `application/vnd.apple.mpegurl`, `application/dash+xml`, `video/mp4; charset=utf-8` (with params)
|
||||
- Negative: `text/html`, `application/json`, `image/png`, `text/plain`, empty string
|
||||
|
||||
Each test calls `isDirectVideoContentType(ct)` and asserts the result.
|
||||
|
||||
Use standard `testing` package only. Follow existing test conventions in the codebase (table-driven tests with `t.Run`).
|
||||
</action>
|
||||
<verify>
|
||||
Run `cd /Users/viktorbarzin/code/infra/modules/kubernetes/f1-stream/files && go test ./internal/scraper/ -v -run "TestContainsVideoMarkers|TestIsDirectVideoContentType"` -- all tests pass.
|
||||
</verify>
|
||||
<done>
|
||||
`validate_test.go` exists with `TestContainsVideoMarkers` (8+ positive, 4+ negative cases) and `TestIsDirectVideoContentType` (6+ positive, 5+ negative cases). All tests pass.
|
||||
</done>
|
||||
</task>
|
||||
|
||||
</tasks>
|
||||
|
||||
<verification>
|
||||
1. `go build ./...` compiles without errors
|
||||
2. `go vet ./...` reports no issues
|
||||
3. `go test ./internal/scraper/ -v` -- all tests pass
|
||||
4. Verify `validate.go` contains the `videoMarkers` slice with at least 15 markers
|
||||
5. Verify `scraper.go:scrape()` calls `validateLinks` between `scrapeReddit()` return and `LoadScrapedLinks()`
|
||||
6. Verify `main.go` reads `SCRAPER_VALIDATE_TIMEOUT` with default `10*time.Second`
|
||||
7. Verify the existing `isF1Post` keyword filtering in `scrapeReddit()` is untouched (SCRP-01)
|
||||
</verification>
|
||||
|
||||
<success_criteria>
|
||||
- The scraper pipeline compiles and all tests pass
|
||||
- `validateLinks` is called in `scrape()` after URL extraction but before merge, filtering out URLs without video markers
|
||||
- The validation timeout is configurable via `SCRAPER_VALIDATE_TIMEOUT` env var (default 10s)
|
||||
- Existing F1 keyword filtering behavior is preserved unchanged
|
||||
- No new external dependencies are introduced (stdlib only)
|
||||
</success_criteria>
|
||||
|
||||
<output>
|
||||
After completion, create `.planning/phases/01-scraper-validation/01-01-SUMMARY.md`
|
||||
</output>
|
||||
|
|
@ -0,0 +1,107 @@
|
|||
---
|
||||
phase: 01-scraper-validation
|
||||
plan: 01
|
||||
subsystem: scraper
|
||||
tags: [go, http, video-detection, content-validation, streaming]
|
||||
|
||||
# Dependency graph
|
||||
requires: []
|
||||
provides:
|
||||
- "URL validation pipeline with video marker detection (validateLinks)"
|
||||
- "Configurable validation timeout via SCRAPER_VALIDATE_TIMEOUT env var"
|
||||
- "Video content type and HTML marker detection functions"
|
||||
affects: [02-health-checks, 04-link-extraction]
|
||||
|
||||
# Tech tracking
|
||||
tech-stack:
|
||||
added: []
|
||||
patterns:
|
||||
- "Pipeline filter pattern: scrapeReddit -> validateLinks -> merge"
|
||||
- "String-match video detection (no DOM parsing) for Phase 1 speed"
|
||||
- "2MB body limit for HTML inspection to prevent memory issues"
|
||||
|
||||
key-files:
|
||||
created:
|
||||
- internal/scraper/validate.go
|
||||
- internal/scraper/validate_test.go
|
||||
modified:
|
||||
- internal/scraper/scraper.go
|
||||
- main.go
|
||||
|
||||
key-decisions:
|
||||
- "String matching over DOM parsing for video detection (DOM reserved for Phase 4)"
|
||||
- "2MB body limit to prevent memory issues on large pages"
|
||||
- "3 redirect limit to avoid infinite redirect chains"
|
||||
|
||||
patterns-established:
|
||||
- "Pipeline filter: validate scraped links before merge into store"
|
||||
- "Env var config pattern: envDuration for timeout configuration"
|
||||
|
||||
requirements-completed: [SCRP-01, SCRP-02, SCRP-03, SCRP-04]
|
||||
|
||||
# Metrics
|
||||
duration: 3min
|
||||
completed: 2026-02-17
|
||||
---
|
||||
|
||||
# Phase 1 Plan 1: Scraper Validation Summary
|
||||
|
||||
**URL validation pipeline with 18 video/player markers filtering scraped links before store merge, configurable via SCRAPER_VALIDATE_TIMEOUT**
|
||||
|
||||
## Performance
|
||||
|
||||
- **Duration:** 3 min
|
||||
- **Started:** 2026-02-17T20:49:16Z
|
||||
- **Completed:** 2026-02-17T20:51:54Z
|
||||
- **Tasks:** 3
|
||||
- **Files modified:** 4
|
||||
|
||||
## Accomplishments
|
||||
- Created validate.go with 18 video/player markers covering HTML5, HLS, DASH, and 10+ player libraries
|
||||
- Wired validateLinks into scrape() pipeline between URL extraction and store merge
|
||||
- Added SCRAPER_VALIDATE_TIMEOUT env var (default 10s) following existing config patterns
|
||||
- Added 25 unit tests (10 positive + 4 negative marker tests, 6 positive + 5 negative content type tests)
|
||||
|
||||
## Task Commits
|
||||
|
||||
Each task was committed atomically:
|
||||
|
||||
1. **Task 1: Create validate.go with video marker detection** - `adeb478` (feat)
|
||||
2. **Task 2: Wire validation into scraper pipeline and add config** - `22d29db` (feat)
|
||||
3. **Task 3: Add unit tests for validation functions** - `6c5cc02` (test)
|
||||
|
||||
## Files Created/Modified
|
||||
- `internal/scraper/validate.go` - URL validation with video marker detection (validateLinks, hasVideoContent, containsVideoMarkers, isDirectVideoContentType)
|
||||
- `internal/scraper/validate_test.go` - Table-driven unit tests for marker detection and content type checks (25 cases)
|
||||
- `internal/scraper/scraper.go` - Added validateTimeout field and validateLinks call in scrape()
|
||||
- `main.go` - Added SCRAPER_VALIDATE_TIMEOUT env var read (default 10s)
|
||||
|
||||
## Decisions Made
|
||||
- Used string matching (not DOM parsing) for video detection -- DOM parsing reserved for Phase 4 link extraction
|
||||
- Set 2MB body read limit to prevent memory issues on large streaming pages
|
||||
- Limited redirects to 3 to avoid infinite redirect chains on sketchy stream sites
|
||||
- Validation runs sequentially (not concurrent) to avoid overwhelming target sites
|
||||
|
||||
## Deviations from Plan
|
||||
|
||||
None - plan executed exactly as written.
|
||||
|
||||
## Issues Encountered
|
||||
None.
|
||||
|
||||
## User Setup Required
|
||||
|
||||
None - no external service configuration required.
|
||||
|
||||
## Next Phase Readiness
|
||||
- Validation pipeline is integrated and tested, ready for health check layer (Phase 2)
|
||||
- The validateLinks function provides the filtering foundation that health checks will build upon
|
||||
- No blockers or concerns
|
||||
|
||||
## Self-Check: PASSED
|
||||
|
||||
All 5 files verified present. All 3 task commits verified in git log.
|
||||
|
||||
---
|
||||
*Phase: 01-scraper-validation*
|
||||
*Completed: 2026-02-17*
|
||||
|
|
@ -0,0 +1,599 @@
|
|||
# Phase 1: Scraper Validation - Research
|
||||
|
||||
**Researched:** 2026-02-17
|
||||
**Domain:** HTTP content fetching and HTML video/player content detection in Go
|
||||
**Confidence:** HIGH
|
||||
|
||||
## Summary
|
||||
|
||||
Phase 1 adds a validation step to the existing Reddit scraper pipeline. Currently, the scraper extracts ALL URLs from F1-related Reddit posts and saves them to `scraped_links.json` without verifying whether they point to actual stream pages. The validation step will proxy-fetch each extracted URL (reusing the existing proxy's HTTP client pattern) and inspect the HTML response for video/player content markers before saving.
|
||||
|
||||
The implementation is straightforward because the codebase already has all the infrastructure needed: HTTP fetching with timeouts (used in both `internal/scraper/reddit.go` and `internal/proxy/proxy.go`), URL validation, and the scraper pipeline with deduplication. The new code is a validation function inserted between URL extraction and saving, operating on the same `[]models.ScrapedLink` type.
|
||||
|
||||
**Primary recommendation:** Add a `validateStreamURL` function in a new file `internal/scraper/validate.go` that uses string-based content matching (not full HTML parsing) to detect video markers, with `golang.org/x/net/html` reserved for Phase 4 (video extraction). Keep it simple: fetch the page, lowercase the body, check for known patterns. This avoids adding a dependency for Phase 1 while Phase 2 will reuse the same validation logic for health checks.
|
||||
|
||||
<phase_requirements>
|
||||
## Phase Requirements
|
||||
|
||||
| ID | Description | Research Support |
|
||||
|----|-------------|-----------------|
|
||||
| SCRP-01 | Scraper filters Reddit posts by F1 keywords before extracting URLs (existing behavior, preserve) | Existing `isF1Post()` function in `reddit.go` lines 272-285 handles this. No changes needed -- just ensure the validation step is added AFTER URL extraction, not replacing the keyword filter. |
|
||||
| SCRP-02 | Scraper validates each extracted URL by proxy-fetching it and checking for video/player content markers | New `validateStreamURL()` function fetches URL with configurable timeout, reads response body, checks for video content markers (see "Video Content Markers" section below for complete list). Reuse existing HTTP client pattern from `reddit.go:88`. |
|
||||
| SCRP-03 | URLs that don't look like streams (no video markers detected) are discarded before saving | Filter applied in `scraper.go:scrape()` between URL extraction (line 57) and merge/save (line 60). Only URLs passing validation are included in the `links` slice passed to the merge step. |
|
||||
| SCRP-04 | Validation has a configurable timeout (default 10s) to avoid blocking on slow sites | Add `SCRAPER_VALIDATE_TIMEOUT` environment variable read in `main.go`, passed to `scraper.New()`. Use `context.WithTimeout` on per-URL fetch to enforce deadline. Default 10 seconds. |
|
||||
</phase_requirements>
|
||||
|
||||
## Standard Stack
|
||||
|
||||
### Core
|
||||
|
||||
| Library | Version | Purpose | Why Standard |
|
||||
|---------|---------|---------|--------------|
|
||||
| `net/http` | stdlib | HTTP client for fetching URLs | Already used throughout codebase (`reddit.go`, `proxy.go`). No external dependency needed. |
|
||||
| `strings` | stdlib | Case-insensitive string matching for content markers | Already used extensively. `strings.Contains` on lowercased body is the simplest approach for marker detection. |
|
||||
| `regexp` | stdlib | Pattern matching for HLS/DASH URLs in page source | Already used in `reddit.go` for URL extraction. Needed for matching `.m3u8` and `.mpd` URL patterns in HTML content. |
|
||||
| `context` | stdlib | Timeout enforcement per URL validation | Already used in scraper (`scraper.go:Run`). `context.WithTimeout` provides per-request deadline. |
|
||||
| `io` | stdlib | `io.LimitReader` for response body size limiting | Already used in `proxy.go` and `reddit.go` for body size limits. |
|
||||
|
||||
### Supporting
|
||||
|
||||
| Library | Version | Purpose | When to Use |
|
||||
|---------|---------|---------|-------------|
|
||||
| `golang.org/x/net/html` | latest | Full HTML DOM parsing | NOT needed for Phase 1. Reserve for Phase 4 (video source extraction). String matching is sufficient for detection. |
|
||||
| `sync` | stdlib | WaitGroup for parallel validation | If parallel validation is desired. But sequential is simpler and respects rate limits of target sites. |
|
||||
|
||||
### Alternatives Considered
|
||||
|
||||
| Instead of | Could Use | Tradeoff |
|
||||
|------------|-----------|----------|
|
||||
| String matching on body | `golang.org/x/net/html` DOM parsing | DOM parsing is more accurate but adds a dependency and complexity. For Phase 1 (detection, not extraction), string matching is sufficient. Phase 4 needs DOM parsing for actual source extraction. |
|
||||
| Sequential URL validation | `sync.WaitGroup` parallel validation | Parallel is faster but risks triggering rate limits on target sites and complicates error handling. Sequential with timeout is simpler and predictable. |
|
||||
| Custom HTTP client | Reuse proxy's `*http.Client` | The proxy client has redirect limits and timeout already configured. But the scraper should have its own client with validation-specific timeout. Keep them independent. |
|
||||
|
||||
**Installation:**
|
||||
```bash
|
||||
# No new dependencies needed for Phase 1. All stdlib.
|
||||
# golang.org/x/net/html deferred to Phase 4.
|
||||
```
|
||||
|
||||
## Architecture Patterns
|
||||
|
||||
### Where Validation Fits in the Pipeline
|
||||
|
||||
```
|
||||
Current flow:
|
||||
scrapeReddit() -> []models.ScrapedLink -> merge with existing -> save
|
||||
|
||||
New flow:
|
||||
scrapeReddit() -> []models.ScrapedLink -> validateLinks() -> []models.ScrapedLink -> merge with existing -> save
|
||||
```
|
||||
|
||||
The validation step is a filter function that takes a slice of scraped links and returns only those that pass validation. This keeps the existing pipeline intact and makes the validation step independently testable.
|
||||
|
||||
### Recommended File Structure
|
||||
|
||||
```
|
||||
internal/scraper/
|
||||
scraper.go # Orchestrator (existing, add validateTimeout field + call validateLinks)
|
||||
reddit.go # Reddit API scraping (existing, no changes)
|
||||
validate.go # NEW: validateStreamURL(), validateLinks(), content marker definitions
|
||||
```
|
||||
|
||||
### Pattern 1: Validation as a Filter Function
|
||||
|
||||
**What:** A pure filter function that takes `[]models.ScrapedLink` and returns the subset that pass validation.
|
||||
**When to use:** When adding a validation/filter step to an existing pipeline.
|
||||
**Example:**
|
||||
|
||||
```go
|
||||
// internal/scraper/validate.go
|
||||
|
||||
// validateLinks filters links to only those with video content markers.
|
||||
// Each URL is fetched with the given timeout and inspected for markers.
|
||||
func validateLinks(links []models.ScrapedLink, timeout time.Duration) []models.ScrapedLink {
|
||||
client := &http.Client{Timeout: timeout}
|
||||
var valid []models.ScrapedLink
|
||||
for _, link := range links {
|
||||
if hasVideoContent(client, link.URL) {
|
||||
valid = append(valid, link)
|
||||
} else {
|
||||
log.Printf("scraper: discarded %s (no video markers)", truncate(link.URL, 60))
|
||||
}
|
||||
}
|
||||
return valid
|
||||
}
|
||||
|
||||
// hasVideoContent fetches a URL and checks for video/player content markers.
|
||||
func hasVideoContent(client *http.Client, rawURL string) bool {
|
||||
req, err := http.NewRequest("GET", rawURL, nil)
|
||||
if err != nil {
|
||||
return false
|
||||
}
|
||||
req.Header.Set("User-Agent", userAgent) // reuse existing constant
|
||||
|
||||
resp, err := client.Do(req)
|
||||
if err != nil {
|
||||
log.Printf("scraper: validate fetch error for %s: %v", truncate(rawURL, 60), err)
|
||||
return false
|
||||
}
|
||||
defer resp.Body.Close()
|
||||
|
||||
// Only inspect HTML responses
|
||||
ct := resp.Header.Get("Content-Type")
|
||||
if !strings.Contains(ct, "text/html") && !strings.Contains(ct, "application/xhtml") {
|
||||
// Could be a direct video file (.m3u8, .mpd, .mp4) which is valid
|
||||
if isDirectVideoContentType(ct) {
|
||||
return true
|
||||
}
|
||||
return false
|
||||
}
|
||||
|
||||
body, err := io.ReadAll(io.LimitReader(resp.Body, 2*1024*1024)) // 2MB limit for validation
|
||||
if err != nil {
|
||||
return false
|
||||
}
|
||||
|
||||
return containsVideoMarkers(strings.ToLower(string(body)))
|
||||
}
|
||||
```
|
||||
|
||||
### Pattern 2: Configuration Via Struct Field
|
||||
|
||||
**What:** Pass validation timeout through the existing `Scraper` struct, configured from `main.go` env vars.
|
||||
**When to use:** Following existing codebase pattern where all config flows through `main.go` -> constructor.
|
||||
**Example:**
|
||||
|
||||
```go
|
||||
// internal/scraper/scraper.go
|
||||
type Scraper struct {
|
||||
store *store.Store
|
||||
interval time.Duration
|
||||
validateTimeout time.Duration // NEW
|
||||
mu sync.Mutex
|
||||
}
|
||||
|
||||
func New(s *store.Store, interval time.Duration, validateTimeout time.Duration) *Scraper {
|
||||
return &Scraper{store: s, interval: interval, validateTimeout: validateTimeout}
|
||||
}
|
||||
|
||||
// In main.go:
|
||||
validateTimeout := envDuration("SCRAPER_VALIDATE_TIMEOUT", 10*time.Second)
|
||||
sc := scraper.New(st, scrapeInterval, validateTimeout)
|
||||
```
|
||||
|
||||
### Pattern 3: Integration Point in scrape()
|
||||
|
||||
**What:** Call validateLinks between scrapeReddit return and merge step.
|
||||
**When to use:** Minimal change to existing scrape flow.
|
||||
**Example:**
|
||||
|
||||
```go
|
||||
// In scraper.go:scrape() - between lines 57 and 60
|
||||
links, err := scrapeReddit()
|
||||
if err != nil {
|
||||
// ... existing error handling
|
||||
}
|
||||
log.Printf("scraper: reddit scrape completed in %v, got %d links", time.Since(start).Round(time.Millisecond), len(links))
|
||||
|
||||
// NEW: validate links before merging
|
||||
if len(links) > 0 {
|
||||
validated := validateLinks(links, s.validateTimeout)
|
||||
log.Printf("scraper: validated %d/%d links as streams", len(validated), len(links))
|
||||
links = validated
|
||||
}
|
||||
|
||||
// Continue with existing merge logic...
|
||||
```
|
||||
|
||||
### Anti-Patterns to Avoid
|
||||
|
||||
- **Fetching URLs inside the Reddit API loop:** Validation should happen after all URLs are collected from Reddit, not interleaved with Reddit API calls. This keeps the Reddit API calls fast and avoids mixing rate-limit concerns.
|
||||
- **Using the proxy's HTTP handler for internal validation:** The proxy (`internal/proxy/proxy.go`) is designed as an HTTP handler for client-facing requests with IP-based rate limiting. The scraper should use its own HTTP client without rate limiting since it is a trusted internal caller.
|
||||
- **Modifying the ScrapedLink model to track validation state:** For Phase 1, validation is a binary filter (pass or discard). Adding validation metadata to the model is premature and adds complexity to the store layer. If needed in Phase 2 for health checking, it can be added then.
|
||||
- **Full HTML DOM parsing for detection:** Using `golang.org/x/net/html` to parse the full DOM tree just to detect presence of video tags is overkill. String matching on lowercased HTML body is sufficient for detection. DOM parsing is needed in Phase 4 for actual source URL extraction.
|
||||
|
||||
## Don't Hand-Roll
|
||||
|
||||
| Problem | Don't Build | Use Instead | Why |
|
||||
|---------|-------------|-------------|-----|
|
||||
| HTTP fetching with timeout | Custom TCP client | `net/http.Client` with `Timeout` field | stdlib handles redirects, TLS, timeouts, connection pooling |
|
||||
| HTML content inspection | Full DOM parser | `strings.Contains` on lowercased body | Detection (yes/no) does not need structural parsing; string matching is faster and simpler |
|
||||
| URL scheme validation | Manual string prefix check | `net/url.Parse` + scheme check | Already used in codebase; handles edge cases |
|
||||
| Concurrent timeout enforcement | Manual goroutine + channel | `context.WithTimeout` + `http.NewRequestWithContext` | stdlib integration; cancels in-flight requests properly |
|
||||
|
||||
**Key insight:** Phase 1 is a detection problem (does this page look like a stream?), not an extraction problem (what is the stream URL?). Detection can be done with string matching. Extraction (Phase 4) needs DOM parsing.
|
||||
|
||||
## Video Content Markers
|
||||
|
||||
### HIGH confidence markers (any one of these strongly indicates a stream page)
|
||||
|
||||
**HTML Tags:**
|
||||
- `<video` - HTML5 video element
|
||||
- `<source` with `type="application/x-mpegurl"` or `type="application/dash+xml"` - HLS/DASH sources
|
||||
- `<iframe` with src containing known player domains
|
||||
|
||||
**HLS/DASH Manifest References:**
|
||||
- `.m3u8` - HLS manifest file extension
|
||||
- `.mpd` - DASH manifest file extension
|
||||
- `application/x-mpegurl` - HLS MIME type
|
||||
- `application/vnd.apple.mpegurl` - Alternative HLS MIME type
|
||||
- `application/dash+xml` - DASH MIME type
|
||||
|
||||
**Player Library References:**
|
||||
- `hls.js` or `hls.min.js` - HLS.js player library
|
||||
- `dash.js` or `dash.all.min.js` - DASH.js player library
|
||||
- `video.js` or `video.min.js` or `videojs` - Video.js player
|
||||
- `jwplayer` - JW Player
|
||||
- `clappr` - Clappr player
|
||||
- `flowplayer` - Flowplayer
|
||||
- `plyr` - Plyr player
|
||||
- `shaka-player` or `shaka` - Google Shaka Player
|
||||
- `mediaelement` - MediaElement.js
|
||||
- `fluidplayer` - Fluid Player
|
||||
|
||||
**Direct Video File Extensions in URLs:**
|
||||
- `.mp4` - MPEG-4 video
|
||||
- `.webm` - WebM video
|
||||
- `.ts` (in context of `.m3u8` references) - MPEG-TS segments
|
||||
|
||||
### MEDIUM confidence markers (suggestive but not conclusive)
|
||||
|
||||
- `player` (as class, id, or variable name in context of video)
|
||||
- `stream` (in context of video-related markup)
|
||||
- `embed` (in context of video players)
|
||||
|
||||
### Implementation Strategy
|
||||
|
||||
Use a tiered approach:
|
||||
1. First check for HIGH confidence markers (any single match = valid)
|
||||
2. Do NOT use MEDIUM confidence markers alone (too many false positives)
|
||||
3. Direct video content types in HTTP response (`video/mp4`, `application/x-mpegurl`, etc.) are valid without HTML inspection
|
||||
|
||||
```go
|
||||
// Content markers to check (case-insensitive, checked against lowercased body)
|
||||
var videoMarkers = []string{
|
||||
// HTML5 video element
|
||||
"<video",
|
||||
// HLS markers
|
||||
".m3u8",
|
||||
"application/x-mpegurl",
|
||||
"application/vnd.apple.mpegurl",
|
||||
// DASH markers
|
||||
".mpd",
|
||||
"application/dash+xml",
|
||||
// Player libraries
|
||||
"hls.js", "hls.min.js",
|
||||
"dash.js", "dash.all.min.js",
|
||||
"video.js", "video.min.js", "videojs",
|
||||
"jwplayer",
|
||||
"clappr",
|
||||
"flowplayer",
|
||||
"plyr",
|
||||
"shaka-player",
|
||||
"mediaelement",
|
||||
"fluidplayer",
|
||||
}
|
||||
|
||||
// Direct video content types (check Content-Type header)
|
||||
var videoContentTypes = []string{
|
||||
"video/",
|
||||
"application/x-mpegurl",
|
||||
"application/vnd.apple.mpegurl",
|
||||
"application/dash+xml",
|
||||
"application/mpegurl",
|
||||
}
|
||||
|
||||
func containsVideoMarkers(loweredBody string) bool {
|
||||
for _, marker := range videoMarkers {
|
||||
if strings.Contains(loweredBody, marker) {
|
||||
return true
|
||||
}
|
||||
}
|
||||
return false
|
||||
}
|
||||
|
||||
func isDirectVideoContentType(ct string) bool {
|
||||
ct = strings.ToLower(ct)
|
||||
for _, vct := range videoContentTypes {
|
||||
if strings.Contains(ct, vct) {
|
||||
return true
|
||||
}
|
||||
}
|
||||
return false
|
||||
}
|
||||
```
|
||||
|
||||
## Common Pitfalls
|
||||
|
||||
### Pitfall 1: Blocking the Scrape Cycle on Slow/Unresponsive URLs
|
||||
|
||||
**What goes wrong:** A single URL that times out at 10s, multiplied by 50 URLs per scrape cycle, means the validation step takes 500 seconds (8+ minutes). With a 15-minute scrape interval, validation could overlap with the next cycle.
|
||||
**Why it happens:** Sequential validation with per-URL timeout does not have a total budget for the validation step.
|
||||
**How to avoid:** The per-URL timeout (SCRP-04) handles individual slowness. Additionally, consider logging total validation time. The mutex in `scraper.go:scrape()` already prevents concurrent scrapes, so overlap is safe (next scrape just waits). With typical scrape volumes (5-20 new URLs per cycle), even worst case (20 * 10s = 200s) is well within the 15-minute interval.
|
||||
**Warning signs:** Scrape logs showing validation taking longer than half the scrape interval.
|
||||
|
||||
### Pitfall 2: False Negatives from JavaScript-Rendered Pages
|
||||
|
||||
**What goes wrong:** Many streaming sites load their video player via JavaScript. An HTTP fetch gets the initial HTML which may not contain `<video>` tags or player references -- those are injected by JS after page load.
|
||||
**Why it happens:** HTTP fetching returns raw HTML; no JavaScript execution.
|
||||
**How to avoid:** This is an accepted limitation per the requirements doc ("Full browser automation (Puppeteer/Playwright)" is Out of Scope). The marker list includes JavaScript library references (e.g., `hls.js`, `video.js`) which ARE present in the raw HTML even before execution. Most streaming sites include their player library in `<script>` tags in the initial HTML. The marker list is designed to catch these references.
|
||||
**Warning signs:** Known good stream URLs being discarded. Monitor discard rate in logs.
|
||||
|
||||
### Pitfall 3: HTTP vs HTTPS URL Handling
|
||||
|
||||
**What goes wrong:** The existing proxy only supports HTTPS URLs (line 68 in `proxy.go`), but scraped URLs may be HTTP. If the validator only accepts HTTPS, valid HTTP stream URLs get discarded.
|
||||
**Why it happens:** The proxy has a stricter security requirement (client-facing) than the scraper (internal validation).
|
||||
**How to avoid:** The scraper validator should accept both HTTP and HTTPS URLs for fetching. The proxy's HTTPS restriction is appropriate for its purpose but should not be inherited by the validator.
|
||||
**Warning signs:** All HTTP URLs being discarded.
|
||||
|
||||
### Pitfall 4: Redirect Chains Leading to Non-Stream Content
|
||||
|
||||
**What goes wrong:** A URL redirects through several pages (ads, link shorteners) before reaching the actual stream page. The HTTP client follows redirects (Go default: up to 10), but the intermediate pages may not have video markers.
|
||||
**Why it happens:** Streaming links from Reddit often go through shorteners or ad-redirect chains.
|
||||
**How to avoid:** Go's `http.Client` follows redirects by default (up to 10). The validation checks the FINAL response after redirects. Set a reasonable redirect limit (3, matching proxy's limit) to avoid infinite chains. The timeout applies to the entire chain, so slow redirect chains will timeout.
|
||||
**Warning signs:** Timeout errors on URLs that resolve fine in a browser.
|
||||
|
||||
### Pitfall 5: Large Response Bodies Causing Memory Pressure
|
||||
|
||||
**What goes wrong:** A streaming site returns a huge HTML page (or a direct video file), and the validator reads the entire body into memory.
|
||||
**Why it happens:** No body size limit on validation fetches.
|
||||
**How to avoid:** Use `io.LimitReader` (already a pattern in the codebase). 2MB is sufficient for detecting markers in HTML. For direct video content types, the Content-Type header check is sufficient -- no need to read the body.
|
||||
**Warning signs:** Memory spikes during scrape cycles.
|
||||
|
||||
### Pitfall 6: Changing the `scraper.New()` Signature Breaks Existing Call Site
|
||||
|
||||
**What goes wrong:** Adding `validateTimeout` parameter to `New()` changes its signature, breaking the call in `main.go`.
|
||||
**Why it happens:** Go does not have optional parameters.
|
||||
**How to avoid:** Update both `New()` in `scraper.go` and the call site in `main.go` simultaneously. This is a simple, predictable change. Alternative: use options pattern, but that is overkill for adding one field.
|
||||
**Warning signs:** Compilation error -- easily caught.
|
||||
|
||||
## Code Examples
|
||||
|
||||
### Complete validateLinks Implementation
|
||||
|
||||
```go
|
||||
// internal/scraper/validate.go
|
||||
package scraper
|
||||
|
||||
import (
|
||||
"io"
|
||||
"log"
|
||||
"net/http"
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
"f1-stream/internal/models"
|
||||
)
|
||||
|
||||
// videoMarkers are case-insensitive strings that indicate video/player content.
|
||||
// Checked against lowercased HTML body.
|
||||
var videoMarkers = []string{
|
||||
"<video",
|
||||
".m3u8",
|
||||
"application/x-mpegurl",
|
||||
"application/vnd.apple.mpegurl",
|
||||
".mpd",
|
||||
"application/dash+xml",
|
||||
"hls.js", "hls.min.js",
|
||||
"dash.js", "dash.all.min.js",
|
||||
"video.js", "video.min.js", "videojs",
|
||||
"jwplayer",
|
||||
"clappr",
|
||||
"flowplayer",
|
||||
"plyr",
|
||||
"shaka-player",
|
||||
"mediaelement",
|
||||
"fluidplayer",
|
||||
}
|
||||
|
||||
// videoContentTypes are Content-Type prefixes that indicate direct video content.
|
||||
var videoContentTypes = []string{
|
||||
"video/",
|
||||
"application/x-mpegurl",
|
||||
"application/vnd.apple.mpegurl",
|
||||
"application/dash+xml",
|
||||
}
|
||||
|
||||
const validateBodyLimit = 2 * 1024 * 1024 // 2MB
|
||||
|
||||
// validateLinks filters links to only those whose URLs contain video content markers.
|
||||
func validateLinks(links []models.ScrapedLink, timeout time.Duration) []models.ScrapedLink {
|
||||
client := &http.Client{
|
||||
Timeout: timeout,
|
||||
CheckRedirect: func(req *http.Request, via []*http.Request) error {
|
||||
if len(via) >= 3 {
|
||||
return http.ErrUseLastResponse
|
||||
}
|
||||
return nil
|
||||
},
|
||||
}
|
||||
|
||||
var valid []models.ScrapedLink
|
||||
for _, link := range links {
|
||||
if hasVideoContent(client, link.URL) {
|
||||
valid = append(valid, link)
|
||||
log.Printf("scraper: validated %s (video markers found)", truncate(link.URL, 60))
|
||||
} else {
|
||||
log.Printf("scraper: discarded %s (no video markers)", truncate(link.URL, 60))
|
||||
}
|
||||
}
|
||||
return valid
|
||||
}
|
||||
|
||||
func hasVideoContent(client *http.Client, rawURL string) bool {
|
||||
req, err := http.NewRequest("GET", rawURL, nil)
|
||||
if err != nil {
|
||||
return false
|
||||
}
|
||||
req.Header.Set("User-Agent", userAgent)
|
||||
|
||||
resp, err := client.Do(req)
|
||||
if err != nil {
|
||||
log.Printf("scraper: validate fetch error for %s: %v", truncate(rawURL, 60), err)
|
||||
return false
|
||||
}
|
||||
defer resp.Body.Close()
|
||||
|
||||
if resp.StatusCode < 200 || resp.StatusCode >= 400 {
|
||||
return false
|
||||
}
|
||||
|
||||
ct := strings.ToLower(resp.Header.Get("Content-Type"))
|
||||
|
||||
// Check if response is a direct video content type
|
||||
for _, vct := range videoContentTypes {
|
||||
if strings.Contains(ct, vct) {
|
||||
return true
|
||||
}
|
||||
}
|
||||
|
||||
// Only inspect HTML responses for markers
|
||||
if !strings.Contains(ct, "text/html") && !strings.Contains(ct, "application/xhtml") {
|
||||
return false
|
||||
}
|
||||
|
||||
body, err := io.ReadAll(io.LimitReader(resp.Body, validateBodyLimit))
|
||||
if err != nil {
|
||||
return false
|
||||
}
|
||||
|
||||
return containsVideoMarkers(strings.ToLower(string(body)))
|
||||
}
|
||||
|
||||
func containsVideoMarkers(loweredBody string) bool {
|
||||
for _, marker := range videoMarkers {
|
||||
if strings.Contains(loweredBody, marker) {
|
||||
return true
|
||||
}
|
||||
}
|
||||
return false
|
||||
}
|
||||
```
|
||||
|
||||
### Integration in scraper.go
|
||||
|
||||
```go
|
||||
// scraper.go - modified scrape() method
|
||||
func (s *Scraper) scrape() {
|
||||
s.mu.Lock()
|
||||
defer s.mu.Unlock()
|
||||
|
||||
start := time.Now()
|
||||
log.Println("scraper: starting scrape")
|
||||
links, err := scrapeReddit()
|
||||
if err != nil {
|
||||
log.Printf("scraper: error after %v: %v", time.Since(start).Round(time.Millisecond), err)
|
||||
return
|
||||
}
|
||||
log.Printf("scraper: reddit scrape completed in %v, got %d links", time.Since(start).Round(time.Millisecond), len(links))
|
||||
|
||||
// Validate links - only keep those with video content markers
|
||||
if len(links) > 0 {
|
||||
validated := validateLinks(links, s.validateTimeout)
|
||||
log.Printf("scraper: validated %d/%d links as streams", len(validated), len(links))
|
||||
links = validated
|
||||
}
|
||||
|
||||
// Rest of existing merge logic unchanged...
|
||||
}
|
||||
```
|
||||
|
||||
### Configuration in main.go
|
||||
|
||||
```go
|
||||
// main.go additions
|
||||
validateTimeout := envDuration("SCRAPER_VALIDATE_TIMEOUT", 10*time.Second)
|
||||
sc := scraper.New(st, scrapeInterval, validateTimeout)
|
||||
```
|
||||
|
||||
### Unit Test for containsVideoMarkers
|
||||
|
||||
```go
|
||||
// internal/scraper/validate_test.go
|
||||
package scraper
|
||||
|
||||
import "testing"
|
||||
|
||||
func TestContainsVideoMarkers(t *testing.T) {
|
||||
tests := []struct {
|
||||
name string
|
||||
body string
|
||||
expected bool
|
||||
}{
|
||||
{"video tag", `<div><video src="stream.mp4"></video></div>`, true},
|
||||
{"hls manifest", `var url = "https://cdn.example.com/live.m3u8";`, true},
|
||||
{"dash manifest", `<source src="stream.mpd" type="application/dash+xml">`, true},
|
||||
{"hls.js library", `<script src="/js/hls.min.js"></script>`, true},
|
||||
{"video.js library", `<script src="https://cdn.example.com/video.js"></script>`, true},
|
||||
{"jwplayer", `<div id="jwplayer-container"></div><script>jwplayer("jwplayer-container")</script>`, true},
|
||||
{"no markers", `<html><body><p>Hello world</p></body></html>`, false},
|
||||
{"reddit link page", `<html><body><a href="https://example.com">Click here</a></body></html>`, false},
|
||||
{"blog post", `<html><body><article>F1 race results...</article></body></html>`, false},
|
||||
}
|
||||
|
||||
for _, tt := range tests {
|
||||
t.Run(tt.name, func(t *testing.T) {
|
||||
result := containsVideoMarkers(tt.body)
|
||||
if result != tt.expected {
|
||||
t.Errorf("containsVideoMarkers(%q) = %v, want %v", truncate(tt.body, 40), result, tt.expected)
|
||||
}
|
||||
})
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## State of the Art
|
||||
|
||||
| Old Approach | Current Approach | When Changed | Impact |
|
||||
|--------------|------------------|--------------|--------|
|
||||
| Save all URLs from F1 posts | Will validate each URL before saving | Phase 1 (now) | Eliminates junk links at the source |
|
||||
| No content inspection | String-based marker detection | Phase 1 (now) | Simple, fast, no external dependencies |
|
||||
| Static marker list | Static marker list (sufficient for now) | - | May need updating as new players emerge; easily extensible |
|
||||
|
||||
**Why not use browser automation:**
|
||||
The REQUIREMENTS.md explicitly marks "Full browser automation (Puppeteer/Playwright)" as Out of Scope. HTTP-based checks with string matching catch the majority of stream pages because player library `<script>` tags are present in raw HTML even before JavaScript execution.
|
||||
|
||||
## Open Questions
|
||||
|
||||
1. **How many URLs per scrape cycle will need validation?**
|
||||
- What we know: The subreddit listing fetches 25 posts, filters by F1 keywords. Typical F1-related posts might yield 5-20 unique URLs per cycle after deduplication and domain filtering.
|
||||
- What's unclear: Real-world distribution. Could be 2 URLs or 50.
|
||||
- Recommendation: Log validation counts in production. The sequential approach with 10s timeout per URL handles up to ~90 URLs within the 15-minute scrape interval (worst case).
|
||||
|
||||
2. **Should already-validated URLs be re-validated on subsequent scrape cycles?**
|
||||
- What we know: The current merge step deduplicates by URL -- existing URLs are not re-processed. Validation runs only on newly discovered URLs.
|
||||
- What's unclear: Whether a URL that failed validation last cycle should be retried.
|
||||
- Recommendation: No retry needed for Phase 1. Failed URLs are simply not saved. If the URL appears again in a future scrape, it will be a "new" URL (not yet in scraped.json) and get validated again naturally.
|
||||
|
||||
3. **Will the marker list need frequent updates?**
|
||||
- What we know: Major player libraries (HLS.js, Video.js, JW Player) are well-established and their names are stable.
|
||||
- What's unclear: Whether niche streaming sites use custom players with no recognizable markers.
|
||||
- Recommendation: Start with the current list. Monitor discard rate in logs. Add markers if known-good sites are being discarded. The list is a simple string slice, trivially extensible.
|
||||
|
||||
## Sources
|
||||
|
||||
### Primary (HIGH confidence)
|
||||
- Codebase analysis: `internal/scraper/scraper.go`, `internal/scraper/reddit.go`, `internal/proxy/proxy.go` - existing patterns for HTTP fetching, URL processing, pipeline structure
|
||||
- Codebase analysis: `internal/models/models.go` - ScrapedLink type definition
|
||||
- Codebase analysis: `main.go` - env var pattern, dependency initialization
|
||||
- Go stdlib docs: `net/http`, `strings`, `io`, `context` packages
|
||||
- `golang.org/x/net/html` package API (verified via pkg.go.dev) - confirmed `html.Parse`, `Node.Descendants()`, `atom.Video` available. Reserved for Phase 4.
|
||||
|
||||
### Secondary (MEDIUM confidence)
|
||||
- `.planning/REQUIREMENTS.md` - phase requirements, out-of-scope decisions
|
||||
- `.planning/ROADMAP.md` - phase dependencies (Phase 2 reuses validation logic)
|
||||
- `.planning/codebase/ARCHITECTURE.md` - data flow, cross-cutting concerns
|
||||
|
||||
### Tertiary (LOW confidence)
|
||||
- Video player library names (hls.js, video.js, jwplayer, etc.) - based on widely known ecosystem knowledge. The specific set of markers may need tuning based on real-world stream sites.
|
||||
|
||||
## Metadata
|
||||
|
||||
**Confidence breakdown:**
|
||||
- Standard stack: HIGH - uses only stdlib; all patterns already exist in codebase
|
||||
- Architecture: HIGH - minimal change to existing pipeline; clear integration point
|
||||
- Pitfalls: HIGH - pitfalls are well-understood; mitigations use existing codebase patterns
|
||||
- Video markers: MEDIUM - marker list covers major players but may miss niche sites; easily extensible
|
||||
|
||||
**Research date:** 2026-02-17
|
||||
**Valid until:** 2026-03-17 (stable domain; marker list may need periodic updates)
|
||||
|
|
@ -0,0 +1,200 @@
|
|||
---
|
||||
phase: 01-scraper-validation
|
||||
verified: 2026-02-17T20:58:00Z
|
||||
status: passed
|
||||
score: 4/4 must-haves verified
|
||||
re_verification: false
|
||||
---
|
||||
|
||||
# Phase 1: Scraper Validation Verification Report
|
||||
|
||||
**Phase Goal:** Scraped URLs are verified to contain actual video/player content before being stored, eliminating junk links at the source
|
||||
|
||||
**Verified:** 2026-02-17T20:58:00Z
|
||||
|
||||
**Status:** PASSED
|
||||
|
||||
**Re-verification:** No — initial verification
|
||||
|
||||
## Goal Achievement
|
||||
|
||||
### Observable Truths
|
||||
|
||||
All 4 success criteria from ROADMAP.md verified:
|
||||
|
||||
| # | Truth | Status | Evidence |
|
||||
|---|-------|--------|----------|
|
||||
| 1 | Scraper still discovers F1-related posts using keyword filtering (existing behavior preserved) | ✓ VERIFIED | `reddit.go:105` calls `isF1Post(post.Title)` before processing posts. Keywords defined in `f1Keywords` slice (lines 29-39). Keyword filtering runs BEFORE validation step. |
|
||||
| 2 | Each extracted URL is proxy-fetched and inspected for video/player content markers | ✓ VERIFIED | `scraper.go:62` calls `validateLinks(links, s.validateTimeout)` after URL extraction. `validate.go:56-76` implements fetch + marker inspection with 20 video markers (HTML5, HLS, DASH, 10+ player libraries). |
|
||||
| 3 | URLs without video content markers are discarded and do not appear in scraped.json | ✓ VERIFIED | `validate.go:71-73` logs discarded URLs and excludes them from return slice. Only URLs passing `hasVideoContent()` are kept in `kept` slice. |
|
||||
| 4 | Validation respects a configurable timeout so slow sites do not block the scrape cycle | ✓ VERIFIED | `main.go:26` reads `SCRAPER_VALIDATE_TIMEOUT` env var (default 10s). `validate.go:58` creates HTTP client with timeout. Passed through `scraper.New()` at `main.go:56`. |
|
||||
|
||||
**Score:** 4/4 truths verified
|
||||
|
||||
### Required Artifacts
|
||||
|
||||
All 4 artifacts from PLAN must_haves verified at all 3 levels (exists, substantive, wired):
|
||||
|
||||
| Artifact | Expected | Exists | Substantive | Wired | Status |
|
||||
|----------|----------|--------|-------------|-------|--------|
|
||||
| `internal/scraper/validate.go` | URL validation logic with video marker detection | ✓ | ✓ 142 lines, 20 video markers, 4 functions | ✓ | ✓ VERIFIED |
|
||||
| `internal/scraper/validate_test.go` | Unit tests for marker detection and content type checks | ✓ | ✓ 124 lines, 14 test cases covering positive/negative scenarios | ✓ | ✓ VERIFIED |
|
||||
| `internal/scraper/scraper.go` | validateTimeout field and validation call in scrape() | ✓ | ✓ validateTimeout field on line 16, validateLinks call on line 62 | ✓ | ✓ VERIFIED |
|
||||
| `main.go` | SCRAPER_VALIDATE_TIMEOUT env var configuration | ✓ | ✓ Line 26 reads env var, line 56 passes to scraper.New() | ✓ | ✓ VERIFIED |
|
||||
|
||||
**Artifact Details:**
|
||||
|
||||
**validate.go (142 lines):**
|
||||
- Contains `validateLinks` function (lines 56-76)
|
||||
- Contains `hasVideoContent` function (lines 81-119)
|
||||
- Contains `containsVideoMarkers` function (lines 123-130)
|
||||
- Contains `isDirectVideoContentType` function (lines 134-142)
|
||||
- Defines 20 video markers (lines 15-40): HTML5 `<video`, HLS (.m3u8, application/x-mpegurl), DASH (.mpd, application/dash+xml), 15 player libraries (hls.js, video.js, jwplayer, clappr, flowplayer, plyr, shaka-player, mediaelement, fluidplayer, etc.)
|
||||
- Defines 4 video content types (lines 44-49)
|
||||
- Sets 2MB body limit (line 52)
|
||||
- 3 redirect limit (line 60)
|
||||
|
||||
**validate_test.go (124 lines):**
|
||||
- `TestContainsVideoMarkers`: 10 positive cases (video tag, HLS manifest, DASH manifest, HLS.js, Video.js, JW Player, Clappr, Flowplayer, Plyr, Shaka Player) + 4 negative cases (plain HTML, Reddit link page, blog post, empty string)
|
||||
- `TestIsDirectVideoContentType`: 6 positive cases (video/mp4, video/webm, HLS content types, DASH, video with params) + 5 negative cases (text/html, application/json, image/png, text/plain, empty)
|
||||
- Total: 14 test cases covering 25 assertion points
|
||||
|
||||
**scraper.go:**
|
||||
- Line 16: `validateTimeout time.Duration` field added to Scraper struct
|
||||
- Line 20-21: `New()` function updated to accept validateTimeout parameter
|
||||
- Lines 60-65: Validation step inserted between URL extraction and merge logic
|
||||
|
||||
**main.go:**
|
||||
- Line 26: `validateTimeout := envDuration("SCRAPER_VALIDATE_TIMEOUT", 10*time.Second)`
|
||||
- Line 56: `sc := scraper.New(st, scrapeInterval, validateTimeout)`
|
||||
|
||||
### Key Link Verification
|
||||
|
||||
All 2 key links from PLAN must_haves verified:
|
||||
|
||||
| From | To | Via | Status | Details |
|
||||
|------|----|----|--------|---------|
|
||||
| `internal/scraper/scraper.go` | `internal/scraper/validate.go` | validateLinks call in scrape() | ✓ WIRED | `scraper.go:62` calls `validateLinks(links, s.validateTimeout)` between URL extraction (`scrapeReddit()` return at line 58) and store merge (`LoadScrapedLinks()` at line 68) |
|
||||
| `main.go` | `internal/scraper/scraper.go` | scraper.New() with validateTimeout | ✓ WIRED | `main.go:56` calls `scraper.New(st, scrapeInterval, validateTimeout)` where validateTimeout is read from env on line 26 |
|
||||
|
||||
**Wiring Verification:**
|
||||
|
||||
**Link 1: scraper.go → validate.go**
|
||||
- Pattern match: `validateLinks\(links` found at `scraper.go:62`
|
||||
- Context: Call occurs after `scrapeReddit()` (line 53) and before `LoadScrapedLinks()` (line 68)
|
||||
- Data flow: `links` variable from `scrapeReddit()` → filtered by `validateLinks()` → assigned back to `links` → merged with existing
|
||||
|
||||
**Link 2: main.go → scraper.go**
|
||||
- Pattern match: `scraper\.New\(st.*validateTimeout` found at `main.go:56`
|
||||
- Context: `validateTimeout` variable read from env on line 26, passed as 3rd parameter to `scraper.New()`
|
||||
- Parameter flow: `envDuration("SCRAPER_VALIDATE_TIMEOUT", 10*time.Second)` → `validateTimeout` variable → `scraper.New()` parameter → `Scraper.validateTimeout` field → `validateLinks()` call
|
||||
|
||||
### Requirements Coverage
|
||||
|
||||
All 4 requirements from PLAN frontmatter verified against REQUIREMENTS.md:
|
||||
|
||||
| Requirement | Description | Status | Evidence |
|
||||
|-------------|-------------|--------|----------|
|
||||
| **SCRP-01** | Scraper filters Reddit posts by F1 keywords before extracting URLs (existing behavior, preserve) | ✓ SATISFIED | `reddit.go:105` calls `isF1Post(post.Title)` before processing. Keywords defined in `f1Keywords` (lines 29-39). This runs BEFORE validation, preserving existing behavior. |
|
||||
| **SCRP-02** | Scraper validates each extracted URL by proxy-fetching it and checking for video/player content markers | ✓ SATISFIED | `validate.go:56-76` implements `validateLinks()` which calls `hasVideoContent()` for each URL. `hasVideoContent()` (lines 81-119) performs HTTP GET and checks for video markers. |
|
||||
| **SCRP-03** | URLs that don't look like streams are discarded before saving | ✓ SATISFIED | `validate.go:71-73` logs and excludes URLs where `hasVideoContent()` returns false. Only kept URLs are returned and merged into store. |
|
||||
| **SCRP-04** | Validation has a configurable timeout (default 10s) to avoid blocking on slow sites | ✓ SATISFIED | `main.go:26` reads `SCRAPER_VALIDATE_TIMEOUT` with default 10s. `validate.go:58` creates HTTP client with timeout. 3-redirect limit also prevents timeout from slow redirect chains. |
|
||||
|
||||
**No orphaned requirements:** All 4 requirements mapped to Phase 1 in REQUIREMENTS.md are accounted for in the PLAN and satisfied by the implementation.
|
||||
|
||||
### Anti-Patterns Found
|
||||
|
||||
No anti-patterns detected:
|
||||
|
||||
| Category | Checked | Found |
|
||||
|----------|---------|-------|
|
||||
| TODO/FIXME/PLACEHOLDER comments | ✓ | 0 |
|
||||
| Placeholder strings | ✓ | 0 |
|
||||
| Empty implementations (return null/empty) | ✓ | 0 |
|
||||
| Console-only implementations | ✓ | 0 |
|
||||
|
||||
**Files checked:**
|
||||
- `internal/scraper/validate.go` (142 lines)
|
||||
- `internal/scraper/validate_test.go` (124 lines)
|
||||
- `internal/scraper/scraper.go` (modified section lines 16, 20-21, 60-65)
|
||||
- `main.go` (modified lines 26, 56)
|
||||
|
||||
All modified code is production-ready with proper error handling, logging, and no stub patterns.
|
||||
|
||||
### Human Verification Required
|
||||
|
||||
None. All success criteria are programmatically verifiable through code inspection:
|
||||
|
||||
- Keyword filtering behavior: Verified by checking `isF1Post()` call placement
|
||||
- URL validation with HTTP fetch: Verified by code inspection of `validateLinks()` and `hasVideoContent()`
|
||||
- Discard behavior: Verified by inspecting return logic in `validateLinks()`
|
||||
- Timeout configuration: Verified by tracing env var read → parameter passing → HTTP client creation
|
||||
|
||||
**Note:** While the *functionality* of video marker detection would ideally be tested against real stream URLs, the *implementation* of the requirement (that validation logic exists, is wired correctly, and has appropriate markers) is fully verified.
|
||||
|
||||
## Verification Summary
|
||||
|
||||
### All Must-Haves VERIFIED
|
||||
|
||||
**Truths (4/4):**
|
||||
1. ✓ F1 keyword filtering preserved (SCRP-01)
|
||||
2. ✓ URLs proxy-fetched and inspected for video markers (SCRP-02)
|
||||
3. ✓ Non-stream URLs discarded (SCRP-03)
|
||||
4. ✓ Configurable timeout prevents blocking (SCRP-04)
|
||||
|
||||
**Artifacts (4/4):**
|
||||
1. ✓ `validate.go` exists, substantive (142 lines, 20 markers, 4 functions), wired
|
||||
2. ✓ `validate_test.go` exists, substantive (124 lines, 14 test cases), wired
|
||||
3. ✓ `scraper.go` exists, substantive (validateTimeout field + call), wired
|
||||
4. ✓ `main.go` exists, substantive (env var read + pass to New()), wired
|
||||
|
||||
**Key Links (2/2):**
|
||||
1. ✓ scraper.go → validate.go via validateLinks call
|
||||
2. ✓ main.go → scraper.go via scraper.New() with validateTimeout
|
||||
|
||||
**Requirements (4/4):**
|
||||
1. ✓ SCRP-01 satisfied
|
||||
2. ✓ SCRP-02 satisfied
|
||||
3. ✓ SCRP-03 satisfied
|
||||
4. ✓ SCRP-04 satisfied
|
||||
|
||||
**Anti-Patterns:** None found
|
||||
|
||||
**Human Verification:** Not required
|
||||
|
||||
### Implementation Quality
|
||||
|
||||
**Strengths:**
|
||||
- 20 comprehensive video markers covering HTML5, HLS, DASH, and 15 player libraries
|
||||
- Proper error handling throughout (HTTP errors, read errors, invalid URLs)
|
||||
- Conservative resource limits (2MB body read limit, 3 redirect limit)
|
||||
- Comprehensive unit test coverage (14 test cases, 25 assertion points)
|
||||
- Clean integration preserving existing F1 keyword filtering (SCRP-01)
|
||||
- Follows existing codebase patterns (envDuration config, logging style, truncate utility)
|
||||
|
||||
**Commits:**
|
||||
All 3 tasks committed atomically:
|
||||
1. `adeb478` - Create validate.go with video marker detection
|
||||
2. `22d29db` - Wire validation into scraper pipeline and add config
|
||||
3. `6c5cc02` - Add unit tests for validation functions
|
||||
|
||||
**No deviations from plan:** Implementation matches PLAN tasks exactly.
|
||||
|
||||
## Conclusion
|
||||
|
||||
**Phase 1 goal ACHIEVED.**
|
||||
|
||||
All success criteria from ROADMAP.md are satisfied:
|
||||
1. ✓ Keyword filtering preserved
|
||||
2. ✓ URLs validated with video marker detection
|
||||
3. ✓ Non-stream URLs discarded
|
||||
4. ✓ Configurable timeout prevents blocking
|
||||
|
||||
All artifacts exist, are substantive, and are wired correctly. All key links verified. All requirements satisfied. No anti-patterns found. No gaps requiring remediation.
|
||||
|
||||
**Ready to proceed to Phase 2.**
|
||||
|
||||
---
|
||||
|
||||
*Verified: 2026-02-17T20:58:00Z*
|
||||
*Verifier: Claude (gsd-verifier)*
|
||||
Loading…
Add table
Add a link
Reference in a new issue