infra/stacks/f1-stream/files/.planning/codebase/CONCERNS.md
Viktor Barzin c7c7047f1c [ci skip] Flatten module wrappers into stack roots
Remove the module "xxx" { source = "./module" } indirection layer
from all 66 service stacks. Resources are now defined directly in
each stack's main.tf instead of through a wrapper module.

- Merge module/main.tf contents into stack main.tf
- Apply variable replacements (var.tier -> local.tiers.X, renamed vars)
- Fix shared module paths (one fewer ../ at each level)
- Move extra files/dirs (factory/, chart_values, subdirs) to stack root
- Update state files to strip module.<name>. prefix
- Update CLAUDE.md to reflect flat structure

Verified: terragrunt plan shows 0 add, 0 destroy across all stacks.
2026-02-22 15:13:55 +00:00

18 KiB

Codebase Concerns

Analysis Date: 2026-02-17

Tech Debt

File-based JSON storage as primary data persistence:

  • Issue: All data (users, streams, sessions, scraped links) are stored as JSON files on disk with file-level locking. This is a fundamental scalability constraint.
  • Files: internal/store/store.go, internal/store/streams.go, internal/store/sessions.go, internal/store/users.go, internal/store/scraped.go
  • Impact:
    • Non-atomic multi-file operations (e.g., DeleteStream reads all streams, filters, writes back). Race conditions possible if two deletes happen simultaneously.
    • Entire file loaded into memory for any operation, even reads. With thousands of streams/sessions, this becomes slow and memory-inefficient.
    • Sessions file grows unbounded until manual cleanup (CleanExpiredSessions runs hourly). Could cause memory/disk pressure.
    • No transaction support, no rollback capability on failure.
  • Fix approach: Migrate to a proper database (SQLite for simplicity, PostgreSQL for production). Keep JSON file for backup/export purposes only.

In-memory WebAuthn ceremony session storage with no cleanup guarantee:

  • Issue: Registration and login ceremony session data stored in Auth.regSessions and Auth.loginSessions maps. Cleanup relies on goroutines that may not execute if server crashes.
  • Files: internal/auth/auth.go (lines 27-29, 107-117, 230-239)
  • Impact:
    • Memory leak on server restarts: orphaned sessions never cleaned up.
    • No recovery mechanism if goroutine misses cleanup window.
    • Session hijacking if an attacker can predict/guess the cleanup timing.
  • Fix approach: Either move ceremony sessions to persistent store or use a time.AfterFunc with guaranteed cleanup (still risky). Better: use signed JWTs for ceremony state instead of server-side storage.

Scraper loads entire scraped links list into memory on every scrape:

  • Issue: Scraper.scrape() loads all existing links, filters and deduplicates them, then rewrites entire file.
  • Files: internal/scraper/scraper.go (lines 46-92)
  • Impact: With thousands of links, each 15-minute scrape cycle causes a large memory spike and full file rewrite. Inefficient deduplication logic (O(n) map lookups on every new link).
  • Fix approach: With database migration, use INSERT OR IGNORE / upsert patterns. For now, batch process links in chunks and use database indexes for deduplication.

No input validation on URL lengths beyond basic checks:

  • Issue: URL length limited to 2048 chars in two places (internal/server/server.go line 153, internal/proxy/proxy.go line 72), but no validation of URL structure beyond "starts with http/https" and HTTPS-only in proxy.
  • Files: internal/server/server.go (lines 146-160), internal/proxy/proxy.go (lines 54-80)
  • Impact: Malformed URLs could bypass checks and cause unexpected behavior in downstream systems. User submission streams could contain typos/malware links.
  • Fix approach: Use a proper URL parsing library with validation. Whitelist domains for stream submissions. Consider regex validation for known stream site patterns.

Hardcoded default streams in main.go:

  • Issue: Default stream URLs are hardcoded and point to external streaming sites that may become unavailable, redirect, or change terms of service.
  • Files: main.go (lines 100-123)
  • Impact: If any of these URLs break, users get broken default content. Sites could shut down or get legal takedown notices. Application appears to endorse/support these sites.
  • Fix approach: Move to configuration file. Make seeding optional. Add stream validation/health checks before serving. Consider removing entirely if this is a liability concern.

Proxy strips CSP headers without replacement:

  • Issue: internal/proxy/proxy.go deliberately strips X-Frame-Options and CSP headers (line 123) to allow iframe-based proxying. No security headers added back.
  • Files: internal/proxy/proxy.go (lines 121-125)
  • Impact: Proxied content loses all origin security protections. Could allow downstream attacks to run XSS, clickjacking, etc. in the proxy context. Injected <base> tag doesn't prevent all attacks.
  • Fix approach: Add back a strict CSP policy scoped to the proxy origin. Implement iframe sandbox attributes. Add additional security headers (X-Content-Type-Options: nosniff, etc.).

Security Considerations

Authentication ceremony session fixation vulnerability:

  • Risk: Username used as session key for WebAuthn ceremonies (Auth.BeginRegistration, Auth.BeginLogin). Attacker could start ceremony for victim's account, then victim continues from attacker's session state.
  • Files: internal/auth/auth.go (lines 107-108, 230-231)
  • Current mitigation: None. Ceremony session stored in-memory and deleted after 5 minutes, but no CSRF token or state validation.
  • Recommendations: Use cryptographically random state tokens for ceremony sessions instead of username. Store state in secure HTTP-only cookies or database. Validate state on finish.

Rate limiting per-IP but no account lockout for failed authentication:

  • Risk: Brute force attacks on specific usernames are possible. Attacker can try many passwords (using different IPs) against a single account without consequence.
  • Files: internal/proxy/proxy.go implements rate limiting (per-IP token bucket), but no equivalent exists for auth endpoints (internal/auth/auth.go).
  • Current mitigation: WebAuthn makes guessing harder (passkeys), but early attack surface (BeginLogin endpoint) has no protection. Leaked user list could enable targeted attacks.
  • Recommendations: Add per-username failure tracking. Lock account after N failed attempts. Add exponential backoff. Require captcha after threshold.

CORS Origin validation incomplete:

  • Risk: OriginCheck middleware in internal/server/middleware.go (lines 71-93) only checks on non-GET requests. GET requests can still trigger state-changing operations (e.g., visiting a crafted link that proxies through the app).
  • Files: internal/server/middleware.go (lines 74)
  • Current mitigation: Proxy request uses query param, but no SameSite cookie attribute on proxy endpoint (only on session cookie).
  • Recommendations: Require Origin header on all mutation requests. Consider using POST for scrape trigger. Add X-CSRF-Token validation.

Admin user initialization has race condition:

  • Risk: First user to register becomes admin if ADMIN_USERNAME not set. Two concurrent registration requests could both see 0 users and both become admin.
  • Files: internal/auth/auth.go (lines 83-91)
  • Current mitigation: Relies on file-level locking in store operations, but store operations are done after the check (line 121), not atomic.
  • Recommendations: Move first-user-is-admin logic into CreateUser transaction, or seed admin during initialization phase before accepting requests.

Session token stored in http-only cookie but not marked Secure in non-HTTPS:

  • Risk: Cookie marked Secure: r.TLS != nil (line 187, 300). In development or non-HTTPS deployments, session token sent over plaintext HTTP.
  • Files: internal/auth/auth.go (lines 187, 300)
  • Current mitigation: None for non-HTTPS. Relies on deployment to enforce HTTPS.
  • Recommendations: Always set Secure=true. Force HTTPS in production via HSTS header. Log warning if TLS is nil.

Proxy does not validate Content-Type before injecting <base> tag:

  • Risk: Non-HTML responses (PDFs, images, binaries) could be corrupted by injecting <base> tag. Base64 encoded binary data could break.
  • Files: internal/proxy/proxy.go (lines 104-119)
  • Current mitigation: 5MB body size limit, but no content-type validation.
  • Recommendations: Check Content-Type header before modification. Skip injection for non-HTML types. Use proper HTML parsing (e.g., golang.org/x/net/html) instead of string manipulation.

Performance Bottlenecks

Scraper Reddit parsing with inefficient comment recursion:

  • Problem: walkComments in internal/scraper/reddit.go (lines 245-260) recursively walks comment trees using JSON unmarshaling in each recursion level. Could cause O(n^2) behavior on deep comment threads.
  • Files: internal/scraper/reddit.go (lines 245-260, 132-142)
  • Cause: Each comment reply is unmarshaled separately. For a thread with 1000 nested replies, this could create 1000 unmarshaling operations.
  • Improvement path: Pre-flatten comment tree or use iterative traversal instead of recursion. Cache unmarshaled comments during initial fetch.

O(n) lookups on every store operation:

  • Problem: All store methods (GetUserByName, GetUserByID, FindStream by ID) iterate through entire in-memory list.
  • Files: internal/store/users.go (lines 21-49), internal/store/streams.go (lines 12-52)
  • Cause: File-based storage forces full-file loads. Even with caching, no indexing.
  • Improvement path: With database migration, use indexed lookups. For now, maintain in-process cache with invalidation on updates.

Rate limiter token bucket not garbage collected properly:

  • Problem: Buckets for old IPs are deleted every 10 minutes (bucketCleanup), but inactive users' buckets accumulate until cleanup cycle.
  • Files: internal/proxy/proxy.go (lines 170-181)
  • Cause: Cleanup is reactive, not triggered on write. High-traffic scenarios could have thousands of stale buckets in memory.
  • Improvement path: Use sync.Map for lock-free reads. Implement heap-based cleanup timer per bucket instead of global interval.

Entire streams/sessions list rewritten on every add/delete:

  • Problem: Adding one stream requires reading all streams, appending, and rewriting entire file. Deleting a session does the same.
  • Files: internal/store/streams.go (lines 54-78, 80-103), internal/store/sessions.go (lines 22-44, 61-81)
  • Cause: Atomic write pattern (writeJSON uses temp-file-then-rename), but forces full serialization.
  • Improvement path: Migrate to database with transaction support. Implement write-ahead logging if staying with files.

Fragile Areas

Proxy string-based HTML manipulation is fragile:

  • Files: internal/proxy/proxy.go (lines 107-119)
  • Why fragile: Uses string.Index to find <head> and <html> tags with string.ToLower comparisons. Cases like <HEAD> would be missed. Malformed HTML (missing closing tags, nested structures) could place <base> tag in wrong location.
  • Safe modification: Use golang.org/x/net/html parser. Insert <base> into head node properly. Handle edge cases (no head, multiple heads, xhtml).
  • Test coverage: No tests for proxy HTML injection logic. Edge cases untested.

Auth ceremony cleanup relies on goroutines:

  • Files: internal/auth/auth.go (lines 112-117, 234-239)
  • Why fragile: If goroutine is blocked or delayed, cleanup doesn't happen. No guarantee cleanup runs at correct time. Server crash loses all in-flight ceremonies.
  • Safe modification: Use context deadlines instead of sleep timers. Implement cleanup on FinishRegistration/FinishLogin regardless of goroutine. Store ceremonies in database with TTL.
  • Test coverage: No tests for ceremony timeout behavior. Hard to test goroutine cleanup timing.

DeleteStream and related operations use string.Contains for error classification:

  • Files: internal/server/server.go (lines 196-203)
  • Why fragile: Error messages must contain specific strings ("not authorized", "not found") for proper HTTP status mapping. Changing error text breaks error handling.
  • Safe modification: Use error types (custom errors or error wrapping with errors.Is/As). Map error types to status codes centrally.
  • Test coverage: No tests for error status code mapping.

Scraper is single-threaded with mutex but TriggerScrape starts new goroutine:

  • Files: internal/scraper/scraper.go (lines 42-44, 46-92)
  • Why fragile: Calling TriggerScrape while scrape() is running (locked) will queue a second scrape. If scrapes take >15 minutes, queue grows. No bounds on concurrent scrapes.
  • Safe modification: Use atomic flag to prevent concurrent scrapes. Queue only one pending scrape. Timeout long-running scrapes.
  • Test coverage: No tests for concurrent scrape behavior or queue limits.

Admin check depends on user count atomicity:

  • Files: internal/auth/auth.go (lines 83-91)
  • Why fragile: Check user count, then create user are separate operations. Two concurrent registrations both see count=0, both get admin. Later operation fails due to username uniqueness check, but by then both claimed to be admin.
  • Safe modification: Move atomicity into CreateUser. Use database transaction.
  • Test coverage: No concurrency tests for admin initialization.

Scaling Limits

All data files live on single filesystem:

  • Current capacity: Depends on disk size. Assuming 1GB available, JSON files with generous spacing could hold ~100k streams, users, or sessions before performance degrades.
  • Limit: At 10k active users with 5 sessions each (50k sessions), sessions.json alone is >50MB uncompressed. Each read loads entire file.
  • Scaling path: Migrate to database. Use SQLite for single-node, PostgreSQL for distributed. Implement sharding for sessions by user_id.

In-memory rate limit buckets per IP:

  • Current capacity: ~100k unique IPs can be tracked before memory pressure (each bucket ~48 bytes).
  • Limit: Behind a proxy/load balancer, all traffic appears from proxy IP, making per-IP limiting useless. Map grows indefinitely per proxy.
  • Scaling path: Move rate limiting to reverse proxy/load balancer layer (nginx, Envoy). Or, extract real IP from X-Forwarded-For more carefully (currently does this, but assumes trust).

Scraper single-threaded, only hits one subreddit:

  • Current capacity: 25 posts per run * 15-min interval = 100 posts/hour. Each post processes comments once. Total throughput ~1000-5000 URLs/hour depending on post depth.
  • Limit: If stream demand increases or multiple subreddits need scraping, single scraper becomes bottleneck. No parallelism.
  • Scaling path: Implement scraper pool. Scrape multiple subreddits in parallel. Move scraper to separate service. Implement distributed job queue.

WebAuthn session storage grows unbounded until server restart:

  • Current capacity: Each ceremony session is ~1KB. 1000 concurrent registrations = 1MB. 100k in-flight = 100MB.
  • Limit: Memory exhaustion if registrations are started but not finished (or attacker starts many ceremonies).
  • Scaling path: Use database for ceremony sessions. Implement hard timeout (e.g., 5 min) enforced by scheduled cleanup task. Set max concurrent ceremonies.

Dependencies at Risk

go-webauthn/webauthn v0.15.0:

  • Risk: Security library. May have vulnerabilities. Check for updates regularly.
  • Impact: Passkey authentication could be compromised if library has bugs.
  • Migration plan: Keep updated. Monitor GitHub releases. Test updates before deploying.

Hardcoded subreddit URL (reddit.com API):

  • Risk: Reddit API could change, add authentication requirements, or shut down /r/motorsportsstreams2 community.
  • Impact: Scraper stops working entirely. No fallback stream sources.
  • Migration plan: Implement abstraction for stream sources. Support multiple scraper backends (Reddit, Discord, Twitter, etc.). Add health checks for scraper endpoints.

Test Coverage Gaps

No tests for HTTP error handling:

  • What's not tested: Error status code mapping, error response formatting, error logging.
  • Files: internal/server/server.go (all handlers), internal/auth/auth.go (all endpoints)
  • Risk: Error responses could be inconsistent or leaky (exposing internal details). Status codes could be wrong.
  • Priority: High

No tests for concurrent store operations:

  • What's not tested: Race conditions in add/delete/update. Concurrent reads while write in progress.
  • Files: internal/store/streams.go, internal/store/sessions.go, internal/store/users.go
  • Risk: Data corruption or loss under load. Auth bypass if race condition allows duplicate users.
  • Priority: High

No tests for WebAuthn ceremony timeouts:

  • What's not tested: Behavior when ceremony session expires. Cleanup of orphaned sessions.
  • Files: internal/auth/auth.go (BeginRegistration, FinishRegistration, BeginLogin, FinishLogin)
  • Risk: Session fixation, orphaned memory, unexpected behavior on retry.
  • Priority: Medium

No tests for proxy HTML injection:

  • What's not tested: Edge cases (malformed HTML, no head tag, nested structures). Security implications (XSS prevention, CSP).
  • Files: internal/proxy/proxy.go (ServeHTTP)
  • Risk: Injected tags could be placed incorrectly. Proxied content could break. Security headers could be ineffective.
  • Priority: Medium

No tests for rate limiter token bucket algorithm:

  • What's not tested: Burst capacity behavior, refill rate, edge cases (high request volume, time skew).
  • Files: internal/proxy/proxy.go (allowRequest, cleanBuckets)
  • Risk: Rate limiting could be too strict or too lenient. Cleanup could fail to run.
  • Priority: Medium

No tests for admin initialization logic:

  • What's not tested: First user gets admin flag. Edge cases with concurrent registrations. Behavior when ADMIN_USERNAME is set.
  • Files: internal/auth/auth.go (BeginRegistration, lines 83-91)
  • Risk: Non-admin user gets admin flag (privilege escalation). Two admins created unexpectedly.
  • Priority: High

No integration tests for full auth flow:

  • What's not tested: Complete registration + login + logout cycle. Error recovery. Session expiration.
  • Files: All of internal/auth/auth.go and internal/server/server.go auth endpoints.
  • Risk: Subtle bugs in ceremony sequencing. Auth logic could break without being detected.
  • Priority: High

No tests for scraper Reddit parsing:

  • What's not tested: Comment tree recursion. URL extraction. F1 keyword matching. Deduplication logic.
  • Files: internal/scraper/reddit.go
  • Risk: Scraper could miss streams, extract bad URLs, or fail on unexpected Reddit response format.
  • Priority: Medium

Concerns audit: 2026-02-17