Compare commits

...

42 commits

Author SHA1 Message Date
Viktor Barzin
afd78f8d3e kms: replace inline ConfigMap nginx with custom Hugo image
The kms-web-page deployment now pulls
forgejo.viktorbarzin.me/viktor/kms-website:${var.image_tag} (source
in the new Forgejo repo viktor/kms-website). The ConfigMap-mounted
index.html is gone — the new site is a Hugo build with full GVLK
catalog for every Microsoft KMS-eligible Windows + Office edition,
copy-to-clipboard, dark/light themes.

The container image tag is managed by CI (kubectl set image), so
add lifecycle ignore_changes on container[0].image alongside the
existing dns_config (Kyverno) ignore.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-07 23:29:35 +00:00
Viktor Barzin
4518aff71c f1-stream: Stremio addon extractor — TvVoo + StremVerse Sky F1 / DAZN F1
5 parallel research agents surveyed Stremio addons, F1 TV / Sky / DAZN
official APIs, IPTV M3U lists, and free-to-air broadcasters. The clean
finding: two community Stremio addons already index Sky Sports F1 +
DAZN F1 via their public HTTP APIs — no Stremio client required, just
GET /stream/<type>/<id>.json on the addon's hosted instance.

New `stremio.py` extractor pulls from:
- **TvVoo** (`https://tvvoo.hayd.uk/manifest.json`) — wraps Vavoo IPTV.
  Lists Sky Sports F1 UK + Sky Sports F1 HD + Sky Sport F1 IT + Sky
  Sport F1 HD DE + DAZN F1 ES. Returns 2 IP-bound m3u8 URLs per
  channel. Source: github.com/qwertyuiop8899/tvvoo. Vavoo's CDN SSL
  certs are currently expired so most clients fail verification today
  — addon framework is right but delivery is degraded.
- **StremVerse** (`https://stremverse.onrender.com/manifest.json`) —
  Returns 11+ streams per id (`stremevent_591` = F1, `stremevent_866`
  = MotoGP). Mix of DRM-walled DASH, JW-broken-chain JWT URLs, and
  HuggingFace-Space proxies that 404 without a per-instance api_password.

The extractor surfaces 15 candidate URLs per run; verifier filters to
the playable subset. Today that subset is 0 (Vavoo cert expiry + JW
chain + proxy auth), but the wiring is correct: as the addons fix
delivery or rotate to fresh URLs, candidates will start passing.

Other agent findings worth noting (not coded but documented):
- F1 TV Pro live = Widevine DASH; impossible without a CDM. VOD is
  clean HLS but only post-session.
- Sky Go / DAZN / Viaplay / Canal+ = all Widevine + geo-fenced + active
  DMCA enforcement. Pursuing not feasible.
- ServusTV AT (free F1 race weekends) = clean public HLS at
  rbmn-live.akamaized.net/hls/live/2002825/geoSTVATweb/master.m3u8 but
  geo-fenced; needs an Austrian-IP egress proxy/VPN.
- iptv-org/iptv has an F1 Channel (Pluto TV IE) at
  jmp2.uk/plu-6661739641af6400080cd8f1.m3u8 — 24/7 free, BG works,
  but only historic races + shoulder programming. Worth adding as a
  curated entry later.
- boxboxbox.* (community-favourite F1 race-weekend domain) is dead
  across all known TLDs as of today.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-07 23:29:35 +00:00
Viktor Barzin
d832a33039 [woodpecker] Bump WOODPECKER_FORGE_TIMEOUT 3s → 30s
The default forge-API timeout is 3 seconds. The config-loader makes
4-6 sequential calls per pipeline trigger (probing for .woodpecker dir
then each .woodpecker.{yaml,yml} variant), and Forgejo responses on
this cluster spike to 1-2s under load — easy to trip the cumulative
3s deadline. Result: 'could not load config from forge: context
deadline exceeded' on virtually every pipeline trigger.

This was the actual root cause of the 'Woodpecker forge-API bug'
that v3.13 → v3.14 was supposed to fix — turns out v3.14 didn't
change the timeout default, and the v3.13 successes I saw earlier
were warm-cache flukes.
2026-05-07 23:29:35 +00:00
Viktor Barzin
afafc9928f [docs] Onboarding runbook for new Forgejo repos in Woodpecker 2026-05-07 23:29:35 +00:00
Viktor Barzin
5b255cf6f2 state(vault): update encrypted state 2026-05-07 23:29:35 +00:00
Viktor Barzin
108bef7b1a f1-stream: subreddit extractor scans r/motorsportsstreams2 (active sub)
User asked specifically for r/motorsportstreams. Reddit banned that sub
years ago; the active 12.5k-subscriber successor is r/motorsportsstreams2.
Added it to SUBREDDITS plus r/f1streams (709 subs, public).

Also extended:
- SEARCH_QUERIES with three Sky Sports F1 / live-stream phrases that
  catch the `[F1 STREAM]` post pattern the community uses on race
  weekends (titles like "[F1 STREAM] Bahrain GP - Live Race | No Buffer
  | Mobile Friendly" linking to boxboxbox.pro/stream-1).
- _INTERESTING_HOSTS allowlist with boxboxbox.{pro,live,lol},
  pitsport.live, ppv.to, streamed.pk, acestrlms/aceztrims, and the
  Super Formula direct CDNs (racelive.jp, cdn.sfgo.jp) — all observed
  in last-50-posts on r/motorsportsstreams2.

Where this leaves us, honestly:
- The r/motorsportsstreams2 megathread "Where to watch every F1 race"
  recommends EXACTLY the four sites we already pull from: pitsport.xyz,
  streamed.pk, ppv.to, acestrlms. The community has the same broken JW
  Player chain we have for Sky Sports F1 24/7 streams. There is no
  free-and-working alternative they know about.
- boxboxbox.pro (the most-promoted F1 stream domain in race-weekend
  posts) is currently NXDOMAIN; .live is parked, .lol unreachable. The
  domain rotates after takedowns; Reddit posts will surface fresh ones
  when posters share them.
- For F1 specifically: extractor surfaces 2 motomundo.net candidates
  (MotoGP wrappers) and lights up to ~6+ during F1 race weekends as
  posters share fresh boxboxbox/equivalent URLs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-07 23:29:35 +00:00
Viktor Barzin
e110b40a4a monitoring(wealth): monthly contrib-vs-mkt as line chart, not bars
User asked for two lines instead of side-by-side bars at monthly
granularity. Converts panel 25 from barchart to timeseries:

  * type: barchart -> timeseries
  * format: table -> time_series, SELECT month::timestamp AS time
  * drawStyle line, lineWidth 2, fillOpacity 0, showPoints auto
  * Same blue (contributions) / green (market gain) colour overrides

Where the green line rises above the blue line is the visual cue that
the market out-earned new contributions for that month -- the trend
the user wants to track.

Diff is small (15 ins / 28 del) because the bar-chart-only fields
(barRadius, barWidth, groupWidth, stacking, xField, xTickLabelRotation)
are dropped.
2026-05-07 23:29:35 +00:00
Viktor Barzin
84fd752747 monitoring(wealth): monthly contributions vs market gain bar chart
Goal stated by user: see when monthly market gain starts to exceed
monthly contributions, i.e. the inflection point where the market is
out-earning savings rather than the other way around.

New panel id=25 between the annual decomposition (13) and per-account
ROI (14): bar chart with two side-by-side bars per month --
contributions (blue) and market gain (green). Same calculation as
panel 13 but month-grain instead of year-grain. Months where the
green bar dwarfs the blue one are visible at a glance.

SQL: same endpoints CTE pattern as panel 13, with date_trunc('month',
valuation_date) as the grouping key. Uses max_complete cutoff so
partial-today doesn't skew the latest month.

Layout: panels at y >= 75 shifted down by 11 (chart height). New
chart at y=75; panel 14 (per-account ROI) -> y=86; panel 10
(activity log) -> y=96.

Spot check (recent months from PG):
  2025-07: contrib +£5,601    market +£42,295   <- big market month
  2025-09: contrib +£1,501    market +£24,206
  2026-02: contrib +£35,501   market +£41,382
  2026-03: contrib +£5,501    market -£38,483   <- correction
  2026-04: contrib +£73,267   market +£21,448
2026-05-07 23:29:34 +00:00
Viktor Barzin
f1d69b0a7a [wealthfolio] Flip wealthfolio-sync CronJob image to Forgejo
The CronJob has been broken since registry-private lost the
wealthfolio-sync image (last successful run 36+ days ago). The image
is built from /home/wizard/code/broker-sync (the brokerage data sync —
Trading 212, Schwab, Fidelity, IMAP-CSV → wealthfolio).

Set up: viktor/broker-sync repo on Forgejo with .woodpecker/build.yml
that pushes to forgejo.viktorbarzin.me/viktor/wealthfolio-sync. Until
Woodpecker recognises the new repo's webhook, the image was bootstrapped
via 'docker pull viktorbarzin/broker-sync:latest && docker tag … &&
docker push forgejo.viktorbarzin.me/viktor/wealthfolio-sync:latest' so
the CronJob unblocks immediately.
2026-05-07 23:29:34 +00:00
Viktor Barzin
d942a21d93 [woodpecker] Bump server + agent v3.13.0 → v3.14.0
Fixes the 'could not load config from forge: context deadline exceeded'
issue that blocked every Forgejo-triggered pipeline during the
forgejo-registry-consolidation cutover. Helm chart 3.5.1 stays
(no 3.6 yet); only the image tag overrides change.
2026-05-07 23:29:34 +00:00
Viktor Barzin
8c73a0243a [forgejo] Phase 4 final decommission: drop registry-private container + port 5050
Image migration completed (forgejo-migrate-orphan-images.sh ran +
all in-scope images now under forgejo.viktorbarzin.me/viktor/) and
the cluster cutover landed in commit 3148d15d. registry-private is
no longer needed.

* infra/modules/docker-registry/docker-compose.yml — registry-private
  service block removed; nginx 5050 port mapping dropped.
* infra/modules/docker-registry/nginx_registry.conf — upstream
  private block + port 5050 server block removed.
* infra/.woodpecker/build-ci-image.yml — drop the dual-push to
  registry.viktorbarzin.me:5050; only push to Forgejo. Verify-
  integrity step removed (the every-15min forgejo-integrity-probe
  in monitoring covers it). Break-glass tarball step still runs but
  pulls from Forgejo (the only registry left).

The registry-config-sync.yml pipeline will pick this commit up and
sync the new compose+nginx to the VM. Manual final step on the VM:
  ssh root@10.0.20.10 'cd /opt/registry && docker compose up -d --remove-orphans'
to actually destroy the registry-private container — compose does
NOT do orphan removal on a normal up -d.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-07 23:29:34 +00:00
Viktor Barzin
59885c21d0 [claude-memory] Restore truncated main.tf — apply Phase 3 image flip on full file
The Phase 3 commit 3148d15d ran into a disk-full ENOSPC during edit
of stacks/claude-memory/main.tf, and the file was committed truncated
at line 286 mid-string ('Cor instead of 'Core Platform' / closing
braces). terraform validate failed with 'Unterminated template string'.
Restoring the trailing 2 lines + re-applying the
viktorbarzin/claude-memory-mcp:17 → forgejo.viktorbarzin.me/viktor/
claude-memory-mcp:17 cutover that Phase 3 was meant to do.
2026-05-07 23:29:34 +00:00
Viktor Barzin
3f3e5fc954 chrome-service: open NP for Traefik → noVNC sidecar (port 6080)
Existing NetworkPolicy only admitted port 3000 (Playwright WS) from
labelled client namespaces, blocking Traefik's traffic to the noVNC
sidecar on port 6080. The chrome.viktorbarzin.me ingress would hang
forever — page never loads, eventually times out.

Adds a second ingress rule allowing TCP/6080 from the traefik
namespace only. Authentik forward-auth still gates external access
at the Traefik layer.

Also reconciles the noVNC image to the new Forgejo registry path
(:v4 unchanged) — already declared in TF, just live-state drift from
the Phase 3 registry consolidation.

Updates the architecture doc; the previous text still described the
old nginx static health stub that noVNC replaced.
2026-05-07 23:29:34 +00:00
Viktor Barzin
56fbd281c9 [forgejo] Restore registry-private temporarily until image migration completes
The Phase 4 docker-compose + nginx changes I landed earlier dropped
the registry-private container's port-5050 listener BEFORE migrating
the existing images to Forgejo. The registry-config-sync pipeline
applied the new nginx config, breaking pulls from registry-private —
which is the source of every image we still need to copy to Forgejo.

Restore registry-private + the 5050 listener until the migration
script has finished. Subsequent commit will drop them once images
are confirmed in Forgejo.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-07 23:29:34 +00:00
Viktor Barzin
a91bbe189e f1-stream: subreddit extractor finds Reddit '[Watch / Download]' threads
Two fixes for the previously-dormant subreddit extractor + a chrome-browser TARGETS pivot to MotoGP weekend live URLs.

1. **Reddit fetch was 403'd by `Accept: application/json`**. Cluster IP +
   that header trips Reddit's anti-bot fingerprint and returns HTML 403.
   Removing the explicit Accept (default `*/*`) restores HTTP 200 with
   JSON. Confirmed via direct httpx test from the f1-stream pod.

2. **Search the right things**. The community uses a stable
   `[Watch / Download] <Series> <Year> - <Round> | <Event>` post pattern
   with selftext links to admin-curated WordPress sites (motomundo.net
   for MotoGP, sister sites for F1 when active). New extractor:
   - Hits both /new.json and /search.json across r/MotorsportsReplays
     and three smaller motorsport subs.
   - Filters posts where title contains `[watch`, `watch online`, or
     flair = `live`.
   - Extracts URLs from selftext (regex), filters to a positive
     `_INTERESTING_HOSTS` allowlist (motomundo, freemotorsports,
     pitsport, rerace, dd12, etc.) so we don't drown the verifier in
     YouTube/Discord/gofile links.
   - Returns each as embed-type so the chrome-service verifier visits.

3. **chrome_browser.TARGETS pivoted** to the live MotoMundo MotoGP
   French GP iframes (motomundo.top/e/<id> + motomundo.upns.xyz/#<id>)
   while the weekend is on. The previous DD12 NASCAR + Acestrlms F1
   targets were both broken JW Player paths anyway.

State after deploy:
- /streams: 3 verified live (WRC Rally Portugal, NASCAR 24/7, Premier League Darts) — Darts is currently active because UK is mid-match.
- Subreddit extractor surfaces the live MotoMundo URL but the verifier
  marks the WordPress wrapper page playable=False (no top-level <video>
  element; the m3u8 lives in nested iframes). Next iteration: drill the
  verifier into iframe contentDocument and capture from there.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-07 23:29:34 +00:00
Viktor Barzin
4ec40ea804 [forgejo] Phases 3+4+5: cutover, decommission, docs sweep
End of forgejo-registry-consolidation. After Phase 0/1 already landed
(Forgejo ready, dual-push CI, integrity probe, retention CronJob,
images migrated via forgejo-migrate-orphan-images.sh), this commit
flips everything off registry.viktorbarzin.me onto Forgejo and
removes the legacy infrastructure.

Phase 3 — image= flips:
* infra/stacks/{payslip-ingest,job-hunter,claude-agent-service,
  fire-planner,freedify/factory,chrome-service,beads-server}/main.tf
  — image= now points to forgejo.viktorbarzin.me/viktor/<name>.
* infra/stacks/claude-memory/main.tf — also moved off DockerHub
  (viktorbarzin/claude-memory-mcp:17 → forgejo.viktorbarzin.me/viktor/...).
* infra/.woodpecker/{default,drift-detection}.yml — infra-ci pulled
  from Forgejo. build-ci-image.yml dual-pushes still until next
  build cycle confirms Forgejo as canonical.
* /home/wizard/code/CLAUDE.md — claude-memory-mcp install URL updated.

Phase 4 — decommission registry-private:
* registry-credentials Secret: dropped registry.viktorbarzin.me /
  registry.viktorbarzin.me:5050 / 10.0.20.10:5050 auths entries.
  Forgejo entry is the only one left.
* infra/stacks/infra/main.tf cloud-init: dropped containerd
  hosts.toml entries for registry.viktorbarzin.me +
  10.0.20.10:5050. (Existing nodes already had the file removed
  manually by `setup-forgejo-containerd-mirror.sh` rollout — the
  cloud-init template only fires on new VM provision.)
* infra/modules/docker-registry/docker-compose.yml: registry-private
  service block removed; nginx 5050 port mapping dropped. Pull-
  through caches for upstream registries (5000/5010/5020/5030/5040)
  stay on the VM permanently.
* infra/modules/docker-registry/nginx_registry.conf: upstream
  `private` block + port 5050 server block removed.
* infra/stacks/monitoring/modules/monitoring/main.tf: registry_
  integrity_probe + registry_probe_credentials resources stripped.
  forgejo_integrity_probe is the only manifest probe now.

Phase 5 — final docs sweep:
* infra/docs/runbooks/registry-vm.md — VM scope reduced to pull-
  through caches; forgejo-registry-breakglass.md cross-ref added.
* infra/docs/architecture/ci-cd.md — registry component table +
  diagram now reflect Forgejo. Pre-migration root-cause sentence
  preserved as historical context with a pointer to the design doc.
* infra/docs/architecture/monitoring.md — Registry Integrity Probe
  row updated to point at the Forgejo probe.
* infra/.claude/CLAUDE.md — Private registry section rewritten end-
  to-end (auth, retention, integrity, where the bake came from).
* prometheus_chart_values.tpl — RegistryManifestIntegrityFailure
  alert annotation simplified now that only one registry is in
  scope.

Operational follow-up (cannot be done from a TF apply):
1. ssh root@10.0.20.10 — edit /opt/registry/docker-compose.yml to
   match the new template AND `docker compose up -d --remove-orphans`
   to actually stop the registry-private container. Memory id=1078
   confirms cloud-init won't redeploy on TF apply alone.
2. After 1 week of no incidents, `rm -rf /opt/registry/data/private/`
   on the VM (~2.6GB freed).
3. Open the dual-push step in build-ci-image.yml and drop
   registry.viktorbarzin.me:5050 from the `repo:` list — at that
   point the post-push integrity check at line 33-107 also needs
   to be repointed at Forgejo or removed (the per-build verify is
   redundant with the every-15min Forgejo probe).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-07 23:29:34 +00:00
Viktor Barzin
e86efd107a [forgejo] Migration script: exclude empty repos, all-images full mode
Updated to handle the actual situation: wealthfolio-sync and
fire-planner have registry repos but no tags (broken/abandoned
deployments). Skip those with a SKIP marker. Migrate everything
else as a stop-gap until Woodpecker pipelines start producing
Forgejo images on their own.

The image list now covers all private images currently in scope.
2026-05-07 23:29:34 +00:00
Viktor Barzin
874f80ecbe [woodpecker] Persist hostAliases patch via null_resource (chart doesn't expose it)
Helm chart 3.5.1 has no `server.hostAliases` field, so the YAML
addition I made earlier was a no-op. Apply via kubectl patch in a
null_resource keyed on helm revision so it re-asserts on every
chart upgrade. Same pattern as the CoreDNS replicas/affinity patch
in stacks/technitium/.

Without this, every helm upgrade on woodpecker reverts the
hostAliases fix and the Forgejo pipeline triggers start failing
with context-deadline-exceeded again.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-07 23:29:34 +00:00
Viktor Barzin
ff19d86557 [woodpecker] Pin forgejo.viktorbarzin.me to in-cluster Traefik LB
Pipeline triggers from Forgejo were failing with "could not load
config from forge: context deadline exceeded" — Woodpecker's
forge-API fetch path was round-tripping through Cloudflare via the
public IP, hitting 30s deadline timeouts on cold connections. The
in-cluster path via the Traefik LB (10.0.20.200) is consistently
sub-100ms.

Same trick we use for the containerd hosts.toml redirect on each
node — Traefik serves the *.viktorbarzin.me wildcard cert so SNI
verification still passes. OAuth callbacks still use the public
hostname (correct, those come from the user's browser).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-07 23:29:34 +00:00
Viktor Barzin
a0b70482fe [forgejo] Bump webhook DELIVER_TIMEOUT 5s -> 30s
Forgejo→Woodpecker webhooks were timing out on first request after
pod restart. The default 5s deadline is too tight for the cold
Cloudflare-tunnel TLS handshake (observed 6-8s). 30s comfortably
covers retries.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-07 23:29:34 +00:00
Viktor Barzin
83496f6e0c [forgejo] Allow webhook delivery to ci.viktorbarzin.me + *.viktorbarzin.me
The Forgejo→Woodpecker webhook (so Woodpecker fires on each push to
viktor/<repo>) was being blocked by the existing ALLOWED_HOST_LIST
of *.svc.cluster.local — ci.viktorbarzin.me resolves to the public IP
because Cloudflare proxying wasn't covering that path. Without this
fix, no Woodpecker pipeline run was triggered on push, the dual-push
bake would never start, and Forgejo's package catalog stays empty.

Add ci.viktorbarzin.me explicitly + *.viktorbarzin.me as a future-
proofing wildcard. The list still excludes arbitrary external hosts,
so this is not a security regression — just unblocking the webhook
to our own CI.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-07 23:29:34 +00:00
Viktor Barzin
76d2d0e536 [forgejo] Add chrome-service-novnc:v4 to orphan-image migrator 2026-05-07 23:29:34 +00:00
Viktor Barzin
413ceec35c [forgejo] securityContext.fsGroup=1000 so /data is writable to forgejo
Phase 0 enabled packages but the pod crashloops on
`mkdir /data/tmp: permission denied` — Forgejo loads the chunked
upload path (default /data/tmp/package-upload) before s6-overlay
gets a chance to chown /data. fsGroup tells kubelet to recursively
chown the volume to GID 1000 on mount, which fixes it.

Pre-23-day Forgejo deployed with packages off so this code path
never ran.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-07 23:29:34 +00:00
Viktor Barzin
3fb05825d8 [forgejo] Drop the FORGEJO__packages__CHUNKED_UPLOAD_PATH override
Setting it to /data/tmp/package-upload triggers a CrashLoopBackOff
because /data is the volume mount root and is owned by root, not
the forgejo user (uid 1000) — Forgejo can't `mkdir /data/tmp`.

The default value resolves under the AppDataPath (a subdir Forgejo
itself owns) which works fine. Keep the ENABLED=true override; v11
ships packages on but explicit is safer.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-07 23:29:34 +00:00
Viktor Barzin
d67e8ddaf8 f1-stream: add chrome-browser, subreddit, dd12 extractors; fix streamed.pk
User asked to broaden the source pipeline so f1-stream can find F1 (and
adjacent motorsport) streams from Sky Sports / DAZN / Reddit / etc.,
using the in-cluster chrome-service headed browser where needed. Four
changes:

1. **streamed.py**: BASE_URL streamed.su → streamed.pk. The .su domain
   stopped serving the API host in 2026 (only the marketing page is
   left); .pk hosts the JSON API now. Adds 3 events/round (currently
   all routed through embedsports.top — see #2 caveat).

2. **chrome_browser.py** (new): generic chrome-service-driven extractor.
   Connects to the existing chrome-service WS (CHROME_WS_URL +
   CHROME_WS_TOKEN env), navigates a list of TARGETS, captures any HLS
   playlist URL the page fetches at runtime, returns one ExtractedStream
   per discovery. Uses the same stealth init script as the verifier so
   anti-bot checks don't trip the page. Handles iframes (DD12-style
   /nas → /new-nas/jwplayer) and probes child-frame <video>/source
   elements after settle. Caveat: most aggregator sites (pooembed,
   embedsports, hmembeds, even DD12's JW Player path) use a broken
   runtime decoder that produces no m3u8 in our environment, so the
   TARGETS list is currently 0-yielding; the framework is the
   contribution and concrete sites can be added as they're discovered.

3. **subreddit.py** (new): scans r/MotorsportsReplays, r/motorsports,
   r/formula1, r/motogp via the public old.reddit.com JSON API for
   posts whose flair/title indicates a live stream. Discovered URLs
   are returned as embed-type streams; the verifier visits each via
   chrome-service to confirm playability. Note: Reddit currently HTTP
   403's our cluster outbound IP for anonymous JSON requests; the
   extractor returns 0 in that state and logs a debug message. Will
   work from any IP Reddit isn't blocking.

4. **dd12.py** (new): inline-HTML scraper for DD12Streams. The site
   embeds `playerInstance.setup({file: "..."})` directly in HTML — no
   JS decoder needed. Currently surfaces NASCAR Cup Series 24/7 (clean
   BunnyCDN-hosted HLS at w9329432hnf3h34.b-cdn.net/pdfs/master.m3u8);
   add new `(path, label, title)` tuples to CHANNELS as DD12 expands.

Result: /streams now shows 2 verified live streams (Rally TV via
pitsport + DD12 NASCAR Cup 24/7). When the next F1 weekend (Canadian
GP, May 22-24) goes live, pitsport will surface F1 sessions
automatically via the existing pushembdz path.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-07 23:29:34 +00:00
Viktor Barzin
a3024d1f51 [docs] Forgejo registry image-rebuild runbook
Companion to forgejo-registry-breakglass.md but for the more common
case: the Forgejo registry is healthy as a whole, but one image's
manifest/blob references are broken (orphan child, half-pushed
upload, retention-vs-pull race). The
RegistryManifestIntegrityFailure alert annotation already points
here.

Mirrors registry-rebuild-image.md (the registry-private equivalent)
in structure: confirm via probe + curl, delete broken version
through Forgejo API, rebuild via Woodpecker manual run, force
consumers to re-pull, verify integrity recovery.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-07 23:29:33 +00:00
Viktor Barzin
fbb41eff9d [ci] Phase 1: infra-ci dual-push + break-glass tarball
Adds Forgejo as a second push target on the build-ci-image pipeline
and saves the just-pushed image as a gzipped tarball on the registry
VM disk (/opt/registry/data/private/_breakglass/) so we can recover
infra-ci with `ctr images import` if both registries are down.

* Dual-push: registry.viktorbarzin.me:5050/infra-ci AND
  forgejo.viktorbarzin.me/viktor/infra-ci, in the same
  woodpeckerci/plugin-docker-buildx step. Same image bytes; the
  Forgejo integrity probe (every 15min) catches any divergence.
* Break-glass step: SSHes to 10.0.20.10, docker pulls + saves +
  gzips, keeps last 5 tarballs (latest symlink). Failure-tolerant
  so a transient registry blip doesn't fail the build pipeline.
* Runbook docs/runbooks/forgejo-registry-breakglass.md documents
  the recovery flow (when to use, scp+ctr import, node cordon,
  underlying-issue fix).

Tarball mirrors to Synology automatically through the existing
daily offsite-sync-backup job — no new sync wiring needed.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-07 23:29:33 +00:00
Viktor Barzin
70ea1cf6fd [forgejo] Tolerate missing Vault keys during Phase 0 bootstrap
Wrap the three new Vault key reads in try(...) so the first apply
succeeds even when forgejo_pull_token / forgejo_cleanup_token /
secret/ci/global haven't been populated yet. Without this, CI
auto-apply blocks on the very push that introduces the references —
chicken-and-egg with the runbook order (which is: apply Forgejo bumps,
then create users + PATs, then apply the rest).

Empty tokens are intentionally visible-broken (auth fails, probe
reports auth failure, cleanup CronJob errors) — that's the signal
to run the bootstrap runbook. Subsequent apply picks up the real
values.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-07 23:29:33 +00:00
Viktor Barzin
f793a5f50b [forgejo] Phase 0 of registry consolidation: prepare Forgejo OCI registry
Stage 1 of moving private images off the registry:2 container at
registry.viktorbarzin.me:5050 (which has hit distribution#3324 corruption
3x in 3 weeks) onto Forgejo's built-in OCI registry. No cutover risk —
pods still pull from the existing registry until Phase 3.

What changes:
* Forgejo deployment: memory 384Mi→1Gi, PVC 5Gi→15Gi (cap 50Gi).
  Explicit FORGEJO__packages__ENABLED + CHUNKED_UPLOAD_PATH (defensive,
  v11 default-on).
* ingress_factory: max_body_size variable was declared but never wired
  in after the nginx→Traefik migration. Now creates a per-ingress
  Buffering middleware when set; default null = no limit (preserves
  existing behavior). Forgejo ingress sets max_body_size=5g to allow
  multi-GB layer pushes.
* Cluster-wide registry-credentials Secret: 4th auths entry for
  forgejo.viktorbarzin.me, populated from Vault secret/viktor/
  forgejo_pull_token (cluster-puller PAT, read:package). Existing
  Kyverno ClusterPolicy syncs cluster-wide — no policy edits.
* Containerd hosts.toml redirect: forgejo.viktorbarzin.me → in-cluster
  Traefik LB 10.0.20.200 (avoids hairpin NAT for in-cluster pulls).
  Cloud-init for new VMs + scripts/setup-forgejo-containerd-mirror.sh
  for existing nodes.
* Forgejo retention CronJob (0 4 * * *): keeps newest 10 versions per
  package + always :latest. First 7 days dry-run (DRY_RUN=true);
  flip the local in cleanup.tf after log review.
* Forgejo integrity probe CronJob (*/15): same algorithm as the
  existing registry-integrity-probe. Existing Prometheus alerts
  (RegistryManifestIntegrityFailure et al) made instance-aware so
  they cover both registries during the bake.
* Docs: design+plan in docs/plans/, setup runbook in docs/runbooks/.

Operational note — the apply order is non-trivial because the new
Vault keys (forgejo_pull_token, forgejo_cleanup_token,
secret/ci/global/forgejo_*) must exist BEFORE terragrunt apply in the
kyverno + monitoring + forgejo stacks. The setup runbook documents
the bootstrap sequence.

Phase 1 (per-project dual-push pipelines) follows in subsequent
commits. Bake clock starts when the last project goes dual-push.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-07 23:29:33 +00:00
Viktor Barzin
00614a3302 f1-stream: drop broken curated, dedupe streams, accept all pitsport categories
User feedback: every stream on /watch shows ads but the player fails
to load. Three causes, three fixes:

1. CuratedExtractor's two hmembeds 24/7 channels (Sky F1, DAZN F1)
   sat at the top of the list and ALWAYS failed: they load the
   upstream's ad overlay then JW Player throws error 102630 (empty
   playlist; the obfuscated decoder produces no fileURL in our
   environment). Disabled the registration in extractors/__init__.py
   until/unless we find a working bypass — leaving the existing
   `CURATED_BYPASS = {"curated"}` shim in service.py so the swap is
   reversible.

2. Pitsport surfaces every WRC stage / MotoGP session as its own
   /watch UUID, but they all resolve to the same upstream m3u8 URL
   (e.g. RallyTV one master.m3u8 across all 22 Rally de Portugal
   stages). Added URL-keyed dedupe in service.run_extraction so the
   /streams response shows one row per actual stream.

3. The pitsport category filter was still narrowed to motorsport.
   Pitsport.xyz only lists curated sports broadcasts (WRC, MotoGP,
   IndyCar, NASCAR, Premier League Darts, Premier League football…),
   so the site's own selection is the right filter. Replaced the
   hand-maintained MOTORSPORT_KEYWORDS list with `bool(category or
   title)` — anything pitsport returns goes through. Streams that
   aren't actually live get filtered out downstream when the embed
   API returns an empty manifest.

Frontend: hls.js `lowLatencyMode` was on by default but RallyTV (and
most non-LL-HLS providers) don't ship the LL-HLS extensions, which
broke playback in real browsers. Default to `lowLatencyMode: false`.

Result: /streams is now 1 verified live entry (Rally TV WRC stage
currently airing); was 24 with the top 2 always broken + 22 dupes.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-07 23:29:33 +00:00
Viktor Barzin
18d96712c7 f1-stream: pitsport extractor — broaden categories + new safeStream payload
The previous extractor only surfaced Formula 1/2/3 and never returned
anything outside race weekends. Two fixes:

1. Broadened category filter from {formula 1/2/3} to a motorsport set
   (MotoGP/Moto2/Moto3, WRC/WEC/IndyCar/NASCAR + the F1 series).
   Replaces the NON_F1_KEYWORDS exclusion list with a positive-match
   MOTORSPORT_KEYWORDS set; removes the F1-specific filter on title
   keywords. Old `_is_f1_*` aliases retained as compat shims.

2. Updated `_parse_stream_config` for the current pushembdz.store embed
   payload — Next.js now serves `safeStream` (just title + method) and
   the actual stream URL is fetched at runtime from
   `pushembdz.store/api/stream/<slug>`. Extractor now hits that endpoint
   when the inline link is missing. Treats `method=jwp` as HLS and
   accepts URLs ending in `.css` (pushembdz disguises some HLS playlists
   with a `.css` extension).

End-to-end result: /streams went from 2 (curated, broken JW decoder) to
24 streams marked `is_live=True`. The verifier confirms each via
`manifest_parsed_codec_missing_in_verifier` (Playwright Chromium has no
H.264 — manifest fetch alone is the codec-independent positive signal).
Currently surfaces Rally de Portugal SS1–SS22 (WRC); MotoGP starts
appearing once the French GP weekend goes live tomorrow.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-07 23:29:33 +00:00
Viktor Barzin
8146d05191 chrome-service: replace static health stub with noVNC view
The static nginx stub at chrome.viktorbarzin.me wasn't useful for
debugging anti-bot interactions. Swap it for a live noVNC HTML5 view
of the headed Chromium session: x11vnc taps Xvfb's :99 over localhost
TCP (added `-listen tcp -ac` to Xvfb), websockify wraps it as a WS
endpoint, and noVNC's vendored web client serves it on :6080.

The ingress chain is unchanged — chrome.viktorbarzin.me stays
Authentik-gated, dns_type=proxied, port 3000 (the Playwright WS) stays
internal-only behind the NetworkPolicy + token. Custom image
`registry.viktorbarzin.me/chrome-service-novnc:v4` (ubuntu:24.04 +
x11vnc + websockify + novnc apt packages) needs imagePullSecrets, so
also added registry-credentials reference to the deployment spec.

x11vnc flags: `-noshm -noxdamage -nopw -shared -forever`. SHM is
disabled because each container has its own /dev/shm so the X server
can't grant access; XDAMAGE isn't compiled into the noble Xvfb. The
sidecar entrypoint waits up to 30s for both Xvfb (:6099) and x11vnc
(:5900) to bind before exec'ing websockify.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-07 23:29:33 +00:00
Viktor Barzin
f18cd1d314 chrome-service: in-cluster headed Chromium pool for f1-stream verifier
The f1-stream verifier's in-process headless Chromium kept tripping
hmembeds' disable-devtool.js Performance detector (CDP latency on
console.log vs console.table) and getting redirected to google.com.

This adds a single-replica chrome-service stack running Playwright
launch-server under Xvfb so callers can connect via WS+token to a
shared headed browser. f1-stream's _ensure_browser now prefers
chromium.connect(CHROME_WS_URL/CHROME_WS_TOKEN) and adds a vendored
stealth init script (webdriver/plugins/languages/Permissions/WebGL
spoofs + querySelector hijack to disarm disable-devtool-auto) on
every new context. Falls back to in-process headless if the env
vars aren't set.

Encrypted PVC for profile + npm cache, NetworkPolicy to TCP/3000
gated by client-namespace label, 6h tar.gz backup CronJob to NFS,
Authentik-gated nginx sidecar at chrome.viktorbarzin.me for human
liveness checks. Image pinned to playwright:v1.48.0-noble in
lockstep with the Python client's playwright==1.48.0.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-07 23:29:32 +00:00
Viktor Barzin
41655096c7 openclaw: realtime usage dashboard via Prometheus exporter sidecar
Stdlib-only Python exporter ($1) reads ~/.openclaw/agents/*/sessions/*.jsonl
(assistant messages with usage) plus auth-profiles.json (OAuth expiry,
Plus-tier label) and exposes Prometheus text format on :9099/metrics.
Container is python:3.12-slim; pod template gets prometheus.io/scrape
annotations so the existing kubernetes-pods job picks it up — no
ServiceMonitor needed.

Metrics exported:
  openclaw_codex_messages_total{provider,model,session_kind}    counter
  openclaw_codex_input/output/cache_read/cache_write_tokens_total
  openclaw_codex_message_errors_total{reason}
  openclaw_codex_active_sessions{kind}                          gauge
  openclaw_codex_oauth_expiry_seconds{provider,account,plan}    gauge
  openclaw_codex_last_run_timestamp                             gauge

Grafana dashboard "OpenClaw — Codex Usage" (Applications folder, 30s
refresh): messages/5h vs Plus rate-card, % of 1,200 floor, tokens/5h,
cache hit %, OAuth expiry days, active sessions, last-turn age, errors,
plus per-model timeseries + bar gauge + error table.

Plus rate-card thresholds in the gauge are conservative (1,200/5h floor;
real cap is dynamic 1,200–7,000). Re-baseline if throttling shows up
below 80%.
2026-05-07 23:29:32 +00:00
Viktor Barzin
115ca184ff openclaw: switch primary to ChatGPT Plus OAuth (openai-codex/gpt-5.4-mini)
Bumps image 2026.2.26 → 2026.5.4 (openai-codex provider plugin landed in
2026.4.21+). Auth profile is OAuth via the device-pairing flow against the
Codex backend (account ancaelena98@gmail.com); token persists in
/home/node/.openclaw/agents/main/agent/auth-state.json on NFS so it survives
pod restarts. Plus tier accepts gpt-5.4-mini (1,200–7,000 local msgs/5h);
gpt-5-mini and gpt-5.1-codex-mini both return errors on Plus, so we pin
gpt-5.4-mini explicitly. doctor --fix auto-promotes the highest-tier model
(gpt-5-pro) after model discovery, so the container command pins the mini
back as default after doctor runs but before gateway start.
2026-05-07 23:29:32 +00:00
Viktor Barzin
574cdf08d2 f1-stream: drop demo + landing-page extractors, add fetch-proxy injection
Per user feedback: the demo Big Buck Bunny / Apple test streams aren't
useful in an F1-streams app. Removed DemoExtractor entirely. Tightened
the discord-extractor path filter from "any stream-shaped path" to
"direct embed/player path only" — the previous filter still let
sportsurge `/event/...` landing pages through, which the verifier
mistook for playable because they render player-class divs without a
real player.

Embed proxy now also rewrites window.fetch + XMLHttpRequest.open inside
the upstream HTML so that cross-origin XHRs (e.g. the hmembeds
`/sec/<JWT>` token-binding endpoint) go through our /embed-asset relay.
This avoids the CORS reject that fired when the player JS tried to call
hghndasw.gbgdhdffhf.shop/sec/... from an `f1.viktorbarzin.me` origin.

The verifier now requires a `<video>` element to mark embed streams
playable (not just a player-class div). Curated streams bypass the
verifier — hmembeds aggressively detects headless Chromium (devtool
trap, console-clear timing, automation flags) and won't progress past
JW Player init in our pod, but the user's real browser should clear
those checks. We can't honestly headless-verify hmembeds, so we trust
the curator instead of falsely rejecting them.

Image: viktorbarzin/f1-stream:v6.1.1
2026-05-07 23:29:32 +00:00
Viktor Barzin
f90d79ed4e f1-stream: only show streams confirmed playable by headless browser
Cuts the stream list from 23 mostly-broken entries to ~6 confirmed-playable
ones, and adds an iframe-stripping proxy so embed sources (hmembeds, etc.)
load through our origin without X-Frame-Options / CSP / JS frame-buster
blocks.

Why: the previous list was dominated by Discord-shared news article URLs,
hardcoded aggregator landing pages, and other non-stream URLs that all sat
at is_live=true because embed streams skipped the health check entirely.
Users could not tell which links would actually play.

What:
- backend/playback_verifier.py: new headless-Chromium verifier (Playwright)
  that polls each candidate stream for a codec-independent "playable" signal
  (hls.js MANIFEST_PARSED for m3u8; <video>/player div for embed). Replaces
  the unconditional is_live=True for embed streams in service.py.
- backend/embed_proxy.py: new /embed and /embed-asset routes that fetch
  upstream embed pages, strip X-Frame-Options/CSP/Set-Cookie, and inject a
  <base href> + frame-buster-defeat <script> that locks down window.top,
  document.referrer, console.clear/table, and window.location so the
  hmembeds disable-devtool.js redirect-to-google trap can't fire.
- extractors/curated.py: new always-on extractor with two known-good 24/7
  hmembeds embeds (Sky Sports F1, DAZN F1) so the list isn't empty between
  race weekends.
- extractors/__init__.py: register CuratedExtractor first; drop
  FallbackExtractor (its 10 aggregator landing-pages can't iframe-play).
- extractors/discord_source.py: positive-match path filter (must look like
  /embed/, /stream, /watch, /live, /player, *.m3u8, *.php) plus expanded
  domain blocklist for news sites — was 10 noise URLs, now ~1.
- extractors/service.py: run_extraction now health-checks AND verifier-
  checks both stream types; only verified-playable streams reach is_live.
- main.py: register /embed + /embed-asset routes; defer initial extraction
  by 8s so the verifier can reach the local /embed proxy on 127.0.0.1:8000.
- frontend/lib/api.js + watch/+page.svelte: route embed iframes through
  /embed proxy instead of the upstream URL, so X-Frame-Options/CSP can't
  block them.
- Dockerfile: install Playwright chromium + system codec-runtime libs.
- main.tf: bump pod memory 256Mi → 1Gi for chromium.

Verified end-to-end with Playwright against
https://f1.viktorbarzin.me/watch — 6/6 streams reach a player UI; the 3
demo m3u8s actually play (codec-bearing browser); the 3 embeds (Sky
Sports F1, DAZN F1, sportsurge) render iframes through the proxy.

Image: viktorbarzin/f1-stream:v6.0.5

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-07 23:29:31 +00:00
Viktor Barzin
8b180f7662 openclaw: switch primary model to qwen3-coder-480b (qwen3.5-397b dead on NIM)
NVIDIA retired nim/qwen/qwen3.5-397b-a17b — modelrelay shows consistent
TIMEOUTs over 24h+ of pings, and nim/nvidia/llama-3.1-nemotron-ultra-253b-v1
returns 404. With both gone the openclaw failover never reached
mistral-large-3 in time, so every message hung until the 120s embedded-run
timeout. Promote qwen3-coder-480b-a35b-instruct (already in models list, UP
~1-2s, 256k ctx) to primary; drop the dead nemotron-ultra fallback.
2026-05-07 23:29:31 +00:00
Viktor Barzin
f006b48566 monitoring(wealth): delta panels to 2x4 grid (rows = type, cols = window)
Better visual grouping: instead of 8 paired panels in a single row at
w=3 (cramped, hard to scan), arrange as a 2x4 grid at w=6. Top row
("all" — wealth change incl new money), bottom row ("mkt" — pure
market gain). Columns are timeframes 1d / 7d / 30d / 90d.

Reading vertically: same window, two interpretations side by side.
Reading horizontally: same metric across timeframes.

Layout shift: delta row goes from y=4 (4 wide) to y=4..11 (8 high).
All chart/log panels with y >= 8 shift down by another 4 rows
(net-worth chart 8->12, activity log 81->85, etc.).
2026-05-07 23:29:31 +00:00
Viktor Barzin
0f107aeacb monitoring(wealth): pair every delta panel with market-only twin
User feedback: net-worth delta panels (1d/7d/30d/90d) confused
because +£174k over 90d looked too big against the £271k cumulative
unrealised gain. Decomposition showed the 90d delta was £114k of new
money in (contributions) + £60k of actual market gain.

So now the delta row shows BOTH:
  Δ Nd (all)  — net-worth change incl new money (the original number)
  Δ Nd (mkt)  — pure market gain, contributions stripped out

Pattern for "(mkt)" panels: same now_snap / past_snap CTEs but
selecting both total_value and net_contribution, then computing
(nw_delta - contrib_delta) = market_gain over window.

Layout: 8 panels at w=3 each on the y=4 row, paired by window
(all next to mkt for each timeframe), so you can see "wealth
change vs investment performance" at a glance.

Verified live (90d): all=+£174,612, mkt=+£60,343, contrib=+£114,268.
2026-05-07 23:29:31 +00:00
Viktor Barzin
87069ae5c3 monitoring(wealth): add delta row (1d / 7d / 30d / 90d net-worth changes)
New row at y=4 with 4 stat panels showing net-worth change over the
trailing windows. Each uses the latest-per-account stitching pattern
(skew-resilient against partial-day syncs) and computes:

  delta = SUM(latest per account) - SUM(latest per account at or
                                       before max_complete - N)

Where max_complete is the most recent date all accounts have a row.
For each window: 1d, 7d, 30d, 90d.

Verified live values: +£8,575 / +£22,696 / +£144,633 / +£174,612.

All panels at y >= 4 shifted down by 4 rows to make room (Net worth
chart 4->8, Per-account stacked 24->28, Activity log 77->81, etc.).

Note: this commit also reformats the dashboard JSON from compact-
object form to indented form (json.dump indent=2 side effect from the
Python patch script). No semantic changes outside the new panels and
y-shifts.
2026-05-07 23:29:31 +00:00
Viktor Barzin
da7a11eb3b fix: strip conditional headers in bot-block-proxy to fix CalDAV sync
nginx's not_modified_filter evaluated If-Match headers forwarded by
Traefik's forwardAuth, returning 412 and breaking CalDAV VTODO updates
from macOS/iOS Reminders. Switch to OpenResty and clear conditional
headers with Lua before proxy processing.
2026-05-07 23:29:31 +00:00
75 changed files with 9635 additions and 2415 deletions

View file

@ -30,7 +30,7 @@ Violations cause state drift, which causes future applies to break or silently r
- **New service**: Use `setup-project` skill for full workflow
- **Ingress**: `ingress_factory` module. Auth: `protected = true`. Anti-AI: on by default. **DNS**: `dns_type = "proxied"` (Cloudflare CDN) or `"non-proxied"` (direct A/AAAA). DNS records are auto-created — no need to edit `config.tfvars`.
- **Docker images**: Always build for `linux/amd64`. Use 8-char git SHA tags — `:latest` causes stale pull-through cache.
- **Private registry**: `registry.viktorbarzin.me` (htpasswd auth, credentials in Vault `secret/viktor`). Use `image: registry.viktorbarzin.me/<name>:<tag>` + `imagePullSecrets: [{name: registry-credentials}]`. Kyverno auto-syncs the secret to all namespaces. Build & push from registry VM (`10.0.20.10`). Containerd `hosts.toml` redirects pulls to LAN IP directly. Web UI at `docker.viktorbarzin.me` (Authentik-protected). Engine pinned to `registry:2.8.3` (see post-mortem 2026-04-19); on-VM configs deploy via `.woodpecker/registry-config-sync.yml`; integrity probed every 15m by `registry-integrity-probe` CronJob in `monitoring` ns — the HTTP API is the authoritative integrity check, NOT `/blobs/*/data` presence (revision-link absence is the real failure mode).
- **Private registry**: `forgejo.viktorbarzin.me/viktor/<name>` (Forgejo packages, OAuth-style PAT auth). Use `image: forgejo.viktorbarzin.me/viktor/<name>:<tag>` + `imagePullSecrets: [{name: registry-credentials}]`. Kyverno auto-syncs the Secret to all namespaces. Containerd `hosts.toml` on every node redirects to in-cluster Traefik LB `10.0.20.200` to avoid hairpin NAT. Push-side: viktor PAT in Vault `secret/ci/global/forgejo_push_token` (Forgejo container packages are scoped per-user; only the package owner can push, ci-pusher cannot write to viktor/*). Pull-side: cluster-puller PAT in Vault `secret/viktor/forgejo_pull_token`. Retention CronJob (`forgejo-cleanup` in `forgejo` ns, daily 04:00) keeps newest 10 versions + always `:latest`; integrity probed every 15min by `forgejo-integrity-probe` in `monitoring` ns (catalog walk + manifest HEAD on every blob). See `docs/plans/2026-05-07-forgejo-registry-consolidation-{design,plan}.md` for the migration history. Pull-through caches for upstream registries (DockerHub, GHCR, Quay, k8s.gcr, Kyverno) stay on the registry VM at `10.0.20.10` ports 5000/5010/5020/5030/5040 — the old port-5050 R/W private registry was decommissioned 2026-05-07.
- **LinuxServer.io containers**: `DOCKER_MODS` runs apt-get on every start — bake slow mods into a custom image (`RUN /docker-mods || true` then `ENV DOCKER_MODS=`). Set `NO_CHOWN=true` to skip recursive chown that hangs on NFS mounts.
- **Node memory changes**: When changing VM memory on any k8s node, update kubelet `systemReserved`, `kubeReserved`, and eviction thresholds accordingly. Config: `/var/lib/kubelet/config.yaml`. Template: `stacks/infra/main.tf`. Current values: systemReserved=512Mi, kubeReserved=512Mi, evictionHard=500Mi, evictionSoft=1Gi.
- **Node OS disk tuning** (in `stacks/infra/main.tf`): kubelet `imageGCHighThresholdPercent=70` (was 85), `imageGCLowThresholdPercent=60` (was 80), ext4 `commit=60` in fstab (was default 5s), journald `SystemMaxUse=200M` + `MaxRetentionSec=3day`.

View file

@ -45,7 +45,8 @@
| nextcloud | File sync/share | nextcloud |
| calibre | E-book management (may be merged into ebooks stack) | calibre |
| onlyoffice | Document editing | onlyoffice |
| f1-stream | F1 streaming | f1-stream |
| f1-stream | F1 streaming (uses chrome-service for hmembeds verifier) | f1-stream |
| chrome-service | Headed Chromium WebSocket pool (`ws://chrome-service.chrome-service.svc:3000/<token>`) for sibling services driving anti-bot embeds | chrome-service |
| rybbit | Analytics | rybbit |
| isponsorblocktv | SponsorBlock for TV | isponsorblocktv |
| actualbudget | Budgeting (factory pattern) | actualbudget |

View file

@ -14,104 +14,72 @@ steps:
- name: build-and-push
image: woodpeckerci/plugin-docker-buildx
settings:
repo: registry.viktorbarzin.me:5050/infra-ci
# Phase 4 of forgejo-registry-consolidation 2026-05-07 —
# registry.viktorbarzin.me dropped, Forgejo is the only target.
repo:
- forgejo.viktorbarzin.me/viktor/infra-ci
dockerfile: ci/Dockerfile
context: ci/
tags:
- latest
- "${CI_COMMIT_SHA:0:8}"
platforms: linux/amd64
registry: registry.viktorbarzin.me:5050
logins:
- registry: registry.viktorbarzin.me:5050
- registry: forgejo.viktorbarzin.me
username:
from_secret: registry_user
from_secret: forgejo_user
password:
from_secret: registry_password
from_secret: forgejo_push_token
# Post-push integrity check. Re-resolves the image we just pushed and HEADs
# every blob it references — top-level manifest (index or single), each child
# platform manifest, each config blob, each layer blob. If any returns !=200
# the pipeline fails loudly here so we never ship a broken index downstream.
# Historical context: 2026-04-13 and 2026-04-19 incidents both shipped indexes
# whose platform/attestation children had been GC-orphaned on the registry VM.
- name: verify-integrity
# Post-push integrity check is now redundant with the every-15min
# forgejo-integrity-probe in stacks/monitoring/, which walks
# /v2/_catalog + HEADs every blob across the entire Forgejo registry.
# If a corruption pattern emerges that the periodic probe misses,
# restore a verify step similar to the pre-Phase-4 version (see
# commit 49f4956f) but pointed at forgejo.viktorbarzin.me.
# Break-glass tarball: save the just-pushed infra-ci image to disk on the
# registry VM (10.0.20.10) so we can `docker load` it back into a node
# when Forgejo is unreachable. Pulls from Forgejo (the only registry now).
# Best-effort — failure here doesn't fail the pipeline.
# Recovery procedure: docs/runbooks/forgejo-registry-breakglass.md.
- name: breakglass-tarball
image: alpine:3.20
failure: ignore
environment:
REG_USER:
from_secret: registry_user
REG_PASS:
from_secret: registry_password
REGISTRY_SSH_KEY:
from_secret: registry_ssh_key
FORGEJO_USER:
from_secret: forgejo_user
FORGEJO_PASS:
from_secret: forgejo_push_token
commands:
- apk add --no-cache curl jq
- REG=registry.viktorbarzin.me:5050
- REPO=infra-ci
- apk add --no-cache openssh-client
- mkdir -p ~/.ssh && chmod 700 ~/.ssh
- printf '%s\n' "$REGISTRY_SSH_KEY" > ~/.ssh/id_ed25519
- chmod 600 ~/.ssh/id_ed25519
- ssh-keyscan -t ed25519 10.0.20.10 >> ~/.ssh/known_hosts 2>/dev/null
- SHA=${CI_COMMIT_SHA:0:8}
- AUTH="$REG_USER:$REG_PASS"
- |
set -euo pipefail
ACCEPT='Accept: application/vnd.oci.image.index.v1+json,application/vnd.oci.image.manifest.v1+json,application/vnd.docker.distribution.manifest.list.v2+json,application/vnd.docker.distribution.manifest.v2+json'
fetch_manifest() {
# Prints the body to $2, returns the HTTP code as stdout.
curl -sk -u "$AUTH" -H "$ACCEPT" \
-o "$2" -w '%{http_code}' \
"https://$REG/v2/$REPO/manifests/$1"
}
head_blob() {
curl -sk -u "$AUTH" -o /dev/null -w '%{http_code}' \
-I "https://$REG/v2/$REPO/blobs/$1"
}
verify_single_manifest() {
local ref="$1" tmp=/tmp/m-$$.json
local rc cfg
rc=$(fetch_manifest "$ref" "$tmp")
if [ "$rc" != "200" ]; then
echo "FAIL: manifest $ref returned HTTP $rc"; return 1
fi
cfg=$(jq -r '.config.digest // empty' "$tmp")
if [ -n "$cfg" ]; then
rc=$(head_blob "$cfg")
[ "$rc" = "200" ] || { echo "FAIL: config blob $cfg returned HTTP $rc"; return 1; }
fi
jq -r '.layers[]?.digest' "$tmp" > /tmp/layers-$$.txt
while IFS= read -r layer; do
[ -z "$layer" ] && continue
rc=$(head_blob "$layer")
[ "$rc" = "200" ] || { echo "FAIL: layer blob $layer returned HTTP $rc"; return 1; }
done < /tmp/layers-$$.txt
return 0
}
echo "=== Verifying push integrity for $REPO:$SHA ==="
TOP=/tmp/top-$$.json
rc=$(fetch_manifest "$SHA" "$TOP")
[ "$rc" = "200" ] || { echo "FAIL: top manifest :$SHA returned HTTP $rc"; exit 1; }
MT=$(jq -r '.mediaType // empty' "$TOP")
echo "Top-level media type: ${MT:-<unset>}"
if echo "$MT" | grep -Eq 'manifest\.list|image\.index'; then
jq -r '.manifests[].digest' "$TOP" > /tmp/children-$$.txt
echo "Multi-platform index: $(wc -l </tmp/children-$$.txt) child manifest(s)"
while IFS= read -r d; do
echo "--- child $d ---"
verify_single_manifest "$d" || exit 1
done < /tmp/children-$$.txt
else
echo "Single-platform manifest — verifying directly"
verify_single_manifest "$SHA" || exit 1
fi
echo "=== All manifests + blobs verified. Push integrity intact. ==="
ssh -n -o BatchMode=yes root@10.0.20.10 "
set -e
mkdir -p /opt/registry/data/private/_breakglass
IMAGE=forgejo.viktorbarzin.me/viktor/infra-ci:$SHA
echo \$FORGEJO_PASS | docker login forgejo.viktorbarzin.me -u \$FORGEJO_USER --password-stdin
docker pull \$IMAGE
docker save \$IMAGE | gzip > /opt/registry/data/private/_breakglass/infra-ci-$SHA.tar.gz
ln -sfn infra-ci-$SHA.tar.gz /opt/registry/data/private/_breakglass/infra-ci-latest.tar.gz
ls -t /opt/registry/data/private/_breakglass/infra-ci-*.tar.gz \
| grep -v 'latest' | tail -n +6 | xargs -r rm -v
ls -lh /opt/registry/data/private/_breakglass/
"
- name: slack
image: curlimages/curl
commands:
- |
curl -s -X POST -H 'Content-type: application/json' \
--data "{\"text\":\"CI image built: registry.viktorbarzin.me:5050/infra-ci:${CI_COMMIT_SHA:0:8}\"}" \
--data "{\"text\":\"CI image built: forgejo.viktorbarzin.me/viktor/infra-ci:${CI_COMMIT_SHA:0:8} (and registry-private mirror)\"}" \
"$SLACK_WEBHOOK" || true
environment:
SLACK_WEBHOOK:

View file

@ -25,7 +25,7 @@ clone:
steps:
- name: apply
image: registry.viktorbarzin.me/infra-ci:latest
image: forgejo.viktorbarzin.me/viktor/infra-ci:latest
pull: true
backend_options:
kubernetes:

View file

@ -14,7 +14,7 @@ clone:
steps:
- name: detect-drift
image: registry.viktorbarzin.me/infra-ci:latest
image: forgejo.viktorbarzin.me/viktor/infra-ci:latest
pull: true
backend_options:
kubernetes:

View file

@ -0,0 +1,136 @@
# chrome-service — In-cluster headed Chromium pool
## Overview
`chrome-service` is a single-replica, persistent-profile, bearer-token-gated
Playwright **launch-server** that exposes a headed Chromium browser over a
WebSocket. Sibling services connect to it instead of running their own
in-process Chromium when the upstream's anti-bot tooling
(`disable-devtool.js` redirect-to-google trap, console-clear timing tricks,
`navigator.webdriver` checks) defeats a headless browser.
Initial caller: `f1-stream`'s `playback_verifier`. Future callers attach
via the WS+token contract documented in `stacks/chrome-service/README.md`.
## Why a separate stack
In-process Chromium inside `f1-stream`:
- Runs **headless** by default (no `Xvfb`/`DISPLAY`).
- Has the `HeadlessChromium/...` UA suffix and `navigator.webdriver === true`.
- Trips `disable-devtool.js`'s **Performance** detector — Playwright's CDP
adds latency to `console.log(largeArray)` vs `console.table(largeArray)`,
which the lib reads as "DevTools is open" and redirects to
`https://www.google.com/`.
`chrome-service` solves this by:
1. Running **headed** under `Xvfb :99` (via `playwright launch-server` with
a JSON config that pins `headless: false`).
2. Living in a long-lived pod so JIT browser launch latency disappears.
3. Allowing a per-context init script
(`stacks/chrome-service/files/stealth.js` ~ 40 lines, vendored from
`puppeteer-extra-plugin-stealth`) to spoof `webdriver`, `chrome.runtime`,
`plugins`, `languages`, `Permissions.query`, WebGL renderer strings, and
to hide the `disable-devtool-auto` script-tag attribute so the lib's
IIFE exits early.
## Wire protocol
```text
ws://chrome-service.chrome-service.svc.cluster.local:3000/<TOKEN>
┌───────────────────────────────┼───────────────────────────────┐
│ caller pod │ chrome-service pod
│ (e.g. f1-stream) │ (single replica)
│ │
│ CHROME_WS_URL ──────────────┘
│ CHROME_WS_TOKEN ─── from `secret/chrome-service.api_bearer_token` (ESO)
│ await chromium.connect(f"{ws}/{token}")
│ await ctx.add_init_script(STEALTH_JS)
│ page.goto("https://upstream.com/embed/...")
└─── ←── pages render under Xvfb, headed Chromium ──── ─────────┘
```
## Image pin
Both the server image (`mcr.microsoft.com/playwright:v1.48.0-noble` in
`stacks/chrome-service/main.tf`) and the Python client
(`playwright==1.48.0` in callers' `requirements.txt`) **must match
minor-versions**. Bump in lockstep — Playwright protocol changes between
minors and the client cannot connect to a mismatched server.
The Microsoft image ships only the browser binaries, not the `playwright`
npm SDK; the start command runs `npx -y playwright@1.48.0 launch-server`
which downloads the SDK on first start (cached under `$HOME/.npm` via the
PVC) and reuses it on subsequent restarts.
## Storage
- **`chrome-service-profile-encrypted`** (PVC, 2Gi → 10Gi autoresize,
`proxmox-lvm-encrypted`) — Chromium user-data dir + npm cache.
Encrypted because cookies/localStorage may include third-party auth tokens
for sites callers drive. `HOME=/profile` so npx caches there.
- **`chrome-service-backup-host`** (NFS, RWX) — destination for a 6-hourly
CronJob that `tar -czf /backup/<YYYY_MM_DD_HH>.tar.gz -C /profile .`,
retention 30 days.
## Auth + secrets
- Vault KV `secret/chrome-service.api_bearer_token` — 32-byte URL-safe
random, rotated by hand:
`vault kv put secret/chrome-service api_bearer_token=$(python3 -c 'import secrets; print(secrets.token_urlsafe(32))')`.
- ESO syncs into namespace-local Secret `chrome-service-secrets`
(server pod) and `chrome-service-client-secrets` (each caller pod).
- Reloader (`reloader.stakater.com/auto = "true"`) cascades token rotation
to both server and any annotated caller — no manual rollout.
## Network controls
- **`kubernetes_network_policy_v1.ws_ingress`** — two separate ingress
rules on the same policy:
- **TCP/3000** (Playwright WS): only namespaces labelled
`chrome-service.viktorbarzin.me/client = "true"` (plus an explicit
fallback for `f1-stream` by `kubernetes.io/metadata.name`).
- **TCP/6080** (noVNC HTTP+WS): only the `traefik` namespace, since
the public-facing path is `chrome.viktorbarzin.me` ingress →
Traefik → sidecar. Authentik forward-auth still gates external
access at the Traefik layer.
- **WS port 3000** is internal-only (no ingress, no Cloudflare DNS).
- **noVNC sidecar** (`forgejo.viktorbarzin.me/viktor/chrome-service-novnc`)
exposes a live HTML5 view of the headed Chromium session via
`x11vnc` (connected to Xvfb on `localhost:6099`) bridged to
`websockify` on port 6080. Service `chrome` maps :80 → :6080 and is
exposed via `ingress_factory` at `chrome.viktorbarzin.me`,
Authentik-gated. Both static page and WebSocket upgrade share the
same path — Cloudflare proxy, Cloudflared tunnel, Traefik, and
Authentik forward-auth all preserve `Upgrade: websocket`.
## Adding a new caller
See `stacks/chrome-service/README.md` for the four-step recipe:
1. Label the caller's namespace.
2. Add an `ExternalSecret` pulling `secret/chrome-service`.
3. Inject `CHROME_WS_URL` + `CHROME_WS_TOKEN` env vars.
4. Vendor `stealth.js` and apply via `await context.add_init_script(...)`
after every `new_context()`.
## Limits + risks
- **Anti-bot vs stealth arms race** — when an upstream beats us (DRM
license check, device-fingerprint mismatch, hotlink protection that
whitelists specific parent domains), the verifier returns
`is_playable=False` and the extractor moves on. No user-visible
breakage, just empty stream lists for that source.
- **JWPlayer DRM error 102630** — observed with several hmembeds embeds
even from the headed chrome-service. The license check bails because
the request origin isn't on the embed's allowlist; this is upstream
policy, not an infra defect.
- **Single replica + RWO PVC** — the deployment uses `Recreate` strategy.
Brief outage on rollout, ~30s for browser warmup.
- **No `/metrics` endpoint** — the cluster's generic
`KubePodCrashLooping` rule covers basic alerting. A Prometheus scrape
exporter is day-2 work.

View file

@ -19,7 +19,7 @@ graph LR
I --> J[Pull from DockerHub<br/>or Pull-Through Cache]
K[Pull-Through Cache<br/>10.0.20.10] -.-> J
L[registry.viktorbarzin.me<br/>Private Registry] -.-> J
L[forgejo.viktorbarzin.me<br/>Private Registry on Forgejo] -.-> J
style B fill:#2088ff
style F fill:#4c9e47
@ -33,7 +33,7 @@ graph LR
| GitHub Actions | Cloud | `.github/workflows/build-and-deploy.yml` | Build Docker images, push to DockerHub |
| Woodpecker CI | Self-hosted | `ci.viktorbarzin.me` | Deploy to Kubernetes cluster |
| DockerHub | Cloud | `viktorbarzin/*` | Public image registry |
| Private Registry | Custom | `registry.viktorbarzin.me` | Private images, htpasswd auth |
| Private Registry | Forgejo Packages | `forgejo.viktorbarzin.me/viktor` | Private container images (PAT auth, retention CronJob) — migrated from registry.viktorbarzin.me 2026-05-07 |
| Pull-Through Cache | Custom | `10.0.20.10:5000` (docker.io)<br/>`10.0.20.10:5010` (ghcr.io) | LAN cache for remote registries |
| Kyverno | Cluster | `kyverno` namespace | Auto-sync registry credentials to all namespaces |
| Vault | Cluster | `vault.viktorbarzin.me` | K8s auth for Woodpecker pipelines |
@ -102,7 +102,7 @@ Woodpecker API uses numeric IDs (not owner/name):
1. **Containerd hosts.toml** redirects pulls from docker.io and ghcr.io to pull-through cache at `10.0.20.10`
2. **Pull-through cache** serves cached images from LAN, fetches from upstream on cache miss
3. **Kyverno ClusterPolicy** auto-syncs `registry-credentials` Secret to all namespaces for private registry access
4. **Private registry** (`registry.viktorbarzin.me`) uses htpasswd auth, credentials stored in Vault. Runs `registry:2.8.3` (pinned — floating `registry:2` was the root cause of the 2026-04-13 + 2026-04-19 orphan-index incidents; see `docs/post-mortems/2026-04-19-registry-orphan-index.md`).
4. **Private registry** has been Forgejo's built-in OCI registry at `forgejo.viktorbarzin.me/viktor/<image>` since 2026-05-07. Auth via PAT (Vault `secret/ci/global/forgejo_push_token` for push, `secret/viktor/forgejo_pull_token` for pull). The pre-migration `registry:2.8.3`-based private registry on `registry.viktorbarzin.me:5050` was the root cause of three orphan-index incidents in three weeks (2026-04-13, 2026-04-19, 2026-05-04 — see `docs/post-mortems/2026-04-19-registry-orphan-index.md` and the full migration writeup at `docs/plans/2026-05-07-forgejo-registry-consolidation-{design,plan}.md`). The five pull-through caches on `10.0.20.10` (ports 5000/5010/5020/5030/5040) stay in place for upstream registries.
5. **Integrity probe** (`registry-integrity-probe` CronJob in `monitoring` ns, every 15m) walks `/v2/_catalog` → tags → indexes → child manifests via HEAD and pushes `registry_manifest_integrity_failures` to Pushgateway; alerts `RegistryManifestIntegrityFailure` / `RegistryIntegrityProbeStale` / `RegistryCatalogInaccessible` page on broken state. Authoritative check (HTTP API, not filesystem).
### Infra Pipelines (Woodpecker-only)

View file

@ -63,7 +63,7 @@ graph TB
| External Monitor Sync | Python 3.12 | `stacks/uptime-kuma/` | CronJob (10min) syncs `[External]` monitors from `cloudflare_proxied_names` |
| dcgm-exporter | Configurable resources | `stacks/monitoring/modules/monitoring/` | NVIDIA GPU metrics collection |
| Email Roundtrip Probe | Python 3.12 | `stacks/mailserver/modules/mailserver/` | E2E email delivery verification via Mailgun API + IMAP |
| Registry Integrity Probe | Alpine 3.20 + curl/jq | `stacks/monitoring/modules/monitoring/main.tf` | CronJob every 15m: walks `/v2/_catalog` on `registry.viktorbarzin.me:5050`, HEADs every tagged manifest + index child; emits `registry_manifest_integrity_*` metrics to Pushgateway. Catches orphan OCI-index state that filesystem scans miss. |
| Forgejo Registry Integrity Probe | Alpine 3.20 + curl/jq | `stacks/monitoring/modules/monitoring/main.tf` | CronJob every 15m: walks `/v2/_catalog` on `forgejo.viktorbarzin.me` (HTTP via in-cluster service), HEADs every tagged manifest + index child; emits `registry_manifest_integrity_*` metrics to Pushgateway. Replaces the legacy `registry-integrity-probe` against `registry.viktorbarzin.me:5050` decommissioned in Phase 4 of forgejo-registry-consolidation 2026-05-07. |
## How It Works

View file

@ -0,0 +1,195 @@
# Forgejo Registry Consolidation — Design
**Date**: 2026-05-07
**Status**: Approved
## Problem
`registry-private` (the `registry:2` container on the docker-registry
VM at `10.0.20.10`) has hit `distribution#3324` corruption three
times in three weeks (2026-04-13, 2026-04-19, 2026-05-04). Each
incident required manual blob recovery and another round of
hardening to `cleanup-tags.sh` and the GC procedure. The integrity
probe catches it within 15 minutes now, but every hit still costs
~1h of cleanup, and we keep tightening the same loose screw.
Root cause is a known race in `distribution`: tag deletes that race
with concurrent garbage collection produce orphan OCI-index children.
Upstream has not patched it; our mitigations (probe, blob
fix-up script, idempotent cleanup) reduce blast radius but don't
remove the failure mode.
Forgejo (deployed for OAuth and personal repos at
`forgejo.viktorbarzin.me`) ships a built-in OCI registry as part of
the Packages feature, default-on in v11. Using it removes
`distribution`-the-engine from the path entirely, replaces it with
Forgejo's own implementation backed by Forgejo's DB+blob store, and
gets us source hosting + image hosting in one resource.
The PVE host RAM upgrade from 142GB to 272GB (memory id=569) means
the cluster can absorb the resource bump Forgejo needs for the
registry workload (1Gi → 1Gi).
## Decision
Move every image currently on `registry.viktorbarzin.me:5050` to
Forgejo's OCI registry at `forgejo.viktorbarzin.me`. Decommission
`registry-private` after a 14-day dual-push bake.
Pull-through caches for upstream registries (DockerHub, GHCR, Quay,
k8s.gcr, Kyverno) stay on the registry VM permanently — Forgejo
won't serve as a pull-through, so the chicken-and-egg of "Forgejo
pulling its own image through itself" never arises.
## Design
### Registry hostname
Image references become `forgejo.viktorbarzin.me/viktor/<image>:<tag>`.
The `viktor/` prefix is the Forgejo owner namespace; all current
private images ship under that single owner.
### Auth
Two service-account users:
| User | Scope | Vault key | Used by |
|---|---|---|---|
| `cluster-puller` | `read:package` | `secret/viktor/forgejo_pull_token` | cluster-wide `registry-credentials` Secret, monitoring probe |
| `ci-pusher` | `write:package` | `secret/ci/global/forgejo_push_token` | Woodpecker pipelines (synced via `vault-woodpecker-sync` CronJob) |
A third PAT (`secret/viktor/forgejo_cleanup_token`, also belongs to
`ci-pusher`) drives the retention CronJob — kept separate from the
push PAT so a leaked CI token doesn't immediately enable mass deletes.
PATs have no expiry. Rotation policy: regenerate via Forgejo Web UI
and `vault kv patch` if a leak is suspected; ESO/sync downstream is
automatic.
### Cluster pull path
`registry-credentials` is a single Secret in `kyverno` ns, cloned
into every namespace by the existing
`sync-registry-credentials` ClusterPolicy. We extend its
`dockerconfigjson` `auths` map with a fourth entry for
`forgejo.viktorbarzin.me`. **No new Secret, no new ClusterPolicy,
no `imagePullSecrets =` line edits across stacks.**
Containerd `hosts.toml` redirects `forgejo.viktorbarzin.me` → in-cluster
Traefik LB at `10.0.20.200`, the same pattern used for
`registry.viktorbarzin.me``10.0.20.10:5050`. Avoids hairpin NAT
through the WAN gateway for in-cluster pulls.
### Push path
Woodpecker pipelines push to BOTH targets during the bake:
```yaml
- name: build-and-push
image: woodpeckerci/plugin-docker-buildx
settings:
repo:
- registry.viktorbarzin.me/<name>
- forgejo.viktorbarzin.me/viktor/<name>
logins:
- registry: registry.viktorbarzin.me
username:
from_secret: registry_user
password:
from_secret: registry_password
- registry: forgejo.viktorbarzin.me
username:
from_secret: forgejo_user
password:
from_secret: forgejo_push_token
```
The `vault-woodpecker-sync` CronJob (every 6h) propagates
`secret/ci/global` keys to every Woodpecker repo as global secrets.
### Retention
Forgejo's per-package "Cleanup Rules" UI is per-user runtime DB
state, not Terraform-driven. Retention runs as a CronJob in the
`forgejo` namespace, schedule `0 4 * * *`, that:
1. Lists all container packages under the `viktor` owner.
2. Groups by package name.
3. Keeps newest 10 versions + always keeps `latest`.
4. DELETEs the rest via `/api/v1/packages/{owner}/{type}/{name}/{version}`.
First 7 days run with `DRY_RUN=true` — script logs what it would
delete but issues no DELETE calls. After log review, flip the
`forgejo_cleanup_dry_run` local in `cleanup.tf` to false.
### Integrity monitoring
Mirror the existing `registry-integrity-probe` CronJob: walk
`/v2/_catalog`, walk every tag, HEAD every manifest + index child,
push `registry_manifest_integrity_*` metrics. Existing
Prometheus alerts fire on the `instance` label, so they cover both
probes automatically once the alert annotations are made
instance-aware (done in this change).
### Source migration
Projects currently living as plain dirs in the local-only monorepo
become standalone Forgejo repos. Two GitHub-hosted private repos
(`beadboard`, `claude-memory-mcp`) move to Forgejo and are archived
on GitHub.
CI standardises on Woodpecker for everything in scope. The two
projects that used GHA (build + Woodpecker-deploy via GHA-hosted
DockerHub push) keep DockerHub for legacy compatibility but their
canonical image source becomes Forgejo.
### Break-glass for infra-ci
`infra-ci` is the Docker image used by all infra Woodpecker
pipelines, including `default.yml` (terragrunt apply). If Forgejo is
unreachable at the moment we need to apply, `infra-ci` is
unreachable, and we can't apply our way out.
Mitigation: dual-push step also `docker save | gzip` the built
infra-ci image to:
- `/opt/registry/data/private/_breakglass/infra-ci-<sha>.tar.gz` on
the registry VM disk (Copy 1)
- `/srv/nfs/forgejo-breakglass/` on the NAS (Copy 2)
A `latest` symlink in each location points at the most recent.
Recovery procedure (`docs/runbooks/forgejo-registry-breakglass.md`):
scp tarball → `docker load``ctr -n k8s.io images import` → fix
Forgejo via that node.
### Cutover style
**Dual-push bake**: pipelines push to both registries for ≥14 days.
Pods continue pulling from `registry.viktorbarzin.me`. After bake:
1. Per-project PR: flip `image=` lines in Terraform stacks. Pod
re-pull naturally on next rollout.
2. Phase 4: stop `registry-private` container, remove its
`auths` entry from the cluster Secret, drop containerd hosts.toml
entry.
## Why not alternatives
| Option | Rejected because |
|---|---|
| Stay on `registry-private` | Three corruption incidents in three weeks; mitigation cost rising |
| Run a fresh registry container alongside (no Forgejo) | Same upstream, same `distribution#3324` failure mode |
| GHCR / DockerHub for all private images | Public-by-default model + push rate limits; loses owner-owned blob storage |
| Harbor | Heavier than Forgejo registry, would need its own DB + ingress, no source-hosting integration |
## Risks
See plan doc § "Risk register" for the full table. Top three:
1. **Forgejo registry hits the same corruption pattern.** Mitigated
by 14-day bake + integrity probe within 15 min.
2. **Forgejo down → infra-ci unreachable → can't apply.** Mitigated
by tarball break-glass on VM + NAS.
3. **Pod re-pulls fail after `image=` flip due to containerd cache
poisoning.** Mitigated by hosts.toml deployment + per-project
`kubectl rollout restart` in Phase 3.

View file

@ -0,0 +1,152 @@
# Forgejo Registry Consolidation — Plan
**Date**: 2026-05-07
**Status**: Approved — execution in progress (Phase 0)
**Design**: `2026-05-07-forgejo-registry-consolidation-design.md`
This is the implementation roadmap for migrating off `registry-private`
onto Forgejo's OCI registry. See the design doc for problem
statement and rationale. Execution spans 5 phases over ≥3 weeks.
## Phase 0 — Prepare Forgejo (1 PR, no cutover risk)
| Task | File / artifact |
|---|---|
| Bump Forgejo memory request+limit 384Mi → 1Gi | `infra/stacks/forgejo/main.tf` |
| Add `FORGEJO__packages__ENABLED=true` and `FORGEJO__packages__CHUNKED_UPLOAD_PATH=/data/tmp/package-upload` env vars (defensive — already default in v11) | `infra/stacks/forgejo/main.tf` |
| Bump Forgejo PVC 5Gi → 15Gi, auto-resize cap 20Gi → 50Gi | `infra/stacks/forgejo/main.tf` |
| Bump ingress `max_body_size = "5g"` (wired into ingress_factory as a Buffering middleware) | `infra/stacks/forgejo/main.tf`, `infra/modules/kubernetes/ingress_factory/main.tf` |
| Create `cluster-puller` (read:package), `ci-pusher` (write:package), and a third `cleanup` PAT on `ci-pusher`; store PATs in Vault | runbook: `docs/runbooks/forgejo-registry-setup.md` |
| Extend `registry-credentials` Secret with 4th `auths` entry for `forgejo.viktorbarzin.me` | `infra/stacks/kyverno/modules/kyverno/registry-credentials.tf` |
| Add containerd `hosts.toml` entry redirecting `forgejo.viktorbarzin.me` → in-cluster Traefik LB `10.0.20.200` | `infra/stacks/infra/main.tf` cloud-init + new `infra/scripts/setup-forgejo-containerd-mirror.sh` for existing nodes |
| Forgejo retention CronJob (`0 4 * * *`, dry-run for first 7 days) | new `infra/stacks/forgejo/cleanup.tf` + `infra/stacks/forgejo/files/cleanup.sh` |
| Forgejo integrity probe CronJob (`*/15 * * * *`) | `infra/stacks/monitoring/modules/monitoring/main.tf` |
| Make existing alerts instance-aware so they cover both registries | `infra/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl` |
**Smoke test (must pass before declaring Phase 0 done):**
- `docker login forgejo.viktorbarzin.me` succeeds.
- Push a hello-world image to `forgejo.viktorbarzin.me/viktor/smoketest:1` succeeds.
- `crictl pull forgejo.viktorbarzin.me/viktor/smoketest:1` from a k8s
node succeeds, using the auto-synced `registry-credentials` Secret.
- A fresh namespace gets the cloned Secret with 4 `auths` entries.
- Delete the smoketest package via API.
- Forgejo integrity probe completes once and pushes metrics.
## Phase 1 — Source migration (parallel-safe, no production impact)
For each project the recipe is identical:
1. `git init` + push to `forgejo.viktorbarzin.me/viktor/<name>`
register in Woodpecker via OAuth.
2. Add `.woodpecker.yml` based on `payslip-ingest/.woodpecker.yml`.
Push step uses `woodpeckerci/plugin-docker-buildx` with TWO
`repo:` entries (dual-push).
3. Confirm first build pushes to BOTH registries.
Projects (bake clock starts at "all dual-push"):
| Project | Action |
|---|---|
| `claude-agent-service` | Extract from monorepo to Forgejo. New `.woodpecker.yml`. |
| `fire-planner` | Extract from monorepo to Forgejo. New `.woodpecker.yml`. |
| `wealthfolio-sync` | Extract from monorepo to Forgejo. New `.woodpecker.yml`. |
| `hmrc-sync` | Extract from monorepo to Forgejo. New `.woodpecker.yml`. |
| `freedify` | Push from monorepo to Forgejo. New `.woodpecker.yml`. (Upstream is gone.) |
| `payslip-ingest` | Already on Forgejo. Add second `repo:` entry to `.woodpecker.yml`. |
| `job-hunter` | Already on Forgejo. Add second `repo:` entry. |
| `beadboard` | Push to Forgejo. New `.woodpecker.yml`. Disable GHA workflow. **Don't archive GitHub yet** (deferred to Phase 3). |
| `claude-memory-mcp` | Push to Forgejo. New `.woodpecker.yml`. |
| `infra-ci` | Edit `.woodpecker/build-ci-image.yml` to dual-push. ALSO `docker save | gzip` to `/opt/registry/data/private/_breakglass/` on VM AND `/srv/nfs/forgejo-breakglass/` on NAS. Pin a `latest` symlink. |
Break-glass runbook (`docs/runbooks/forgejo-registry-breakglass.md`)
documents the recovery path.
## Phase 2 — Bake (≥14 days)
- No `image=` lines change. Pods still pull from
`registry.viktorbarzin.me`.
- **Daily smoke check**: pull a recent image from Forgejo as
`cluster-puller`, verify integrity (HEAD on manifest + each blob).
- **Bake exit criteria**:
- Zero `RegistryManifestIntegrityFailure` alerts on Forgejo.
- Zero `ContainerNearOOM` for the forgejo pod.
- Retention CronJob has run ≥14 times successfully.
- At least one full Sunday GC cycle has elapsed.
- Switch retention CronJob to `DRY_RUN=false` on day 7, observe
until day 14.
## Phase 3 — Cutover (one PR per project, single session)
Order = lowest blast radius first. Each step:
`image=` flip → `kubectl rollout restart` → verify pull from Forgejo.
1. `payslip-ingest` (`infra/stacks/payslip-ingest/main.tf`)
2. `job-hunter` (`infra/stacks/job-hunter/main.tf`)
3. `claude-agent-service` (`infra/stacks/claude-agent-service/main.tf`)
4. `fire-planner` (`infra/stacks/fire-planner/main.tf`)
5. `wealthfolio-sync` (`infra/stacks/wealthfolio/main.tf`)
6. `freedify` (`infra/stacks/freedify/factory/main.tf`)
7. `chrome-service` (`infra/stacks/chrome-service/main.tf`)
8. `beads-server` / `beadboard` (`infra/stacks/beads-server/main.tf`).
Then `gh repo archive ViktorBarzin/beadboard`.
9. `infra-ci` — flip `image:` references in 4 `.woodpecker/*.yml`
files in the infra repo. Verify next push to master applies cleanly.
10. `claude-memory-mcp` — update `CLAUDE.md` install instruction from
`claude plugins install github:ViktorBarzin/claude-memory-mcp` to
`claude plugins install https://forgejo.viktorbarzin.me/viktor/claude-memory-mcp.git`.
`gh repo archive ViktorBarzin/claude-memory-mcp`.
## Phase 4 — Decommission
| Step | File / location |
|---|---|
| Stop `registry-private` container on VM (10.0.20.10): edit `/opt/registry/docker-compose.yml`, comment out service, `docker compose up -d --remove-orphans`. (Manual SSH — cloud-init won't redeploy on TF apply per memory id=1078.) | live VM |
| Update cloud-init template to match the new compose file | `infra/stacks/infra/main.tf:288` |
| Delete `auths` entries for `registry.viktorbarzin.me` / `:5050` / `10.0.20.10:5050` from the dockerconfigjson | `infra/stacks/kyverno/modules/kyverno/registry-credentials.tf` |
| Drop `registry.viktorbarzin.me` and `10.0.20.10:5050` `hosts.toml` entries on each node + cloud-init template | `infra/stacks/infra/main.tf` cloud-init + ad-hoc script |
| After 1 week of no incidents, delete `/opt/registry/data/private/` blob storage on the VM (~2.6GB freed) | manual SSH |
## Phase 5 — Docs
In the same commit as the Phase 4 closing:
| Doc | Update |
|---|---|
| `docs/runbooks/registry-vm.md` | Note `registry-private` is gone; pull-through caches and break-glass tarballs only |
| `docs/runbooks/registry-rebuild-image.md` | Replaced by NEW `forgejo-registry-rebuild-image.md` |
| `docs/runbooks/forgejo-registry-rebuild-image.md` (NEW) | Forgejo PVC restore procedure |
| `docs/runbooks/forgejo-registry-breakglass.md` (NEW) | infra-ci tarball recovery |
| `docs/architecture/ci-cd.md` | Image registry section flips to Forgejo |
| `docs/architecture/monitoring.md` | Integrity probe target updated |
| `infra/.claude/CLAUDE.md` | Registry references updated |
| `CLAUDE.md` (monorepo root) | claude-memory-mcp install URL updated |
| `infra/.claude/reference/service-catalog.md` | Cross-reference checked |
## Critical files modified
| File | Phase | What |
|---|---|---|
| `infra/stacks/forgejo/main.tf` | 0 | Memory bump, packages env vars, PVC bump, ingress max_body_size |
| `infra/stacks/forgejo/cleanup.tf` (NEW) | 0 | Retention CronJob |
| `infra/stacks/forgejo/files/cleanup.sh` (NEW) | 0 | Retention script (mounted via ConfigMap) |
| `infra/modules/kubernetes/ingress_factory/main.tf` | 0 | Wire `max_body_size` into a Traefik Buffering middleware |
| `infra/stacks/kyverno/modules/kyverno/registry-credentials.tf` | 0 | Add 4th `auths` entry |
| `infra/stacks/infra/main.tf` | 0 + 4 | Containerd hosts.toml block (add Forgejo, later remove registry-private); compose template update |
| `infra/scripts/setup-forgejo-containerd-mirror.sh` (NEW) | 0 | One-shot rollout for existing nodes |
| `infra/stacks/monitoring/modules/monitoring/main.tf` | 0 | Forgejo integrity probe CronJob |
| `infra/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl` | 0 | Make alerts instance-aware |
| `infra/stacks/monitoring/main.tf` | 0 | Plumb `forgejo_pull_token` into module |
| `infra/.woodpecker/build-ci-image.yml` | 1 | Dual-push to add Forgejo target + tarball break-glass |
| `<each-project>/.woodpecker.yml` | 1 | Dual-push (NEW for fire-planner, wealthfolio-sync, hmrc-sync, freedify, beadboard, claude-memory-mcp; EDIT for payslip-ingest, job-hunter, claude-agent-service) |
| `infra/.woodpecker/{default,drift-detection,build-cli}.yml` | 3 | Flip `image:` to Forgejo for infra-ci |
| `infra/stacks/{beads-server,chrome-service,claude-agent-service,fire-planner,freedify/factory,job-hunter,payslip-ingest,wealthfolio}/main.tf` | 3 | Flip `image =` to Forgejo |
## Verification
- **Push** (Phase 0/1): `docker push forgejo.viktorbarzin.me/viktor/<name>` visible in Forgejo Web UI under viktor/.
- **Pull** (Phase 0): `crictl pull forgejo.viktorbarzin.me/viktor/smoketest:1` succeeds with auto-synced Secret.
- **Dual-push** (Phase 1): every Woodpecker pipeline run pushes to BOTH endpoints — confirmed via HEAD checks on `<reg>:<sha>` for both.
- **Bake** (Phase 2): existing daily Forgejo `/api/healthz` external monitor stays green; integrity probe stays green; no `ContainerNearOOM` for forgejo pod.
- **Cutover** (Phase 3): `kubectl rollout status deploy/<svc> -n <ns>` succeeds. `kubectl describe pod` shows the image was pulled from `forgejo.viktorbarzin.me`.
- **Decommission** (Phase 4): `docker ps` on registry VM no longer shows `registry-private`. Brand-new namespace gets the Secret with only the Forgejo `auths` entry. Pull still works.

View file

@ -0,0 +1,126 @@
# Runbook: Forgejo registry break-glass — recovering infra-ci
Last updated: 2026-05-07
## When to use this runbook
When **all** of the following are true:
1. Forgejo (`forgejo.viktorbarzin.me`) is unreachable.
2. `registry-private` is also gone (post-Phase 4 of the consolidation),
so you can't fall back to `registry.viktorbarzin.me:5050/infra-ci`.
3. You need to run an infra Woodpecker pipeline (apply, build-cli,
drift-detection, etc.) — but those pipelines pull `infra-ci` and
crash because the registry is down.
If only Forgejo is down but `registry-private` is still alive, the
pipelines work — `image:` references in `infra/.woodpecker/*.yml`
still hit `registry.viktorbarzin.me:5050/infra-ci` until Phase 3
flips them. Skip this runbook entirely.
## What's available
The `build-ci-image.yml` Woodpecker pipeline saves a tarball after
each successful push:
| Location | Path |
|---|---|
| Registry VM disk (10.0.20.10) | `/opt/registry/data/private/_breakglass/infra-ci-<sha>.tar.gz` |
| Registry VM disk (latest symlink) | `/opt/registry/data/private/_breakglass/infra-ci-latest.tar.gz` |
| Synology NAS (offsite copy via daily-backup sync) | `/volume1/Backup/Viki/pve-backup/_forgejo-breakglass/` |
The registry VM keeps the last 5 tarballs. Synology mirrors them
through the existing offsite-sync-backup job (`/usr/local/bin/
offsite-sync-backup`).
## Recovery procedure
The goal is to get a working `infra-ci` image onto a k8s node so
Woodpecker pods can run it. Then run a Woodpecker pipeline that
restores Forgejo from PVC backup or rebuilds it.
### Step 1 — copy the tarball to a node
From your workstation (the registry VM is reachable but Forgejo is
not — the rest of the cluster might be in a similar partial state):
```bash
ssh wizard@10.0.20.103 # any responsive k8s node
sudo mkdir -p /var/breakglass
sudo scp root@10.0.20.10:/opt/registry/data/private/_breakglass/infra-ci-latest.tar.gz \
/var/breakglass/
```
If the registry VM is also down, fall back to Synology:
```bash
sudo scp 192.168.1.13:/volume1/Backup/Viki/pve-backup/_forgejo-breakglass/infra-ci-latest.tar.gz \
/var/breakglass/
```
### Step 2 — load into containerd
`docker load` won't help on a k8s node — it loads into the docker
daemon, which kubelet/containerd doesn't see. Use `ctr`:
```bash
sudo ctr -n k8s.io images import /var/breakglass/infra-ci-latest.tar.gz
sudo ctr -n k8s.io images list | grep infra-ci
```
Confirm the image is tagged with the original repository name
(`registry.viktorbarzin.me:5050/infra-ci:<sha>` — the tarball was
saved with that tag, NOT the Forgejo name).
### Step 3 — pin pods to this node
Add a node selector or taint-toleration to whatever pipeline you
need to run. Simplest: cordon the other nodes briefly so Woodpecker
schedules onto this one.
```bash
for n in $(kubectl get nodes -o name | grep -v $(hostname)); do
kubectl cordon ${n#node/}
done
```
Run the pipeline. After it completes:
```bash
for n in $(kubectl get nodes -o name); do
kubectl uncordon ${n#node/}
done
```
### Step 4 — fix the underlying problem
The pipeline you just ran was meant to restore Forgejo. Common
options:
- **Forgejo PVC corrupt**`docs/runbooks/forgejo-registry-rebuild-image.md`
walks through PVC restore from LVM snapshot or PVE backup.
- **Forgejo OOM-loop** — bump memory request+limit in
`infra/stacks/forgejo/main.tf` and apply.
- **Forgejo unreachable due to network** — check Traefik, MetalLB,
pfSense.
Once Forgejo is back, run `build-ci-image.yml` manually so the
tarball regenerates with the latest commit.
## Why this exists
The 2026-04-19 post-mortem on the registry-orphan-index incident
showed that a single registry going corrupt could block ALL infra
pipelines (because every pipeline pulls `infra-ci` from that
registry). The dual-push to Forgejo + registry-private removes that
single-point-of-failure during the bake. After Phase 4
decommissions registry-private, the tarball is the last line of
defense.
## Why on the registry VM and not in-cluster
The Forgejo pod and registry-private pod both depend on cluster
networking + storage. The registry VM is an independent
non-clustered VM with local storage. If the cluster is in a bad
state, the VM's disk is still readable from any other host on the
LAN.

View file

@ -0,0 +1,128 @@
# Runbook: Rebuild an Image on the Forgejo OCI Registry
Last updated: 2026-05-07
## When to use this
Pipelines pulling from `forgejo.viktorbarzin.me/viktor/<image>` fail with:
- `failed to resolve reference … : not found`
- `manifest unknown`
- HEAD on a manifest/blob digest returns 404
- `forgejo-integrity-probe` CronJob in `monitoring` reports
`registry_manifest_integrity_failures > 0` for
`instance="forgejo.viktorbarzin.me"`
This is the Forgejo equivalent of the registry-private orphan-index
failure mode (`docs/post-mortems/2026-04-19-registry-orphan-index.md`).
Cause is usually package-version delete races with an in-flight pull,
or PVC corruption. Fix is to rebuild the image from source and
re-push, so Forgejo receives a complete, fresh upload.
If the symptom is different (Forgejo unreachable, PVC OOM,
authentication failure), use:
- `docs/runbooks/forgejo-registry-setup.md` for auth + token issues
- `docs/runbooks/forgejo-registry-breakglass.md` if Forgejo + the
cluster are both unreachable
- `docs/runbooks/restore-pvc-from-backup.md` for PVC corruption
## Phase 1 — Confirm the diagnosis
From any host:
```sh
REG=forgejo.viktorbarzin.me
USER=cluster-puller
PASS="$(vault kv get -field=forgejo_pull_token secret/viktor)"
IMAGE=viktor/payslip-ingest
TAG=latest
# 1. Confirm the manifest exists at all.
curl -sk -u "$USER:$PASS" \
-H 'Accept: application/vnd.oci.image.index.v1+json,application/vnd.oci.image.manifest.v1+json' \
"https://$REG/v2/$IMAGE/manifests/$TAG" | jq '.mediaType, .manifests[].digest // .config.digest'
# 2. HEAD each child / config / layer digest. Any non-200 = confirmed.
for d in $(curl -sk -u "$USER:$PASS" -H 'Accept: application/vnd.oci.image.index.v1+json' \
"https://$REG/v2/$IMAGE/manifests/$TAG" | jq -r '.manifests[].digest // empty'); do
code=$(curl -sk -u "$USER:$PASS" -o /dev/null -w '%{http_code}' \
-I "https://$REG/v2/$IMAGE/manifests/$d")
echo "$d → $code"
done
```
The probe's last log run is also a fast way to see what's affected:
```sh
kubectl -n monitoring logs \
$(kubectl -n monitoring get pods -l job-name -o name \
| grep forgejo-integrity-probe | head -1)
```
## Phase 2 — Rebuild and re-push
Forgejo lets you delete a specific package version through the API.
Doing this **before** the rebuild ensures the new push doesn't
collide with the half-broken existing entry.
```sh
# Delete the broken version (replace TAG with the actual tag).
curl -X DELETE -H "Authorization: token $(vault kv get -field=forgejo_cleanup_token secret/viktor)" \
"https://$REG/api/v1/packages/viktor/container/$(basename $IMAGE)/$TAG"
```
Rebuild via Woodpecker (manual run if the pipeline isn't triggered
by a code change):
1. Open `https://ci.viktorbarzin.me/repos/<repo>/manual` for the
project.
2. Click **Run pipeline** with `branch=master`.
3. Wait for the build-and-push step to complete.
4. Confirm the new version is visible in Forgejo Web UI under
`viktor/<image>` → Packages → Container.
## Phase 3 — Restart consumers
Pods that already cached the broken digest may continue using it.
Force a fresh pull:
```sh
kubectl rollout restart deploy/<service> -n <ns>
```
If the pod still fails, the new manifest digest may not have
propagated through containerd's cache. Drain + restart containerd on
the affected node:
```sh
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data
ssh wizard@<node> sudo systemctl restart containerd
kubectl uncordon <node>
```
## Phase 4 — Verify integrity recovery
The next probe run (every 15 min) will report:
```
registry_manifest_integrity_failures{instance="forgejo.viktorbarzin.me"} 0
```
The `RegistryManifestIntegrityFailure` alert resolves automatically
30 minutes after the metric goes back to 0.
## Why this happens
Forgejo's OCI registry stores blobs in its own DB+filesystem. Unlike
`registry:2` + `distribution`, it doesn't have the
[`distribution#3324`](https://github.com/distribution/distribution/issues/3324)
GC-vs-tag-delete race. But it can still reach a broken state if:
- The retention CronJob deletes a version while a pull is in flight
on the same digest.
- The PVC fills up mid-push (`docs/runbooks/restore-pvc-from-backup.md`).
- A Forgejo upgrade migrates the package schema and a row is dropped.
In all cases the recovery procedure is identical: delete the broken
version through the API, rebuild from source, force consumers to
re-pull.

View file

@ -0,0 +1,163 @@
# Runbook: Forgejo OCI registry — initial setup
Last updated: 2026-05-07
This runbook covers the **one-time** bootstrap of Forgejo's container
registry, executed during Phase 0 of the registry consolidation plan
(`docs/plans/2026-05-07-forgejo-registry-consolidation-plan.md`).
After this runbook is complete, the Forgejo OCI registry at
`forgejo.viktorbarzin.me` accepts pushes from CI and pulls from the
cluster, with retention and integrity monitoring in place.
## Order of operations
The Terraform stacks reference Vault keys that don't exist on a fresh
cluster. Create the keys **before** running `scripts/tg apply`.
1. Apply the resource bumps (memory, PVC, ingress body size,
packages env vars) — these don't depend on the new Vault keys.
2. Create the service-account users + PATs in Forgejo.
3. Push the PATs to Vault.
4. Apply the rest of Phase 0 (registry-credentials extension,
monitoring probe, retention CronJob).
### Step 1 — apply Forgejo deployment bumps
```bash
cd infra/stacks/forgejo
scripts/tg apply
```
Wait for the new pod to come up at the bumped 1Gi memory request and
the resized 15Gi PVC. Verify packages are enabled:
```bash
kubectl exec -n forgejo deploy/forgejo -- forgejo manager flush-queues
kubectl exec -n forgejo deploy/forgejo -- env | grep PACKAGES
```
### Step 2 — create service-account users
`forgejo admin user create` is idempotent only with
`--must-change-password=false`. Re-running it on an existing user
errors out — that's fine; skip on rerun.
```bash
# cluster-puller — read:package PAT for in-cluster pulls.
kubectl exec -n forgejo deploy/forgejo -- \
forgejo admin user create \
--username cluster-puller \
--email cluster-puller@viktorbarzin.me \
--password "$(openssl rand -base64 24)" \
--must-change-password=false
# ci-pusher — write:package PAT for CI dual-push, also reused as the
# cleanup CronJob credential (write:package includes delete).
kubectl exec -n forgejo deploy/forgejo -- \
forgejo admin user create \
--username ci-pusher \
--email ci-pusher@viktorbarzin.me \
--password "$(openssl rand -base64 24)" \
--must-change-password=false
```
The user passwords are throwaway — we only ever auth via PAT. Forgejo
admin can reset them at any time from the Web UI.
### Step 3 — generate the PATs
PATs **must** be generated through the Web UI logged in as the
respective user (the CLI doesn't expose token creation). To log in
without OAuth (registration is disabled for everyone except `viktor`,
the admin), use the per-user temporary password from step 2.
For each of `cluster-puller` and `ci-pusher`:
1. Sign out of `viktor`.
2. Go to `https://forgejo.viktorbarzin.me/user/login` and sign in
with the throwaway password.
3. Settings → Applications → Generate new token.
4. Name: `cluster-pull` / `ci-push`. **Expiration: never.**
5. Scopes:
- `cluster-puller`: `read:package`
- `ci-pusher`: `write:package` (covers read+write+delete)
6. Save the token shown on the next page — it is **not** displayed again.
For the cleanup CronJob, generate a third PAT on `ci-pusher`:
7. Repeat steps 4-6 with name `cleanup`, scope `write:package`.
### Step 4 — push PATs to Vault
```bash
vault login -method=oidc
# Read-only, used by the cluster-wide registry-credentials Secret and
# by the Forgejo integrity probe.
vault kv patch secret/viktor \
forgejo_pull_token=<paste cluster-puller PAT>
# Write+delete, used by the retention CronJob inside Forgejo's
# namespace.
vault kv patch secret/viktor \
forgejo_cleanup_token=<paste ci-pusher cleanup PAT>
# Write, propagated by vault-woodpecker-sync to all Woodpecker repos.
vault kv patch secret/ci/global \
forgejo_user=ci-pusher \
forgejo_push_token=<paste ci-pusher push PAT>
```
### Step 5 — apply the rest of Phase 0
```bash
# Registry credential Secret (now reads forgejo_pull_token).
cd infra/stacks/kyverno && scripts/tg apply
# Monitoring probe + retention CronJob.
cd infra/stacks/monitoring && scripts/tg apply
cd infra/stacks/forgejo && scripts/tg apply
# Containerd hosts.toml on each existing k8s node — VM cloud-init
# only fires on first boot.
infra/scripts/setup-forgejo-containerd-mirror.sh
```
## Verification
```bash
# Login from a workstation with docker.
echo "<ci-pusher PAT>" | docker login forgejo.viktorbarzin.me -u ci-pusher --password-stdin
# Push a smoketest image.
docker pull alpine:3.20
docker tag alpine:3.20 forgejo.viktorbarzin.me/viktor/smoketest:1
docker push forgejo.viktorbarzin.me/viktor/smoketest:1
# Pull from a k8s node.
ssh wizard@<node> sudo crictl pull forgejo.viktorbarzin.me/viktor/smoketest:1
# Confirm the cluster-wide Secret was synced into a fresh namespace.
kubectl create namespace forgejo-smoketest
kubectl get secret -n forgejo-smoketest registry-credentials \
-o jsonpath='{.data.\.dockerconfigjson}' | base64 -d | jq '.auths | keys'
# Expect: ["10.0.20.10:5050", "forgejo.viktorbarzin.me",
# "registry.viktorbarzin.me", "registry.viktorbarzin.me:5050"]
kubectl delete namespace forgejo-smoketest
# Delete the smoketest package via API.
curl -X DELETE -H "Authorization: token <ci-pusher cleanup PAT>" \
https://forgejo.viktorbarzin.me/api/v1/packages/viktor/container/smoketest/1
```
## When to revisit
- **PAT rotation**: PATs created here have no expiry by design. If a
PAT leaks, regenerate via the Web UI and `vault kv patch` the new
value into the same key — the next `terragrunt apply` will sync it
to all consumers within minutes (Kyverno ClusterPolicy clones the
Secret, vault-woodpecker-sync runs every 6h).
- **New service account**: if a future workload needs different
scopes, add a parallel user/PAT here rather than expanding existing
PAT scope. Principle of least privilege.

View file

@ -1,12 +1,30 @@
# Runbook: Registry VM (docker-registry, 10.0.20.10)
Last updated: 2026-04-19
Last updated: 2026-05-07
The registry VM hosts `registry.viktorbarzin.me` (private Docker
registry, htpasswd-auth, NGINX → registry:2). It is an Ubuntu 24.04
VM on the cluster LAN subnet `10.0.20.0/24`, with a static netplan
config (no DHCP). Because it sits on a subnet that only has pfSense
as its gateway, its DNS must be statically configured.
The registry VM is an Ubuntu 24.04 VM on the cluster LAN subnet
`10.0.20.0/24`, with a static netplan config (no DHCP). Because it
sits on a subnet that only has pfSense as its gateway, its DNS must
be statically configured.
**As of Phase 4 of forgejo-registry-consolidation 2026-05-07** the VM
no longer hosts the private R/W registry. It hosts pull-through
caches only:
| Port | Upstream |
|---|---|
| 5000 | docker.io (Docker Hub) — auth via dockerhub_registry_password |
| 5010 | ghcr.io |
| 5020 | quay.io |
| 5030 | registry.k8s.io |
| 5040 | reg.kyverno.io |
The decommissioned private registry (port 5050) is now hosted on
Forgejo at `forgejo.viktorbarzin.me/viktor/<image>`. See
`docs/plans/2026-05-07-forgejo-registry-consolidation-plan.md` for the
migration. Break-glass tarballs of `infra-ci` are still produced on
each build to `/opt/registry/data/private/_breakglass/` — see
`docs/runbooks/forgejo-registry-breakglass.md`.
## DNS configuration

View file

@ -0,0 +1,73 @@
# Runbook: Onboarding a new Forgejo repo to Woodpecker
Last updated: 2026-05-07
When you create a new repo on `forgejo.viktorbarzin.me`, Woodpecker
does NOT auto-discover it via the cluster's existing OAuth session.
The `forgejo` user inside Woodpecker (Forgejo-OAuth'd) needs to:
1. Open `https://ci.viktorbarzin.me/` in a browser.
2. Log in via Forgejo OAuth (the "Sign in with Forgejo" button).
3. Click "Add Repository" — your new repo should appear.
4. Click the toggle to activate it. Woodpecker will:
- Add a webhook on the Forgejo repo (push, PR, release events).
- Register the repo's `forge_remote_id` in its DB so subsequent
hooks deserialize correctly.
5. Push a commit (or hit "Run pipeline" in Woodpecker UI) — first
build fires.
## Why API-only doesn't work
The webhook URL contains a JWT signed with a per-server key that's
stored in the DB and only accessible at OAuth-flow time. POST'ing
`/api/repos` as the admin (`ViktorBarzin` GitHub user) returns 500
because the lookup queries forge-side OAuth state for THAT user,
which doesn't exist for the Forgejo `viktor` user. We confirmed:
- Direct `POST /api/repos?forge_remote_id=N` → HTTP 500 server-side.
- Generating a JWT with the agent secret → "token is unverifiable"
on hook delivery (the signing key is repo-specific, not the
global agent secret).
There's no admin endpoint that side-steps the OAuth flow.
## Bootstrap when UI access isn't available
If you absolutely need to bootstrap a new image without UI access
(e.g., during an outage), the workaround is:
1. Build locally:
```bash
docker build -t forgejo.viktorbarzin.me/viktor/<name>:<tag> /path/to/source
docker push forgejo.viktorbarzin.me/viktor/<name>:<tag>
```
2. Or pull from another already-built source and retag:
```bash
docker pull viktorbarzin/<name>:<tag> # DockerHub
docker tag viktorbarzin/<name>:<tag> forgejo.viktorbarzin.me/viktor/<name>:<tag>
docker push forgejo.viktorbarzin.me/viktor/<name>:<tag>
```
3. Flip the cluster `image=` reference and restart deployments.
Document the bootstrap in the relevant stack so future maintainers
know the image was put there by hand. After Woodpecker UI onboarding,
the next pipeline run replaces the bootstrap image with a CI-built one.
## Repos onboarded in flight 2026-05-07
These were created during the forgejo-registry-consolidation but the
UI step above hasn't been done yet — their `.woodpecker.yml` /
`.woodpecker/build.yml` exists on Forgejo but no pipeline fires:
- `viktor/broker-sync` — image bootstrapped via DockerHub (see
`infra/stacks/wealthfolio/main.tf` comment).
- `viktor/fire-planner` — image bootstrapped via local docker build.
- `viktor/hmrc-sync`
- `viktor/freedify`
- `viktor/claude-agent-service`
- `viktor/beadboard` — image bootstrapped via local docker build.
- `viktor/claude-memory-mcp`
Walk through each in the Woodpecker UI to enable. Pipelines for
already-onboarded repos (payslip-ingest, job-hunter, infra) fired
correctly after the v3.13 → v3.14 upgrade.

View file

@ -89,35 +89,26 @@ services:
retries: 3
start_period: 10s
registry-private:
image: registry:2.8.3
container_name: registry-private
restart: always
volumes:
- /opt/registry/data/private:/var/lib/registry
- /opt/registry/config-private.yml:/etc/docker/registry/config.yml:ro
- /opt/registry/htpasswd:/auth/htpasswd:ro
networks:
- registry
healthcheck:
# 401 is expected (auth required) — any HTTP response means the registry is healthy
test: ["CMD", "sh", "-c", "wget -qS -O /dev/null http://127.0.0.1:5000/v2/ 2>&1 | grep -q 'HTTP/'"]
interval: 30s
timeout: 10s
retries: 3
start_period: 10s
# registry-private decommissioned in Phase 4 of
# forgejo-registry-consolidation 2026-05-07 — image migration completed,
# cluster flipped to forgejo.viktorbarzin.me/viktor/<image>. The remaining
# five services on this VM are pull-through caches for upstream registries.
# After 1 week of no incidents, `rm -rf /opt/registry/data/private/` on the
# VM frees ~2.6 GB. The tarball break-glass under
# /opt/registry/data/private/_breakglass/ stays — it's how we recover
# infra-ci if Forgejo ever goes fully down.
nginx:
image: nginx:alpine
container_name: registry-nginx
restart: always
# 5050 dropped Phase 4 of forgejo-registry-consolidation 2026-05-07.
ports:
- "5000:5000"
- "5010:5010"
- "5020:5020"
- "5030:5030"
- "5040:5040"
- "5050:5050"
volumes:
- /opt/registry/nginx.conf:/etc/nginx/nginx.conf:ro
- /opt/registry/tls:/etc/nginx/tls:ro
@ -135,8 +126,6 @@ services:
condition: service_healthy
registry-kyverno:
condition: service_healthy
registry-private:
condition: service_healthy
healthcheck:
test: ["CMD", "sh", "-c", "wget -qO- http://127.0.0.1:5000/v2/ >/dev/null 2>&1"]
interval: 30s

View file

@ -33,10 +33,9 @@ http {
keepalive 32;
}
upstream private {
server registry-private:5000;
keepalive 32;
}
# `upstream private` removed in Phase 4 of forgejo-registry-consolidation
# 2026-05-07. The /v2/ private registry is now Forgejo at
# forgejo.viktorbarzin.me/viktor/.
# --- Docker Hub (port 5000) ---
@ -168,37 +167,8 @@ http {
}
}
# --- Private R/W Registry (port 5050, TLS) ---
server {
listen 5050 ssl;
server_name registry.viktorbarzin.me;
ssl_certificate /etc/nginx/tls/fullchain.pem;
ssl_certificate_key /etc/nginx/tls/privkey.pem;
ssl_protocols TLSv1.2 TLSv1.3;
client_max_body_size 0;
proxy_request_buffering off;
proxy_buffering off;
chunked_transfer_encoding on;
location /v2/ {
proxy_pass http://private;
proxy_http_version 1.1;
proxy_set_header Host $http_host;
proxy_set_header Connection "";
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_read_timeout 900;
proxy_send_timeout 900;
}
location / {
return 200 'ok';
add_header Content-Type text/plain;
}
}
# --- Private R/W Registry (port 5050) decommissioned Phase 4 2026-05-07 ---
# The TLS port 5050 server block previously fronted `registry-private`.
# Migrated to Forgejo at forgejo.viktorbarzin.me/viktor/. Both
# docker-compose.yml and this nginx config no longer reference port 5050.
}

View file

@ -40,8 +40,9 @@ variable "ingress_path" {
default = ["/"]
}
variable "max_body_size" {
type = string
default = "50m"
type = string
default = null
description = "Maximum request body size, e.g. '5g'. null = no limit (Traefik default). When set, a per-ingress Buffering middleware is created and attached."
}
variable "extra_annotations" {
default = {}
@ -203,6 +204,17 @@ locals {
"gethomepage.dev/href" = "https://${local.effective_host}"
"gethomepage.dev/icon" = "${replace(var.name, "-", "")}.png"
} : {}
# Parse "5g"/"50m"/"1024k"/"42" into bytes. Traefik's Buffering middleware
# takes maxRequestBodyBytes as an integer. Empty unit = bytes.
body_size_match = var.max_body_size == null ? null : regex("^([0-9]+)([kmgKMG]?)$", var.max_body_size)
body_size_unit_multiplier = var.max_body_size == null ? 0 : (
lower(local.body_size_match[1]) == "g" ? 1073741824 :
lower(local.body_size_match[1]) == "m" ? 1048576 :
lower(local.body_size_match[1]) == "k" ? 1024 :
1
)
max_body_size_bytes = var.max_body_size == null ? 0 : tonumber(local.body_size_match[0]) * local.body_size_unit_multiplier
}
@ -245,6 +257,7 @@ resource "kubernetes_ingress_v1" "proxied-ingress" {
var.protected ? "traefik-authentik-forward-auth@kubernetescrd" : null,
var.allow_local_access_only ? "traefik-local-only@kubernetescrd" : null,
var.custom_content_security_policy != null ? "${var.namespace}-custom-csp-${var.name}@kubernetescrd" : null,
var.max_body_size != null ? "${var.namespace}-buffering-${var.name}@kubernetescrd" : null,
], var.extra_middlewares)))
"traefik.ingress.kubernetes.io/router.entrypoints" = "websecure"
}, local.homepage_defaults, var.extra_annotations,
@ -302,6 +315,27 @@ resource "kubernetes_manifest" "custom_csp" {
}
}
# Buffering middleware - created per service when max_body_size is set.
# Traefik default is unlimited; setting maxRequestBodyBytes enforces a limit
# (e.g. Forgejo container pushes can ship multi-GB layer blobs).
resource "kubernetes_manifest" "buffering" {
count = var.max_body_size != null ? 1 : 0
manifest = {
apiVersion = "traefik.io/v1alpha1"
kind = "Middleware"
metadata = {
name = "buffering-${var.name}"
namespace = var.namespace
}
spec = {
buffering = {
maxRequestBodyBytes = local.max_body_size_bytes
}
}
}
}
# Cloudflare DNS records created automatically when dns_type is set.
# Proxied: CNAME to Cloudflare tunnel. Non-proxied: A + AAAA to public IP.
resource "cloudflare_record" "proxied" {

View file

@ -0,0 +1,76 @@
#!/usr/bin/env bash
# One-shot migration of every private image on registry.viktorbarzin.me to
# Forgejo. Used as a stop-gap when the dual-push CI pipelines aren't
# producing Forgejo images on their own (Forgejo-Woodpecker forge driver
# context-deadline-exceeded issue, see bd code-d3y / 2026-05-07).
#
# Pulls each image from registry.viktorbarzin.me, retags, pushes to
# forgejo.viktorbarzin.me/viktor/<name>:<tag> — preserving the blob bytes
# verbatim so the cluster can flip image= without a rebuild.
#
# Run from any host with docker + network reach to BOTH registries. Auth
# from `docker login` (~/.docker/config.json) — make sure both registries
# are logged in:
# docker login registry.viktorbarzin.me -u viktorbarzin
# docker login forgejo.viktorbarzin.me -u viktor # use viktor PAT, not ci-pusher
#
# (ci-pusher CANNOT push to viktor/<image> — Forgejo container packages
# are scoped to the pushing user. Only viktor's PAT can write to viktor/*.)
#
# After the script, the new image lives at
# forgejo.viktorbarzin.me/viktor/<name>:<tag>
# Phase 3 of the consolidation flips infra/stacks/<svc>/main.tf image=
# to that path.
set -euo pipefail
OLD_REG=registry.viktorbarzin.me
NEW_REG=forgejo.viktorbarzin.me/viktor
# Image list: <name>:<tag>. Generated 2026-05-07 from `grep -rEn 'image\s*=\s*
# "registry\.viktorbarzin\.me'` across infra/stacks/.
#
# Excluded:
# - wealthfolio-sync: registry repo exists but has 0 tags (CronJob has been
# broken for 36+ days, separate decision needed). User to triage before
# migration.
# - fire-planner: registry repo exists but has 0 tags. Dockerfile + CI added
# in this session (commit 8b53d99e); rebuild via Woodpecker before flipping.
IMAGES=(
"chrome-service-novnc:v4"
"chrome-service-novnc:latest"
"payslip-ingest:latest"
"job-hunter:latest"
"claude-agent-service:latest"
"freedify:latest"
"beadboard:latest"
"infra-ci:latest"
)
for img in "${IMAGES[@]}"; do
echo "=== $img ==="
src="$OLD_REG/$img"
dst="$NEW_REG/$img"
if ! docker pull "$src" 2>&1 | tee /tmp/pull-$$ | grep -q 'Status: '; then
if grep -q 'not found' /tmp/pull-$$; then
echo " SKIP — image not present in source registry"
rm -f /tmp/pull-$$
continue
fi
fi
rm -f /tmp/pull-$$
echo " tag → $dst"
docker tag "$src" "$dst"
echo " push $dst"
docker push "$dst" 2>&1 | tail -2
echo " cleanup local copy"
docker rmi "$src" "$dst" 2>&1 | tail -1 || true
done
echo ""
echo "Done. Verify in Forgejo Web UI: https://forgejo.viktorbarzin.me/viktor/-/packages?type=container"
echo "Phase 3 of the plan flips infra/stacks/{wealthfolio,fire-planner}/main.tf image= references."

View file

@ -0,0 +1,59 @@
#!/usr/bin/env bash
# One-shot deployment of the forgejo.viktorbarzin.me containerd hosts.toml
# entry across every k8s node. Cloud-init only fires on VM provision, so
# existing nodes need this manual rollout.
#
# What it does, per node:
# 1. drain (ignore-daemonsets, delete-emptydir-data)
# 2. ssh in: mkdir + write /etc/containerd/certs.d/forgejo.viktorbarzin.me/hosts.toml
# 3. systemctl restart containerd
# 4. uncordon
#
# hosts.toml is documented as hot-reloaded but the post-2026-04-19
# containerd corruption playbook calls for an explicit restart so the
# config is unambiguously in effect. Running drain/uncordon around it
# avoids pulling against an in-flight containerd restart.
#
# Re-run is safe: writes are idempotent.
set -euo pipefail
CERTS_DIR=/etc/containerd/certs.d/forgejo.viktorbarzin.me
HOSTS_TOML='server = "https://forgejo.viktorbarzin.me"
[host."https://10.0.20.200"]
capabilities = ["pull", "resolve"]
'
NODES=$(kubectl get nodes -o name | sed 's|^node/||')
if [[ -z "$NODES" ]]; then
echo "ERROR: no nodes returned from kubectl get nodes" >&2
exit 1
fi
for n in $NODES; do
echo "=== $n ==="
kubectl drain "$n" --ignore-daemonsets --delete-emptydir-data --force --grace-period=60
ssh -o StrictHostKeyChecking=accept-new "wizard@$n" sudo bash <<EOF
set -euo pipefail
mkdir -p "$CERTS_DIR"
cat > "$CERTS_DIR/hosts.toml" <<'TOML'
$HOSTS_TOML
TOML
systemctl restart containerd
EOF
kubectl uncordon "$n"
# Wait for the node to report Ready before moving to the next one.
for i in {1..30}; do
if kubectl get node "$n" -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}' | grep -q True; then
echo " node Ready"
break
fi
sleep 2
done
done
echo "All nodes updated."

View file

@ -567,7 +567,8 @@ resource "kubernetes_deployment" "beadboard" {
container {
name = "beadboard"
image = "registry.viktorbarzin.me:5050/beadboard:${var.beadboard_image_tag}"
# Phase 3 cutover 2026-05-07 Forgejo registry consolidation.
image = "forgejo.viktorbarzin.me/viktor/beadboard:${var.beadboard_image_tag}"
port {
name = "http"
@ -725,7 +726,8 @@ resource "kubernetes_config_map" "beads_metadata" {
}
locals {
claude_agent_service_image = "registry.viktorbarzin.me/claude-agent-service:${var.claude_agent_service_image_tag}"
# Phase 3 cutover 2026-05-07 Forgejo registry consolidation.
claude_agent_service_image = "forgejo.viktorbarzin.me/viktor/claude-agent-service:${var.claude_agent_service_image_tag}"
beadboard_internal_url = "http://${kubernetes_service.beadboard.metadata[0].name}.${kubernetes_namespace.beads.metadata[0].name}.svc.cluster.local"
beads_script_prelude = <<-EOT

View file

@ -0,0 +1,90 @@
# chrome-service
In-cluster headed Chromium exposed over Playwright's WebSocket protocol.
Sibling services drive it instead of running their own in-process browser
— useful when the upstream tries to detect headless mode (e.g. hmembeds'
`disable-devtool.js` redirect-to-google trap).
## Connect
```python
from playwright.async_api import async_playwright
WS_URL = "ws://chrome-service.chrome-service.svc.cluster.local:3000"
WS_TOKEN = os.environ["CHROME_WS_TOKEN"] # 32-byte URL-safe random
async with async_playwright() as p:
browser = await p.chromium.connect(f"{WS_URL}/{WS_TOKEN}", timeout=15_000)
context = await browser.new_context()
await context.add_init_script(STEALTH_JS) # see files/stealth.js
page = await context.new_page()
...
await browser.close()
```
The token comes from Vault KV `secret/chrome-service.api_bearer_token`,
which ESO syncs into a per-namespace K8s Secret in each caller stack
(see f1-stream's `chrome-service-client-secrets`).
## Add a new caller
1. **Label the caller's namespace** so the chrome-service NetworkPolicy
admits it:
```hcl
resource "kubernetes_namespace" "<ns>" {
metadata {
labels = {
"chrome-service.viktorbarzin.me/client" = "true"
}
}
}
```
2. **Add an ExternalSecret** in the caller stack pulling the token:
```hcl
resource "kubernetes_manifest" "chrome_token" {
manifest = {
apiVersion = "external-secrets.io/v1beta1"
kind = "ExternalSecret"
metadata = { name = "chrome-service-client-secrets", namespace = "<ns>" }
spec = {
refreshInterval = "15m"
secretStoreRef = { name = "vault-kv", kind = "ClusterSecretStore" }
target = { name = "chrome-service-client-secrets" }
dataFrom = [{ extract = { key = "chrome-service" } }]
}
}
}
```
3. **Inject `CHROME_WS_URL` + `CHROME_WS_TOKEN`** into the caller's pod env.
Use `secret_key_ref` for the token; the URL is a plain value.
4. **Vendor `stealth.js`** into the caller (or just paste — it's ~40 lines)
and apply via `await context.add_init_script(STEALTH_JS)` after every
`new_context()`. Without it, hmembeds-class anti-bot still trips.
## Image pin
Both the server image (`mcr.microsoft.com/playwright:v1.48.0-noble` in
`main.tf`) and the client (`playwright==1.48.0` in callers' requirements)
must match minor-versions. Bump in lockstep — Playwright protocol changes
between minors.
## Operations
- **Storage**: encrypted PVC at `/profile` for cookies + npm cache. Ephemeral
contexts (`browser.new_context()`) bypass the profile; persistent contexts
share it. Backed up tar+gzip every 6h to `/srv/nfs/chrome-service-backup/`,
30-day retention.
- **Probes**: TCP/3000. Playwright run-server has no HTTP `/health`; a TCP
open is the only liveness signal available without spinning a browser.
- **Health page**: visit `https://chrome.viktorbarzin.me` (Authentik-gated)
to confirm the pod is up. The WS port stays internal-only.
- **Token rotation**: `vault kv put secret/chrome-service api_bearer_token=$(python3 -c 'import secrets; print(secrets.token_urlsafe(32))')`.
Reloader cascades the rotation to both the server pod and any caller
whose secret has the `reloader.stakater.com/auto = "true"` annotation.
## Why headed (Xvfb) instead of headless?
`disable-devtool.js` and similar libraries detect `navigator.webdriver`,
console-clear timing, and the `HeadlessChromium/...` user-agent suffix.
Running headed inside `Xvfb :99` reports as a normal Chromium, and the
stealth init script handles the JS-visible giveaways.

View file

@ -0,0 +1,19 @@
FROM docker.io/library/ubuntu:24.04
RUN apt-get update \
&& apt-get install -y --no-install-recommends \
x11vnc \
novnc \
websockify \
ca-certificates \
&& rm -rf /var/lib/apt/lists/*
# noVNC ships /usr/share/novnc/vnc.html; alias to index.html so / works.
RUN ln -sf /usr/share/novnc/vnc.html /usr/share/novnc/index.html
EXPOSE 6080
COPY entrypoint.sh /entrypoint.sh
RUN chmod +x /entrypoint.sh
CMD ["/entrypoint.sh"]

View file

@ -0,0 +1,39 @@
#!/usr/bin/env bash
# Connect to the chrome-service container's Xvfb (shared pod network, TCP)
# and serve the noVNC HTML5 client + websockify bridge on :6080.
set -e
for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15; do
if echo > /dev/tcp/127.0.0.1/6099 2>/dev/null; then
echo "Xvfb TCP up after attempt $i"
break
fi
echo "waiting for Xvfb TCP 6099 attempt=$i"
sleep 2
done
# websockify runs as PID 1; x11vnc is a child so its logs land on container stdout
# `-noshm` skips MIT-SHM probes that fail across container boundaries (each
# container has its own /dev/shm); `-noxdamage` skips XDAMAGE which Xvfb
# doesn't expose; `-quiet` keeps the polling chatter out of pod logs.
echo "starting x11vnc -> :5900"
x11vnc -display localhost:99 -nopw -listen 0.0.0.0 -rfbport 5900 \
-forever -shared -noshm -noxdamage -quiet 2>&1 &
X11VNC_PID=$!
for i in 1 2 3 4 5 6 7 8 9 10; do
if echo > /dev/tcp/127.0.0.1/5900 2>/dev/null; then
echo "x11vnc bound 5900 after attempt $i"
break
fi
echo "waiting for x11vnc :5900 attempt=$i"
sleep 2
done
if ! echo > /dev/tcp/127.0.0.1/5900 2>/dev/null; then
echo "ERROR: x11vnc did not bind 5900"
exit 1
fi
echo "starting websockify -> :6080"
exec websockify --web=/usr/share/novnc 6080 localhost:5900

View file

@ -0,0 +1,54 @@
// Minimal stealth init script for Playwright-driven Chromium.
// Vendored from puppeteer-extra-plugin-stealth/evasions/* (MIT) — covers:
// webdriver, chrome.runtime, navigator.plugins, navigator.languages,
// Permissions.query, WebGL getParameter (vendor + renderer spoof).
// Run via context.add_init_script() so it executes before any page script.
(() => {
// navigator.webdriver — most common detection, removed entirely.
Object.defineProperty(Navigator.prototype, 'webdriver', { get: () => undefined });
// window.chrome.runtime — many sites check that real Chrome exposes this.
if (!window.chrome) window.chrome = {};
window.chrome.runtime = window.chrome.runtime || {};
// navigator.plugins — headless reports zero; spoof a plausible PDF viewer.
Object.defineProperty(navigator, 'plugins', {
get: () => [{ name: 'Chrome PDF Plugin' }, { name: 'Chrome PDF Viewer' }, { name: 'Native Client' }],
});
// navigator.languages — headless returns empty array.
Object.defineProperty(navigator, 'languages', { get: () => ['en-US', 'en'] });
// Permissions.query — headless returns 'denied' for notifications instead of 'default'.
const origQuery = window.navigator.permissions && window.navigator.permissions.query;
if (origQuery) {
window.navigator.permissions.query = (parameters) =>
parameters && parameters.name === 'notifications'
? Promise.resolve({ state: Notification.permission })
: origQuery(parameters);
}
// WebGL getParameter — spoof vendor + renderer strings to a real GPU.
const spoofGl = (proto) => {
if (!proto) return;
const orig = proto.getParameter;
proto.getParameter = function (parameter) {
if (parameter === 37445) return 'Intel Inc.'; // UNMASKED_VENDOR_WEBGL
if (parameter === 37446) return 'Intel Iris OpenGL Engine'; // UNMASKED_RENDERER_WEBGL
return orig.apply(this, arguments);
};
};
spoofGl(window.WebGLRenderingContext && window.WebGLRenderingContext.prototype);
spoofGl(window.WebGL2RenderingContext && window.WebGL2RenderingContext.prototype);
// disable-devtool.js (theajack/disable-devtool) auto-inits via a script
// tag with `disable-devtool-auto`. Its Performance detector trips under
// Playwright (CDP adds console.log latency vs console.table) and the
// redirect URL is hard-coded — for hmembeds that's google.com.
// Hide the auto-init marker so the library's IIFE exits early.
const origQS = Document.prototype.querySelector;
Document.prototype.querySelector = function (sel) {
if (typeof sel === 'string' && sel.indexOf('disable-devtool-auto') !== -1) return null;
return origQS.apply(this, arguments);
};
})();

View file

@ -0,0 +1,504 @@
variable "tls_secret_name" {
type = string
sensitive = true
}
variable "nfs_server" { type = string }
locals {
namespace = "chrome-service"
labels = {
app = "chrome-service"
}
# Pin to the same Playwright minor that the Python client requires.
# If you bump this image, also bump `playwright==X.Y.Z` in the client
# (currently f1-stream) and re-run the connect smoke test.
image = "mcr.microsoft.com/playwright:v1.48.0-noble"
}
# --- Namespace ---
resource "kubernetes_namespace" "chrome_service" {
metadata {
name = local.namespace
labels = {
"istio-injection" = "disabled"
tier = local.tiers.aux
"chrome-service.viktorbarzin.me/server" = "true"
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
# --- Secrets (single-key extract: api_bearer_token) ---
resource "kubernetes_manifest" "external_secret" {
manifest = {
apiVersion = "external-secrets.io/v1beta1"
kind = "ExternalSecret"
metadata = {
name = "chrome-service-secrets"
namespace = local.namespace
}
spec = {
refreshInterval = "15m"
secretStoreRef = {
name = "vault-kv"
kind = "ClusterSecretStore"
}
target = {
name = "chrome-service-secrets"
}
dataFrom = [{
extract = {
key = "chrome-service"
}
}]
}
}
depends_on = [kubernetes_namespace.chrome_service]
}
# tls-secret for the chrome.viktorbarzin.me ingress is auto-cloned into
# every namespace by Kyverno's `sync-tls-secret` ClusterPolicy no local
# module call needed.
# --- Encrypted profile PVC ---
# Holds Chromium user data: cookies, localStorage, IndexedDB. Sites we
# drive may set auth tokens or session cookies encrypted is correct.
resource "kubernetes_persistent_volume_claim" "profile_encrypted" {
wait_until_bound = false
metadata {
name = "chrome-service-profile-encrypted"
namespace = kubernetes_namespace.chrome_service.metadata[0].name
annotations = {
"resize.topolvm.io/threshold" = "80%"
"resize.topolvm.io/increase" = "100%"
"resize.topolvm.io/storage_limit" = "10Gi"
}
}
spec {
access_modes = ["ReadWriteOnce"]
storage_class_name = "proxmox-lvm-encrypted"
resources {
requests = {
storage = "2Gi"
}
}
}
}
# --- NFS backup target ---
module "nfs_chrome_service_backup_host" {
source = "../../modules/kubernetes/nfs_volume"
name = "chrome-service-backup-host"
namespace = kubernetes_namespace.chrome_service.metadata[0].name
nfs_server = "192.168.1.127"
nfs_path = "/srv/nfs/chrome-service-backup"
}
# --- Deployment ---
resource "kubernetes_deployment" "chrome_service" {
metadata {
name = "chrome-service"
namespace = kubernetes_namespace.chrome_service.metadata[0].name
labels = merge(local.labels, {
tier = local.tiers.aux
})
annotations = {
"reloader.stakater.com/auto" = "true"
}
}
spec {
replicas = 1
strategy {
type = "Recreate"
}
selector {
match_labels = local.labels
}
template {
metadata {
labels = local.labels
}
spec {
# The noVNC sidecar pulls from registry.viktorbarzin.me which needs
# auth. Kyverno's `sync-registry-credentials` ClusterPolicy syncs
# the secret into every namespace.
image_pull_secrets {
name = "registry-credentials"
}
security_context {
run_as_user = 1000
run_as_group = 1000
fs_group = 1000
seccomp_profile {
type = "RuntimeDefault"
}
}
# Fix profile dir ownership (PVC may have root-owned files from prior run).
init_container {
name = "fix-perms"
image = "busybox:1.37"
command = ["sh", "-c", "chown -R 1000:1000 /profile"]
security_context {
run_as_user = 0
}
volume_mount {
name = "profile"
mount_path = "/profile"
}
resources {
requests = { memory = "32Mi" }
limits = { memory = "64Mi" }
}
}
container {
name = "chrome-service"
image = local.image
image_pull_policy = "IfNotPresent"
# `launch-server` (not `run-server`) lets us pin headed mode +
# specific args. `run-server` defaults to headless, which the
# disable-devtool.js Performance detector trips under Playwright
# (CDP adds latency to console.log; lib detects + redirects).
# The Microsoft image ships only the browsers, not the playwright
# npm package itself `npx -y playwright@<ver>` downloads it on
# first start (cached under $HOME/.npm via the PVC) and pins to
# the same minor as the Python client. Bump in lockstep.
command = ["bash", "-c"]
args = [
<<-EOT
set -e
# `-listen tcp` enables localhost:6099 so the noVNC sidecar can
# connect over the pod's shared network namespace (Ubuntu 24.04
# defaults Xvfb to -nolisten tcp).
# `-ac` disables X access control so the noVNC sidecar can
# attach without an MIT-MAGIC-COOKIE; safe because Xvfb only
# listens on localhost (pod's lo).
Xvfb :99 -screen 0 1280x720x24 -listen tcp -ac &
sleep 1
cat > /tmp/launch.json <<JSON
{
"headless": false,
"port": 3000,
"host": "0.0.0.0",
"wsPath": "/$${PW_TOKEN}",
"args": [
"--no-sandbox",
"--disable-blink-features=AutomationControlled",
"--disable-features=IsolateOrigins,site-per-process",
"--autoplay-policy=no-user-gesture-required",
"--disable-dev-shm-usage"
]
}
JSON
exec npx -y playwright@1.48.0 launch-server --browser chromium --config /tmp/launch.json
EOT
]
env {
name = "DISPLAY"
value = ":99"
}
env {
name = "HOME"
value = "/profile"
}
env {
name = "PW_TOKEN"
value_from {
secret_key_ref {
name = "chrome-service-secrets"
key = "api_bearer_token"
}
}
}
port {
name = "ws"
container_port = 3000
protocol = "TCP"
}
# Playwright run-server exposes only the WS endpoint; no /health.
liveness_probe {
tcp_socket { port = 3000 }
initial_delay_seconds = 30
period_seconds = 30
failure_threshold = 3
}
readiness_probe {
tcp_socket { port = 3000 }
initial_delay_seconds = 10
period_seconds = 10
}
startup_probe {
tcp_socket { port = 3000 }
period_seconds = 5
failure_threshold = 24 # up to 2 minutes
}
volume_mount {
name = "profile"
mount_path = "/profile"
}
volume_mount {
name = "dshm"
mount_path = "/dev/shm"
}
resources {
requests = {
cpu = "200m"
memory = "1500Mi"
}
limits = {
memory = "2Gi"
}
}
}
# noVNC sidecar exposes a live HTML5 view of the headed Chromium
# session via x11vnc + websockify, gated by the Authentik-protected
# ingress at chrome.viktorbarzin.me. WS port 3000 (the Playwright
# endpoint) stays internal-only.
container {
name = "novnc"
# Phase 3 cutover 2026-05-07 Forgejo registry consolidation.
image = "forgejo.viktorbarzin.me/viktor/chrome-service-novnc:v4"
image_pull_policy = "IfNotPresent"
port {
name = "http"
container_port = 6080
protocol = "TCP"
}
# x11vnc connects to the chrome-service container's Xvfb over
# localhost TCP (shared pod network). Same uid 1000 as chrome
# container so we can read MIT-MAGIC-COOKIE if Xvfb adds one.
resources {
requests = { cpu = "10m", memory = "32Mi" }
limits = { memory = "96Mi" }
}
}
volume {
name = "profile"
persistent_volume_claim {
claim_name = kubernetes_persistent_volume_claim.profile_encrypted.metadata[0].name
}
}
volume {
name = "dshm"
empty_dir {
medium = "Memory"
size_limit = "256Mi"
}
}
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
# --- Services ---
# WS endpoint (internal only, gated by NetworkPolicy + token).
resource "kubernetes_service" "chrome_service" {
metadata {
name = "chrome-service"
namespace = kubernetes_namespace.chrome_service.metadata[0].name
labels = local.labels
}
spec {
selector = local.labels
port {
name = "ws"
port = 3000
target_port = 3000
protocol = "TCP"
}
}
}
# noVNC view (Authentik-gated, exposed via ingress).
resource "kubernetes_service" "chrome_novnc" {
metadata {
name = "chrome"
namespace = kubernetes_namespace.chrome_service.metadata[0].name
labels = local.labels
}
spec {
selector = local.labels
port {
name = "http"
port = 80
target_port = 6080
protocol = "TCP"
}
}
}
module "ingress" {
source = "../../modules/kubernetes/ingress_factory"
dns_type = "proxied"
namespace = kubernetes_namespace.chrome_service.metadata[0].name
name = "chrome"
tls_secret_name = var.tls_secret_name
protected = true
# noVNC defaults to /vnc.html auto-redirect / there.
ingress_path = ["/"]
extra_annotations = {
"gethomepage.dev/enabled" = "true"
"gethomepage.dev/name" = "Chrome Service"
"gethomepage.dev/description" = "Live noVNC view of headed Chromium"
"gethomepage.dev/icon" = "chromium.png"
"gethomepage.dev/group" = "Infrastructure"
}
}
# --- NetworkPolicy: scoped ingress.
# - TCP/3000 (Playwright WS): only from labelled client namespaces.
# - TCP/6080 (noVNC HTTP+WS): only from the traefik namespace, since the
# public-facing path is `chrome.viktorbarzin.me` ingress Traefik
# sidecar. Authentik forward-auth still gates external access at the
# Traefik layer.
# The cluster has no default-deny, so this NP only takes effect inside
# chrome-service ns pods elsewhere remain unaffected.
resource "kubernetes_network_policy_v1" "ws_ingress" {
metadata {
name = "chrome-service-ws-ingress"
namespace = kubernetes_namespace.chrome_service.metadata[0].name
}
spec {
pod_selector {
match_labels = local.labels
}
policy_types = ["Ingress"]
ingress {
from {
namespace_selector {
match_labels = {
"chrome-service.viktorbarzin.me/client" = "true"
}
}
}
# Explicit fallback list admit f1-stream by name in case the label
# is removed by accident. Keep this in sync with the labels above.
from {
namespace_selector {
match_labels = {
"kubernetes.io/metadata.name" = "f1-stream"
}
}
}
ports {
port = "3000"
protocol = "TCP"
}
}
ingress {
from {
namespace_selector {
match_labels = {
"kubernetes.io/metadata.name" = "traefik"
}
}
}
ports {
port = "6080"
protocol = "TCP"
}
}
}
}
# --- Backup CronJob: tar+gzip the profile every 6h, 30-day retention. ---
resource "kubernetes_cron_job_v1" "chrome_service_backup" {
metadata {
name = "chrome-service-backup"
namespace = kubernetes_namespace.chrome_service.metadata[0].name
}
spec {
concurrency_policy = "Replace"
failed_jobs_history_limit = 3
successful_jobs_history_limit = 1
schedule = "47 */6 * * *"
starting_deadline_seconds = 60
job_template {
metadata {}
spec {
backoff_limit = 2
ttl_seconds_after_finished = 300
template {
metadata {}
spec {
# PVC is RWO colocate the backup pod with the chrome-service
# pod so both can mount the volume on the same node.
affinity {
pod_affinity {
required_during_scheduling_ignored_during_execution {
label_selector {
match_labels = local.labels
}
topology_key = "kubernetes.io/hostname"
}
}
}
container {
name = "backup"
image = "docker.io/library/alpine:3.20"
command = ["/bin/sh", "-c", <<-EOT
set -euxo pipefail
ts=$(date +"%Y_%m_%d_%H")
tar -czf /backup/$${ts}.tar.gz -C /profile .
find /backup -maxdepth 1 -type f -name '*.tar.gz' -mtime +30 -delete
echo "Backup complete: $${ts}.tar.gz"
EOT
]
volume_mount {
name = "profile"
mount_path = "/profile"
read_only = true
}
volume_mount {
name = "backup"
mount_path = "/backup"
}
resources {
requests = { cpu = "10m", memory = "32Mi" }
limits = { memory = "64Mi" }
}
}
volume {
name = "profile"
persistent_volume_claim {
claim_name = kubernetes_persistent_volume_claim.profile_encrypted.metadata[0].name
}
}
volume {
name = "backup"
persistent_volume_claim {
claim_name = module.nfs_chrome_service_backup_host.claim_name
}
}
restart_policy = "OnFailure"
}
}
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}

View file

@ -0,0 +1,8 @@
include "root" {
path = find_in_parent_folders()
}
dependency "platform" {
config_path = "../platform"
skip_outputs = true
}

View file

@ -10,7 +10,8 @@ data "vault_kv_secret_v2" "viktor_secrets" {
locals {
namespace = "claude-agent"
image = "registry.viktorbarzin.me/claude-agent-service"
# Phase 3 cutover 2026-05-07 see infra/docs/plans/2026-05-07-forgejo-registry-consolidation-plan.md.
image = "forgejo.viktorbarzin.me/viktor/claude-agent-service"
image_tag = "2fd7670d"
labels = {
app = "claude-agent-service"

View file

@ -175,8 +175,10 @@ resource "kubernetes_deployment" "claude-memory" {
}
}
container {
name = "claude-memory"
image = "viktorbarzin/claude-memory-mcp:17"
name = "claude-memory"
# Phase 3 cutover 2026-05-07 moved off DockerHub to Forgejo as
# part of the registry consolidation. Old: viktorbarzin/claude-memory-mcp:17
image = "forgejo.viktorbarzin.me/viktor/claude-memory-mcp:17"
port {
container_port = 8000

View file

@ -14,9 +14,26 @@ FROM python:3.13-slim-bookworm
WORKDIR /app
# Headless Chromium runtime libs for the playback verifier. Listed inline
# (instead of running `playwright install-deps`) so the image build doesn't
# need root-network apt fetches at runtime.
RUN apt-get update && apt-get install -y --no-install-recommends \
ca-certificates \
libnss3 libnspr4 \
libatk1.0-0 libatk-bridge2.0-0 libcups2 \
libdrm2 libxkbcommon0 libxcomposite1 libxdamage1 \
libxfixes3 libxrandr2 libgbm1 libpango-1.0-0 libcairo2 \
libasound2 libatspi2.0-0 \
fonts-liberation fonts-noto-color-emoji \
&& rm -rf /var/lib/apt/lists/*
COPY backend/requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Install the Chromium browser binary used by the verifier. Skip
# --with-deps because we already installed the system libs above.
RUN playwright install chromium
COPY backend/ ./backend/
# Copy built frontend into the image

View file

@ -0,0 +1,359 @@
"""Embed iframe-stripping reverse proxy.
Serves third-party embed pages (e.g. https://hmembeds.one/embed/{hash},
https://pooembed.eu/embed/{slug}) through our origin so we can:
1. Strip X-Frame-Options and Content-Security-Policy: frame-ancestors headers,
so the embed loads in our <iframe> regardless of upstream policy.
2. Inject <base> + a frame-buster-defeat <script> at the top of <head> so
the embed's JS sees `window.top === window` and a plausible
`document.referrer` pointing at the upstream origin.
3. Forward Referer / User-Agent matching the upstream's own pages so
the upstream's hotlink / origin-allowlist checks pass.
Two endpoints:
- GET /embed?url=<base64url> the embed HTML page (rewritten).
- GET /embed-asset?url=<base64url> fallback for any subresource the
upstream blocks based on hotlink protection. Most assets load directly
via the injected <base> tag and bypass our proxy.
"""
import logging
import re
from typing import AsyncGenerator
from urllib.parse import urlparse
import httpx
from fastapi import HTTPException
from backend.m3u8_rewriter import decode_url
logger = logging.getLogger(__name__)
EMBED_TIMEOUT = 20.0
ASSET_TIMEOUT = 30.0
RELAY_CHUNK_SIZE = 65536
USER_AGENT = (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/120.0.0.0 Safari/537.36"
)
# Response headers we never forward (they break frame embedding or leak upstream policy).
STRIP_RESPONSE_HEADERS = {
"x-frame-options",
"content-security-policy",
"content-security-policy-report-only",
"set-cookie",
"report-to",
"nel",
"permissions-policy",
"cross-origin-opener-policy",
"cross-origin-embedder-policy",
"cross-origin-resource-policy",
# let httpx/uvicorn re-set these
"transfer-encoding",
"content-encoding",
"content-length",
"connection",
}
# Inject this <script> at the top of <head> to defeat JS frame-busters.
# - Locks window.top, window.parent, and window.self to the embed window
# itself, so `self !== window.top` checks pass.
# - Forces document.referrer to the upstream origin so allowlist checks
# like `document.referrer.includes("timstreams.net")` keep working.
# - No-ops anything that would call window.parent.location or attempt to
# reload the top frame.
_FRAME_BUSTER_DEFEAT_TEMPLATE = """
<script>(function(){{
try {{
var fakeWindow = window;
Object.defineProperty(window, 'top', {{get: function(){{return fakeWindow;}}, configurable: false}});
Object.defineProperty(window, 'parent', {{get: function(){{return fakeWindow;}}, configurable: false}});
Object.defineProperty(window, 'frameElement', {{get: function(){{return null;}}, configurable: false}});
Object.defineProperty(document, 'referrer', {{get: function(){{return {referrer!r};}}, configurable: false}});
}} catch (e) {{}}
// Defeat the `disable-devtool.js` redirect trap that hmembeds and similar
// embed hosts use. The trap fires `console.clear`/`console.table` in a
// tight loop, then if it thinks DevTools is open, calls
// `window.location = "https://www.google.com"`. We block those redirect
// sinks while leaving normal playback unaffected.
try {{
var noop = function(){{}};
console.clear = noop;
console.table = noop;
console.dir = noop;
var loc = window.location;
Object.defineProperty(window, 'location', {{
get: function(){{ return loc; }},
set: function(v){{ /* swallow assignment */ }},
configurable: false,
}});
var origAssign = loc.assign && loc.assign.bind(loc);
var origReplace = loc.replace && loc.replace.bind(loc);
loc.assign = function(u){{ if (typeof u === 'string' && u.indexOf('google.com') !== -1) return; if (origAssign) origAssign(u); }};
loc.replace = function(u){{ if (typeof u === 'string' && u.indexOf('google.com') !== -1) return; if (origReplace) origReplace(u); }};
}} catch (e) {{}}
// Route all cross-origin fetch/XHR requests through our /embed-asset
// proxy. The hmembeds player calls a token-binding endpoint
// (hghndasw.gbgdhdffhf.shop/sec/<JWT>) that CORS-rejects requests from
// any origin other than hmembeds.one. By rewriting the URL to
// /embed-asset?url=..., the browser fetches our same-origin endpoint
// (no CORS issue), and our backend fetches the upstream with the
// correct Referer/Origin server-side (no CORS issue there either).
try {{
var b64url = function(s) {{
return btoa(unescape(encodeURIComponent(s)))
.replace(/\\+/g, '-').replace(/\\//g, '_').replace(/=+$/, '');
}};
var sameOrigin = function(u) {{
try {{ return (new URL(u, document.baseURI || location.href)).origin === location.origin; }}
catch (_) {{ return true; }}
}};
var toAbsolute = function(u) {{
try {{ return (new URL(u, document.baseURI || location.href)).toString(); }}
catch (_) {{ return u; }}
}};
var proxify = function(u) {{
var abs = toAbsolute(u);
if (sameOrigin(abs)) return u;
// Don't double-proxy.
if (abs.indexOf('/embed-asset?') !== -1 || abs.indexOf('/embed?') !== -1) return u;
return location.origin + '/embed-asset?url=' + b64url(abs);
}};
var _fetch = window.fetch && window.fetch.bind(window);
if (_fetch) {{
window.fetch = function(input, init) {{
try {{
if (typeof input === 'string') {{
return _fetch(proxify(input), init);
}} else if (input && input.url) {{
var newUrl = proxify(input.url);
if (newUrl !== input.url) {{
return _fetch(new Request(newUrl, input), init);
}}
}}
}} catch (e) {{}}
return _fetch(input, init);
}};
}}
var XHR = window.XMLHttpRequest;
if (XHR && XHR.prototype && XHR.prototype.open) {{
var _open = XHR.prototype.open;
XHR.prototype.open = function(method, url) {{
try {{ url = proxify(url); }} catch (e) {{}}
var args = Array.prototype.slice.call(arguments);
args[1] = url;
return _open.apply(this, args);
}};
}}
}} catch (e) {{}}
}})();</script>
"""
def _decode(encoded_url: str) -> str:
try:
return decode_url(encoded_url)
except Exception as e:
raise HTTPException(status_code=400, detail=f"Invalid encoded URL: {e}")
def _filter_headers(upstream_headers: httpx.Headers) -> dict[str, str]:
"""Forward upstream headers minus the ones we strip."""
out: dict[str, str] = {}
for k, v in upstream_headers.items():
if k.lower() in STRIP_RESPONSE_HEADERS:
continue
out[k] = v
# Always allow our domain to embed and load cross-origin
out["Access-Control-Allow-Origin"] = "*"
out["X-Frame-Options-Stripped"] = "by-f1-embed-proxy"
return out
def _make_referer(upstream_url: str) -> str:
"""Build a plausible Referer header — the upstream's own root."""
parsed = urlparse(upstream_url)
return f"{parsed.scheme}://{parsed.netloc}/"
def _make_origin(upstream_url: str) -> str:
parsed = urlparse(upstream_url)
return f"{parsed.scheme}://{parsed.netloc}"
def _inject_into_head(html: str, upstream_url: str) -> str:
"""Inject <base> tag + frame-buster defeat script into the response HTML."""
parsed = urlparse(upstream_url)
base_href = f"{parsed.scheme}://{parsed.netloc}/"
# The frame-buster-defeat script. Use the upstream's own URL as the spoofed referrer.
busted = _FRAME_BUSTER_DEFEAT_TEMPLATE.format(referrer=upstream_url)
base_tag = f'<base href="{base_href}">'
injection = base_tag + busted
# Drop any inline CSP <meta> tags first so they can't override our header strip.
html = re.sub(
r'<meta[^>]+http-equiv=[\'"]?Content-Security-Policy[\'"]?[^>]*>',
"",
html,
flags=re.IGNORECASE,
)
# Strip disable-devtool.js script tags. The library runs detection heuristics
# and redirects on match. Removing it reduces attack surface even with our
# location-setter lockdown — saves redundant work and one fewer thing to
# bypass in case the lockdown misses an edge case.
html = re.sub(
r'<script[^>]+(?:disable-devtool|devtool|disabledevtool)[^<]*</script>',
"",
html,
flags=re.IGNORECASE,
)
html = re.sub(
r'<script[^>]+src=["\'][^"\']*disable-devtool[^"\']*["\'][^>]*></script>',
"",
html,
flags=re.IGNORECASE,
)
# Insert immediately after the opening <head> (case-insensitive).
head_match = re.search(r"<head[^>]*>", html, flags=re.IGNORECASE)
if head_match:
idx = head_match.end()
return html[:idx] + injection + html[idx:]
# No <head> — prepend at the start of the document so the script runs first.
return injection + html
def _looks_blocked_by_anti_bot(content: str) -> bool:
"""Detect Cloudflare-style challenge interstitials in the upstream body."""
sample = content[:4096].lower()
markers = (
"cf-chl-bypass",
"checking your browser",
"just a moment",
"attention required",
"cf-browser-verification",
)
return any(m in sample for m in markers)
async def fetch_embed(encoded_url: str) -> tuple[bytes, dict[str, str], int]:
"""Fetch an upstream embed page, rewrite the HTML, and return the response.
Returns: (body_bytes, headers_dict, status_code).
Raises HTTPException on transport errors.
"""
url = _decode(encoded_url)
logger.info("Embed-proxying: %s", url)
upstream_headers = {
"User-Agent": USER_AGENT,
"Referer": _make_referer(url),
"Origin": _make_origin(url),
"Accept": (
"text/html,application/xhtml+xml,application/xml;q=0.9,"
"image/avif,image/webp,*/*;q=0.8"
),
"Accept-Language": "en-US,en;q=0.9",
}
try:
async with httpx.AsyncClient(
timeout=EMBED_TIMEOUT,
follow_redirects=True,
) as client:
response = await client.get(url, headers=upstream_headers)
except httpx.TimeoutException:
raise HTTPException(status_code=504, detail="Upstream embed timeout")
except httpx.HTTPError as e:
raise HTTPException(status_code=502, detail=f"Upstream embed error: {e}")
status_code = response.status_code
upstream_ct = response.headers.get("content-type", "")
headers_out = _filter_headers(response.headers)
body = response.content
# Detect Cloudflare-style challenge so the frontend can show a clear error.
if "html" in upstream_ct.lower():
text = response.text
if _looks_blocked_by_anti_bot(text):
logger.warning("Upstream returned anti-bot challenge: %s", url)
raise HTTPException(
status_code=502,
detail="Upstream returned anti-bot challenge — proxy cannot bypass",
)
rewritten = _inject_into_head(text, url)
body = rewritten.encode("utf-8")
headers_out["Content-Type"] = "text/html; charset=utf-8"
return body, headers_out, status_code
async def relay_asset(
encoded_url: str, range_header: str | None
) -> tuple[AsyncGenerator[bytes, None], dict[str, str], int]:
"""Relay an upstream subresource (JS/CSS/image/font) as a chunked stream.
Used as a fallback when an upstream blocks hotlinked assets via Referer
or Origin checks. The injected <base> tag handles most of these cases
by letting the browser hit upstream directly the relay is only for
the awkward few that need a proxied origin.
"""
url = _decode(encoded_url)
logger.debug("Embed-asset relay: %s", url)
headers = {
"User-Agent": USER_AGENT,
"Referer": _make_referer(url),
"Origin": _make_origin(url),
"Accept": "*/*",
}
if range_header:
headers["Range"] = range_header
client = httpx.AsyncClient(timeout=ASSET_TIMEOUT, follow_redirects=True)
try:
response = await client.send(
client.build_request("GET", url, headers=headers),
stream=True,
)
except httpx.TimeoutException:
await client.aclose()
raise HTTPException(status_code=504, detail="Upstream asset timeout")
except httpx.HTTPError as e:
await client.aclose()
raise HTTPException(status_code=502, detail=f"Upstream asset error: {e}")
if response.status_code >= 400:
await response.aclose()
await client.aclose()
raise HTTPException(
status_code=502,
detail=f"Upstream asset returned HTTP {response.status_code}",
)
headers_out = _filter_headers(response.headers)
async def _stream() -> AsyncGenerator[bytes, None]:
try:
async for chunk in response.aiter_bytes(chunk_size=RELAY_CHUNK_SIZE):
yield chunk
finally:
await response.aclose()
await client.aclose()
return _stream(), headers_out, response.status_code

View file

@ -12,12 +12,20 @@ Example:
"""
from backend.extractors.aceztrims import AceztrimsExtractor
from backend.extractors.chrome_browser import ChromeBrowserExtractor
from backend.extractors.curated import CuratedExtractor
from backend.extractors.dd12 import DD12Extractor
from backend.extractors.stremio import StremioAddonExtractor
from backend.extractors.subreddit import SubredditExtractor
from backend.extractors.daddylive import DaddyLiveExtractor
from backend.extractors.demo import DemoExtractor
from backend.extractors.discord_source import DiscordExtractor
from backend.extractors.models import ExtractedStream
from backend.extractors.pitsport import PitsportExtractor
from backend.extractors.ppv import PPVExtractor
from backend.extractors.registry import ExtractorRegistry
from backend.extractors.service import ExtractionService
from backend.extractors.streamed import StreamedExtractor
from backend.extractors.timstreams import TimStreamsExtractor
__all__ = [
"ExtractedStream",
@ -36,10 +44,36 @@ def create_registry() -> ExtractorRegistry:
registry = ExtractorRegistry()
# --- Register extractors below ---
registry.register(DemoExtractor())
# CuratedExtractor previously surfaced two hmembeds 24/7 channels (Sky
# Sports F1, DAZN F1) but their JW Player decoder produces an empty
# playlist in our environment (error 102630) regardless of headed mode,
# IP, or fingerprint we tried. The streams loaded the upstream's ad
# overlay but never produced a video element, so they confused users —
# disabled until/unless we find a working bypass.
# registry.register(CuratedExtractor())
registry.register(StreamedExtractor())
# ChromeBrowserExtractor drives the in-cluster chrome-service via the
# CHROME_WS_URL / CHROME_WS_TOKEN env vars to scrape JS-rendered
# pages whose m3u8 is computed at runtime.
registry.register(ChromeBrowserExtractor())
# SubredditExtractor pulls live-stream posts from motorsport subreddits.
# Returns embed-type streams; the verifier will visit each via
# chrome-service to confirm playability.
registry.register(SubredditExtractor())
# DD12Extractor scrapes DD12Streams' per-channel pages for the inline
# JW Player file URL. The site embeds the m3u8 in HTML so curl-based
# parsing is enough — no browser needed.
registry.register(DD12Extractor())
# StremioAddonExtractor calls Stremio addon HTTP APIs (TvVoo, StremVerse)
# which already index Sky F1 / DAZN F1 / Vavoo IPTV channels. No
# Stremio client needed — just /stream/<type>/<id>.json calls.
registry.register(StremioAddonExtractor())
registry.register(DaddyLiveExtractor())
registry.register(AceztrimsExtractor())
registry.register(PitsportExtractor())
registry.register(PPVExtractor())
registry.register(TimStreamsExtractor())
registry.register(DiscordExtractor())
return registry

View file

@ -0,0 +1,243 @@
"""Generic chrome-service-driven extractor.
Drives the in-cluster headed Chromium pool (chrome-service) to load a list
of stream/aggregator pages, captures any HLS playlist URL the page fetches
at runtime, and returns one ExtractedStream per discovered playlist.
Unlike the API-based extractors (pitsport/streamed/ppv) this one handles
sites where the m3u8 is computed by JavaScript at page load time the
URL only exists after the page evaluates an obfuscated decoder, fetches a
token, etc. Curl can't see it; a real browser can.
Add new targets via the `TARGETS` constant below. Each entry is a (label,
title, page_url) tuple. The extractor visits each URL with a stealthed
context, waits for the JS to settle, and yields any captured HLS URL.
"""
import asyncio
import logging
import os
import re
import urllib.parse
from dataclasses import dataclass
from backend.extractors.base import BaseExtractor
from backend.extractors.models import ExtractedStream
logger = logging.getLogger(__name__)
# Best-effort pause between navigation and capture. The decoder usually
# fires within 5s; 12s gives slow JS time to settle without dragging the
# extraction round.
DEFAULT_SETTLE_SECONDS = 12
USER_AGENT = (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/605.1.15 (KHTML, like Gecko) "
"Version/17.4 Safari/605.1.15"
)
@dataclass(frozen=True)
class _Target:
label: str # site_name (homepage label in the UI)
title: str # human-readable stream title
url: str # page to navigate
settle: int = DEFAULT_SETTLE_SECONDS
# ---------------------------------------------------------------------------
# Target list. F1-relevant 24/7 channels and motorsport aggregator pages
# whose m3u8 is JS-computed. Add freely — each one takes ~12s to scrape.
# ---------------------------------------------------------------------------
TARGETS: tuple[_Target, ...] = (
# MotoMundo embed pages — the community-curated WordPress site for
# MotoGP. Each /e/<id> URL is one of the iframes their "Watch Online"
# post lists for the active session (FP/Q/Race). The m3u8 is
# JS-computed at load time so a real browser is required to capture
# it. Update IDs each weekend to match the current race; subreddit.py
# discovers them from the Reddit "[Watch / Download]" thread.
_Target(
label="MotoMundo",
title="MotoGP Live (MotoMundo) — French GP / Le Mans",
url="https://motomundo.top/e/9yzn08jk9py4",
settle=15,
),
_Target(
label="MotoMundo",
title="MotoGP Live (MotoMundo upns) — French GP / Le Mans",
url="https://motomundo.upns.xyz/#kqasde",
settle=15,
),
)
# Heuristic to recognise an HLS playlist URL from network capture. Most CDNs
# use `.m3u8`; some (pushembdz/oe1.ossfeed) disguise the playlist as `.css`
# under a /out/v… or /hls/ path. Filter out obvious junk (.css for actual
# stylesheets, .ts segments — we only want the playlist).
_HLS_URL_RE = re.compile(r"\.m3u8(\?|$)|/out/v[0-9]+/.+\.css(\?|$)|/hls/.+/master\.css(\?|$)")
_SEGMENT_EXT_RE = re.compile(r"\.(ts|m4s|aac|key)(\?|$)")
def _looks_like_hls_playlist(url: str) -> bool:
if _SEGMENT_EXT_RE.search(url):
return False
return bool(_HLS_URL_RE.search(url))
def _resolve_chrome_ws() -> str | None:
base = os.getenv("CHROME_WS_URL")
token = os.getenv("CHROME_WS_TOKEN")
if not base or not token:
return None
return f"{base.rstrip('/')}/{token}"
class ChromeBrowserExtractor(BaseExtractor):
"""Drive chrome-service to capture m3u8 URLs from JS-heavy pages."""
@property
def site_key(self) -> str:
return "chrome-browser"
@property
def site_name(self) -> str:
return "Chrome Browser"
async def extract(self) -> list[ExtractedStream]:
ws_url = _resolve_chrome_ws()
if not ws_url:
logger.warning(
"[chrome-browser] CHROME_WS_URL/TOKEN not set — extractor disabled"
)
return []
try:
from playwright.async_api import async_playwright
except ImportError:
logger.warning("[chrome-browser] playwright not installed — disabled")
return []
# One Playwright instance + one browser connection per extraction
# round. Contexts are cheap; the browser is shared.
async with async_playwright() as p:
try:
browser = await p.chromium.connect(ws_url, timeout=15_000)
except Exception:
logger.exception("[chrome-browser] connect to chrome-service failed")
return []
results: list[ExtractedStream] = []
for target in TARGETS:
try:
stream = await self._scrape(browser, target)
if stream:
results.append(stream)
except Exception:
logger.exception(
"[chrome-browser] failed to scrape %s", target.url
)
try:
await browser.close()
except Exception:
pass
logger.info("[chrome-browser] returned %d stream(s)", len(results))
return results
async def _scrape(self, browser, target: _Target) -> ExtractedStream | None:
ctx = await browser.new_context(
user_agent=USER_AGENT,
viewport={"width": 1280, "height": 720},
bypass_csp=True,
)
# Inject the same stealth script the verifier uses so anti-bot
# checks don't trip the page before its decoder runs.
try:
from backend.stealth import STEALTH_JS
await ctx.add_init_script(STEALTH_JS)
except Exception:
pass
page = await ctx.new_page()
captured: list[str] = []
def on_response(resp):
try:
if _looks_like_hls_playlist(resp.url):
captured.append(resp.url)
except Exception:
pass
page.on("response", on_response)
# Some pages (DD12 variants) load the player in a child iframe;
# frame events catch nested navigations.
page.on(
"framenavigated",
lambda fr: captured.append(fr.url) if _looks_like_hls_playlist(fr.url) else None,
)
try:
await page.goto(target.url, wait_until="domcontentloaded", timeout=20_000)
except Exception as e:
logger.debug("[chrome-browser] %s goto failed: %s", target.url, e)
await ctx.close()
return None
# Let the page's JS settle.
await asyncio.sleep(target.settle)
# Also probe child iframes — `pushembdz`, `pooembed`, `embedsports`
# all live behind one. Collect any HLS URL the iframes loaded.
for fr in page.frames:
if fr is page.main_frame:
continue
try:
# JW Player and Clappr both expose the playing source via
# a <video>/`<source>` element after setup completes.
sources = await fr.evaluate(
"() => Array.from(document.querySelectorAll('video, source')).map(e => e.currentSrc || e.src || '').filter(s => s.includes('.m3u8') || s.includes('.css'))"
)
for s in sources:
if _looks_like_hls_playlist(s):
captured.append(s)
except Exception:
pass
await ctx.close()
# Pick the first plausible URL (any subsequent are usually variant
# playlists referenced from the master). Prefer URLs that look like
# full master playlists.
unique = list(dict.fromkeys(captured))
if not unique:
logger.debug("[chrome-browser] %s yielded no HLS URL", target.url)
return None
# Prefer URLs that look like a master/index playlist over variant
# playlists when both are captured.
master = next(
(u for u in unique if "master" in u.lower() or "index" in u.lower()),
unique[0],
)
# Strip query strings on URLs that include short-lived tokens —
# the verifier and frontend re-resolve them per request.
# (Some CDNs require the query though; only strip when obvious.)
m3u8 = master
# Decode URL-encoded characters so the proxy gets a clean URL.
m3u8 = urllib.parse.unquote(m3u8)
logger.info(
"[chrome-browser] %s -> %s",
target.url, m3u8[:120],
)
return ExtractedStream(
url=m3u8,
site_key=self.site_key,
site_name=target.label,
quality="",
title=target.title,
stream_type="m3u8",
)

View file

@ -0,0 +1,61 @@
"""Curated extractor — known-good 24/7 F1 channels via direct embed URLs.
Returns a small, hand-picked list of embed URLs that are reliable enough to
be served as fallback "always-on" streams when the dynamic extractors find
nothing (e.g. between race weekends, when API providers are down).
These are direct embed URLs. The frontend routes them through /embed so the
iframe-stripping proxy bypasses any frame-buster JS in the upstream player.
"""
import logging
from backend.extractors.base import BaseExtractor
from backend.extractors.models import ExtractedStream
logger = logging.getLogger(__name__)
# Curated list. Each entry is a known direct embed URL. These were sourced
# from the timstreams.py ALWAYS_INCLUDE_HASHES list (Sky Sports F1, DAZN F1)
# and are documented as 24/7 channels that play F1 content year-round.
_CURATED_STREAMS = [
{
"url": "https://hmembeds.one/embed/888520f36cd94c5da4c71fddc1a5fc9b",
"title": "Sky Sports F1 (24/7)",
"quality": "HD",
},
{
"url": "https://hmembeds.one/embed/fc3a54634d0867b0c02ee3223292e7c6",
"title": "DAZN F1 (24/7)",
"quality": "HD",
},
]
class CuratedExtractor(BaseExtractor):
"""Returns curated known-good 24/7 F1 channel embed URLs."""
@property
def site_key(self) -> str:
return "curated"
@property
def site_name(self) -> str:
return "Curated 24/7 Channels"
async def extract(self) -> list[ExtractedStream]:
streams = [
ExtractedStream(
url=entry["url"],
site_key=self.site_key,
site_name=self.site_name,
quality=entry["quality"],
title=entry["title"],
stream_type="embed",
embed_url=entry["url"],
)
for entry in _CURATED_STREAMS
]
logger.info("[curated] Returning %d curated stream(s)", len(streams))
return streams

View file

@ -0,0 +1,111 @@
"""DD12Streams extractor — scrapes inline m3u8 URLs from per-channel pages.
Each DD12 sport page (`/nas`, `/f1`, `/sky`, etc.) renders an iframe to
`/<channel>c1` which 302-redirects to `/new-<channel>/jwplayer`. That
page contains a JW Player setup with the m3u8 URL hard-coded inline:
playerInstance.setup({
file: "https://...b-cdn.net/.../master.m3u8",
...
});
The JW Player runtime fails in our cluster (same fingerprint trap as
hmembeds), but we don't need it — the file URL is in the HTML and any
browser with H.264 codecs can play it directly via hls.js.
Channel discovery: probe a known list. New ones can be added by checking
DD12's own homepage / nav.
"""
import logging
import re
import httpx
from backend.extractors.base import BaseExtractor
from backend.extractors.models import ExtractedStream
logger = logging.getLogger(__name__)
BASE = "https://dd12streams.com"
USER_AGENT = (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/605.1.15 (KHTML, like Gecko) "
"Version/17.4 Safari/605.1.15"
)
# (path, channel_label, title). Add as DD12 surfaces new channels.
CHANNELS = (
("nas", "DD12Streams", "NASCAR Cup Series (24/7) — DD12"),
)
_FILE_URL_RE = re.compile(r"""file\s*:\s*["']([^"']+\.m3u8[^"']*)["']""")
class DD12Extractor(BaseExtractor):
@property
def site_key(self) -> str:
return "dd12"
@property
def site_name(self) -> str:
return "DD12Streams"
async def extract(self) -> list[ExtractedStream]:
results: list[ExtractedStream] = []
async with httpx.AsyncClient(
timeout=15.0,
follow_redirects=True,
headers={"User-Agent": USER_AGENT},
) as client:
for path, label, title in CHANNELS:
try:
page_url = f"{BASE}/{path}"
resp = await client.get(page_url)
if resp.status_code != 200:
continue
iframe_path = self._extract_iframe(resp.text)
if not iframe_path:
continue
iframe_url = (
iframe_path
if iframe_path.startswith("http")
else f"{BASE}{iframe_path}"
)
iframe_resp = await client.get(
iframe_url, headers={"Referer": page_url}
)
if iframe_resp.status_code != 200:
continue
m3u8 = self._find_m3u8(iframe_resp.text)
if not m3u8:
continue
results.append(
ExtractedStream(
url=m3u8,
site_key=self.site_key,
site_name=label,
quality="",
title=title,
stream_type="m3u8",
)
)
except Exception:
logger.debug(
"[dd12] /%s extraction failed", path, exc_info=True
)
logger.info("[dd12] Extracted %d stream(s)", len(results))
return results
@staticmethod
def _extract_iframe(html: str) -> str | None:
m = re.search(
r'<iframe[^>]+id=["\']vplayer["\'][^>]+src=["\']([^"\']+)["\']',
html,
)
return m.group(1) if m else None
@staticmethod
def _find_m3u8(html: str) -> str | None:
m = _FILE_URL_RE.search(html)
return m.group(1) if m else None

View file

@ -0,0 +1,203 @@
"""Discord extractor - monitors Discord channels for F1 stream links.
Reads recent messages from configured Discord channels using a user token,
extracts URLs that look like stream links, and returns them as embed streams.
"""
import logging
import os
import re
import httpx
from backend.extractors.base import BaseExtractor
from backend.extractors.models import ExtractedStream
logger = logging.getLogger(__name__)
DISCORD_API = "https://discord.com/api/v9"
DISCORD_TOKEN = os.getenv("DISCORD_TOKEN", "")
# Comma-separated channel IDs to monitor
DISCORD_CHANNELS = os.getenv("DISCORD_CHANNELS", "").split(",")
# How many messages to fetch per channel
MESSAGE_LIMIT = 50
USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
# URL pattern to match stream links (exclude Discord CDN, images, etc.)
URL_PATTERN = re.compile(r"https?://[^\s<>\)\]\"']+", re.IGNORECASE)
# Domains that publish news/articles, not playable streams. Discord users share
# these links during race weekends; they are NOT streams and pollute the list.
EXCLUDED_DOMAINS = {
"discord.com", "discord.gg", "cdn.discordapp.com",
"tenor.com", "giphy.com", "imgur.com",
"youtube.com", "youtu.be", "twitter.com", "x.com",
"reddit.com", "instagram.com", "tiktok.com",
"fmhy.net", "github.com", "freemotorsports.com",
# News / official sites — never playable embeds
"formula1.com", "fia.com", "skysports.com", "motorsport.com",
"driverdb.com", "autosport.com", "the-race.com", "racefans.net",
"wikipedia.org", "fantasy.formula1.com",
}
# A URL is treated as a candidate stream embed only if its path looks like
# a *direct* player/embed page — `/embed/{id}`, `/player/{...}`, `*.m3u8`,
# `*.php` (legacy iframe1.php style). Aggregator landing pages
# (`/event/...`, `/watch?session=...`, etc.) are rejected because they
# show a list of links instead of playing automatically — those produce
# verifier-passing UI without actual playback.
_PATH_KEYWORDS = (
"/embed/", "/player/", ".m3u8", ".php",
)
def _is_stream_url(url: str) -> bool:
"""Heuristic: does this URL look like an actual stream/embed/player link?
Discord users share lots of news links during race weekends. The old
filter only blocked specific domains and let everything else through,
which produced a stream list dominated by formula1.com news articles.
The new filter is positive-match: a URL must contain at least one
stream-shaped path keyword to be included.
"""
from urllib.parse import urlparse
try:
parsed = urlparse(url)
domain = parsed.netloc.lower()
path = parsed.path.lower()
except Exception:
return False
if not domain:
return False
for excluded in EXCLUDED_DOMAINS:
if excluded in domain:
return False
if any(path.endswith(ext) for ext in (".png", ".jpg", ".jpeg", ".gif", ".webp", ".mp4", ".webm", ".svg", ".css", ".js")):
return False
full = path + ("?" + parsed.query if parsed.query else "")
if not any(kw in full for kw in _PATH_KEYWORDS):
return False
return True
class DiscordExtractor(BaseExtractor):
"""Extracts stream links from Discord channel messages.
Monitors configured Discord channels for URLs shared by users,
filters to likely stream links, and returns them as embed streams.
"""
@property
def site_key(self) -> str:
return "discord"
@property
def site_name(self) -> str:
return "Discord Community"
async def extract(self) -> list[ExtractedStream]:
"""Fetch recent messages from Discord channels and extract URLs."""
if not DISCORD_TOKEN:
logger.info("[discord] No DISCORD_TOKEN set, skipping")
return []
channels = [c.strip() for c in DISCORD_CHANNELS if c.strip()]
if not channels:
logger.info("[discord] No DISCORD_CHANNELS configured, skipping")
return []
streams: list[ExtractedStream] = []
seen_urls: set[str] = set()
try:
async with httpx.AsyncClient(
timeout=15.0,
follow_redirects=True,
headers={
"Authorization": DISCORD_TOKEN,
"User-Agent": USER_AGENT,
},
) as client:
for channel_id in channels:
try:
channel_streams = await self._fetch_channel(
client, channel_id, seen_urls
)
streams.extend(channel_streams)
except Exception:
logger.debug(
"[discord] Failed to fetch channel %s",
channel_id,
exc_info=True,
)
except Exception:
logger.exception("[discord] Failed to connect to Discord API")
logger.info("[discord] Extracted %d stream(s) from %d channel(s)", len(streams), len(channels))
return streams
async def _fetch_channel(
self,
client: httpx.AsyncClient,
channel_id: str,
seen_urls: set[str],
) -> list[ExtractedStream]:
"""Fetch messages from a single channel and extract stream URLs."""
resp = await client.get(
f"{DISCORD_API}/channels/{channel_id}/messages",
params={"limit": MESSAGE_LIMIT},
)
if resp.status_code != 200:
logger.warning(
"[discord] Channel %s returned HTTP %d", channel_id, resp.status_code
)
return []
messages = resp.json()
if not isinstance(messages, list):
return []
streams: list[ExtractedStream] = []
for msg in messages:
content = msg.get("content", "")
author = msg.get("author", {}).get("username", "unknown")
# Extract URLs from message content
urls = URL_PATTERN.findall(content)
# Also check embeds
for embed in msg.get("embeds", []):
if embed.get("url"):
urls.append(embed["url"])
for url in urls:
# Clean trailing punctuation
url = url.rstrip(".,;:!?)")
if url in seen_urls:
continue
if not _is_stream_url(url):
continue
seen_urls.add(url)
streams.append(
ExtractedStream(
url=url,
site_key=self.site_key,
site_name=self.site_name,
quality="",
title=f"Shared by {author}",
stream_type="embed",
embed_url=url,
)
)
return streams

View file

@ -0,0 +1,544 @@
"""Pitsport.xyz extractor - fetches F1 streams from the Next.js RSC payload.
Architecture:
- Main page (pitsport.xyz) has a "Live Now" section with event cards containing
category, title, time, imageUrl props and /watch/{UUID} links.
- Schedule page (pitsport.xyz/schedule) lists all events grouped by category
(h2 headings) with /watch/{UUID} links and event titles.
- Watch pages (/watch/{UUID}) embed iframes from pushembdz.store/embed/{EMBED_UUID}.
- Embed pages contain an RSC payload with a stream config: {title, link, method}.
- When method is "player" or "hls", the link field points to a serveplay.site
m3u8 playlist. Otherwise we return the embed URL for iframe playback.
"""
import logging
import re
from dataclasses import dataclass
import httpx
from backend.extractors.base import BaseExtractor
from backend.extractors.models import ExtractedStream
logger = logging.getLogger(__name__)
PITSPORT_BASE = "https://pitsport.xyz"
EMBED_BASE = "https://pushembdz.store"
USER_AGENT = (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/120.0.0.0 Safari/537.36"
)
# Categories to include (case-insensitive match). Broadened beyond F1
# to also surface MotoGP and adjacent motorsports — keeps the f1-stream
# UI useful between race weekends and during the off-season.
MOTORSPORT_CATEGORIES = {
"formula 1", "formula 2", "formula 3",
"motogp", "moto gp", "moto2", "moto3", "motoe",
"world rally championship", "wrc",
"world endurance championship", "wec",
"indycar series", "indycar", "indynxt",
"nascar cup series", "nascar truck series", "nascar o'reilly auto parts series",
"nascar xfinity series", "nascar",
}
# Title keywords that are strong positives even when the category text
# is missing (live-now cards sometimes elide it).
MOTORSPORT_KEYWORDS = {
"formula 1", "formula one", "f1",
"motogp", "moto gp", "moto2", "moto3",
"rally", "wrc",
"indycar", "indy car",
"nascar",
"le mans", "lemans", "wec", "endurance",
}
GP_KEYWORD = "grand prix"
@dataclass
class _PitsportEvent:
"""An event discovered from the Pitsport site."""
category: str
title: str
watch_uuid: str
def _is_motorsport_category(category: str) -> bool:
"""Check if a category string matches an included motorsport series."""
return category.strip().lower() in MOTORSPORT_CATEGORIES
def _is_motorsport_event(category: str, title: str) -> bool:
"""Accept anything pitsport.xyz lists. Pitsport curates sports
broadcasts (WRC, MotoGP, IndyCar, NASCAR, Premier League Darts,
Premier League football, etc.) the site's own selection is the
filter we want. Empty/garbage events still get filtered downstream
when `_resolve_event_streams` produces no playable URL."""
return bool(category or title)
# Aliases kept so older call-sites stay compiling. Both now point at the
# broadened motorsport filter.
_is_f1_category = _is_motorsport_category
_is_f1_event = _is_motorsport_event
def _parse_live_events(html: str) -> list[_PitsportEvent]:
"""Parse live events from the main page RSC payload.
The main page contains event cards with props:
category, title, time, imageUrl
wrapped in <a href="/watch/{UUID}"> links.
"""
events: list[_PitsportEvent] = []
# Match event cards in the RSC payload - they appear as JSON-like structures
# Pattern: href="/watch/UUID" ... category":"...", "title":"..."
# In the RSC payload, the data is in the format:
# ["$","$L2","/watch/UUID",{"href":"/watch/UUID","children":["$","$L10",null,
# {"category":"...","title":"...","time":...,"imageUrl":"..."}]}]
pattern = re.compile(
r'"href":"(/watch/([0-9a-f-]{36}))"[^}]*?"category":"([^"]+)","title":"([^"]+)"',
)
for match in pattern.finditer(html):
_, uuid, category, title = match.groups()
events.append(_PitsportEvent(category=category, title=title, watch_uuid=uuid))
return events
def _parse_schedule_events(html: str) -> list[_PitsportEvent]:
"""Parse events from the schedule page.
The schedule page groups events under category headers (h2 elements).
In the rendered HTML:
<h2 ...>Formula 1</h2>
<div ...>
<a href="/watch/UUID">...</a>
...
</div>
In the RSC payload, similar structure with section divs containing
a category h2 and child event links with titles.
"""
events: list[_PitsportEvent] = []
# Strategy 1: Parse from rendered HTML
# Find category sections: >CategoryName</h2> followed by watch links
# Split HTML at each category header
section_pattern = re.compile(
r'>([^<]+)</h2>\s*<div[^>]*class="flex flex-wrap gap-6">(.*?)(?=</div>\s*</div>\s*(?:<div|</div>|$))',
re.DOTALL,
)
for section_match in section_pattern.finditer(html):
category = section_match.group(1).strip()
section_html = section_match.group(2)
# Find all watch links in this section
link_pattern = re.compile(
r'href="/watch/([0-9a-f-]{36})".*?<h1[^>]*>([^<]+)</h1>',
re.DOTALL,
)
for link_match in link_pattern.finditer(section_html):
uuid = link_match.group(1)
title = link_match.group(2).strip()
events.append(
_PitsportEvent(category=category, title=title, watch_uuid=uuid)
)
# Strategy 2: Parse from RSC payload if rendered HTML didn't yield results
# The RSC payload has patterns like:
# "children":"Formula 1"}] ... "/watch/UUID" ... "title":"EventTitle"
if not events:
events = _parse_schedule_rsc(html)
return events
def _parse_schedule_rsc(html: str) -> list[_PitsportEvent]:
"""Parse events from schedule page RSC payload as fallback.
Extracts category section divs from the RSC JSON structure.
"""
events: list[_PitsportEvent] = []
# Find the RSC payload chunks
rsc_chunks = re.findall(
r'self\.__next_f\.push\(\[1,"(.*?)"\]\)', html, re.DOTALL
)
if not rsc_chunks:
return events
# Concatenate and unescape
full_payload = ""
for chunk in rsc_chunks:
try:
full_payload += chunk.encode().decode("unicode_escape")
except Exception:
full_payload += chunk
# Find category sections in the RSC data
# Pattern: "children":"CategoryName"}],["$","div",...watch links...
# Each section div contains an h2 with the category name and watch links
cat_pattern = re.compile(
r'border-gray-700 pb-2","children":"([^"]+)"\}.*?'
r'(?=border-gray-700 pb-2","children"|$)',
re.DOTALL,
)
for cat_match in cat_pattern.finditer(full_payload):
category = cat_match.group(1)
section_text = cat_match.group(0)
# Find watch UUIDs and titles in this section
# Pattern: "/watch/UUID" ... "title":"EventTitle"
event_pattern = re.compile(
r'/watch/([0-9a-f-]{36}).*?"title":"([^"]+)"',
)
for ev_match in event_pattern.finditer(section_text):
uuid = ev_match.group(1)
title = ev_match.group(2)
events.append(
_PitsportEvent(category=category, title=title, watch_uuid=uuid)
)
return events
def _parse_embed_uuids(html: str) -> list[str]:
"""Extract embed UUIDs from a watch page.
Watch pages contain iframes like:
<iframe src="https://pushembdz.store/embed/{EMBED_UUID}" ...>
And in the RSC payload:
"iframe":"https://pushembdz.store/embed/{EMBED_UUID}"
"""
uuids: list[str] = []
# From rendered HTML
iframe_pattern = re.compile(
r'pushembdz\.store/embed/([0-9a-f-]{36})',
)
for match in iframe_pattern.finditer(html):
uuid = match.group(1)
if uuid not in uuids:
uuids.append(uuid)
return uuids
@dataclass
class _StreamConfig:
"""Stream configuration extracted from an embed page."""
title: str
link: str
method: str
def _parse_stream_config(html: str) -> _StreamConfig | None:
"""Extract stream config from an embed page RSC payload.
The embed page now uses a `safeStream` payload that elides the link:
4:["$","$Ld",null,{"safeStream":{"title":"Rally TV","method":"jwp"},
"error":null,"slug":"..."}]
The actual stream URL is fetched at runtime via
pushembdz.store/api/stream/<slug>. Older payloads used "stream" with
inline title+link+method kept as fallback.
"""
# Current format: safeStream with title + method only (link via API).
pattern_safe = re.compile(
r'\\?"safeStream\\?"\s*:\s*\{'
r'\\?"title\\?"\s*:\s*\\?"([^"\\]+)\\?"\s*,\s*'
r'\\?"method\\?"\s*:\s*\\?"([^"\\]+)\\?"',
)
match = pattern_safe.search(html)
if match:
return _StreamConfig(
title=match.group(1),
link="", # filled in by the caller via the api/stream endpoint
method=match.group(2),
)
# Legacy: escaped RSC payload with inline link.
pattern = re.compile(
r'"stream":\{["\']?\\?"title\\?"["\']?:["\']?\\?"([^"\\]+)\\?"["\']?,'
r'["\']?\\?"link\\?"["\']?:["\']?\\?"([^"\\]+)\\?"["\']?,'
r'["\']?\\?"method\\?"["\']?:["\']?\\?"([^"\\]+)\\?"',
)
match = pattern.search(html)
if match:
return _StreamConfig(title=match.group(1), link=match.group(2), method=match.group(3))
pattern2 = re.compile(
r'\\?"stream\\?":\{\\?"title\\?":\\?"([^\\]+)\\?",'
r'\\?"link\\?":\\?"([^\\]+)\\?",'
r'\\?"method\\?":\\?"([^\\]+)\\?"',
)
match = pattern2.search(html)
if match:
return _StreamConfig(title=match.group(1), link=match.group(2), method=match.group(3))
pattern3 = re.compile(
r'"stream"\s*:\s*\{\s*"title"\s*:\s*"([^"]+)"\s*,'
r'\s*"link"\s*:\s*"([^"]+)"\s*,'
r'\s*"method"\s*:\s*"([^"]+)"',
)
match = pattern3.search(html)
if match:
return _StreamConfig(title=match.group(1), link=match.group(2), method=match.group(3))
return None
def _is_m3u8_method(method: str) -> bool:
"""Check if the stream method indicates a direct HLS stream."""
# `jwp` (current pushembdz format) returns an m3u8 from the api/stream
# endpoint regardless of player UI; treat it as HLS.
return method.lower() in ("player", "hls", "jwp")
def _extract_m3u8_url(link: str) -> str:
"""Convert a serveplay.site player URL to an m3u8 playlist URL.
Input: https://dash.serveplay.site/{channel}/index.html
Output: https://dash.serveplay.site/{channel}/index.html
The index.html IS the m3u8 playlist (served with proper content-type
when fetched with the correct Referer header).
"""
return link
class PitsportExtractor(BaseExtractor):
"""Extracts F1 streams from Pitsport.xyz.
Scrapes the Next.js RSC payload from the main page and schedule page
to find F1 events, then resolves embed UUIDs to stream configurations.
"""
@property
def site_key(self) -> str:
return "pitsport"
@property
def site_name(self) -> str:
return "Pitsport"
async def extract(self) -> list[ExtractedStream]:
"""Fetch F1 events and return stream URLs or embed URLs."""
streams: list[ExtractedStream] = []
try:
async with httpx.AsyncClient(
timeout=20.0,
follow_redirects=True,
headers={"User-Agent": USER_AGENT},
) as client:
# Fetch both pages to get comprehensive event data
events = await self._discover_events(client)
logger.info(
"[pitsport] Found %d F1 event(s) to process", len(events)
)
# Deduplicate by watch UUID
seen_uuids: set[str] = set()
unique_events: list[_PitsportEvent] = []
for ev in events:
if ev.watch_uuid not in seen_uuids:
seen_uuids.add(ev.watch_uuid)
unique_events.append(ev)
# For each event, resolve streams
for event in unique_events:
event_streams = await self._resolve_event_streams(
client, event
)
streams.extend(event_streams)
except Exception:
logger.exception("[pitsport] Failed to extract streams")
logger.info("[pitsport] Extracted %d stream(s)", len(streams))
return streams
async def _discover_events(
self, client: httpx.AsyncClient
) -> list[_PitsportEvent]:
"""Discover F1 events from both main page and schedule page."""
all_events: list[_PitsportEvent] = []
# Fetch main page for live events
try:
resp = await client.get(PITSPORT_BASE)
if resp.status_code == 200:
live_events = _parse_live_events(resp.text)
logger.info(
"[pitsport] Main page: %d live event(s)", len(live_events)
)
for ev in live_events:
if _is_f1_event(ev.category, ev.title):
all_events.append(ev)
else:
logger.warning(
"[pitsport] Main page returned HTTP %d", resp.status_code
)
except Exception:
logger.exception("[pitsport] Failed to fetch main page")
# Fetch schedule page for upcoming events
try:
resp = await client.get(f"{PITSPORT_BASE}/schedule")
if resp.status_code == 200:
schedule_events = _parse_schedule_events(resp.text)
logger.info(
"[pitsport] Schedule page: %d total event(s)",
len(schedule_events),
)
for ev in schedule_events:
if _is_f1_event(ev.category, ev.title):
all_events.append(ev)
else:
logger.warning(
"[pitsport] Schedule page returned HTTP %d",
resp.status_code,
)
except Exception:
logger.exception("[pitsport] Failed to fetch schedule page")
return all_events
async def _resolve_event_streams(
self, client: httpx.AsyncClient, event: _PitsportEvent
) -> list[ExtractedStream]:
"""Resolve an event's watch page to actual stream URLs."""
streams: list[ExtractedStream] = []
try:
# Fetch the watch page to get embed UUIDs
watch_url = f"{PITSPORT_BASE}/watch/{event.watch_uuid}"
resp = await client.get(watch_url)
if resp.status_code != 200:
logger.debug(
"[pitsport] Watch page %s returned HTTP %d",
event.watch_uuid,
resp.status_code,
)
return []
embed_uuids = _parse_embed_uuids(resp.text)
if not embed_uuids:
logger.debug(
"[pitsport] No embed UUIDs found for %s", event.watch_uuid
)
return []
logger.debug(
"[pitsport] Event '%s' has %d embed(s)",
event.title,
len(embed_uuids),
)
# Resolve each embed to a stream config
for i, embed_uuid in enumerate(embed_uuids):
stream = await self._resolve_embed(
client, embed_uuid, event, stream_num=i + 1
)
if stream:
streams.append(stream)
except Exception:
logger.debug(
"[pitsport] Failed to resolve event %s",
event.watch_uuid,
exc_info=True,
)
return streams
async def _resolve_embed(
self,
client: httpx.AsyncClient,
embed_uuid: str,
event: _PitsportEvent,
stream_num: int,
) -> ExtractedStream | None:
"""Resolve an embed UUID to a stream configuration."""
try:
embed_url = f"{EMBED_BASE}/embed/{embed_uuid}"
resp = await client.get(embed_url)
if resp.status_code != 200:
logger.debug(
"[pitsport] Embed page %s returned HTTP %d",
embed_uuid,
resp.status_code,
)
return None
config = _parse_stream_config(resp.text)
if not config:
logger.debug(
"[pitsport] No stream config found in embed %s",
embed_uuid,
)
return None
# Build the stream title
stream_title = f"{event.category} - {event.title}"
if config.title:
stream_title += f" ({config.title})"
if stream_num > 1:
stream_title += f" #{stream_num}"
# `safeStream` payload elides the link — fetch it from the
# pushembdz.store/api/stream/<slug> endpoint. Older `stream`
# payloads provided the link inline.
link = config.link
if not link and _is_m3u8_method(config.method):
api_url = f"{EMBED_BASE}/api/stream/{embed_uuid}"
try:
api_resp = await client.get(
api_url,
headers={"Referer": embed_url, "Accept": "application/json"},
)
if api_resp.status_code == 200:
link = (api_resp.json() or {}).get("link", "")
except Exception:
logger.debug(
"[pitsport] api/stream lookup failed for %s",
embed_uuid,
exc_info=True,
)
# Treat any HLS-ish URL (m3u8, or pushembdz's .css disguise) as m3u8.
looks_hls = link and (".m3u8" in link or link.endswith(".css") or "serveplay.site" in link)
if _is_m3u8_method(config.method) and looks_hls:
return ExtractedStream(
url=link,
site_key=self.site_key,
site_name=self.site_name,
quality="",
title=stream_title,
stream_type="m3u8",
)
else:
# Iframe embed fallback
return ExtractedStream(
url=embed_url,
site_key=self.site_key,
site_name=self.site_name,
quality="",
title=stream_title,
stream_type="embed",
embed_url=embed_url,
)
except Exception:
logger.debug(
"[pitsport] Failed to resolve embed %s",
embed_uuid,
exc_info=True,
)
return None

View file

@ -0,0 +1,270 @@
"""PPV.to extractor - fetches F1 streams via the public PPV API.
Returns embed URLs (pooembed.eu) for iframe playback.
The API at api.ppv.to/api/streams requires no authentication.
Falls back to api.ppv.st if the primary API is unreachable.
"""
import logging
import httpx
from backend.extractors.base import BaseExtractor
from backend.extractors.models import ExtractedStream
logger = logging.getLogger(__name__)
PRIMARY_API = "https://api.ppv.to/api/streams"
FALLBACK_API = "https://api.ppv.st/api/streams"
EMBED_BASE = "https://pooembed.eu/embed"
USER_AGENT = (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/120.0.0.0 Safari/537.36"
)
# Category name for motorsport on PPV.to
MOTORSPORT_CATEGORY = "motorsports"
# Only include events matching these keywords (case-insensitive)
F1_KEYWORDS = {"formula 1", "formula one", "f1", "sky sports f1"}
# Grand Prix is shared with MotoGP/IndyCar — only match if no other series keywords
GP_KEYWORD = "grand prix"
NON_F1_KEYWORDS = {
"motogp", "moto gp", "moto2", "moto3", "motoe",
"indycar", "indy car", "firestone", "nascar",
"rally", "wrc", "wec", "lemans", "le mans",
"superbike", "dtm", "supercars",
}
def _is_f1_stream(name: str, category_name: str = "") -> bool:
"""Check if a stream is Formula 1 related.
Checks both the stream name and the category name.
A stream qualifies if:
- It is in the motorsport category AND matches F1 keywords, OR
- It matches F1 keywords regardless of category.
"""
lower_name = name.lower()
lower_cat = category_name.lower()
# Reject if it contains non-F1 motorsport keywords
if any(kw in lower_name for kw in NON_F1_KEYWORDS):
return False
# Direct F1 keyword match in the stream name
if any(kw in lower_name for kw in F1_KEYWORDS):
return True
# "grand prix" in the name, only if in motorsports category and no non-F1 keywords
if GP_KEYWORD in lower_name and MOTORSPORT_CATEGORY in lower_cat:
return True
# If the category is motorsport, also check category-level keywords
if MOTORSPORT_CATEGORY in lower_cat and any(kw in lower_cat for kw in F1_KEYWORDS):
return True
return False
class PPVExtractor(BaseExtractor):
"""Extracts embed URLs from PPV.to's public JSON API.
Uses the endpoint:
- GET https://api.ppv.to/api/streams -> all streams grouped by category
- Fallback: https://api.ppv.st/api/streams
Each stream object contains an `iframe` field with the embed URL,
or a `uri_name` from which the embed URL can be constructed.
"""
@property
def site_key(self) -> str:
return "ppv"
@property
def site_name(self) -> str:
return "PPV.to"
async def _fetch_streams(self, client: httpx.AsyncClient) -> dict | None:
"""Try primary and fallback APIs, return parsed JSON or None."""
for api_url in (PRIMARY_API, FALLBACK_API):
try:
resp = await client.get(api_url)
if resp.status_code == 200:
data = resp.json()
logger.info("[ppv] Fetched streams from %s", api_url)
return data
logger.warning(
"[ppv] %s returned HTTP %d", api_url, resp.status_code
)
except Exception:
logger.debug(
"[ppv] Failed to reach %s", api_url, exc_info=True
)
return None
async def extract(self) -> list[ExtractedStream]:
"""Fetch F1 streams and return embed URLs for iframe playback."""
streams: list[ExtractedStream] = []
try:
async with httpx.AsyncClient(
timeout=15.0,
follow_redirects=True,
headers={"User-Agent": USER_AGENT, "Accept": "application/json"},
) as client:
data = await self._fetch_streams(client)
if data is None:
logger.warning("[ppv] Could not fetch streams from any API")
return []
# The API returns:
# { "streams": [ { "category": "Name", "id": N, "streams": [...] }, ... ] }
# Flatten into (category_name, stream_obj) tuples.
all_streams = self._normalize_streams(data)
logger.info(
"[ppv] Found %d total stream(s) across all categories",
len(all_streams),
)
for category_name, stream_obj in all_streams:
name = stream_obj.get("name", "") or stream_obj.get("title", "")
if not _is_f1_stream(name, category_name):
continue
# Build the embed URL
embed_url = self._get_embed_url(stream_obj)
if not embed_url:
logger.debug("[ppv] No embed URL for stream: %s", name)
continue
# Extract quality from tag if present
tag = stream_obj.get("tag", "")
quality = tag if tag else ""
# Build descriptive title
title = name
viewers = stream_obj.get("viewers")
if viewers and int(viewers) > 0:
title += f" ({viewers} viewers)"
# Check for substreams (multiple quality/language options)
substreams = stream_obj.get("substreams")
if isinstance(substreams, list) and substreams:
for i, sub in enumerate(substreams):
sub_embed = sub.get("iframe", "") or sub.get("embed_url", "")
if not sub_embed:
# Fall back to the parent embed URL
sub_embed = embed_url
sub_name = sub.get("name", "") or sub.get("label", "")
sub_quality = sub.get("tag", "") or sub.get("quality", "") or quality
sub_title = f"{name}"
if sub_name:
sub_title += f" - {sub_name}"
elif i > 0:
sub_title += f" #{i + 1}"
streams.append(
ExtractedStream(
url=sub_embed,
site_key=self.site_key,
site_name=self.site_name,
quality=sub_quality,
title=sub_title,
stream_type="embed",
embed_url=sub_embed,
)
)
else:
# Single stream, no substreams
streams.append(
ExtractedStream(
url=embed_url,
site_key=self.site_key,
site_name=self.site_name,
quality=quality,
title=title,
stream_type="embed",
embed_url=embed_url,
)
)
except Exception:
logger.exception("[ppv] Failed to extract streams")
logger.info("[ppv] Extracted %d F1 stream(s)", len(streams))
return streams
@staticmethod
def _normalize_streams(data: dict | list) -> list[tuple[str, dict]]:
"""Normalize the API response into a flat list of (category_name, stream_dict) tuples.
The PPV API returns data in this shape:
{
"streams": [
{
"category": "Motorsports",
"id": 35,
"streams": [ { stream objects... } ]
},
...
]
}
Each category group has a "category" string and a nested "streams" list.
"""
result: list[tuple[str, dict]] = []
# Handle the top-level wrapper
if isinstance(data, dict):
categories = data.get("streams", [])
elif isinstance(data, list):
categories = data
else:
return result
for category_group in categories:
if not isinstance(category_group, dict):
continue
category_name = category_group.get("category", "")
# The nested streams within this category
inner_streams = category_group.get("streams", [])
if isinstance(inner_streams, list):
for stream_obj in inner_streams:
if isinstance(stream_obj, dict):
# Attach category_name to each stream for filtering
result.append((category_name, stream_obj))
elif isinstance(category_group, dict) and "name" in category_group:
# Fallback: the item itself is a stream (flat list format)
result.append((category_name, category_group))
return result
@staticmethod
def _get_embed_url(stream: dict) -> str:
"""Extract or construct the embed URL for a stream."""
# Prefer the iframe field directly
iframe = stream.get("iframe", "")
if iframe:
return iframe
# Construct from uri_name
uri_name = stream.get("uri_name", "") or stream.get("uri", "")
if uri_name:
# Strip leading slash if present
uri_name = uri_name.lstrip("/")
return f"{EMBED_BASE}/{uri_name}"
# Last resort: use the stream id
stream_id = stream.get("id")
if stream_id:
return f"{EMBED_BASE}/{stream_id}"
return ""

View file

@ -6,6 +6,7 @@ from datetime import datetime, timezone
from backend.extractors.models import ExtractedStream
from backend.extractors.registry import ExtractorRegistry
from backend.health import StreamHealthChecker
from backend.playback_verifier import PlaybackVerifier
logger = logging.getLogger(__name__)
@ -29,6 +30,11 @@ class ExtractionService:
self._last_run: str | None = None
self._last_run_stream_count: int = 0
self._health_checker = StreamHealthChecker()
self._playback_verifier = PlaybackVerifier()
async def shutdown(self) -> None:
"""Release the headless browser instance owned by the verifier."""
await self._playback_verifier.shutdown()
async def run_extraction(self) -> None:
"""Run all extractors, health-check results, and cache them.
@ -43,31 +49,93 @@ class ExtractionService:
streams = await self._registry.extract_all()
# Run health checks on all extracted streams
# Dedupe by canonical URL — pitsport surfaces every WRC stage as a
# separate event but they all point at the same RallyTV master.m3u8
# (and similar for MotoGP weekend sessions). Keep the first
# occurrence so the user sees one entry per actual stream.
deduped: list[ExtractedStream] = []
seen_urls: set[str] = set()
for stream in streams:
key = (stream.embed_url or "").strip() or (stream.url or "").strip()
if not key or key in seen_urls:
continue
seen_urls.add(key)
deduped.append(stream)
if len(deduped) < len(streams):
logger.info(
"Deduped streams: %d -> %d (collapsed %d duplicate URL(s))",
len(streams), len(deduped), len(streams) - len(deduped),
)
streams = deduped
# Run health checks + headless-browser playback verification.
# Both stream types are now verified end-to-end so the user only
# ever sees streams that actually play in a browser.
if streams:
# Separate m3u8 streams (need health check) from embed streams (skip)
m3u8_streams = [s for s in streams if s.stream_type != "embed"]
embed_streams = [s for s in streams if s.stream_type == "embed"]
# Mark embed streams as live (no health check possible for iframes)
for stream in embed_streams:
stream.is_live = True
stream.response_time_ms = 0
stream.checked_at = start.isoformat()
# Health-check only m3u8 streams
# m3u8 streams: cheap structural health check (validates manifest,
# checks first variant playlist), then a headless-browser test
# to confirm hls.js can decode and render frames.
if m3u8_streams:
stream_dicts = [s.to_dict() for s in m3u8_streams]
health_map = await self._health_checker.check_all(stream_dicts)
for stream in m3u8_streams:
health = health_map.get(stream.url)
if health:
stream.is_live = health.is_live
stream.response_time_ms = health.response_time_ms
stream.checked_at = health.checked_at
if health.bitrate > 0:
stream.bitrate = health.bitrate
# tentatively mark live; final word comes from the verifier
stream.is_live = health.is_live
# Browser verification: applies to both m3u8 (only those that
# passed structural health) and embed (always — they have no
# other way to verify).
verify_items: list[tuple[str, str]] = []
for stream in m3u8_streams:
if stream.is_live:
verify_items.append((stream.url, "m3u8"))
for stream in embed_streams:
verify_items.append((stream.embed_url or stream.url, "embed"))
verdicts = await self._playback_verifier.verify_many(verify_items)
now_iso = datetime.now(timezone.utc).isoformat()
for stream in m3u8_streams:
if not stream.is_live:
continue # already failed health check
verdict = verdicts.get(stream.url)
if verdict is None:
continue # verifier disabled or unavailable
stream.is_live = verdict.is_playable
stream.checked_at = now_iso
# Curated streams skip the verifier — they are hand-picked
# 24/7 channels whose embed pages aggressively detect headless
# automation. We can't reliably confirm playback server-side,
# but we trust the curator. The user's real browser does NOT
# trigger the same anti-bot heuristics (real plugins, real
# mouse movements, etc.).
CURATED_BYPASS = {"curated"}
for stream in embed_streams:
stream.checked_at = now_iso
if stream.site_key in CURATED_BYPASS:
stream.is_live = True
stream.response_time_ms = 0
continue
key = stream.embed_url or stream.url
verdict = verdicts.get(key)
if verdict is None:
# Verifier unavailable — fall back to "trust extractor".
# This keeps the service usable even without playwright.
stream.is_live = True
stream.response_time_ms = 0
else:
stream.is_live = verdict.is_playable
stream.response_time_ms = verdict.elapsed_ms
# Group streams by site_key and update cache
new_cache: dict[str, list[ExtractedStream]] = {}

View file

@ -9,7 +9,9 @@ from backend.extractors.models import ExtractedStream
logger = logging.getLogger(__name__)
BASE_URL = "https://streamed.su"
# Site renamed from streamed.su → streamed.pk in 2026; the .su domain
# stopped resolving the API host (only the marketing page is left).
BASE_URL = "https://streamed.pk"
USER_AGENT = (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "

View file

@ -0,0 +1,161 @@
"""Stremio-addon-driven extractor.
Stremio addons expose a public HTTP API: each addon has a manifest at
`<base>/manifest.json` and per-resource endpoints like
`<base>/stream/<type>/<id>.json` returning `{streams:[{url,name,...}]}`.
This extractor calls a curated set of live-TV addons that surface F1
and Sky-Sports-class motorsport channels. We treat each returned URL as
an ExtractedStream and let the playback verifier confirm playability.
We don't need a Stremio client — we just call the documented HTTP API.
Findings from initial research (2026-05-07):
- **TvVoo** (`tvvoo.hayd.uk`) wraps the Vavoo IPTV network, lists
Sky Sports F1 (UK + IT + DE), DAZN F1, Movistar F1, Canal+ F1,
Viaplay F1. The returned m3u8 URLs are IP-bound at the Vavoo CDN
(`*.ngolpdkyoctjcddxshli469r.org/sunshine/...`); they're tokenised
to whichever IP fetched the manifest. Currently their SSL certs have
expired which fails most clients the addon framework is right but
delivery is degraded today.
- **StremVerse** (`stremverse.onrender.com`) returns 11+ streams per
catalog id (`stremevent_591`=F1, `stremevent_866`=MotoGP). Mix of
DRM-walled DASH, JW-Player-broken-chain JWT, and apar151 HuggingFace
proxy URLs. Master playlists parse; variant URLs sometimes return 404
if they're meant to be resolved by the addon's player rather than
directly.
Adding a new addon = one entry in `_ADDONS`. Each addon's resolver only
needs the manifest + stream endpoints; the addon does the heavy lifting.
"""
import asyncio
import logging
from dataclasses import dataclass
from typing import Iterable
import httpx
from backend.extractors.base import BaseExtractor
from backend.extractors.models import ExtractedStream
logger = logging.getLogger(__name__)
USER_AGENT = (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/605.1.15 (KHTML, like Gecko) "
"Version/17.4 Safari/605.1.15"
)
@dataclass(frozen=True)
class _Addon:
name: str
base: str # e.g. "https://tvvoo.hayd.uk"
stream_ids: tuple[tuple[str, str, str], ...]
"""(stream_type, stream_id, label) per F1/motorsport entry."""
# Curated addon list — see module docstring. These IDs are documented in
# the addons' manifests / channel lists. Update when channel names/IDs
# rotate.
_ADDONS: tuple[_Addon, ...] = (
_Addon(
name="TvVoo",
base="https://tvvoo.hayd.uk",
stream_ids=(
("tv", "vavoo_SKY%20SPORTS%20F1|group:uk", "Sky Sports F1 UK (Vavoo)"),
("tv", "vavoo_SKY%20SPORTS%20F1%20HD|group:uk", "Sky Sports F1 HD UK (Vavoo)"),
("tv", "vavoo_SKY%20SPORT%20F1|group:it", "Sky Sport F1 IT (Vavoo)"),
("tv", "vavoo_SKY%20SPORT%20F1%20HD|group:de", "Sky Sport F1 DE (Vavoo)"),
("tv", "vavoo_DAZN%20F1|group:es", "DAZN F1 ES (Vavoo)"),
),
),
_Addon(
name="StremVerse",
base="https://stremverse.onrender.com",
stream_ids=(
("tv", "stremevent_591", "Formula 1 (StremVerse)"),
("tv", "stremevent_866", "MotoGP (StremVerse)"),
),
),
)
class StremioAddonExtractor(BaseExtractor):
"""Pull F1 + Sky-class motorsport URLs from public Stremio addons."""
@property
def site_key(self) -> str:
return "stremio"
@property
def site_name(self) -> str:
return "Stremio Addon"
async def extract(self) -> list[ExtractedStream]:
async with httpx.AsyncClient(
timeout=15.0,
follow_redirects=True,
headers={"User-Agent": USER_AGENT},
# Some addons (TvVoo→Vavoo) hand back URLs whose origin certs
# are expired; honest-default verify=True is preserved here so
# the verifier sees the same TLS errors a browser would.
) as client:
tasks = []
for addon in _ADDONS:
for stype, sid, label in addon.stream_ids:
tasks.append(self._resolve(client, addon, stype, sid, label))
results = await asyncio.gather(*tasks, return_exceptions=True)
streams: list[ExtractedStream] = []
for r in results:
if isinstance(r, Exception):
logger.debug("[stremio] resolve failed: %s", r)
continue
streams.extend(r)
logger.info("[stremio] surfaced %d candidate stream URL(s) across %d addon(s)",
len(streams), len(_ADDONS))
return streams
async def _resolve(
self, client: httpx.AsyncClient, addon: _Addon,
stype: str, sid: str, label: str,
) -> list[ExtractedStream]:
url = f"{addon.base}/stream/{stype}/{sid}.json"
try:
resp = await client.get(url)
except Exception as e:
logger.debug("[stremio] %s fetch failed: %s", url, e)
return []
if resp.status_code != 200:
logger.debug("[stremio] %s -> HTTP %d", url, resp.status_code)
return []
try:
data = resp.json()
except Exception:
return []
out: list[ExtractedStream] = []
for idx, s in enumerate(data.get("streams") or []):
stream_url = (s.get("url") or "").strip()
if not stream_url:
continue
# Skip DRM-tagged entries — they need Widevine which neither
# our verifier nor a clean hls.js path can play.
if "DRM" in (s.get("name") or "").upper():
continue
title = label
if idx > 0:
title = f"{label} #{idx + 1}"
out.append(
ExtractedStream(
url=stream_url,
site_key=self.site_key,
site_name=f"{addon.name}",
quality="",
title=title,
stream_type="m3u8",
)
)
return out

View file

@ -0,0 +1,249 @@
"""Subreddit extractor — pulls community-curated live-stream URLs from
the *MotorsportsReplays* subreddit (and a few siblings).
The community follows a stable pattern: a single mod-curated post titled
`[Watch / Download] <Series> <Year> - <Round> | <Event>` goes up on or
near each race weekend with a `**Watch Online:**` link in the selftext,
pointing at an admin-run WordPress site (motomundo.net for MotoGP, the
F1 equivalent has rotated over the years). That WordPress page hosts
iframe embeds whose m3u8 is JS-computed at load time ideal target for
the chrome-service pipeline downstream.
This extractor:
- Hits Reddit with a real-browser User-Agent (httpx default UA + cluster
IP combo gets HTTP 403'd on r/motogp; a Safari UA does not).
- Searches for the `[Watch` thread pattern AND scans `/new.json` for
any flair set to LIVE.
- Pulls selftext URLs and returns each candidate as an `embed`-type
ExtractedStream. The verifier already drives chrome-service for embed
streams, so the m3u8 capture happens there.
"""
import asyncio
import logging
import re
import urllib.parse
from typing import NamedTuple
import httpx
from backend.extractors.base import BaseExtractor
from backend.extractors.models import ExtractedStream
logger = logging.getLogger(__name__)
USER_AGENT = (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/605.1.15 (KHTML, like Gecko) "
"Version/17.4 Safari/605.1.15"
)
# Subreddits to scan.
# - r/motorsportsstreams2 is the active 12.5k-sub successor to the banned
# r/motorsportstreams; race-weekend "[F1 STREAM]" posts include
# `boxboxbox.pro/stream-1` URLs and similar fresh aggregator links.
# - r/MotorsportsReplays runs the [Watch / Download] mod-post pattern
# linking to motomundo.net (MotoGP) and sister sites.
# - The rest are low-yield but cost nothing.
SUBREDDITS: tuple[str, ...] = (
"motorsportsstreams2",
"MotorsportsReplays",
"f1streams",
"motorsports",
"formula1",
"motogp",
)
# Search queries fired against r/motorsportsstreams2 + r/MotorsportsReplays.
# The first set captures the [Watch / Download] mod posts; the second set
# catches race-weekend live discussion threads.
SEARCH_QUERIES: tuple[str, ...] = (
"Watch Download F1 2026",
"Watch Download MotoGP 2026",
"Watch Online F1 2026",
"F1 STREAM live",
"Sky Sports F1 live",
"Sky F1 stream",
)
# Hosts we accept as "interesting" stream-page URLs. These are the
# admin-curated WordPress / aggregator sites the community links to.
# Anchored to what r/motorsportsstreams2 currently posts (May 2026 sweep).
_INTERESTING_HOSTS = (
# WordPress wrappers / community-run sites
"motomundo.net", # MotoGP — admin-curated WP
"motomundo.top", # MotoMundo embed host
"motomundo.upns.xyz", # MotoMundo embed host (newer)
"freemotorsports.com", # WAC successor curated link list
"boxboxbox.pro", # F1 race-weekend aggregator (community fav)
"boxboxbox.live", # boxboxbox sister
"boxboxbox.lol",
# Aggregators we already have direct extractors for, but Reddit may
# surface event-specific deeplinks (e.g. /watch/<UUID>) we'd miss
# otherwise.
"pitsport.xyz",
"pitsport.live",
"rerace.io",
"dd12streams.com",
"ppv.to",
"streamed.pk",
"acestrlms.pages.dev",
"aceztrims.pages.dev",
# Sport-specific direct CDNs that occasionally appear in posts
"racelive.jp", # Super Formula
"cdn.sfgo.jp", # Super Formula CDN
# Speculative F1 sister sites — pattern likely if motomundo for MotoGP
"f1mundo.net",
"f1.live",
"f1live",
"skystreams",
"raceon",
"watchf1",
)
# URLs we actively never try to scrape (auth-walled, social media,
# direct downloads with no live stream).
_REJECT_HOSTS = (
"discord.gg", "discord.com",
"twitter.com", "x.com",
"youtube.com", "youtu.be",
"instagram.com", "tiktok.com",
"f1tv.formula1.com",
"viktorbarzin.me",
"gofile.io",
"mega.nz", "drive.google.com",
"1fichier.com", "rapidgator", "uploaded.net",
"magnet:",
)
_URL_RE = re.compile(r"https?://[^\s\)\]\>\"']+")
class _Candidate(NamedTuple):
title: str
url: str
subreddit: str
flair: str
def _is_interesting(url: str) -> bool:
low = url.lower()
if any(host in low for host in _REJECT_HOSTS):
return False
return any(host in low for host in _INTERESTING_HOSTS)
def _has_live_marker(post: dict) -> bool:
title = (post.get("title") or "").lower()
flair = (post.get("link_flair_text") or "").lower()
if "[watch" in title or "watch online" in title or "live" in flair:
return True
return False
class SubredditExtractor(BaseExtractor):
"""Scan motorsport subreddits for community-curated live-stream URLs."""
@property
def site_key(self) -> str:
return "subreddit"
@property
def site_name(self) -> str:
return "Subreddit"
async def extract(self) -> list[ExtractedStream]:
# NB: do NOT send `Accept: application/json` — Reddit's anti-bot
# fingerprint flags that header from datacenter IPs and returns
# HTTP 403 with HTML. Default Accept (`*/*`) gets through fine
# and `.json` URLs always return JSON regardless.
async with httpx.AsyncClient(
timeout=15.0,
follow_redirects=True,
headers={"User-Agent": USER_AGENT},
) as client:
tasks = [self._fetch_new(client, sub) for sub in SUBREDDITS]
tasks.extend(self._search(client, q) for q in SEARCH_QUERIES)
results = await asyncio.gather(*tasks, return_exceptions=True)
candidates: list[_Candidate] = []
for r in results:
if isinstance(r, Exception):
logger.debug("[subreddit] fetch failed: %s", r)
continue
candidates.extend(r)
# Dedupe by URL, keep first occurrence.
seen: set[str] = set()
picks: list[_Candidate] = []
for c in candidates:
if c.url in seen:
continue
seen.add(c.url)
picks.append(c)
logger.info(
"[subreddit] scanned %d source(s) — %d unique candidate URL(s)",
len(SUBREDDITS) + len(SEARCH_QUERIES), len(picks),
)
return [
ExtractedStream(
url=c.url,
site_key=self.site_key,
site_name=f"r/{c.subreddit}",
quality="",
title=c.title[:100],
stream_type="embed",
embed_url=c.url,
)
for c in picks
]
async def _fetch_new(self, client: httpx.AsyncClient, sub: str) -> list[_Candidate]:
return await self._collect(
client,
f"https://www.reddit.com/r/{sub}/new.json?limit=25",
sub,
)
async def _search(self, client: httpx.AsyncClient, query: str) -> list[_Candidate]:
q = urllib.parse.quote_plus(query)
return await self._collect(
client,
f"https://www.reddit.com/r/MotorsportsReplays/search.json?q={q}&restrict_sr=on&sort=new&limit=10",
"MotorsportsReplays",
)
async def _collect(
self, client: httpx.AsyncClient, url: str, sub: str
) -> list[_Candidate]:
try:
resp = await client.get(url)
except Exception as e:
logger.debug("[subreddit] fetch %s failed: %s", url, e)
return []
if resp.status_code != 200:
logger.debug("[subreddit] %s -> HTTP %d", url, resp.status_code)
return []
try:
data = resp.json()
except Exception:
return []
out: list[_Candidate] = []
for child in (data.get("data", {}) or {}).get("children", []):
d = child.get("data", {}) or {}
if not _has_live_marker(d):
continue
text = (d.get("selftext") or "")
title = d.get("title") or ""
flair = d.get("link_flair_text") or ""
# First, the linked URL itself (if it's a recognised live site).
top = d.get("url") or ""
if top and _is_interesting(top):
out.append(_Candidate(title, top, sub, flair))
# Then any URL embedded in the selftext that points at a
# community-curated live page.
for u in _URL_RE.findall(text):
if _is_interesting(u):
out.append(_Candidate(title, u, sub, flair))
return out

View file

@ -0,0 +1,190 @@
"""TimStreams extractor - fetches F1 streams from the TimStreams JSON API.
Returns embed URLs from hmembeds.one for iframe playback.
The public API at stra.viaplus.site/main requires no authentication
and returns all events/channels across Events, Replays, and 24/7 categories.
"""
import logging
import httpx
from backend.extractors.base import BaseExtractor
from backend.extractors.models import ExtractedStream
logger = logging.getLogger(__name__)
API_URL = "https://stra.viaplus.site/main"
USER_AGENT = (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/120.0.0.0 Safari/537.36"
)
# Direct F1 keyword matches (case-insensitive)
F1_KEYWORDS = {"formula 1", "formula one", "f1", "sky sports f1", "dazn f1"}
# "Grand prix" is F1-related only if non-F1 motorsport keywords are absent
GP_KEYWORD = "grand prix"
# Exclude these motorsport series when matching on "grand prix"
NON_F1_KEYWORDS = {
"motogp", "moto gp", "moto2", "moto3", "motoe",
"indycar", "indy car", "nascar",
"rally", "wrc", "wec", "lemans", "le mans",
"superbike", "dtm", "supercars",
}
# 24/7 channels that should always be included (embed hashes on hmembeds.one)
ALWAYS_INCLUDE_HASHES = {
"888520f36cd94c5da4c71fddc1a5fc9b", # Sky Sports F1
"fc3a54634d0867b0c02ee3223292e7c6", # DAZN F1
}
def _is_f1_event(name: str) -> bool:
"""Check if an event/channel is Formula 1 related by name.
Returns True when the name contains a direct F1 keyword, or contains
"grand prix" without non-F1 series keywords.
Note: The TimStreams API genre field (genre=2) covers ALL sports channels,
not just motorsport, so we rely solely on name-based matching.
"""
lower = name.lower()
# Direct F1 keyword match
if any(kw in lower for kw in F1_KEYWORDS):
return True
# Grand prix without competing series
if GP_KEYWORD in lower and not any(kw in lower for kw in NON_F1_KEYWORDS):
return True
return False
def _extract_embed_hash(url: str) -> str | None:
"""Extract the hash from an hmembeds.one embed URL.
Expected format: https://hmembeds.one/embed/{hash}
Returns the hash string, or None if the URL is not in the expected format.
"""
if not url:
return None
# Handle both with and without trailing slash
url = url.rstrip("/")
prefix = "https://hmembeds.one/embed/"
alt_prefix = "http://hmembeds.one/embed/"
if url.startswith(prefix):
return url[len(prefix):] or None
if url.startswith(alt_prefix):
return url[len(alt_prefix):] or None
return None
def _is_always_include(url: str) -> bool:
"""Check if a stream URL is one of the always-include 24/7 channels."""
embed_hash = _extract_embed_hash(url)
return embed_hash in ALWAYS_INCLUDE_HASHES if embed_hash else False
class TimStreamsExtractor(BaseExtractor):
"""Extracts embed URLs from TimStreams' public JSON API.
The API at stra.viaplus.site/main returns a JSON array of categories,
each containing events with stream URLs pointing to hmembeds.one embeds.
"""
@property
def site_key(self) -> str:
return "timstreams"
@property
def site_name(self) -> str:
return "TimStreams"
async def extract(self) -> list[ExtractedStream]:
"""Fetch F1 events/channels and return embed URLs for iframe playback."""
streams: list[ExtractedStream] = []
seen_urls: set[str] = set()
try:
async with httpx.AsyncClient(
timeout=15.0,
follow_redirects=True,
headers={"User-Agent": USER_AGENT, "Accept": "application/json"},
) as client:
resp = await client.get(API_URL)
if resp.status_code != 200:
logger.warning(
"[timstreams] API returned HTTP %d", resp.status_code
)
return []
data = resp.json()
if not isinstance(data, list):
logger.warning("[timstreams] Unexpected API response type: %s", type(data).__name__)
return []
logger.info("[timstreams] API returned %d categorie(s)", len(data))
for category in data:
category_name = category.get("category", "Unknown")
events = category.get("events", [])
if not isinstance(events, list):
continue
for event in events:
event_name = event.get("name", "Unknown")
event_streams = event.get("streams", [])
if not isinstance(event_streams, list) or not event_streams:
continue
# Check if any stream URL matches an always-include channel
always_include = any(
_is_always_include(s.get("url", ""))
for s in event_streams
)
# Filter: must be F1-related or an always-include channel
if not always_include and not _is_f1_event(event_name):
continue
for stream_info in event_streams:
stream_name = stream_info.get("name", "")
stream_url = stream_info.get("url", "")
if not stream_url:
continue
# Deduplicate by URL
if stream_url in seen_urls:
continue
seen_urls.add(stream_url)
# Build a descriptive title
title = event_name
if stream_name and stream_name.lower() != event_name.lower():
title = f"{event_name} - {stream_name}"
if category_name:
title = f"[{category_name}] {title}"
streams.append(
ExtractedStream(
url=stream_url,
site_key=self.site_key,
site_name=self.site_name,
quality="",
title=title,
stream_type="embed",
embed_url=stream_url,
)
)
except httpx.TimeoutException:
logger.warning("[timstreams] API request timed out")
except Exception:
logger.exception("[timstreams] Failed to fetch from API")
logger.info("[timstreams] Extracted %d stream(s)", len(streams))
return streams

View file

@ -3,6 +3,7 @@
import logging
import os
from contextlib import asynccontextmanager
from datetime import datetime, timedelta, timezone
from apscheduler.schedulers.asyncio import AsyncIOScheduler
from apscheduler.triggers.cron import CronTrigger
@ -13,6 +14,7 @@ from fastapi.staticfiles import StaticFiles
from pydantic import BaseModel
from starlette.responses import Response, StreamingResponse
from backend.embed_proxy import fetch_embed, relay_asset
from backend.extractors import create_extraction_service
from backend.proxy import proxy_playlist, relay_stream
from backend.schedule import ScheduleService
@ -117,10 +119,6 @@ async def lifespan(app: FastAPI):
# Startup: load schedule and start background scheduler
await schedule_service.initialize()
# Run initial extraction
logger.info("Running initial stream extraction...")
await extraction_service.run_extraction()
# Schedule daily schedule refresh
scheduler.add_job(
_scheduled_refresh,
@ -130,13 +128,18 @@ async def lifespan(app: FastAPI):
replace_existing=True,
)
# Schedule periodic stream extraction (default: every 30 minutes)
# Schedule periodic stream extraction (default: every 30 minutes).
# next_run_time fires the first run 8s after startup. We don't run
# extraction inline here because it calls the playback verifier,
# which hits http://127.0.0.1:8000/embed for embed streams — uvicorn
# isn't listening yet inside the lifespan startup phase.
scheduler.add_job(
_scheduled_extraction,
trigger=IntervalTrigger(minutes=30),
id="stream_extraction",
name="Extract streams from all registered sites",
replace_existing=True,
next_run_time=datetime.now(timezone.utc) + timedelta(seconds=8),
)
# Schedule token refresh every 4 minutes (safe margin for 5-min CDN tokens).
@ -159,6 +162,10 @@ async def lifespan(app: FastAPI):
# Shutdown
scheduler.shutdown(wait=False)
logger.info("APScheduler shut down")
try:
await extraction_service.shutdown()
except Exception:
logger.exception("extraction_service shutdown failed")
app = FastAPI(title="F1 Streams", lifespan=lifespan)
@ -409,6 +416,37 @@ async def relay_endpoint(
)
# --- Embed iframe-stripping proxy ---
@app.get("/embed")
async def embed_proxy(url: str = Query(..., description="Base64url-encoded embed URL")):
"""Proxy a third-party embed page so it can be iframed in our origin.
Strips X-Frame-Options and CSP frame-ancestors from the upstream
response, injects a base href + frame-buster-defeat script, and
forwards a plausible Referer/Origin to bypass upstream allowlists.
"""
body, headers, status_code = await fetch_embed(url)
return Response(content=body, headers=headers, status_code=status_code)
@app.get("/embed-asset")
async def embed_asset(
request: Request,
url: str = Query(..., description="Base64url-encoded subresource URL"),
):
"""Relay an upstream subresource (JS/CSS/image/etc.) for the embed proxy.
Used as a fallback when an upstream blocks hotlinked assets via Origin
or Referer checks. Most assets load directly via the injected <base>
tag without going through this endpoint.
"""
range_header = request.headers.get("range")
stream_gen, headers, status_code = await relay_asset(url, range_header)
return StreamingResponse(stream_gen, headers=headers, status_code=status_code)
# --- Frontend Static Files ---
# Mount the SvelteKit static build AFTER all API routes so API endpoints take priority.
# SvelteKit adapter-static with ssr=false produces {page}.html files and a fallback index.html.

View file

@ -0,0 +1,449 @@
"""Headless-browser playback verification for extracted streams.
The basic health checker (backend/health.py) only validates m3u8 syntax.
For embed/iframe streams it has nothing to check the previous code blindly
marked every embed `is_live=True`, which meant the stream list was full of
news articles and aggregator landing pages that never actually played.
This module loads each candidate stream URL in headless Chromium (via
Playwright) and looks for *codec-independent* signals that the upstream
serves a playable stream:
- For m3u8: hls.js receives MANIFEST_PARSED + at least one FRAG_LOADED
event. We don't wait for `<video>` to gain dimensions, because Playwright's
chromium build doesn't include the H.264/AAC codecs. The user's real
browser does, so confirming "manifest + segment fetch succeed" is the
right server-side signal.
- For embed: a `<video>` element appears at top level OR inside the iframe
(the embed proxy strips X-Frame-Options + frame-buster JS so we can
introspect the iframe content), OR the player has set up a MediaSource.
Designed to be called from the extraction service's run_extraction()
hook, with bounded concurrency. Each verification typically takes
4-12 seconds.
"""
import asyncio
import base64
import logging
import os
import time
from dataclasses import dataclass
logger = logging.getLogger(__name__)
# Toggle off in development by setting PLAYBACK_VERIFY_ENABLED=false.
VERIFY_ENABLED = os.getenv("PLAYBACK_VERIFY_ENABLED", "true").lower() in ("true", "1", "yes")
# Maximum number of concurrent browser pages.
MAX_CONCURRENCY = int(os.getenv("PLAYBACK_VERIFY_CONCURRENCY", "2"))
# Per-stream verification budget (seconds). Beyond this we declare unplayable.
PER_STREAM_TIMEOUT = float(os.getenv("PLAYBACK_VERIFY_TIMEOUT", "20"))
# Where the embed proxy lives, used to wrap embed URLs so they bypass
# X-Frame-Options/CSP/JS frame-busters during verification. Defaults to
# loopback because verification runs inside the same FastAPI process.
PROXY_BASE = os.getenv("PLAYBACK_VERIFY_PROXY_BASE", "http://127.0.0.1:8000")
USER_AGENT = (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/120.0.0.0 Safari/537.36"
)
@dataclass
class PlaybackVerdict:
is_playable: bool
signal: str = "" # which check triggered the positive verdict
elapsed_ms: int = 0
error: str = ""
def _b64url(s: str) -> str:
"""URL-safe base64 with padding stripped — matches m3u8_rewriter.encode_url."""
return base64.urlsafe_b64encode(s.encode()).decode().rstrip("=")
def _hls_test_html(m3u8_url: str) -> str:
"""A self-contained HTML page that loads an m3u8 via hls.js into a <video>.
The page exposes window._verifier with manifest_parsed / frag_loaded
booleans the verifier polls. It also marks media-error or fatal-error
so we can distinguish 'upstream is unreachable' from 'codec missing'.
"""
return f"""<!doctype html>
<html><head><meta charset="utf-8"><title>verify</title>
<script src="https://cdn.jsdelivr.net/npm/hls.js@1.5/dist/hls.min.js"></script>
</head><body>
<video id="v" muted playsinline width="640" height="360"></video>
<script>
window._verifier = {{
manifest_parsed: false,
frag_loaded: false,
media_loaded: false, // true when MSE has appended any buffer
fatal_network_error: false, // upstream truly unreachable
manifest_incompatible: false, // codec missing separate from network reachability
hls_error_details: ""
}};
const v = document.getElementById('v');
const url = {m3u8_url!r};
function start() {{
if (window.Hls && Hls.isSupported()) {{
const hls = new Hls({{enableWorker: true}});
hls.on(Hls.Events.MANIFEST_PARSED, () => {{ window._verifier.manifest_parsed = true; }});
hls.on(Hls.Events.FRAG_LOADED, () => {{ window._verifier.frag_loaded = true; }});
hls.on(Hls.Events.BUFFER_APPENDED, () => {{ window._verifier.media_loaded = true; }});
hls.on(Hls.Events.ERROR, (_, d) => {{
window._verifier.hls_error_details = d.details || "";
if (d.fatal && d.type === Hls.ErrorTypes.NETWORK_ERROR) {{
window._verifier.fatal_network_error = true;
}}
if (d.details === Hls.ErrorDetails.MANIFEST_INCOMPATIBLE_CODECS_ERROR) {{
window._verifier.manifest_incompatible = true;
}}
}});
hls.loadSource(url);
hls.attachMedia(v);
}} else if (v.canPlayType('application/vnd.apple.mpegurl')) {{
v.src = url;
v.addEventListener('loadedmetadata', () => {{ window._verifier.manifest_parsed = true; window._verifier.frag_loaded = true; }});
v.addEventListener('error', () => {{ window._verifier.fatal_network_error = true; }});
}} else {{
window._verifier.hls_error_details = "no hls support";
}}
}}
window.addEventListener('load', start);
</script></body></html>"""
def _embed_test_html(_proxied_embed_url: str) -> str:
"""No longer used — verifier navigates the page directly to the proxy URL.
The earlier iframe-wrapper approach hit same-origin policy when inspecting
the iframe's contentDocument (the wrapper page was a data: URL, the iframe
was http://127.0.0.1:8000), so we couldn't read the embed's DOM.
"""
return ""
_M3U8_POLL_JS = """
() => {
const v = window._verifier || {};
const vid = document.querySelector('video');
return {
manifest_parsed: !!v.manifest_parsed,
frag_loaded: !!v.frag_loaded,
media_loaded: !!v.media_loaded,
fatal_network_error: !!v.fatal_network_error,
manifest_incompatible: !!v.manifest_incompatible,
hls_error_details: v.hls_error_details || "",
video_width: vid ? vid.videoWidth : 0,
video_ready: vid ? vid.readyState : 0,
};
}
"""
_EMBED_POLL_JS = """
() => {
try {
const vids = document.querySelectorAll('video');
if (vids.length > 0) {
const v = vids[0];
return {
has_video: true,
src: v.currentSrc || v.src || "",
width: v.videoWidth,
ready: v.readyState,
duration: isFinite(v.duration) ? v.duration : 0,
media_keys: !!v.mediaKeys,
sources: v.querySelectorAll('source').length,
};
}
return {has_video: false};
} catch (e) {
return {has_video: false, err: String(e)};
}
}
"""
async def _verify_m3u8(page, m3u8_url: str, deadline: float) -> PlaybackVerdict:
"""Confirm an m3u8 URL is fetchable via hls.js end-to-end.
Positive signal hierarchy:
1. media_loaded (MSE buffer appended) strongest, codec-supported.
2. frag_loaded (hls.js fetched at least one segment) upstream is OK
even if the local browser lacks codecs.
3. manifest_parsed without media_loaded but with manifest_incompatible
indicates upstream playlist is valid; player can't decode here
but a real user's browser will.
Negative signal:
- fatal_network_error: upstream is unreachable.
- timeout with no manifest_parsed: upstream did not respond.
"""
start = time.monotonic()
html = _hls_test_html(m3u8_url)
data_url = "data:text/html;base64," + base64.b64encode(html.encode()).decode()
try:
await page.goto(data_url, wait_until="domcontentloaded", timeout=10_000)
except Exception as e:
return PlaybackVerdict(
is_playable=False, error=f"goto failed: {e}",
elapsed_ms=int((time.monotonic() - start) * 1000),
)
last_state: dict = {}
while time.monotonic() < deadline:
try:
state = await page.evaluate(_M3U8_POLL_JS)
except Exception as e:
return PlaybackVerdict(
is_playable=False, error=f"evaluate failed: {e}",
elapsed_ms=int((time.monotonic() - start) * 1000),
)
last_state = state
if state.get("media_loaded"):
return PlaybackVerdict(
is_playable=True, signal="media_loaded",
elapsed_ms=int((time.monotonic() - start) * 1000),
)
if state.get("frag_loaded"):
return PlaybackVerdict(
is_playable=True, signal="frag_loaded",
elapsed_ms=int((time.monotonic() - start) * 1000),
)
# MANIFEST_INCOMPATIBLE_CODECS_ERROR fires after hls.js successfully
# fetched and parsed the manifest — the failure is purely local
# (chromium lacks H.264). The user's real browser has codecs, so
# this URL is playable from the user's perspective.
if state.get("manifest_incompatible"):
return PlaybackVerdict(
is_playable=True, signal="manifest_parsed_codec_missing_in_verifier",
elapsed_ms=int((time.monotonic() - start) * 1000),
)
if state.get("manifest_parsed"):
return PlaybackVerdict(
is_playable=True, signal="manifest_parsed",
elapsed_ms=int((time.monotonic() - start) * 1000),
)
if state.get("fatal_network_error"):
return PlaybackVerdict(
is_playable=False, error="upstream network error",
elapsed_ms=int((time.monotonic() - start) * 1000),
)
await asyncio.sleep(0.25)
err = "no playback signal"
if last_state.get("hls_error_details"):
err = f"hls.js error: {last_state['hls_error_details']}"
return PlaybackVerdict(
is_playable=False, error=err,
elapsed_ms=int((time.monotonic() - start) * 1000),
)
async def _verify_embed(page, proxied_url: str, deadline: float) -> PlaybackVerdict:
"""Navigate directly to the proxied embed and confirm a player rendered.
Positive signals (in priority order):
- <video> with src/sources/mediaKeys set (player wired up).
- <video> element exists with any state (script ran, player attaching).
- A player container div (jwplayer, video-js, [id*=player], etc.).
Loading the embed page directly (not via iframe wrapper) avoids the
same-origin policy that prevented earlier iframe-introspection runs
from seeing the embed DOM.
"""
start = time.monotonic()
try:
await page.goto(proxied_url, wait_until="domcontentloaded", timeout=15_000)
except Exception as e:
return PlaybackVerdict(
is_playable=False, error=f"goto failed: {e}",
elapsed_ms=int((time.monotonic() - start) * 1000),
)
# Track the best state seen across all polls. Some embeds load a player
# briefly then anti-bot JS tears the DOM down (hmembeds redirects to
# google.com if its devtool-detection trips). We accept any positive
# signal observed during the window, even if it's gone by timeout.
#
# We require an actual <video> element — a "player container div"
# is too weak (sportsurge has player-class divs but no real player).
seen_video_wired = False
seen_video_tag = False
last_err = ""
while time.monotonic() < deadline:
try:
r = await page.evaluate(_EMBED_POLL_JS)
except Exception as e:
return PlaybackVerdict(
is_playable=False, error=f"evaluate failed: {e}",
elapsed_ms=int((time.monotonic() - start) * 1000),
)
if r.get("has_video"):
seen_video_tag = True
if r.get("src") or r.get("width", 0) > 0 or r.get("media_keys") or r.get("sources", 0) > 0:
seen_video_wired = True
return PlaybackVerdict(
is_playable=True, signal="video.wired",
elapsed_ms=int((time.monotonic() - start) * 1000),
)
last_err = r.get("err", "")
await asyncio.sleep(0.5)
if seen_video_wired:
return PlaybackVerdict(is_playable=True, signal="video.wired",
elapsed_ms=int((time.monotonic() - start) * 1000))
if seen_video_tag:
return PlaybackVerdict(is_playable=True, signal="video.tag_only",
elapsed_ms=int((time.monotonic() - start) * 1000))
err = "no <video> element rendered"
if last_err:
err += f"; last_err: {last_err}"
return PlaybackVerdict(is_playable=False, error=err,
elapsed_ms=int((time.monotonic() - start) * 1000))
class PlaybackVerifier:
"""Verifies playability of m3u8 and embed URLs via headless Chromium.
Manages a single browser instance for the process lifetime (cheap per-page
contexts) and bounds concurrency with a semaphore.
"""
def __init__(self) -> None:
self._browser = None
self._playwright = None
self._sem = asyncio.Semaphore(MAX_CONCURRENCY)
self._lock = asyncio.Lock()
async def _ensure_browser(self):
if self._browser is not None:
return self._browser
async with self._lock:
if self._browser is not None:
return self._browser
try:
from playwright.async_api import async_playwright
except ImportError:
logger.error("playwright not installed — playback verification disabled")
return None
self._playwright = await async_playwright().start()
ws_base = os.getenv("CHROME_WS_URL")
ws_token = os.getenv("CHROME_WS_TOKEN")
if ws_base and ws_token:
self._browser = await self._playwright.chromium.connect(
f"{ws_base.rstrip('/')}/{ws_token}", timeout=15_000,
)
logger.info("connected to remote chrome-service (concurrency=%d)", MAX_CONCURRENCY)
else:
self._browser = await self._playwright.chromium.launch(
headless=True,
args=[
"--disable-dev-shm-usage",
"--disable-web-security",
"--no-sandbox",
"--disable-setuid-sandbox",
"--disable-features=IsolateOrigins,site-per-process",
"--autoplay-policy=no-user-gesture-required",
],
)
logger.warning("CHROME_WS_URL not set — using in-process Chromium (concurrency=%d)", MAX_CONCURRENCY)
return self._browser
async def shutdown(self) -> None:
if self._browser is not None:
try:
await self._browser.close()
except Exception:
logger.exception("error closing browser")
if self._playwright is not None:
try:
await self._playwright.stop()
except Exception:
logger.exception("error stopping playwright")
self._browser = None
self._playwright = None
async def verify(self, url: str, stream_type: str) -> PlaybackVerdict:
if not VERIFY_ENABLED:
return PlaybackVerdict(is_playable=True, error="disabled")
browser = await self._ensure_browser()
if browser is None:
return PlaybackVerdict(is_playable=False, error="playwright unavailable")
is_m3u8 = stream_type == "m3u8"
if not is_m3u8:
url = f"{PROXY_BASE}/embed?url={_b64url(url)}"
async with self._sem:
# Set the per-stream deadline AFTER acquiring the semaphore.
# Otherwise queued streams that wait behind earlier ones
# would have already-expired deadlines when they start.
deadline = time.monotonic() + PER_STREAM_TIMEOUT
try:
context = await browser.new_context(
user_agent=USER_AGENT,
viewport={"width": 1280, "height": 720},
bypass_csp=True,
)
from backend.stealth import STEALTH_JS
await context.add_init_script(STEALTH_JS)
page = await context.new_page()
except Exception as e:
return PlaybackVerdict(
is_playable=False, error=f"context create failed: {e}",
)
try:
if is_m3u8:
verdict = await _verify_m3u8(page, url, deadline)
else:
verdict = await _verify_embed(page, url, deadline)
except asyncio.TimeoutError:
verdict = PlaybackVerdict(is_playable=False, error="overall timeout")
except Exception as e:
verdict = PlaybackVerdict(
is_playable=False, error=f"verify exception: {e}",
)
finally:
try:
await page.close()
await context.close()
except Exception:
pass
logger.info(
"[verify] %s -> playable=%s signal=%s err=%s elapsed=%dms",
url[:120], verdict.is_playable, verdict.signal,
verdict.error, verdict.elapsed_ms,
)
return verdict
async def verify_many(self, items: list[tuple[str, str]]) -> dict[str, PlaybackVerdict]:
if not items:
return {}
if not VERIFY_ENABLED:
return {url: PlaybackVerdict(is_playable=True, error="disabled") for url, _ in items}
async def _run(url: str, stream_type: str):
verdict = await self.verify(url, stream_type)
return url, verdict
results = await asyncio.gather(
*[_run(url, st) for url, st in items], return_exceptions=True
)
out: dict[str, PlaybackVerdict] = {}
for r in results:
if isinstance(r, Exception):
logger.exception("verify task crashed: %s", r)
continue
url, verdict = r
out[url] = verdict
return out

View file

@ -3,3 +3,4 @@ uvicorn[standard]
httpx>=0.27.0
apscheduler>=3.10.0,<4.0
pydantic>=2.0.0
playwright==1.48.0

View file

@ -0,0 +1,43 @@
"""Vendored Playwright stealth init script.
Mirror of `stacks/chrome-service/files/stealth.js`. Kept in sync by hand
update both files together if the JS is changed.
"""
STEALTH_JS = r"""
(() => {
Object.defineProperty(Navigator.prototype, 'webdriver', { get: () => undefined });
if (!window.chrome) window.chrome = {};
window.chrome.runtime = window.chrome.runtime || {};
Object.defineProperty(navigator, 'plugins', {
get: () => [{ name: 'Chrome PDF Plugin' }, { name: 'Chrome PDF Viewer' }, { name: 'Native Client' }],
});
Object.defineProperty(navigator, 'languages', { get: () => ['en-US', 'en'] });
const origQuery = window.navigator.permissions && window.navigator.permissions.query;
if (origQuery) {
window.navigator.permissions.query = (parameters) =>
parameters && parameters.name === 'notifications'
? Promise.resolve({ state: Notification.permission })
: origQuery(parameters);
}
const spoofGl = (proto) => {
if (!proto) return;
const orig = proto.getParameter;
proto.getParameter = function (parameter) {
if (parameter === 37445) return 'Intel Inc.';
if (parameter === 37446) return 'Intel Iris OpenGL Engine';
return orig.apply(this, arguments);
};
};
spoofGl(window.WebGLRenderingContext && window.WebGLRenderingContext.prototype);
spoofGl(window.WebGL2RenderingContext && window.WebGL2RenderingContext.prototype);
// disable-devtool.js auto-init evasion: hide the marker attribute so the
// library's IIFE exits early. Without this, hmembeds-class players redirect
// to google.com when the Performance detector trips under Playwright.
const origQS = Document.prototype.querySelector;
Document.prototype.querySelector = function (sel) {
if (typeof sel === 'string' && sel.indexOf('disable-devtool-auto') !== -1) return null;
return origQS.apply(this, arguments);
};
})();
"""

View file

@ -44,6 +44,20 @@ export function getProxyUrl(m3u8Url) {
return `${API_BASE}/proxy?url=${encoded}`;
}
/**
* Get the embed-proxy URL for an upstream iframe embed page.
*
* The proxy strips X-Frame-Options / CSP frame-ancestors and injects a
* frame-buster-defeat script so the embed renders inside our iframe even
* when the upstream tries to block it.
* @param {string} embedUrl - The original embed page URL
* @returns {string} URL pointing at our /embed proxy
*/
export function getEmbedProxyUrl(embedUrl) {
const encoded = toBase64Url(embedUrl);
return `${API_BASE}/embed?url=${encoded}`;
}
/**
* Mark a stream as actively being watched (enables token refresh).
* @param {string} url - The stream URL

View file

@ -1,5 +1,5 @@
<script>
import { fetchStreams, fetchSchedule, getProxyUrl, activateStream, deactivateStream } from '$lib/api.js';
import { fetchStreams, fetchSchedule, getProxyUrl, getEmbedProxyUrl, activateStream, deactivateStream } from '$lib/api.js';
import { onMount, onDestroy } from 'svelte';
import { page } from '$app/state';
@ -107,12 +107,14 @@
}
if (stream.stream_type === 'embed') {
// Embed/iframe player — no hls.js needed
// Embed/iframe player — route through our /embed proxy so the
// upstream's X-Frame-Options / CSP / JS frame-busters can't
// block the iframe.
const newPlayer = {
id: Date.now(),
proxyUrl: '',
originalUrl: stream.embed_url,
embedUrl: stream.embed_url,
embedUrl: getEmbedProxyUrl(stream.embed_url),
streamType: 'embed',
siteKey: stream.site_key || '',
siteName: stream.site_name || stream.site_key || 'Unknown',
@ -173,9 +175,13 @@
if (!player || !player.videoEl) return;
if (Hls.isSupported()) {
// `lowLatencyMode` previously broke playback on regular (non-LL-HLS)
// providers like RallyTV — they don't ship the LL-HLS extensions
// hls.js needs in that mode. Default off; explicit per-stream flag
// can re-enable later.
const hlsInstance = new Hls({
enableWorker: true,
lowLatencyMode: true,
lowLatencyMode: false,
backBufferLength: 90
});

View file

@ -11,7 +11,8 @@ resource "kubernetes_namespace" "f1-stream" {
name = "f1-stream"
labels = {
"istio-injection" : "disabled"
tier = local.tiers.aux
tier = local.tiers.aux
"chrome-service.viktorbarzin.me/client" = "true"
}
}
lifecycle {
@ -47,6 +48,35 @@ resource "kubernetes_manifest" "external_secret" {
depends_on = [kubernetes_namespace.f1-stream]
}
# Pull the chrome-service bearer token into this namespace as a separate
# Secret so the verifier can reach the in-cluster Playwright pool.
resource "kubernetes_manifest" "chrome_service_client_secret" {
manifest = {
apiVersion = "external-secrets.io/v1beta1"
kind = "ExternalSecret"
metadata = {
name = "chrome-service-client-secrets"
namespace = "f1-stream"
}
spec = {
refreshInterval = "15m"
secretStoreRef = {
name = "vault-kv"
kind = "ClusterSecretStore"
}
target = {
name = "chrome-service-client-secrets"
}
dataFrom = [{
extract = {
key = "chrome-service"
}
}]
}
}
depends_on = [kubernetes_namespace.f1-stream]
}
resource "kubernetes_persistent_volume_claim" "data_proxmox" {
wait_until_bound = false
metadata {
@ -104,11 +134,11 @@ resource "kubernetes_deployment" "f1-stream" {
name = "f1-stream"
resources {
limits = {
memory = "256Mi"
memory = "1Gi"
}
requests = {
cpu = "25m"
memory = "256Mi"
cpu = "100m"
memory = "1Gi"
}
}
port {
@ -127,6 +157,29 @@ resource "kubernetes_deployment" "f1-stream" {
name = "DISCORD_CHANNELS"
value = var.discord_f1_channel_ids
}
# Verifier connects to in-cluster headed Chromium pool see
# stacks/chrome-service/. Falls back to in-process headless if unset.
env {
name = "CHROME_WS_URL"
value = "ws://chrome-service.chrome-service.svc.cluster.local:3000"
}
env {
name = "CHROME_WS_TOKEN"
value_from {
secret_key_ref {
name = "chrome-service-client-secrets"
key = "api_bearer_token"
}
}
}
# The embed proxy (this pod's /embed?url=) must be reachable from
# the remote chrome-service pod. Default 127.0.0.1 only works for
# in-process Chromium for the remote browser we point it at our
# own ClusterIP service.
env {
name = "PLAYBACK_VERIFY_PROXY_BASE"
value = "http://f1.f1-stream.svc.cluster.local"
}
volume_mount {
name = "data"
mount_path = "/data"

View file

@ -8,7 +8,11 @@ variable "postgresql_host" { type = string }
locals {
namespace = "fire-planner"
image = "registry.viktorbarzin.me/fire-planner:${var.image_tag}"
# Phase 3 cutover 2026-05-07. NOTE: the registry-private repo for
# fire-planner has 0 tags first build via Woodpecker on the new Forgejo
# repo (viktor/fire-planner, Dockerfile + .woodpecker.yml added 2026-05-07)
# must succeed BEFORE the next pod restart, otherwise pulls will 404.
image = "forgejo.viktorbarzin.me/viktor/fire-planner:${var.image_tag}"
labels = {
app = "fire-planner"
}

123
stacks/forgejo/cleanup.tf Normal file
View file

@ -0,0 +1,123 @@
# Forgejo container-package retention CronJob.
#
# Forgejo's per-package "Cleanup Rules" UI is not exposed via Terraform
# it's per-user runtime state inside the Forgejo DB. Driving retention from
# a CronJob hitting the public API keeps the policy versioned in this repo.
#
# Auth: a write:package PAT belonging to ci-pusher (same user that pushes
# from CI). DELETE on packages requires write:package scope. PAT lives in
# Vault at secret/viktor/forgejo_cleanup_token.
data "vault_kv_secret_v2" "forgejo_viktor" {
mount = "secret"
name = "viktor"
}
locals {
# Flip to false after first 7 days of dry-run logs look correct.
forgejo_cleanup_dry_run = true
}
resource "kubernetes_config_map" "forgejo_cleanup_script" {
metadata {
name = "forgejo-cleanup-script"
namespace = kubernetes_namespace.forgejo.metadata[0].name
}
data = {
"cleanup.sh" = file("${path.module}/files/cleanup.sh")
}
}
resource "kubernetes_secret" "forgejo_cleanup_token" {
metadata {
name = "forgejo-cleanup-token"
namespace = kubernetes_namespace.forgejo.metadata[0].name
}
type = "Opaque"
data = {
# try() so the apply succeeds before the Vault key is populated during
# Phase 0 bootstrap (see docs/runbooks/forgejo-registry-setup.md). Empty
# token causes the cleanup CronJob to fail visibly that's intended.
FORGEJO_TOKEN = try(data.vault_kv_secret_v2.forgejo_viktor.data["forgejo_cleanup_token"], "")
}
}
resource "kubernetes_cron_job_v1" "forgejo_cleanup" {
metadata {
name = "forgejo-cleanup"
namespace = kubernetes_namespace.forgejo.metadata[0].name
}
spec {
concurrency_policy = "Forbid"
schedule = "0 4 * * *"
failed_jobs_history_limit = 3
successful_jobs_history_limit = 3
job_template {
metadata {}
spec {
backoff_limit = 1
ttl_seconds_after_finished = 3600
template {
metadata {}
spec {
container {
name = "cleanup"
image = "docker.io/library/alpine:3.20"
command = ["/bin/sh", "/scripts/cleanup.sh"]
env {
name = "FORGEJO_TOKEN"
value_from {
secret_key_ref {
name = kubernetes_secret.forgejo_cleanup_token.metadata[0].name
key = "FORGEJO_TOKEN"
}
}
}
env {
name = "FORGEJO_HOST"
value = "http://forgejo.forgejo.svc.cluster.local"
}
env {
name = "FORGEJO_OWNER"
value = "viktor"
}
env {
name = "KEEP_LAST_N"
value = "10"
}
env {
name = "DRY_RUN"
value = local.forgejo_cleanup_dry_run ? "true" : "false"
}
volume_mount {
name = "scripts"
mount_path = "/scripts"
}
resources {
requests = {
cpu = "10m"
memory = "32Mi"
}
limits = {
memory = "96Mi"
}
}
}
volume {
name = "scripts"
config_map {
name = kubernetes_config_map.forgejo_cleanup_script.metadata[0].name
default_mode = "0755"
}
}
restart_policy = "OnFailure"
}
}
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}

View file

@ -0,0 +1,109 @@
#!/bin/sh
# Forgejo container-package retention.
#
# For each container package owned by ${FORGEJO_OWNER}, keep newest
# ${KEEP_LAST_N} versions + always keep tag "latest". Deletes the rest via
# DELETE /api/v1/packages/{owner}/container/{name}/{version}.
#
# DRY_RUN=true logs what would be deleted but issues no DELETE calls.
#
# Required env:
# FORGEJO_HOST e.g. http://forgejo.forgejo.svc.cluster.local
# FORGEJO_OWNER e.g. viktor
# FORGEJO_USER PAT owner (write:package scope)
# FORGEJO_TOKEN PAT
# KEEP_LAST_N integer (default 10)
# DRY_RUN true|false (default true)
set -eu
apk add --no-cache curl jq >/dev/null
OWNER="${FORGEJO_OWNER}"
KEEP="${KEEP_LAST_N:-10}"
DRY="${DRY_RUN:-true}"
BASE="${FORGEJO_HOST%/}/api/v1"
AUTH_HEADER="Authorization: token $FORGEJO_TOKEN"
echo "Forgejo cleanup: owner=$OWNER keep_last=$KEEP dry_run=$DRY"
echo "API base: $BASE"
# Page through ALL container packages.
TMPDIR=$(mktemp -d)
trap 'rm -rf "$TMPDIR"' EXIT
ALL="$TMPDIR/all.json"
echo "[]" > "$ALL"
PAGE=1
while :; do
RESP=$(curl -sf -H "$AUTH_HEADER" \
"$BASE/packages/$OWNER?type=container&limit=50&page=$PAGE")
COUNT=$(echo "$RESP" | jq 'length')
if [ "$COUNT" = "0" ]; then break; fi
jq -s '.[0] + .[1]' "$ALL" <(echo "$RESP") > "$TMPDIR/merged.json"
mv "$TMPDIR/merged.json" "$ALL"
PAGE=$((PAGE + 1))
# Safety: never run away.
if [ "$PAGE" -gt 100 ]; then break; fi
done
TOTAL=$(jq 'length' "$ALL")
echo "Found $TOTAL package version(s)."
if [ "$TOTAL" = "0" ]; then
echo "Nothing to do."
exit 0
fi
# Group by name and process each group.
NAMES=$(jq -r '.[].name' "$ALL" | sort -u)
DEL=0
KEPT=0
for NAME in $NAMES; do
# All versions of this name, sorted by created_at descending.
jq --arg n "$NAME" '
[.[] | select(.name == $n)]
| sort_by(.created_at) | reverse
' "$ALL" > "$TMPDIR/$NAME.json"
N_VERSIONS=$(jq 'length' "$TMPDIR/$NAME.json")
echo "[$NAME] $N_VERSIONS version(s)"
# Build the keep set: top $KEEP + anything tagged 'latest'.
jq -r --argjson keep "$KEEP" '
[.[0:$keep][].version] + [.[] | select(.version == "latest") | .version]
| unique
| .[]
' "$TMPDIR/$NAME.json" > "$TMPDIR/$NAME.keep"
# Build the delete set.
jq -r '.[].version' "$TMPDIR/$NAME.json" \
| grep -vxFf "$TMPDIR/$NAME.keep" > "$TMPDIR/$NAME.delete" || true
D_COUNT=$(wc -l < "$TMPDIR/$NAME.delete" | tr -d ' ')
K_COUNT=$(wc -l < "$TMPDIR/$NAME.keep" | tr -d ' ')
echo " keep=$K_COUNT delete=$D_COUNT"
KEPT=$((KEPT + K_COUNT))
while IFS= read -r VER; do
[ -z "$VER" ] && continue
URL="$BASE/packages/$OWNER/container/$NAME/$VER"
if [ "$DRY" = "true" ]; then
echo " DRY_RUN would DELETE $URL"
else
HTTP=$(curl -s -o /dev/null -w '%{http_code}' \
-X DELETE -H "$AUTH_HEADER" "$URL" || echo "000")
if [ "$HTTP" = "204" ] || [ "$HTTP" = "200" ]; then
echo " deleted $NAME:$VER"
else
echo " FAIL $NAME:$VER HTTP $HTTP"
fi
fi
DEL=$((DEL + 1))
done < "$TMPDIR/$NAME.delete"
done
echo "Summary: kept=$KEPT to_delete=$DEL dry_run=$DRY"

View file

@ -32,7 +32,7 @@ resource "kubernetes_persistent_volume_claim" "data_encrypted" {
annotations = {
"resize.topolvm.io/threshold" = "80%"
"resize.topolvm.io/increase" = "50%"
"resize.topolvm.io/storage_limit" = "20Gi"
"resize.topolvm.io/storage_limit" = "50Gi"
}
}
spec {
@ -40,7 +40,7 @@ resource "kubernetes_persistent_volume_claim" "data_encrypted" {
storage_class_name = "proxmox-lvm-encrypted"
resources {
requests = {
storage = "5Gi"
storage = "15Gi"
}
}
}
@ -72,6 +72,14 @@ resource "kubernetes_deployment" "forgejo" {
}
}
spec {
# fsGroup chowns the mounted PVC to GID 1000 (the forgejo user) on
# mount. Without this, /data is owned by root and the
# `[packages].CHUNKED_UPLOAD_PATH` default at /data/tmp is not
# writable, crashlooping the pod when packages is enabled. Pre-23-day
# Forgejo ran without packages on so this never surfaced.
security_context {
fs_group = 1000
}
container {
name = "forgejo"
image = "codeberg.org/forgejo/forgejo:11"
@ -101,10 +109,30 @@ resource "kubernetes_deployment" "forgejo" {
name = "FORGEJO__openid__ENABLE_OPENID_SIGNIN"
value = "false"
}
# Allow webhook delivery to internal k8s services
# Allow webhook delivery to internal k8s services AND to the public
# ingress hostnames Forgejo's own webhooks point to (ci.viktorbarzin.me
# for Woodpecker pipelines).
env {
name = "FORGEJO__webhook__ALLOWED_HOST_LIST"
value = "*.svc.cluster.local"
value = "*.svc.cluster.local,ci.viktorbarzin.me,*.viktorbarzin.me"
}
# Default DELIVER_TIMEOUT is 5s too tight for the Cloudflare-tunnel
# round-trip on first request after pod restart (cold TLS handshake
# can hit 6-8s). 30s comfortably covers retries.
env {
name = "FORGEJO__webhook__DELIVER_TIMEOUT"
value = "30"
}
# OCI registry (container packages). Default-on in Forgejo v11 but
# explicit so it can't be silently disabled by an upstream config
# change. CHUNKED_UPLOAD_PATH defaults to `data/tmp/package-upload`
# under Forgejo's AppDataPath (resolves to a writable subdir of
# /data/gitea/) overriding to /data/tmp directly hits a perms
# issue because /data is the volume mount root and is not chowned
# to the forgejo user.
env {
name = "FORGEJO__packages__ENABLED"
value = "true"
}
volume_mount {
name = "data"
@ -113,10 +141,10 @@ resource "kubernetes_deployment" "forgejo" {
resources {
requests = {
cpu = "15m"
memory = "384Mi"
memory = "1Gi"
}
limits = {
memory = "384Mi"
memory = "1Gi"
}
}
port {
@ -165,6 +193,9 @@ module "ingress" {
namespace = kubernetes_namespace.forgejo.metadata[0].name
name = "forgejo"
tls_secret_name = var.tls_secret_name
# OCI registry pushes ship full image layer blobs in one request; default
# Traefik buffering chokes on anything past a few hundred MB.
max_body_size = "5g"
extra_annotations = {
"gethomepage.dev/enabled" = "true"
"gethomepage.dev/name" = "Forgejo"

View file

@ -105,7 +105,8 @@ resource "kubernetes_deployment" "freedify" {
name = "registry-credentials"
}
container {
image = "registry.viktorbarzin.me/freedify:${var.tag}"
# Phase 3 cutover 2026-05-07 Forgejo registry consolidation.
image = "forgejo.viktorbarzin.me/viktor/freedify:${var.tag}"
name = "freedify"
port {

View file

@ -75,13 +75,13 @@ module "k8s-node-template" {
mkdir -p /etc/containerd/certs.d/ghcr.io
printf 'server = "https://ghcr.io"\n\n[host."http://10.0.20.10:5010"]\n capabilities = ["pull", "resolve"]\n\n[host."https://ghcr.io"]\n capabilities = ["pull", "resolve"]\n' > /etc/containerd/certs.d/ghcr.io/hosts.toml
# Create hosts.toml for private registry both IP and hostname entries
# IP-based (10.0.20.10:5050): direct access, skip TLS verify (wildcard cert, no IP SAN)
mkdir -p /etc/containerd/certs.d/10.0.20.10:5050
printf 'server = "https://10.0.20.10:5050"\n\n[host."https://10.0.20.10:5050"]\n capabilities = ["pull", "resolve", "push"]\n skip_verify = true\n' > /etc/containerd/certs.d/10.0.20.10:5050/hosts.toml
# Hostname-based (registry.viktorbarzin.me): redirects to LAN IP to avoid Traefik round-trip
mkdir -p /etc/containerd/certs.d/registry.viktorbarzin.me
printf 'server = "https://registry.viktorbarzin.me"\n\n[host."https://10.0.20.10:5050"]\n capabilities = ["pull", "resolve", "push"]\n skip_verify = true\n' > /etc/containerd/certs.d/registry.viktorbarzin.me/hosts.toml
# Forgejo OCI registry: redirect to in-cluster Traefik LB (10.0.20.200) so
# pulls don't hairpin out through the WAN gateway. Traefik serves the
# *.viktorbarzin.me wildcard so SNI verification still passes.
# registry.viktorbarzin.me / 10.0.20.10:5050 entries removed in Phase 4 of
# the forgejo-registry-consolidation 2026-05-07 registry-private is gone.
mkdir -p /etc/containerd/certs.d/forgejo.viktorbarzin.me
printf 'server = "https://forgejo.viktorbarzin.me"\n\n[host."https://10.0.20.200"]\n capabilities = ["pull", "resolve"]\n' > /etc/containerd/certs.d/forgejo.viktorbarzin.me/hosts.toml
# Low-traffic registries (registry.k8s.io, quay.io, reg.kyverno.io) pull directly.
# Pull-through cache removed: caused corrupted images (truncated downloads)

View file

@ -8,7 +8,8 @@ variable "postgresql_host" { type = string }
locals {
namespace = "job-hunter"
image = "registry.viktorbarzin.me/job-hunter:${var.image_tag}"
# Phase 3 cutover 2026-05-07 see infra/docs/plans/2026-05-07-forgejo-registry-consolidation-plan.md.
image = "forgejo.viktorbarzin.me/viktor/job-hunter:${var.image_tag}"
labels = {
app = "job-hunter"
}

View file

@ -24,6 +24,14 @@ provider "registry.terraform.io/cloudflare/cloudflare" {
]
}
provider "registry.terraform.io/goauthentik/authentik" {
version = "2024.12.1"
constraints = "~> 2024.10"
hashes = [
"h1:roBMd+gi+TGgikH/bMzEI8JfvJiMAQWt+8FmokCrQIs=",
]
}
provider "registry.terraform.io/hashicorp/helm" {
version = "3.1.1"
hashes = [

View file

@ -24,16 +24,6 @@ module "tls_secret" {
tls_secret_name = var.tls_secret_name
}
resource "kubernetes_config_map" "kms-web-page" {
metadata {
name = "kms-web-page-config"
namespace = kubernetes_namespace.kms.metadata[0].name
}
data = {
"index.html" = var.index_html
}
}
resource "kubernetes_deployment" "kms-web-page" {
metadata {
name = "kms-web-page"
@ -59,8 +49,11 @@ resource "kubernetes_deployment" "kms-web-page" {
}
}
spec {
image_pull_secrets {
name = "registry-credentials"
}
container {
image = "nginx"
image = "forgejo.viktorbarzin.me/viktor/kms-website:${var.image_tag}"
name = "kms-web-page"
image_pull_policy = "IfNotPresent"
resources {
@ -76,29 +69,17 @@ resource "kubernetes_deployment" "kms-web-page" {
container_port = 80
protocol = "TCP"
}
volume_mount {
name = "config"
mount_path = "/usr/share/nginx/html/"
}
}
volume {
name = "config"
config_map {
name = "kms-web-page-config"
items {
key = "index.html"
path = "index.html"
}
}
}
}
}
}
depends_on = [kubernetes_config_map.kms-web-page]
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
ignore_changes = [
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
spec[0].template[0].spec[0].dns_config,
# CI (Woodpecker) manages the live image tag via `kubectl set image`
spec[0].template[0].spec[0].container[0].image,
]
}
}

View file

@ -1,68 +1,5 @@
variable "index_html" {
default = <<EOT
<h1>How to activate windows</h1>
Open the following link and find a key for you version of windows: </br>
<b><a href="https://goo.gl/BcrPjW" target="_blank">https://goo.gl/BcrPjW</a></b>
</br>
</br>
Open cmd as <b>Administrator</b> and run the following: </br>
</br>
<b>slmgr.vbs /ipk key_for_your_windows</b>
</br>
<b>slmgr.vbs /skms kms.viktorbarzin.me </b>
<br>
<b>
slmgr /ato
</b>
<br>
<p>
<h3> If you have an evaluation windows, you need to change it to retail one. This is how:</h3>
<br>
From an elevated command prompt, determine the current edition name with the command <br>
<strong>DISM /online /Get-CurrentEdition</strong>.
<br>Make note of the edition ID, an abbreviated form of the edition name. Then run
<br>
<strong>DISM /online /Set-Edition:<edition ID> /ProductKey:XXXXX-XXXXX-XXXXX-XXXXX-XXXXX /AcceptEula</strong>
<br> providing the edition ID and a retail product key. The server will restart
</p>
<hr>
<h1>How to activate Microsoft Office</h1>
<br>
<b>
CD \Program Files\Microsoft Office\Office16 </b> OR <b>CD \Program Files (x86)\Microsoft Office\Office16
</b>
<br>
<b>
cscript ospp.vbs /sethst:kms.viktorbarzin.me
</b>
<br>
<b>
cscript ospp.vbs /inpkey:xxxxx-xxxxx-xxxxx-xxxxx-xxxxx
</b>
<br>
where 'xxxx' is a key for your office. Some examples for office 2016 - <a
href="https://www.techdee.com/microsoft-office-2016-product-key/">https://www.techdee.com/microsoft-office-2016-product-key/</a>
<br>
<b>
cscript ospp.vbs /act
</b>
<br>
<br>
If you messed up activation settings reset them using
<br>
slmgr /upk
<br>
slmgr /cpky
<br>
and
<br>
slmgr /rearm
<h3>Buy me a beer :P</h3>
EOT
variable "image_tag" {
type = string
default = "latest"
description = "kms-website image tag pushed to forgejo.viktorbarzin.me/viktor/kms-website. Use 8-char git SHA in CI."
}

View file

@ -20,14 +20,14 @@ resource "kubernetes_secret" "registry_credentials" {
data = {
".dockerconfigjson" = jsonencode({
auths = {
"registry.viktorbarzin.me" = {
auth = base64encode("${data.vault_kv_secret_v2.viktor.data["registry_user"]}:${data.vault_kv_secret_v2.viktor.data["registry_password"]}")
}
"registry.viktorbarzin.me:5050" = {
auth = base64encode("${data.vault_kv_secret_v2.viktor.data["registry_user"]}:${data.vault_kv_secret_v2.viktor.data["registry_password"]}")
}
"10.0.20.10:5050" = {
auth = base64encode("${data.vault_kv_secret_v2.viktor.data["registry_user"]}:${data.vault_kv_secret_v2.viktor.data["registry_password"]}")
# Phase 4 of forgejo-registry-consolidation 2026-05-07 registry-
# private decommissioned. Old auths entries (registry.viktorbarzin.me,
# registry.viktorbarzin.me:5050, 10.0.20.10:5050) removed to prevent
# silent fallback. If a pod somehow references the old hostname now,
# it will visibly fail with auth missing rather than silently pulling
# potentially-stale blobs.
"forgejo.viktorbarzin.me" = {
auth = base64encode("cluster-puller:${try(data.vault_kv_secret_v2.viktor.data["forgejo_pull_token"], "")}")
}
}
})

View file

@ -33,5 +33,10 @@ module "monitoring" {
kube_config_path = var.kube_config_path
registry_user = data.vault_kv_secret_v2.viktor.data["registry_user"]
registry_password = data.vault_kv_secret_v2.viktor.data["registry_password"]
tier = local.tiers.cluster
# try() so apply succeeds before the Vault key is populated during Phase 0
# bootstrap (see docs/runbooks/forgejo-registry-setup.md). Empty token =
# probe will report an auth failure and fire RegistryCatalogInaccessible
# that's the intended visible-broken state until the PAT is created.
forgejo_pull_token = try(data.vault_kv_secret_v2.viktor.data["forgejo_pull_token"], "")
tier = local.tiers.cluster
}

View file

@ -0,0 +1,476 @@
{
"annotations": {"list": []},
"editable": true,
"fiscalYearStartMonth": 0,
"graphTooltip": 0,
"id": null,
"links": [],
"liveNow": false,
"refresh": "30s",
"schemaVersion": 38,
"tags": ["openclaw", "ai", "codex"],
"time": {"from": "now-6h", "to": "now"},
"timepicker": {},
"timezone": "",
"title": "OpenClaw — Codex Usage",
"uid": "openclaw-codex",
"version": 1,
"panels": [
{
"type": "row",
"id": 100,
"title": "Now",
"gridPos": {"h": 1, "w": 24, "x": 0, "y": 0},
"collapsed": false,
"panels": []
},
{
"type": "stat",
"id": 1,
"title": "Messages last 5h — gpt-5.4-mini",
"description": "Plus rate-card lower bound: 1,200 / 5h. Hard cap at the upper bound: 7,000 / 5h.",
"datasource": {"type": "prometheus", "uid": "PBFA97CFB590B2093"},
"gridPos": {"h": 5, "w": 6, "x": 0, "y": 1},
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "auto",
"orientation": "auto",
"reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": false},
"textMode": "auto"
},
"fieldConfig": {
"defaults": {
"decimals": 0,
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "green", "value": null},
{"color": "yellow", "value": 960},
{"color": "orange", "value": 1500},
{"color": "red", "value": 5600}
]
},
"unit": "short"
}
},
"targets": [
{
"datasource": {"type": "prometheus", "uid": "PBFA97CFB590B2093"},
"expr": "sum(increase(openclaw_codex_messages_total{provider=\"openai-codex\",model=\"gpt-5.4-mini\"}[5h]))",
"refId": "A"
}
]
},
{
"type": "gauge",
"id": 2,
"title": "% of Plus 5h floor (1,200 cap)",
"description": "Conservative gauge against the lower bound of the published rate-card. Real ceiling depends on dynamic allocation (1,2007,000). Re-baseline if you observe throttling at <80%.",
"datasource": {"type": "prometheus", "uid": "PBFA97CFB590B2093"},
"gridPos": {"h": 5, "w": 6, "x": 6, "y": 1},
"options": {
"orientation": "auto",
"showThresholdLabels": false,
"showThresholdMarkers": true,
"reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": false}
},
"fieldConfig": {
"defaults": {
"min": 0,
"max": 100,
"decimals": 1,
"unit": "percent",
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "green", "value": null},
{"color": "yellow", "value": 60},
{"color": "orange", "value": 80},
{"color": "red", "value": 95}
]
}
}
},
"targets": [
{
"datasource": {"type": "prometheus", "uid": "PBFA97CFB590B2093"},
"expr": "100 * sum(increase(openclaw_codex_messages_total{provider=\"openai-codex\",model=\"gpt-5.4-mini\"}[5h])) / 1200",
"refId": "A"
}
]
},
{
"type": "stat",
"id": 3,
"title": "Tokens last 5h (input + output, codex)",
"datasource": {"type": "prometheus", "uid": "PBFA97CFB590B2093"},
"gridPos": {"h": 5, "w": 6, "x": 12, "y": 1},
"options": {
"colorMode": "value",
"graphMode": "area",
"reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": false}
},
"fieldConfig": {
"defaults": {
"decimals": 0,
"unit": "short",
"thresholds": {"mode": "absolute", "steps": [{"color": "blue", "value": null}]}
}
},
"targets": [
{
"datasource": {"type": "prometheus", "uid": "PBFA97CFB590B2093"},
"expr": "sum(increase(openclaw_codex_input_tokens_total{provider=\"openai-codex\"}[5h])) + sum(increase(openclaw_codex_output_tokens_total{provider=\"openai-codex\"}[5h]))",
"refId": "A"
}
]
},
{
"type": "stat",
"id": 4,
"title": "Cache hit ratio (codex, 5h)",
"description": "cacheRead / (cacheRead + input). Higher is better — caching cuts effective Plus quota burn.",
"datasource": {"type": "prometheus", "uid": "PBFA97CFB590B2093"},
"gridPos": {"h": 5, "w": 6, "x": 18, "y": 1},
"options": {
"colorMode": "value",
"graphMode": "area",
"reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": false}
},
"fieldConfig": {
"defaults": {
"min": 0,
"max": 100,
"decimals": 1,
"unit": "percent",
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "red", "value": null},
{"color": "yellow", "value": 30},
{"color": "green", "value": 60}
]
}
}
},
"targets": [
{
"datasource": {"type": "prometheus", "uid": "PBFA97CFB590B2093"},
"expr": "100 * sum(increase(openclaw_codex_cache_read_tokens_total{provider=\"openai-codex\"}[5h])) / clamp_min(sum(increase(openclaw_codex_input_tokens_total{provider=\"openai-codex\"}[5h])) + sum(increase(openclaw_codex_cache_read_tokens_total{provider=\"openai-codex\"}[5h])), 1)",
"refId": "A"
}
]
},
{
"type": "stat",
"id": 5,
"title": "OAuth token expiry",
"description": "Days until the openai-codex OAuth token expires. Re-run `openclaw models auth login --provider openai-codex` before this hits 0.",
"datasource": {"type": "prometheus", "uid": "PBFA97CFB590B2093"},
"gridPos": {"h": 5, "w": 6, "x": 0, "y": 6},
"options": {
"colorMode": "background",
"graphMode": "none",
"reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": false}
},
"fieldConfig": {
"defaults": {
"decimals": 1,
"unit": "d",
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "red", "value": null},
{"color": "orange", "value": 1},
{"color": "yellow", "value": 3},
{"color": "green", "value": 5}
]
}
}
},
"targets": [
{
"datasource": {"type": "prometheus", "uid": "PBFA97CFB590B2093"},
"expr": "max(openclaw_codex_oauth_expiry_seconds{provider=\"openai-codex\"}) / 86400",
"refId": "A"
}
]
},
{
"type": "stat",
"id": 6,
"title": "Active sessions",
"datasource": {"type": "prometheus", "uid": "PBFA97CFB590B2093"},
"gridPos": {"h": 5, "w": 6, "x": 6, "y": 6},
"options": {
"colorMode": "value",
"graphMode": "none",
"reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": true},
"textMode": "value_and_name"
},
"fieldConfig": {
"defaults": {
"unit": "short",
"thresholds": {"mode": "absolute", "steps": [{"color": "blue", "value": null}]}
}
},
"targets": [
{
"datasource": {"type": "prometheus", "uid": "PBFA97CFB590B2093"},
"expr": "openclaw_codex_active_sessions",
"legendFormat": "{{kind}}",
"refId": "A"
}
]
},
{
"type": "stat",
"id": 7,
"title": "Last assistant turn",
"description": "Time since the latest assistant message landed in any session.",
"datasource": {"type": "prometheus", "uid": "PBFA97CFB590B2093"},
"gridPos": {"h": 5, "w": 6, "x": 12, "y": 6},
"options": {
"colorMode": "background",
"graphMode": "none",
"reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": false}
},
"fieldConfig": {
"defaults": {
"unit": "s",
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "green", "value": null},
{"color": "yellow", "value": 1800},
{"color": "orange", "value": 7200},
{"color": "red", "value": 86400}
]
}
}
},
"targets": [
{
"datasource": {"type": "prometheus", "uid": "PBFA97CFB590B2093"},
"expr": "time() - openclaw_codex_last_run_timestamp",
"refId": "A"
}
]
},
{
"type": "stat",
"id": 8,
"title": "Errors last 24h",
"datasource": {"type": "prometheus", "uid": "PBFA97CFB590B2093"},
"gridPos": {"h": 5, "w": 6, "x": 18, "y": 6},
"options": {
"colorMode": "background",
"graphMode": "area",
"reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": false}
},
"fieldConfig": {
"defaults": {
"decimals": 0,
"unit": "short",
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "green", "value": null},
{"color": "yellow", "value": 1},
{"color": "red", "value": 10}
]
}
}
},
"targets": [
{
"datasource": {"type": "prometheus", "uid": "PBFA97CFB590B2093"},
"expr": "sum(increase(openclaw_codex_message_errors_total[24h]))",
"refId": "A"
}
]
},
{
"type": "row",
"id": 200,
"title": "Over time",
"gridPos": {"h": 1, "w": 24, "x": 0, "y": 11},
"collapsed": false,
"panels": []
},
{
"type": "timeseries",
"id": 10,
"title": "Messages / min by model",
"datasource": {"type": "prometheus", "uid": "PBFA97CFB590B2093"},
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 12},
"fieldConfig": {
"defaults": {
"color": {"mode": "palette-classic"},
"custom": {
"drawStyle": "bars",
"fillOpacity": 60,
"lineWidth": 1,
"stacking": {"mode": "normal"}
},
"unit": "short"
}
},
"options": {
"legend": {"displayMode": "table", "placement": "right", "showLegend": true, "calcs": ["sum"]},
"tooltip": {"mode": "multi", "sort": "desc"}
},
"targets": [
{
"datasource": {"type": "prometheus", "uid": "PBFA97CFB590B2093"},
"expr": "sum by (provider, model) (rate(openclaw_codex_messages_total[1m])) * 60",
"legendFormat": "{{provider}}/{{model}}",
"refId": "A"
}
]
},
{
"type": "timeseries",
"id": 11,
"title": "Tokens / min by type (codex)",
"datasource": {"type": "prometheus", "uid": "PBFA97CFB590B2093"},
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 20},
"fieldConfig": {
"defaults": {
"color": {"mode": "palette-classic"},
"custom": {
"drawStyle": "line",
"fillOpacity": 25,
"lineWidth": 2,
"stacking": {"mode": "none"}
},
"unit": "short"
}
},
"options": {
"legend": {"displayMode": "list", "placement": "bottom", "showLegend": true},
"tooltip": {"mode": "multi", "sort": "desc"}
},
"targets": [
{
"datasource": {"type": "prometheus", "uid": "PBFA97CFB590B2093"},
"expr": "sum(rate(openclaw_codex_input_tokens_total{provider=\"openai-codex\"}[5m])) * 60",
"legendFormat": "input",
"refId": "A"
},
{
"datasource": {"type": "prometheus", "uid": "PBFA97CFB590B2093"},
"expr": "sum(rate(openclaw_codex_output_tokens_total{provider=\"openai-codex\"}[5m])) * 60",
"legendFormat": "output",
"refId": "B"
},
{
"datasource": {"type": "prometheus", "uid": "PBFA97CFB590B2093"},
"expr": "sum(rate(openclaw_codex_cache_read_tokens_total{provider=\"openai-codex\"}[5m])) * 60",
"legendFormat": "cache_read",
"refId": "C"
}
]
},
{
"type": "bargauge",
"id": 12,
"title": "Messages / 5h by model",
"datasource": {"type": "prometheus", "uid": "PBFA97CFB590B2093"},
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 20},
"options": {
"displayMode": "gradient",
"orientation": "horizontal",
"showUnfilled": true,
"reduceOptions": {"calcs": ["lastNotNull"], "fields": "", "values": false}
},
"fieldConfig": {
"defaults": {
"min": 0,
"decimals": 0,
"unit": "short",
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "green", "value": null},
{"color": "yellow", "value": 100},
{"color": "orange", "value": 500},
{"color": "red", "value": 1000}
]
}
}
},
"targets": [
{
"datasource": {"type": "prometheus", "uid": "PBFA97CFB590B2093"},
"expr": "sum by (provider, model) (increase(openclaw_codex_messages_total[5h]))",
"legendFormat": "{{provider}}/{{model}}",
"refId": "A"
}
]
},
{
"type": "row",
"id": 300,
"title": "Errors",
"gridPos": {"h": 1, "w": 24, "x": 0, "y": 28},
"collapsed": false,
"panels": []
},
{
"type": "table",
"id": 20,
"title": "Recent errors by model and reason",
"datasource": {"type": "prometheus", "uid": "PBFA97CFB590B2093"},
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 29},
"options": {
"showHeader": true
},
"fieldConfig": {
"defaults": {
"custom": {"align": "auto", "displayMode": "auto"}
},
"overrides": [
{
"matcher": {"id": "byName", "options": "Value"},
"properties": [
{"id": "displayName", "value": "Errors (24h)"},
{"id": "custom.displayMode", "value": "color-background"},
{
"id": "thresholds",
"value": {
"mode": "absolute",
"steps": [
{"color": "green", "value": null},
{"color": "yellow", "value": 1},
{"color": "red", "value": 10}
]
}
}
]
}
]
},
"targets": [
{
"datasource": {"type": "prometheus", "uid": "PBFA97CFB590B2093"},
"expr": "sum by (provider, model, reason) (increase(openclaw_codex_message_errors_total[24h])) > 0",
"format": "table",
"instant": true,
"refId": "A"
}
],
"transformations": [
{
"id": "organize",
"options": {
"excludeByName": {"Time": true, "__name__": true, "instance": true, "job": true, "namespace": true, "pod": true, "app": true},
"indexByName": {"provider": 0, "model": 1, "reason": 2, "Value": 3},
"renameByName": {}
}
}
]
}
]
}

File diff suppressed because it is too large Load diff

View file

@ -134,6 +134,7 @@ locals {
# Applications
"qbittorrent.json" = "Applications"
"realestate-crawler.json" = "Applications"
"openclaw.json" = "Applications"
"uk-payslip.json" = "Finance (Personal)"
"wealth.json" = "Finance (Personal)"
"job-hunter.json" = "Finance"

View file

@ -41,6 +41,11 @@ variable "registry_password" {
type = string
sensitive = true
}
variable "forgejo_pull_token" {
type = string
sensitive = true
description = "PAT for the cluster-puller user, used by the Forgejo registry integrity probe."
}
resource "kubernetes_namespace" "monitoring" {
metadata {
@ -238,27 +243,42 @@ resource "kubernetes_cron_job_v1" "dns_anomaly_monitor" {
}
# -----------------------------------------------------------------------------
# Registry manifest-integrity probe HEADs every tag in the private R/W
# registry's catalog, walks multi-platform image indexes, and reports blob
# availability. Catches the orphan-index failure mode seen 2026-04-13 and
# 2026-04-19 before downstream pipelines hit it.
# Phase 4 of forgejo-registry-consolidation 2026-05-07: registry-private
# decommissioned. The integrity probe below caught the orphan-index failure
# mode in `registry:2.8.3` (post-mortem 2026-04-19). With that engine
# retired, the probe is replaced by `forgejo_integrity_probe` below.
#
# Resource definitions stripped wholesale terragrunt apply destroys the
# in-cluster CronJob + Secret on the next run.
# See: docs/post-mortems/2026-04-19-registry-orphan-index.md
# -----------------------------------------------------------------------------
resource "kubernetes_secret" "registry_probe_credentials" {
# -----------------------------------------------------------------------------
# Forgejo registry integrity probe same algorithm as registry-integrity-probe
# above, but targets the Forgejo OCI registry instead of registry-private. Runs
# in parallel with the existing probe during the dual-push bake; once Phase 4
# decommissions registry-private, the registry-integrity-probe CronJob is
# deleted and only this one remains.
#
# Auth: HTTP Basic with cluster-puller PAT (read:package scope is enough to
# walk catalog + manifests). Reaches Forgejo via the in-cluster service so we
# don't hairpin out through Traefik for every probe run.
# -----------------------------------------------------------------------------
resource "kubernetes_secret" "forgejo_probe_credentials" {
metadata {
name = "registry-probe-credentials"
name = "forgejo-probe-credentials"
namespace = kubernetes_namespace.monitoring.metadata[0].name
}
type = "Opaque"
data = {
REG_USER = var.registry_user
REG_PASS = var.registry_password
REG_USER = "cluster-puller"
REG_PASS = var.forgejo_pull_token
}
}
resource "kubernetes_cron_job_v1" "registry_integrity_probe" {
resource "kubernetes_cron_job_v1" "forgejo_integrity_probe" {
metadata {
name = "registry-integrity-probe"
name = "forgejo-integrity-probe"
namespace = kubernetes_namespace.monitoring.metadata[0].name
}
spec {
@ -275,13 +295,13 @@ resource "kubernetes_cron_job_v1" "registry_integrity_probe" {
metadata {}
spec {
container {
name = "registry-integrity-probe"
name = "forgejo-integrity-probe"
image = "docker.io/library/alpine:3.20"
env {
name = "REG_USER"
value_from {
secret_key_ref {
name = kubernetes_secret.registry_probe_credentials.metadata[0].name
name = kubernetes_secret.forgejo_probe_credentials.metadata[0].name
key = "REG_USER"
}
}
@ -290,22 +310,26 @@ resource "kubernetes_cron_job_v1" "registry_integrity_probe" {
name = "REG_PASS"
value_from {
secret_key_ref {
name = kubernetes_secret.registry_probe_credentials.metadata[0].name
name = kubernetes_secret.forgejo_probe_credentials.metadata[0].name
key = "REG_PASS"
}
}
}
env {
name = "REGISTRY_HOST"
value = "10.0.20.10:5050"
value = "forgejo.forgejo.svc.cluster.local"
}
env {
name = "REGISTRY_SCHEME"
value = "http"
}
env {
name = "REGISTRY_INSTANCE"
value = "registry.viktorbarzin.me:5050"
value = "forgejo.viktorbarzin.me"
}
env {
name = "PUSHGATEWAY"
value = "http://prometheus-prometheus-pushgateway.monitoring:9091/metrics/job/registry-integrity-probe"
value = "http://prometheus-prometheus-pushgateway.monitoring:9091/metrics/job/forgejo-integrity-probe"
}
env {
name = "TAGS_PER_REPO"
@ -316,16 +340,16 @@ resource "kubernetes_cron_job_v1" "registry_integrity_probe" {
apk add --no-cache curl jq >/dev/null
REG="$REGISTRY_HOST"
SCHEME="$${REGISTRY_SCHEME:-https}"
INSTANCE="$REGISTRY_INSTANCE"
AUTH="$REG_USER:$REG_PASS"
ACCEPT='application/vnd.oci.image.index.v1+json,application/vnd.oci.image.manifest.v1+json,application/vnd.docker.distribution.manifest.list.v2+json,application/vnd.docker.distribution.manifest.v2+json'
push() {
# Prometheus pushgateway body ends with blank line. Ignore push errors.
curl -sf --max-time 10 --data-binary @- "$PUSHGATEWAY" >/dev/null 2>&1 || true
}
CATALOG=$(curl -sk -u "$AUTH" --max-time 30 "https://$REG/v2/_catalog?n=1000" || echo "")
CATALOG=$(curl -sk -u "$AUTH" --max-time 30 "$SCHEME://$REG/v2/_catalog?n=1000" || echo "")
REPOS=$(echo "$CATALOG" | jq -r '.repositories[]?' 2>/dev/null || echo "")
if [ -z "$REPOS" ]; then
@ -350,7 +374,7 @@ resource "kubernetes_cron_job_v1" "registry_integrity_probe" {
[ -z "$repo" ] && continue
REPOS_N=$((REPOS_N + 1))
TAGS_JSON=$(curl -sk -u "$AUTH" --max-time 15 "https://$REG/v2/$repo/tags/list" || echo "")
TAGS_JSON=$(curl -sk -u "$AUTH" --max-time 15 "$SCHEME://$REG/v2/$repo/tags/list" || echo "")
echo "$TAGS_JSON" | jq -r '.tags[]?' 2>/dev/null | tail -n "$TAGS_PER_REPO" > /tmp/tags.txt || true
while IFS= read -r tag; do
@ -359,7 +383,7 @@ resource "kubernetes_cron_job_v1" "registry_integrity_probe" {
HTTP=$(curl -sk -u "$AUTH" -o /tmp/m.json -w '%%{http_code}' \
-H "Accept: $ACCEPT" --max-time 15 \
"https://$REG/v2/$repo/manifests/$tag")
"$SCHEME://$REG/v2/$repo/manifests/$tag")
if [ "$HTTP" != "200" ]; then
echo "FAIL: $repo:$tag manifest HTTP $HTTP"
FAIL=$((FAIL + 1))
@ -374,7 +398,7 @@ resource "kubernetes_cron_job_v1" "registry_integrity_probe" {
[ -z "$d" ] && continue
CH=$(curl -sk -u "$AUTH" -o /dev/null -w '%%{http_code}' \
-H "Accept: $ACCEPT" --max-time 10 -I \
"https://$REG/v2/$repo/manifests/$d")
"$SCHEME://$REG/v2/$repo/manifests/$d")
if [ "$CH" != "200" ]; then
echo "FAIL: $repo:$tag index child $d HTTP $CH"
FAIL=$((FAIL + 1))

View file

@ -1656,22 +1656,22 @@ serverFiles:
labels:
severity: critical
annotations:
summary: "Registry has {{ $value }} broken manifest reference(s) — orphan index or missing blob"
description: "The registry-integrity-probe CronJob in the monitoring namespace found {{ $value }} manifest/blob references that return non-200 on the private registry. Almost certainly an orphan OCI-index child from the cleanup-tags.sh+GC race. Rebuild the affected image per docs/runbooks/registry-rebuild-image.md and investigate which tag(s) the probe logs flagged."
summary: "{{ $labels.instance }}: {{ $value }} broken manifest reference(s) — orphan index or missing blob"
description: "The forgejo-integrity-probe CronJob found {{ $value }} manifest/blob references that return non-200 on {{ $labels.instance }}. Rebuild the affected image per docs/runbooks/forgejo-registry-rebuild-image.md. (registry.viktorbarzin.me retired Phase 4 of forgejo-registry-consolidation 2026-05-07 — only forgejo.viktorbarzin.me remains.)"
- alert: RegistryIntegrityProbeStale
expr: time() - registry_manifest_integrity_last_run_timestamp > 3600
for: 15m
labels:
severity: warning
annotations:
summary: "Registry integrity probe has not reported in >1h — CronJob may be broken"
summary: "{{ $labels.instance }} integrity probe has not reported in >1h — CronJob may be broken"
- alert: RegistryCatalogInaccessible
expr: registry_manifest_integrity_catalog_accessible == 0
for: 15m
labels:
severity: critical
annotations:
summary: "Registry probe cannot fetch /v2/_catalog — auth failure or registry down"
summary: "{{ $labels.instance }} probe cannot fetch /v2/_catalog — auth failure or registry down"
- alert: NodeHighCPUUsage
expr: pve_cpu_usage_ratio * 100 > 60
for: 6h

View file

@ -0,0 +1,264 @@
#!/usr/bin/env python3
"""OpenClaw / Codex usage exporter.
Reads ~/.openclaw/agents/*/sessions/*.jsonl (assistant messages with usage)
and ~/.openclaw/agents/*/agent/auth-state.json (OAuth profiles), then exposes
Prometheus text-format metrics on :9099/metrics. Stdlib only no pip install
needed at startup.
Metrics (all cumulative-since-session-start; use Prometheus increase()/rate()
for windowed views):
openclaw_codex_messages_total{provider,model,session_kind} counter
openclaw_codex_input_tokens_total{provider,model} counter
openclaw_codex_output_tokens_total{provider,model} counter
openclaw_codex_cache_read_tokens_total{provider,model} counter
openclaw_codex_cache_write_tokens_total{provider,model} counter
openclaw_codex_message_errors_total{provider,model,reason} counter
openclaw_codex_active_sessions{kind} gauge
openclaw_codex_oauth_expiry_seconds{provider,account} gauge
openclaw_codex_last_run_timestamp gauge
openclaw_codex_exporter_scrape_duration_ms gauge
"""
import glob
import json
import os
import re
import time
from datetime import datetime
from http.server import BaseHTTPRequestHandler, HTTPServer
from threading import Lock
OPENCLAW_HOME = os.environ.get("OPENCLAW_HOME", "/home/node/.openclaw")
PORT = int(os.environ.get("METRICS_PORT", "9099"))
CACHE_SEC = float(os.environ.get("CACHE_SEC", "5"))
SKIP_FRAGMENTS = (".broken.", ".reset.", ".deleted.", ".bak.")
SESSION_RE = re.compile(r"^([0-9a-f-]{36})\.jsonl$")
_lock = Lock()
_cache = {"text": "", "ts": 0.0}
def _esc(value: str) -> str:
return str(value).replace("\\", "\\\\").replace('"', '\\"').replace("\n", "\\n")
def _line(name: str, labels: dict, value) -> str:
if labels:
rendered = ",".join(f'{k}="{_esc(v)}"' for k, v in sorted(labels.items()))
return f"{name}{{{rendered}}} {value}"
return f"{name} {value}"
def _kind_for(session_id: str, sessions_index: dict) -> str:
for key, val in sessions_index.items():
if val.get("sessionId") != session_id:
continue
if key.startswith("agent:main:cron:"):
return "cron"
if key.startswith("telegram:slash:"):
return "telegram-slash"
if key.startswith("agent:main:"):
return "main"
surface = (val.get("origin") or {}).get("surface")
if surface:
return surface
return key.split(":", 1)[0]
return "unknown"
def _parse_ts(value):
if isinstance(value, (int, float)):
return float(value)
if isinstance(value, str):
try:
return datetime.fromisoformat(value.replace("Z", "+00:00")).timestamp()
except ValueError:
return 0.0
return 0.0
def _build_text() -> str:
start = time.monotonic()
out = []
sessions_index: dict = {}
for sp in glob.glob(os.path.join(OPENCLAW_HOME, "agents/*/sessions/sessions.json")):
try:
with open(sp) as f:
sessions_index.update(json.load(f))
except Exception:
pass
msg_count: dict = {}
in_tok: dict = {}
out_tok: dict = {}
cr_tok: dict = {}
cw_tok: dict = {}
err_count: dict = {}
latest_ts = 0.0
for jsonl in glob.glob(os.path.join(OPENCLAW_HOME, "agents/*/sessions/*.jsonl")):
bn = os.path.basename(jsonl)
if any(s in bn for s in SKIP_FRAGMENTS):
continue
m = SESSION_RE.match(bn)
if not m:
continue
sid = m.group(1)
kind = _kind_for(sid, sessions_index)
try:
with open(jsonl) as f:
for line in f:
line = line.strip()
if not line:
continue
try:
obj = json.loads(line)
except Exception:
continue
if obj.get("type") != "message":
continue
msg = obj.get("message") or {}
if msg.get("role") != "assistant":
continue
provider = msg.get("provider") or "unknown"
model = msg.get("model") or "unknown"
usage = msg.get("usage") or {}
ts = _parse_ts(obj.get("timestamp"))
if ts > latest_ts:
latest_ts = ts
if msg.get("stopReason") == "error":
reason = (msg.get("errorMessage") or "unknown")[:80]
ek = (provider, model, reason)
err_count[ek] = err_count.get(ek, 0) + 1
continue
mk = (provider, model, kind)
msg_count[mk] = msg_count.get(mk, 0) + 1
pm = (provider, model)
in_tok[pm] = in_tok.get(pm, 0) + (usage.get("input") or 0)
out_tok[pm] = out_tok.get(pm, 0) + (usage.get("output") or 0)
cr_tok[pm] = cr_tok.get(pm, 0) + (usage.get("cacheRead") or 0)
cw_tok[pm] = cw_tok.get(pm, 0) + (usage.get("cacheWrite") or 0)
except Exception:
pass
out.append("# HELP openclaw_codex_messages_total Cumulative assistant messages")
out.append("# TYPE openclaw_codex_messages_total counter")
for (p, mdl, k), c in msg_count.items():
out.append(_line("openclaw_codex_messages_total",
{"provider": p, "model": mdl, "session_kind": k}, c))
for name, src, hlp in [
("openclaw_codex_input_tokens_total", in_tok, "Cumulative input tokens"),
("openclaw_codex_output_tokens_total", out_tok, "Cumulative output tokens"),
("openclaw_codex_cache_read_tokens_total", cr_tok, "Cumulative cache-read tokens"),
("openclaw_codex_cache_write_tokens_total", cw_tok, "Cumulative cache-write tokens"),
]:
out.append(f"# HELP {name} {hlp}")
out.append(f"# TYPE {name} counter")
for (p, mdl), c in src.items():
out.append(_line(name, {"provider": p, "model": mdl}, c))
out.append("# HELP openclaw_codex_message_errors_total Cumulative assistant errors")
out.append("# TYPE openclaw_codex_message_errors_total counter")
for (p, mdl, r), c in err_count.items():
out.append(_line("openclaw_codex_message_errors_total",
{"provider": p, "model": mdl, "reason": r}, c))
out.append("# HELP openclaw_codex_active_sessions Active sessions in sessions.json")
out.append("# TYPE openclaw_codex_active_sessions gauge")
kc: dict = {}
for k in sessions_index:
if k.startswith("agent:main:cron:"):
kk = "cron"
elif k.startswith("telegram:slash:"):
kk = "telegram-slash"
elif k.startswith("agent:main:"):
kk = "main"
else:
kk = k.split(":", 1)[0]
kc[kk] = kc.get(kk, 0) + 1
for k, c in kc.items():
out.append(_line("openclaw_codex_active_sessions", {"kind": k}, c))
if latest_ts:
out.append("# HELP openclaw_codex_last_run_timestamp Unix ts of newest assistant message")
out.append("# TYPE openclaw_codex_last_run_timestamp gauge")
out.append(_line("openclaw_codex_last_run_timestamp", {}, latest_ts))
out.append("# HELP openclaw_codex_oauth_expiry_seconds Seconds until OAuth token expires")
out.append("# TYPE openclaw_codex_oauth_expiry_seconds gauge")
now = time.time()
for af in glob.glob(os.path.join(OPENCLAW_HOME, "agents/*/agent/auth-profiles.json")):
try:
with open(af) as f:
data = json.load(f)
except Exception:
continue
# Schema: {"version": 1, "profiles": {"<id>": {...}}}.
# `expires` is Unix milliseconds.
for profile in (data.get("profiles") or {}).values():
exp_ms = profile.get("expires")
if not isinstance(exp_ms, (int, float)):
continue
exp_ts = exp_ms / 1000.0
out.append(_line(
"openclaw_codex_oauth_expiry_seconds",
{
"provider": profile.get("provider", "unknown"),
"account": profile.get("email") or profile.get("account") or "unknown",
"plan": profile.get("chatgptPlanType") or "unknown",
},
max(0, exp_ts - now),
))
out.append("# HELP openclaw_codex_exporter_scrape_duration_ms Last scrape duration ms")
out.append("# TYPE openclaw_codex_exporter_scrape_duration_ms gauge")
out.append(_line("openclaw_codex_exporter_scrape_duration_ms", {},
(time.monotonic() - start) * 1000))
return "\n".join(out) + "\n"
class Handler(BaseHTTPRequestHandler):
def do_GET(self):
if self.path == "/healthz":
self.send_response(200)
self.send_header("Content-Type", "text/plain")
self.end_headers()
self.wfile.write(b"ok\n")
return
if self.path != "/metrics":
self.send_response(404)
self.end_headers()
return
with _lock:
now = time.time()
if now - _cache["ts"] > CACHE_SEC:
try:
_cache["text"] = _build_text()
except Exception as exc: # noqa: BLE001
_cache["text"] = (
f'openclaw_codex_exporter_errors_total{{kind="scrape"}} 1\n'
f'# scrape error: {_esc(str(exc))[:200]}\n'
)
_cache["ts"] = now
body = _cache["text"].encode()
self.send_response(200)
self.send_header("Content-Type", "text/plain; version=0.0.4; charset=utf-8")
self.send_header("Content-Length", str(len(body)))
self.end_headers()
self.wfile.write(body)
def log_message(self, *args, **kwargs):
pass
def main():
print(f"openclaw exporter listening on :{PORT}", flush=True)
HTTPServer(("0.0.0.0", PORT), Handler).serve_forever()
if __name__ == "__main__":
main()

View file

@ -131,8 +131,12 @@ resource "kubernetes_config_map" "openclaw_config" {
mode = "off"
}
model = {
primary = "nim/qwen/qwen3.5-397b-a17b"
fallbacks = ["nim/mistralai/mistral-large-3-675b-instruct-2512", "nim/nvidia/llama-3.1-nemotron-ultra-253b-v1", "modelrelay/auto-fastest"]
# ChatGPT Plus OAuth via openai-codex plugin (account: ancaelena98@gmail.com).
# gpt-5.4-mini is the only mini variant the Codex backend accepts for Plus tier;
# gpt-5-mini / gpt-5.1-codex-mini return model_not_found / "not supported with
# ChatGPT account". Plus rate-card: 1,2007,000 local msgs / 5h on gpt-5.4-mini.
primary = "openai-codex/gpt-5.4-mini"
fallbacks = ["openai-codex/gpt-5.5", "nim/qwen/qwen3-coder-480b-a35b-instruct", "modelrelay/auto-fastest"]
}
models = {
"modelrelay/auto-fastest" = {}
@ -146,6 +150,8 @@ resource "kubernetes_config_map" "openclaw_config" {
"llama-as-openai/Llama-4-Scout-17B-16E-Instruct-FP8" = {}
"openrouter/stepfun/step-3.5-flash:free" = {}
"openrouter/arcee-ai/trinity-large-preview:free" = {}
"openai-codex/gpt-5.4-mini" = {}
"openai-codex/gpt-5.5" = {}
}
}
}
@ -255,6 +261,19 @@ resource "random_password" "gateway_token" {
special = false
}
# Prometheus exporter script read by the openclaw-exporter sidecar.
# Stdlib-only Python so no pip install at startup. Reads sessions JSONL +
# auth-profiles.json from the NFS-backed openclaw home volume (mounted ro).
resource "kubernetes_config_map" "openclaw_exporter" {
metadata {
name = "openclaw-exporter"
namespace = kubernetes_namespace.openclaw.metadata[0].name
}
data = {
"exporter.py" = file("${path.module}/files/exporter.py")
}
}
module "nfs_tools_host" {
source = "../../modules/kubernetes/nfs_volume"
name = "openclaw-tools-host"
@ -344,6 +363,11 @@ resource "kubernetes_deployment" "openclaw" {
}
annotations = {
"reloader.stakater.com/search" = "true"
# Prometheus auto-discovers pods with these annotations.
# Scraped by the openclaw-exporter sidecar exposes /metrics on :9099.
"prometheus.io/scrape" = "true"
"prometheus.io/port" = "9099"
"prometheus.io/path" = "/metrics"
}
}
spec {
@ -383,8 +407,10 @@ resource "kubernetes_deployment" "openclaw" {
# Main container: OpenClaw
container {
name = "openclaw"
image = "ghcr.io/openclaw/openclaw:2026.2.26"
command = ["sh", "-c", "node openclaw.mjs doctor --fix 2>/dev/null; exec node openclaw.mjs gateway --allow-unconfigured --bind lan"]
image = "ghcr.io/openclaw/openclaw:2026.5.4"
# Doctor --fix auto-promotes the highest-tier codex model (gpt-5-pro) after
# auth-profile-based model discovery; pin gpt-5.4-mini back to default after it.
command = ["sh", "-c", "node openclaw.mjs doctor --fix 2>/dev/null; node openclaw.mjs models set openai-codex/gpt-5.4-mini 2>/dev/null; exec node openclaw.mjs gateway --allow-unconfigured --bind lan"]
port {
container_port = 18789
}
@ -510,6 +536,54 @@ resource "kubernetes_deployment" "openclaw" {
}
}
# Sidecar: openclaw-exporter Prometheus exporter for Codex/OAuth usage.
# Reads sessions JSONL files + auth-profiles.json, exposes /metrics on :9099.
# Stdlib-only Python; no pip install at startup.
container {
name = "openclaw-exporter"
image = "docker.io/library/python:3.12-slim"
command = ["python3", "/scripts/exporter.py"]
port {
container_port = 9099
name = "metrics"
}
env {
name = "OPENCLAW_HOME"
value = "/home/node/.openclaw"
}
env {
name = "METRICS_PORT"
value = "9099"
}
volume_mount {
name = "openclaw-exporter-script"
mount_path = "/scripts"
read_only = true
}
volume_mount {
name = "openclaw-home"
mount_path = "/home/node/.openclaw"
read_only = true
}
readiness_probe {
http_get {
path = "/healthz"
port = 9099
}
initial_delay_seconds = 5
period_seconds = 30
}
resources {
requests = {
cpu = "10m"
memory = "64Mi"
}
limits = {
memory = "128Mi"
}
}
}
# Sidecar: modelrelay auto-routes to fastest healthy free model
container {
name = "modelrelay"
@ -598,6 +672,13 @@ resource "kubernetes_deployment" "openclaw" {
name = kubernetes_config_map.openclaw_config.metadata[0].name
}
}
volume {
name = "openclaw-exporter-script"
config_map {
name = kubernetes_config_map.openclaw_exporter.metadata[0].name
default_mode = "0555"
}
}
}
}
}

View file

@ -8,7 +8,10 @@ variable "postgresql_host" { type = string }
locals {
namespace = "payslip-ingest"
image = "registry.viktorbarzin.me/payslip-ingest:${var.image_tag}"
# Phase 3 of forgejo-registry-consolidation image= flipped to Forgejo
# 2026-05-07. registry-private kept image at the same path, so the new
# Forgejo URL is `viktor/<name>` under forgejo.viktorbarzin.me.
image = "forgejo.viktorbarzin.me/viktor/payslip-ingest:${var.image_tag}"
labels = {
app = "payslip-ingest"
}

View file

@ -307,6 +307,12 @@ resource "kubernetes_config_map" "bot_block_proxy_config" {
server {
listen 8080;
location /auth {
access_by_lua_block {
ngx.req.clear_header("If-Match")
ngx.req.clear_header("If-None-Match")
ngx.req.clear_header("If-Modified-Since")
ngx.req.clear_header("If-Unmodified-Since")
}
proxy_pass http://poison_fountain;
proxy_connect_timeout 3s;
proxy_read_timeout 5s;
@ -373,7 +379,7 @@ resource "kubernetes_deployment" "bot_block_proxy" {
}
container {
name = "nginx"
image = "nginx:1-alpine"
image = "openresty/openresty:alpine"
port {
container_port = 8080

View file

@ -515,7 +515,12 @@ resource "kubernetes_cron_job_v1" "wealthfolio_sync" {
}
container {
name = "sync"
image = "registry.viktorbarzin.me/wealthfolio-sync:latest"
# Phase 4 of forgejo-registry-consolidation 2026-05-07 +
# post-cutover wealthfolio-sync rebuild: image is now
# produced by /home/wizard/code/broker-sync (Forgejo
# viktor/broker-sync, DockerHub viktorbarzin/broker-sync,
# Forgejo viktor/wealthfolio-sync as the cluster pull path).
image = "forgejo.viktorbarzin.me/viktor/wealthfolio-sync:latest"
env {
name = "IMAP_HOST"
value_from {

View file

@ -172,6 +172,31 @@ resource "helm_release" "woodpecker" {
depends_on = [kubernetes_manifest.db_external_secret]
}
# Patch hostAliases onto the woodpecker-server StatefulSet the chart 3.5.1
# does NOT expose this field, so we have to do it after the helm release.
# Keeps the OAuth/forge-API path off the WAN gateway (forgejo.viktorbarzin.me
# resolves to the public IP via DNS, which round-trips through Cloudflare
# and routinely tripped 30s context-deadline timeouts when fetching pipeline
# config). 10.0.20.200 is the Traefik LB that fronts forgejo internally;
# Traefik serves the *.viktorbarzin.me wildcard so SNI verification still
# passes.
resource "null_resource" "woodpecker_server_host_alias" {
triggers = {
helm_revision = helm_release.woodpecker.metadata[0].revision
}
provisioner "local-exec" {
command = <<-BASH
set -euo pipefail
kubectl -n woodpecker patch statefulset/woodpecker-server --type=strategic --patch '{"spec":{"template":{"spec":{"hostAliases":[{"ip":"10.0.20.200","hostnames":["forgejo.viktorbarzin.me"]}]}}}}'
kubectl -n woodpecker rollout status statefulset/woodpecker-server --timeout=120s
BASH
interpreter = ["/bin/bash", "-c"]
}
depends_on = [helm_release.woodpecker]
}
# ClusterRoleBinding - build pods need cluster-admin to PATCH deployments across namespaces
resource "kubernetes_cluster_role_binding" "woodpecker" {
metadata {

View file

@ -4,10 +4,19 @@ server:
reloader.stakater.com/search: "true"
statefulSet:
replicaCount: 1
# NOTE: hostAliases is NOT exposed by the woodpecker Helm chart (3.5.1 verified) —
# see main.tf null_resource.woodpecker_server_host_alias which applies the same
# via `kubectl patch` post-helm. Pinned to the in-cluster Traefik LB
# (10.0.20.200) so the forge-API fetch path never round-trips through
# Cloudflare ("context deadline exceeded" was failing every Forgejo
# pipeline trigger).
image:
registry: docker.io
repository: woodpeckerci/woodpecker-server
tag: "v3.13.0"
# Bumped 2026-05-07 from v3.13.0 → v3.14.0 to fix the
# "could not load config from forge: context deadline exceeded"
# issue when fetching .woodpecker.yml from Forgejo.
tag: "v3.14.0"
extraSecretNamesForEnvFrom:
- woodpecker-db-creds
env:
@ -27,6 +36,14 @@ server:
WOODPECKER_FORGEJO_CLIENT: "${forgejo_client_id}"
WOODPECKER_FORGEJO_SECRET: "${forgejo_client_secret}"
WOODPECKER_FORGEJO_URL: "${forgejo_url}"
# Default is 3s (cmd/server/flags.go @ default `--forge-timeout`).
# Forgejo responses on this cluster spike to 1-2s under load and the
# config-loader makes 4-6 sequential calls (.woodpecker dir, .woodpecker.yaml,
# .woodpecker.yml, raw .woodpecker/build.yml, etc.); occasionally the cumulative
# overhead trips the 3s deadline → "could not load config from forge: context
# deadline exceeded" on every pipeline. 30s removes the false-positive timeouts
# without regressing the legitimate-failure detection window meaningfully.
WOODPECKER_FORGE_TIMEOUT: "30s"
service:
type: ClusterIP
port: 80
@ -46,7 +63,7 @@ agent:
image:
registry: docker.io
repository: woodpeckerci/woodpecker-agent
tag: "v3.13.0"
tag: "v3.14.0"
env:
WOODPECKER_BACKEND: "kubernetes"
WOODPECKER_BACKEND_K8S_NAMESPACE: "woodpecker"

File diff suppressed because one or more lines are too long