fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip]

6d224861 came from a --no-checkout worktree whose empty index made the
commit drop every file except two. This restores 05b50d2b's full tree and
correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su
entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the
live infra was never applied from the broken commit.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-06-09 08:45:33 +00:00
parent 6d224861c4
commit fd0f4a0365
1166 changed files with 358546 additions and 0 deletions

View file

@ -0,0 +1,119 @@
# chrome-service
In-cluster headed Chromium exposed over the Chrome DevTools Protocol
(CDP) on TCP :9222. Sibling services drive it instead of running their
own in-process browser — useful when the upstream tries to detect
headless mode (e.g. hmembeds' `disable-devtool.js` redirect-to-google
trap). Also publishes an hourly snapshot of cookies + localStorage so
external dev-box Claude Code sessions can warm their isolated
playwright contexts from the same logged-in profile.
## Connect (in-cluster callers)
```python
from playwright.async_api import async_playwright
CDP_URL = "http://chrome-service.chrome-service.svc.cluster.local:9222"
async with async_playwright() as p:
browser = await p.chromium.connect_over_cdp(CDP_URL, timeout=15_000)
# browser.contexts[0] is the persistent default context (the one
# the user logs into via noVNC). For bot work that should NOT share
# cookies, create a fresh incognito context:
context = await browser.new_context()
await context.add_init_script(STEALTH_JS)
page = await context.new_page()
...
await browser.close()
```
NetworkPolicy is the only gate on the CDP endpoint — labelled client
namespaces or explicit fallback (`f1-stream`). No bearer token is
required for the connection itself.
## Snapshot endpoint (external callers)
```bash
# Bearer token comes from Vault secret/chrome-service.api_bearer_token.
TOKEN=$(vault kv get -field=api_bearer_token secret/chrome-service)
curl -fsSL \
-H "Authorization: Bearer $TOKEN" \
https://chrome.viktorbarzin.me/api/snapshot \
> storage-state.json
# Use the snapshot with @playwright/mcp:
npx @playwright/mcp@latest --port 8931 --host localhost \
--headless --browser chrome \
--isolated --storage-state ./storage-state.json
```
The snapshot is refreshed hourly by the `chrome-service-snapshot-harvester`
CronJob (schedule `23 * * * *`) which calls `context.storageState()` via
the CDP endpoint and writes to `/profile/snapshots/storage-state.json`
(atomic rename). The `snapshot-server` sidecar serves that file.
## Add a new in-cluster caller
1. **Label the caller's namespace** so the chrome-service NetworkPolicy
admits it:
```hcl
resource "kubernetes_namespace" "<ns>" {
metadata {
labels = {
"chrome-service.viktorbarzin.me/client" = "true"
}
}
}
```
2. **Inject `CHROME_CDP_URL`** into the caller's pod env:
```hcl
env {
name = "CHROME_CDP_URL"
value = "http://chrome-service.chrome-service.svc.cluster.local:9222"
}
```
3. **Vendor `stealth.js`** into the caller (or just paste — it's ~40
lines) and apply via `await context.add_init_script(STEALTH_JS)` after
every `new_context()`. Without it, hmembeds-class anti-bot still trips.
## Image pin
Both the server image (`mcr.microsoft.com/playwright:v1.48.0-noble` in
`main.tf`) and the client (`playwright==1.48.0` in callers' requirements)
must match minor-versions. Bump in lockstep — Playwright protocol changes
between minors.
## Operations
- **Storage**: encrypted PVC at `/profile`. Chromium user-data-dir lives
at `/profile/chromium-data` — cookies + localStorage + IndexedDB
persist here. Snapshots at `/profile/snapshots/storage-state.json`.
Backed up tar+gzip every 6h to `/srv/nfs/chrome-service-backup/`,
30-day retention.
- **Probes**: TCP/9222. Chrome's CDP serves `/json/version` once it's
bound; TCP-open is enough for readiness.
- **Health page**: visit `https://chrome.viktorbarzin.me` (Authentik-
gated) to confirm the pod is up and to log into sites. The CDP port
stays internal-only.
- **Token rotation**: `vault kv put secret/chrome-service api_bearer_token=$(python3 -c 'import secrets; print(secrets.token_urlsafe(32))')`.
Reloader cascades to the snapshot-server sidecar. Update the cached
token on any dev box that pulls the snapshot:
`vault kv get -field=api_bearer_token secret/chrome-service > ~/.config/playwright/token`.
## Why headed (Xvfb) instead of headless?
`disable-devtool.js` and similar libraries detect `navigator.webdriver`,
console-clear timing, and the `HeadlessChromium/...` user-agent suffix.
Running headed inside `Xvfb :99` reports as a normal Chromium, and the
stealth init script handles the JS-visible giveaways.
## Why direct chromium (CDP) instead of `playwright launch-server`?
`playwright launch-server` creates ephemeral browser contexts per
`connect()` call — cookies and localStorage never persist to the PVC.
The `/profile` mount only ever held npm cache + fontconfig cache
despite the original docs claiming it held "cookies, localStorage,
IndexedDB". Switched 2026-06-04 to direct chromium launch with
`--user-data-dir=/profile/chromium-data --remote-debugging-port=9222`
so the persistent profile actually persists, and callers migrate
`chromium.connect(ws_url)``chromium.connect_over_cdp(cdp_url)`.

View file

@ -0,0 +1,214 @@
#!/usr/bin/env python3
"""CDP-aware proxy: 0.0.0.0:9222 → 127.0.0.1:9223 with Host header rewriting.
Why this exists:
Stock Chrome binaries silently ignore --remote-debugging-address (the flag is
gated by a build-time switch most distributions don't set), so CDP always
binds 127.0.0.1:<port>. Worse, Chrome enforces DNS rebinding protection on
the HTTP DevTools endpoint: any Host header that isn't `localhost`,
`127.0.0.1`, or `[::1]` returns 500 "Host header is specified and is not an
IP address or localhost". There is no `--remote-allow-hosts` flag in stock
Chrome 130 (verified by binary string search).
This means a raw TCP forwarder doesn't work — clients hitting the K8s
Service DNS get 500 because Chrome rejects the Host header.
What this script does:
- Listens on 0.0.0.0:9222 (the public CDP port the K8s Service exposes).
- For each TCP connection from a CDP client:
1. Read the HTTP request line + headers.
2. Rewrite `Host: <whatever>` to `Host: localhost:9222`, remembering
the original value (for response rewriting).
3. Open a connection to Chrome at 127.0.0.1:9223 and forward the
modified request line + headers + body.
4. Read Chrome's HTTP response. If it's 101 Switching Protocols
(WebSocket upgrade), forward it as-is and switch to raw byte piping
in both directions (CDP frames are binary, no further parsing).
5. Otherwise it's a regular HTTP/JSON response. Substitute
`localhost:9222` (the URL Chrome composed from the rewritten Host)
back to the client's original Host header value. Forward.
- The Microsoft playwright image ships python3 but not socat, hence this
stdlib-only helper.
Limitations:
- Only HTTP/1.x supported (CDP doesn't use HTTP/2).
- Body is assumed to fit in one read for non-WS responses (CDP JSON
responses are kilobytes, well within limits).
- No SSL/TLS the cluster network is the trust boundary.
"""
import os
import socket
import sys
import threading
LISTEN_ADDR = os.environ.get("BRIDGE_LISTEN_ADDR", "0.0.0.0")
LISTEN_PORT = int(os.environ.get("BRIDGE_LISTEN_PORT", "9222"))
TARGET_ADDR = os.environ.get("BRIDGE_TARGET_ADDR", "127.0.0.1")
TARGET_PORT = int(os.environ.get("BRIDGE_TARGET_PORT", "9223"))
INTERNAL_HOST = f"localhost:{LISTEN_PORT}"
def recv_until(sock: socket.socket, marker: bytes, max_bytes: int = 65536) -> bytes:
"""Read from sock until marker is seen or max_bytes hit. Returns everything read."""
buf = b""
while marker not in buf and len(buf) < max_bytes:
chunk = sock.recv(4096)
if not chunk:
break
buf += chunk
return buf
def rewrite_host(headers: bytes, new_host: str) -> tuple[bytes, str | None]:
"""Replace the Host header. Returns (new_headers, original_host)."""
lines = headers.split(b"\r\n")
original = None
out = []
for line in lines:
if line.lower().startswith(b"host:"):
original = line.split(b":", 1)[1].strip().decode("latin-1")
out.append(f"Host: {new_host}".encode("latin-1"))
else:
out.append(line)
return b"\r\n".join(out), original
def pipe(src: socket.socket, dst: socket.socket) -> None:
"""Raw byte pipe used after WS upgrade."""
try:
while True:
data = src.recv(65536)
if not data:
break
dst.sendall(data)
except OSError:
pass
finally:
try:
src.shutdown(socket.SHUT_RD)
except OSError:
pass
try:
dst.shutdown(socket.SHUT_WR)
except OSError:
pass
def handle(client: socket.socket) -> None:
upstream: socket.socket | None = None
try:
# Read until end-of-headers.
head_buf = recv_until(client, b"\r\n\r\n")
if b"\r\n\r\n" not in head_buf:
return
head, tail = head_buf.split(b"\r\n\r\n", 1)
new_head, original_host = rewrite_host(head, INTERNAL_HOST)
upstream = socket.create_connection((TARGET_ADDR, TARGET_PORT), timeout=5)
# `create_connection(timeout=5)` sets the socket's timeout to 5s,
# which then applies to all subsequent recv() calls too. After a WS
# upgrade either side can stay silent for minutes — leave timeouts
# off so the pipe doesn't blow up the connection on idle.
upstream.settimeout(None)
upstream.sendall(new_head + b"\r\n\r\n" + tail)
# Read response headers from upstream.
resp_head_buf = recv_until(upstream, b"\r\n\r\n")
if b"\r\n\r\n" not in resp_head_buf:
return
resp_head, resp_tail = resp_head_buf.split(b"\r\n\r\n", 1)
first_line = resp_head.split(b"\r\n", 1)[0].decode("latin-1", errors="replace")
# Match any 101 status (Chrome's CDP says "101 WebSocket Protocol
# Handshake", not the canonical "101 Switching Protocols"). Sniff the
# status code from the first line, e.g. "HTTP/1.1 101 ...".
parts = first_line.split(" ", 2)
status_code = parts[1] if len(parts) >= 2 else ""
if status_code == "101":
# WS upgrade. Forward as-is and start raw pipe.
client.sendall(resp_head + b"\r\n\r\n" + resp_tail)
t1 = threading.Thread(target=pipe, args=(client, upstream), daemon=True)
t2 = threading.Thread(target=pipe, args=(upstream, client), daemon=True)
t1.start()
t2.start()
t1.join()
t2.join()
return
# Regular HTTP response. Determine body length (Content-Length only —
# CDP doesn't use chunked encoding for /json/* endpoints) and rewrite.
content_length = 0
for line in resp_head.split(b"\r\n"):
if line.lower().startswith(b"content-length:"):
try:
content_length = int(line.split(b":", 1)[1].strip())
except ValueError:
pass
break
body = resp_tail
while len(body) < content_length:
chunk = upstream.recv(65536)
if not chunk:
break
body += chunk
# Truncate any extra bytes that came past content_length (shouldn't
# happen with stock chrome but defensive against pipelined responses).
if content_length and len(body) > content_length:
body = body[:content_length]
# Rewrite the URLs Chrome composed using its localhost Host so callers
# can follow them back through this bridge.
if original_host:
body = body.replace(INTERNAL_HOST.encode(), original_host.encode())
# Rebuild response headers: drop any existing Content-Length / Connection
# header and force `Connection: close` + the new Content-Length. This
# keeps the bridge one-request-per-connection (no keep-alive); avoids a
# whole class of upstream/downstream desync issues, especially because
# Node's ws library will open a fresh TCP for the WS upgrade rather
# than trying to reuse the HTTP probe's connection.
new_lines = []
for line in resp_head.split(b"\r\n"):
l = line.lower()
if l.startswith(b"content-length:") or l.startswith(b"connection:"):
continue
new_lines.append(line)
new_lines.append(f"Content-Length: {len(body)}".encode())
new_lines.append(b"Connection: close")
resp_head = b"\r\n".join(new_lines)
client.sendall(resp_head + b"\r\n\r\n" + body)
except Exception as e:
sys.stderr.write(f"[cdp-bridge] handle error: {e}\n")
finally:
try:
client.close()
except OSError:
pass
if upstream is not None:
try:
upstream.close()
except OSError:
pass
def main() -> int:
listener = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
listener.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
listener.bind((LISTEN_ADDR, LISTEN_PORT))
listener.listen(64)
sys.stderr.write(
f"[cdp-bridge] HTTP-aware proxy listening on {LISTEN_ADDR}:{LISTEN_PORT}"
f"{TARGET_ADDR}:{TARGET_PORT} (rewriting Host → {INTERNAL_HOST})\n"
)
while True:
client, _ = listener.accept()
threading.Thread(target=handle, args=(client,), daemon=True).start()
if __name__ == "__main__":
sys.exit(main() or 0)

View file

@ -0,0 +1,19 @@
FROM docker.io/library/ubuntu:24.04
RUN apt-get update \
&& apt-get install -y --no-install-recommends \
x11vnc \
novnc \
websockify \
ca-certificates \
&& rm -rf /var/lib/apt/lists/*
# noVNC ships /usr/share/novnc/vnc.html; alias to index.html so / works.
RUN ln -sf /usr/share/novnc/vnc.html /usr/share/novnc/index.html
EXPOSE 6080
COPY entrypoint.sh /entrypoint.sh
RUN chmod +x /entrypoint.sh
CMD ["/entrypoint.sh"]

View file

@ -0,0 +1,39 @@
#!/usr/bin/env bash
# Connect to the chrome-service container's Xvfb (shared pod network, TCP)
# and serve the noVNC HTML5 client + websockify bridge on :6080.
set -e
for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15; do
if echo > /dev/tcp/127.0.0.1/6099 2>/dev/null; then
echo "Xvfb TCP up after attempt $i"
break
fi
echo "waiting for Xvfb TCP 6099 attempt=$i"
sleep 2
done
# websockify runs as PID 1; x11vnc is a child so its logs land on container stdout
# `-noshm` skips MIT-SHM probes that fail across container boundaries (each
# container has its own /dev/shm); `-noxdamage` skips XDAMAGE which Xvfb
# doesn't expose; `-quiet` keeps the polling chatter out of pod logs.
echo "starting x11vnc -> :5900"
x11vnc -display localhost:99 -nopw -listen 0.0.0.0 -rfbport 5900 \
-forever -shared -noshm -noxdamage -quiet 2>&1 &
X11VNC_PID=$!
for i in 1 2 3 4 5 6 7 8 9 10; do
if echo > /dev/tcp/127.0.0.1/5900 2>/dev/null; then
echo "x11vnc bound 5900 after attempt $i"
break
fi
echo "waiting for x11vnc :5900 attempt=$i"
sleep 2
done
if ! echo > /dev/tcp/127.0.0.1/5900 2>/dev/null; then
echo "ERROR: x11vnc did not bind 5900"
exit 1
fi
echo "starting websockify -> :6080"
exec websockify --web=/usr/share/novnc 6080 localhost:5900

View file

@ -0,0 +1,69 @@
#!/usr/bin/env python3
"""Connect to chrome-service via CDP, dump storage state, write atomically.
Runs hourly as a Kubernetes CronJob. Mounts the chrome-service encrypted
PVC at /profile (same node via pod-affinity) and writes the snapshot to
/profile/snapshots/storage-state.json. The snapshot-server sidecar reads
from the same path and serves it bearer-gated.
CDP endpoint is plain HTTP protection is the chrome-service
NetworkPolicy (allow only labelled client namespaces). Same security model
as the previous WS endpoint, just unauthenticated within the trust zone.
"""
import asyncio
import logging
import os
import pathlib
import sys
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
log = logging.getLogger("snapshot-harvester")
CDP_URL = os.environ.get(
"CDP_URL", "http://chrome-service.chrome-service.svc.cluster.local:9222"
)
SNAPSHOT_DIR = pathlib.Path(os.environ.get("SNAPSHOT_DIR", "/profile/snapshots"))
SNAPSHOT_FILE = SNAPSHOT_DIR / "storage-state.json"
TMP_FILE = SNAPSHOT_DIR / "storage-state.json.tmp"
async def main() -> int:
try:
from playwright.async_api import async_playwright
except ImportError:
log.error("playwright not installed in image")
return 2
SNAPSHOT_DIR.mkdir(parents=True, exist_ok=True)
async with async_playwright() as p:
try:
browser = await p.chromium.connect_over_cdp(CDP_URL, timeout=20_000)
except Exception:
log.exception("connect_over_cdp failed (%s)", CDP_URL)
return 3
try:
contexts = browser.contexts
if not contexts:
log.error("no browser contexts found — chrome-service may not have launched a persistent context yet")
return 4
ctx = contexts[0]
# storage_state writes cookies + localStorage to a JSON file.
# IndexedDB and sessionStorage are NOT included (known Playwright limitation).
await ctx.storage_state(path=str(TMP_FILE))
os.replace(TMP_FILE, SNAPSHOT_FILE)
size = SNAPSHOT_FILE.stat().st_size
log.info("wrote snapshot (%d bytes) to %s", size, SNAPSHOT_FILE)
finally:
try:
await browser.close()
except Exception:
pass
return 0
if __name__ == "__main__":
sys.exit(asyncio.run(main()))

View file

@ -0,0 +1,68 @@
#!/usr/bin/env python3
"""Tiny HTTP server that exposes /api/snapshot, gated by a bearer token.
Runs as a sidecar in the chrome-service pod. Reads the persisted storage
state written hourly by the snapshot-harvester CronJob and returns it to
authenticated callers (the dev-box `playwright-snapshot-refresh` timer).
Token is read from the PW_TOKEN env var, same secret the legacy WS path
used. The endpoint is mounted behind Traefik on `chrome.viktorbarzin.me`
at the `/api/snapshot` path (auth=none at the ingress; the bearer check
is here).
"""
import os
import sys
from http.server import HTTPServer, BaseHTTPRequestHandler
TOKEN = os.environ.get("PW_TOKEN")
SNAPSHOT_PATH = os.environ.get(
"SNAPSHOT_PATH", "/profile/snapshots/storage-state.json"
)
PORT = int(os.environ.get("PORT", "8088"))
class Handler(BaseHTTPRequestHandler):
server_version = "chrome-snapshot/1"
def _short(self, status: int, body: bytes = b"") -> None:
self.send_response(status)
self.send_header("Content-Length", str(len(body)))
self.end_headers()
if body:
self.wfile.write(body)
def do_GET(self):
if self.path == "/healthz":
self._short(200, b"ok\n")
return
if self.path != "/api/snapshot":
self._short(404)
return
if TOKEN is None:
self._short(503, b"{\"error\":\"token not configured\"}\n")
return
if self.headers.get("Authorization", "") != f"Bearer {TOKEN}":
self._short(401, b"{\"error\":\"invalid bearer\"}\n")
return
try:
with open(SNAPSHOT_PATH, "rb") as f:
data = f.read()
except FileNotFoundError:
self._short(404, b"{\"error\":\"snapshot not yet available\"}\n")
return
self.send_response(200)
self.send_header("Content-Type", "application/json")
self.send_header("Cache-Control", "no-cache")
self.send_header("Content-Length", str(len(data)))
self.end_headers()
self.wfile.write(data)
def log_message(self, fmt, *args):
sys.stderr.write(
"[snapshot-server] %s - %s\n" % (self.address_string(), fmt % args)
)
if __name__ == "__main__":
HTTPServer(("0.0.0.0", PORT), Handler).serve_forever()

View file

@ -0,0 +1,54 @@
// Minimal stealth init script for Playwright-driven Chromium.
// Vendored from puppeteer-extra-plugin-stealth/evasions/* (MIT) — covers:
// webdriver, chrome.runtime, navigator.plugins, navigator.languages,
// Permissions.query, WebGL getParameter (vendor + renderer spoof).
// Run via context.add_init_script() so it executes before any page script.
(() => {
// navigator.webdriver — most common detection, removed entirely.
Object.defineProperty(Navigator.prototype, 'webdriver', { get: () => undefined });
// window.chrome.runtime — many sites check that real Chrome exposes this.
if (!window.chrome) window.chrome = {};
window.chrome.runtime = window.chrome.runtime || {};
// navigator.plugins — headless reports zero; spoof a plausible PDF viewer.
Object.defineProperty(navigator, 'plugins', {
get: () => [{ name: 'Chrome PDF Plugin' }, { name: 'Chrome PDF Viewer' }, { name: 'Native Client' }],
});
// navigator.languages — headless returns empty array.
Object.defineProperty(navigator, 'languages', { get: () => ['en-US', 'en'] });
// Permissions.query — headless returns 'denied' for notifications instead of 'default'.
const origQuery = window.navigator.permissions && window.navigator.permissions.query;
if (origQuery) {
window.navigator.permissions.query = (parameters) =>
parameters && parameters.name === 'notifications'
? Promise.resolve({ state: Notification.permission })
: origQuery(parameters);
}
// WebGL getParameter — spoof vendor + renderer strings to a real GPU.
const spoofGl = (proto) => {
if (!proto) return;
const orig = proto.getParameter;
proto.getParameter = function (parameter) {
if (parameter === 37445) return 'Intel Inc.'; // UNMASKED_VENDOR_WEBGL
if (parameter === 37446) return 'Intel Iris OpenGL Engine'; // UNMASKED_RENDERER_WEBGL
return orig.apply(this, arguments);
};
};
spoofGl(window.WebGLRenderingContext && window.WebGLRenderingContext.prototype);
spoofGl(window.WebGL2RenderingContext && window.WebGL2RenderingContext.prototype);
// disable-devtool.js (theajack/disable-devtool) auto-inits via a script
// tag with `disable-devtool-auto`. Its Performance detector trips under
// Playwright (CDP adds console.log latency vs console.table) and the
// redirect URL is hard-coded — for hmembeds that's google.com.
// Hide the auto-init marker so the library's IIFE exits early.
const origQS = Document.prototype.querySelector;
Document.prototype.querySelector = function (sel) {
if (typeof sel === 'string' && sel.indexOf('disable-devtool-auto') !== -1) return null;
return origQS.apply(this, arguments);
};
})();

View file

@ -0,0 +1,833 @@
variable "tls_secret_name" {
type = string
sensitive = true
}
variable "nfs_server" { type = string }
locals {
namespace = "chrome-service"
labels = {
app = "chrome-service"
}
# Pin to the same Playwright minor that the Python client requires.
# If you bump this image, also bump `playwright==X.Y.Z` in callers'
# requirements (currently f1-stream, snapshot-harvester) and re-run the
# connect smoke test. Image ships chromium under /ms-playwright/.
image = "mcr.microsoft.com/playwright:v1.48.0-noble"
# Python image for the snapshot-harvester CronJob and the snapshot-server
# sidecar (the latter just runs a 60-line stdlib HTTP server).
python_image = "mcr.microsoft.com/playwright/python:v1.48.0-noble"
snapshot_dir = "/profile/snapshots"
}
# --- Namespace ---
resource "kubernetes_namespace" "chrome_service" {
metadata {
name = local.namespace
labels = {
"istio-injection" = "disabled"
tier = local.tiers.aux
"chrome-service.viktorbarzin.me/server" = "true"
"keel.sh/enrolled" = "true"
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
# --- Secrets (single-key extract: api_bearer_token) ---
resource "kubernetes_manifest" "external_secret" {
manifest = {
apiVersion = "external-secrets.io/v1beta1"
kind = "ExternalSecret"
metadata = {
name = "chrome-service-secrets"
namespace = local.namespace
}
spec = {
refreshInterval = "15m"
secretStoreRef = {
name = "vault-kv"
kind = "ClusterSecretStore"
}
target = {
name = "chrome-service-secrets"
}
dataFrom = [{
extract = {
key = "chrome-service"
}
}]
}
}
depends_on = [kubernetes_namespace.chrome_service]
}
# tls-secret for the chrome.viktorbarzin.me ingress is auto-cloned into
# every namespace by Kyverno's `sync-tls-secret` ClusterPolicy no local
# module call needed.
# --- Encrypted profile PVC ---
# Holds Chromium user data: cookies, localStorage, IndexedDB. Sites we
# drive may set auth tokens or session cookies encrypted is correct.
resource "kubernetes_persistent_volume_claim" "profile_encrypted" {
wait_until_bound = false
metadata {
name = "chrome-service-profile-encrypted"
namespace = kubernetes_namespace.chrome_service.metadata[0].name
annotations = {
"resize.topolvm.io/threshold" = "10%"
"resize.topolvm.io/increase" = "100%"
"resize.topolvm.io/storage_limit" = "10Gi"
}
}
spec {
access_modes = ["ReadWriteOnce"]
storage_class_name = "proxmox-lvm-encrypted"
resources {
requests = {
storage = "2Gi"
}
}
}
lifecycle {
# The autoresizer expands requests.storage up to storage_limit and
# PVCs can't shrink. Without this, every TF apply tries to revert
# to the spec value, K8s rejects the shrink, and the PVC ends up
# in Terminating-but-in-use limbo.
ignore_changes = [spec[0].resources[0].requests]
}
}
# --- NFS backup target ---
module "nfs_chrome_service_backup_host" {
source = "../../modules/kubernetes/nfs_volume"
name = "chrome-service-backup-host"
namespace = kubernetes_namespace.chrome_service.metadata[0].name
nfs_server = "192.168.1.127"
nfs_path = "/srv/nfs/chrome-service-backup"
}
# --- Deployment ---
resource "kubernetes_deployment" "chrome_service" {
metadata {
name = "chrome-service"
namespace = kubernetes_namespace.chrome_service.metadata[0].name
labels = merge(local.labels, {
tier = local.tiers.aux
# Deliberate pin: chrome-service's playwright image MUST match
# the playwright Python version in f1-stream (see local.image
# comment above). Opt out of Keel auto-update via this label
# the inject-keel-annotations ClusterPolicy excludes workloads
# selector-matching keel.sh/policy=never.
"keel.sh/policy" = "never"
})
annotations = {
"reloader.stakater.com/auto" = "true"
}
}
spec {
replicas = 1
strategy {
type = "Recreate"
}
selector {
match_labels = local.labels
}
template {
metadata {
labels = local.labels
}
spec {
# The noVNC sidecar pulls from registry.viktorbarzin.me which needs
# auth. Kyverno's `sync-registry-credentials` ClusterPolicy syncs
# the secret into every namespace.
image_pull_secrets {
name = "registry-credentials"
}
security_context {
run_as_user = 1000
run_as_group = 1000
fs_group = 1000
seccomp_profile {
type = "RuntimeDefault"
}
}
# Fix profile dir ownership (PVC may have root-owned files from prior run).
init_container {
name = "fix-perms"
image = "busybox:1.37"
command = ["sh", "-c", "chown -R 1000:1000 /profile"]
security_context {
run_as_user = 0
}
volume_mount {
name = "profile"
mount_path = "/profile"
}
resources {
requests = { memory = "32Mi" }
limits = { memory = "64Mi" }
}
}
container {
name = "chrome-service"
image = local.image
image_pull_policy = "IfNotPresent"
# Direct chromium launch (NOT `playwright launch-server`). Reason:
# launch-server creates ephemeral browser contexts per `connect()`
# call, so cookies/localStorage never persist to the PVC the
# `/profile` mount only ever held npm cache + fontconfig.
# Replaced 2026-06-04 with a CDP+persistent-profile model so the
# warm browser (where Viktor logs in via noVNC) keeps cookies, and
# the hourly snapshot-harvester CronJob can dump them via the
# CDP endpoint. Callers migrate `chromium.connect()`
# `chromium.connect_over_cdp()` (see f1-stream's playback_verifier).
#
# --remote-debugging-port=9222 : TCP CDP (vs default pipe).
# --remote-debugging-address=0.0.0.0 : bind on all pod IFs;
# NetworkPolicy is the gate.
# --remote-allow-origins=* : Chrome 111+ requires for
# non-loopback CDP origins.
# --user-data-dir=/profile/chromium-data: persistent profile on
# the encrypted PVC.
command = ["bash", "-c"]
args = [
<<-EOT
set -e
# Locate chromium in the Microsoft image. The path is
# /ms-playwright/chromium-XXXX/chrome-linux/chrome where XXXX
# is the playwright-pinned build; resolve at runtime so a minor
# bump of the image doesn't break the launch line.
CHROMIUM=$(find /ms-playwright -maxdepth 4 -name 'chrome' -type f -executable -path '*/chrome-linux/*' 2>/dev/null | head -1)
if [ -z "$CHROMIUM" ]; then
echo "ERROR: chromium binary not found under /ms-playwright" >&2
exit 1
fi
echo "[chrome-service] using chromium: $CHROMIUM"
# -listen tcp enables localhost:6099 so the noVNC sidecar can
# attach over the pod's shared network ns (Ubuntu 24.04
# defaults Xvfb to -nolisten tcp). -ac disables X access
# control; safe because Xvfb only listens on the pod's lo.
Xvfb :99 -screen 0 1280x720x24 -listen tcp -ac &
sleep 1
mkdir -p /profile/chromium-data ${local.snapshot_dir}
# Why a bridge?
# Stock Chrome binaries silently ignore --remote-debugging-address
# (the flag is gated by a build-time switch most distributions don't
# set), so CDP always binds 127.0.0.1:<port> regardless of what we
# pass. The K8s liveness/readiness probe + cluster callers reach
# the pod via its pod-IP, never localhost.
# Fix: chromium listens on 127.0.0.1:9223 (hidden internal port),
# cdp_bridge.py listens on 0.0.0.0:9222 (the public CDP port) and
# transparently forwards. K8s Service, probes, NetworkPolicy all
# stay on 9222 no caller-side changes needed.
# (Microsoft playwright image ships python3 but not socat, so the
# bridge is a tiny stdlib script see files/cdp_bridge.py.)
python3 /scripts/cdp_bridge.py &
BRIDGE_PID=$!
trap "kill $BRIDGE_PID 2>/dev/null" EXIT
exec "$CHROMIUM" \
--remote-debugging-port=9223 \
--remote-allow-origins=* \
--user-data-dir=/profile/chromium-data \
--no-sandbox \
--no-first-run \
--no-default-browser-check \
--disable-blink-features=AutomationControlled \
--disable-features=IsolateOrigins,site-per-process \
--autoplay-policy=no-user-gesture-required \
--disable-dev-shm-usage \
--password-store=basic \
--use-mock-keychain \
about:blank
EOT
]
env {
name = "DISPLAY"
value = ":99"
}
env {
name = "HOME"
value = "/profile"
}
port {
name = "cdp"
container_port = 9222
protocol = "TCP"
}
# Chrome's CDP endpoint serves /json/version once it's bound;
# TCP-open is enough for readiness.
liveness_probe {
tcp_socket { port = 9222 }
initial_delay_seconds = 30
period_seconds = 30
failure_threshold = 3
}
readiness_probe {
tcp_socket { port = 9222 }
initial_delay_seconds = 10
period_seconds = 10
}
startup_probe {
tcp_socket { port = 9222 }
period_seconds = 5
failure_threshold = 24 # up to 2 minutes
}
volume_mount {
name = "profile"
mount_path = "/profile"
}
volume_mount {
name = "dshm"
mount_path = "/dev/shm"
}
# /scripts/cdp_bridge.py provides the 0.0.0.0:9222 127.0.0.1:9223
# TCP forwarder (see entrypoint comment above for why).
volume_mount {
name = "scripts"
mount_path = "/scripts"
read_only = true
}
resources {
requests = {
cpu = "200m"
memory = "1500Mi"
}
limits = {
memory = "2Gi"
}
}
}
# noVNC sidecar exposes a live HTML5 view of the headed Chromium
# session via x11vnc + websockify, gated by the Authentik-protected
# ingress at chrome.viktorbarzin.me. CDP port 9222 (the new
# Playwright endpoint) stays internal-only.
container {
name = "novnc"
# Phase 3 cutover 2026-05-07 Forgejo registry consolidation.
image = "forgejo.viktorbarzin.me/viktor/chrome-service-novnc:v4"
image_pull_policy = "IfNotPresent"
port {
name = "http"
container_port = 6080
protocol = "TCP"
}
# x11vnc connects to the chrome-service container's Xvfb over
# localhost TCP (shared pod network). Same uid 1000 as chrome
# container so we can read MIT-MAGIC-COOKIE if Xvfb adds one.
resources {
requests = { cpu = "10m", memory = "32Mi" }
limits = { memory = "96Mi" }
}
}
# snapshot-server sidecar serves the hourly storage-state.json
# snapshot (written by the snapshot-harvester CronJob to the same
# PVC) over an HTTP endpoint, bearer-gated by PW_TOKEN. Mounted
# behind Traefik at chrome.viktorbarzin.me/api/snapshot with
# auth=none; the bearer check inside this server is the gate.
# Source: files/snapshot_server.py 60 lines, stdlib only.
container {
name = "snapshot-server"
image = local.python_image
image_pull_policy = "IfNotPresent"
command = ["python3", "/scripts/snapshot_server.py"]
env {
name = "PW_TOKEN"
value_from {
secret_key_ref {
name = "chrome-service-secrets"
key = "api_bearer_token"
}
}
}
env {
name = "SNAPSHOT_PATH"
value = "${local.snapshot_dir}/storage-state.json"
}
env {
name = "PORT"
value = "8088"
}
port {
name = "snap"
container_port = 8088
protocol = "TCP"
}
liveness_probe {
http_get {
path = "/healthz"
port = 8088
}
initial_delay_seconds = 5
period_seconds = 30
}
readiness_probe {
http_get {
path = "/healthz"
port = 8088
}
initial_delay_seconds = 2
period_seconds = 10
}
volume_mount {
name = "profile"
mount_path = "/profile"
read_only = true
}
volume_mount {
name = "scripts"
mount_path = "/scripts"
read_only = true
}
resources {
requests = { cpu = "5m", memory = "32Mi" }
limits = { memory = "96Mi" }
}
}
volume {
name = "profile"
persistent_volume_claim {
claim_name = kubernetes_persistent_volume_claim.profile_encrypted.metadata[0].name
}
}
volume {
name = "dshm"
empty_dir {
medium = "Memory"
size_limit = "256Mi"
}
}
volume {
name = "scripts"
config_map {
name = kubernetes_config_map_v1.snapshot_scripts.metadata[0].name
default_mode = "0555"
}
}
}
}
}
lifecycle {
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
metadata[0].annotations["keel.sh/match-tag"],
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE Keel manages tag updates
spec[0].template[0].spec[0].container[1].image,
spec[0].template[0].spec[0].init_container[0].image,
metadata[0].annotations["kubernetes.io/change-cause"],
metadata[0].annotations["deployment.kubernetes.io/revision"],
spec[0].template[0].metadata[0].annotations["keel.sh/update-time"], # KEEL_LIFECYCLE_V1
]
}
}
# --- ConfigMap: sidecar + harvester scripts ---
resource "kubernetes_config_map_v1" "snapshot_scripts" {
metadata {
name = "snapshot-scripts"
namespace = kubernetes_namespace.chrome_service.metadata[0].name
labels = local.labels
}
data = {
"snapshot_server.py" = file("${path.module}/files/snapshot_server.py")
"snapshot_harvester.py" = file("${path.module}/files/snapshot_harvester.py")
# Tiny TCP forwarder used by chrome-service container to bridge
# 0.0.0.0:9222 127.0.0.1:9223 (Chromium silently ignores
# --remote-debugging-address on stock builds; see cdp_bridge.py).
"cdp_bridge.py" = file("${path.module}/files/cdp_bridge.py")
}
}
# --- Services ---
# CDP endpoint (internal only, gated by NetworkPolicy). 2026-06-04: switched
# from Playwright WS (:3000) to direct chromium CDP (:9222) so the persistent
# user-data-dir actually persists cookies; callers use `connect_over_cdp()`.
resource "kubernetes_service" "chrome_service" {
metadata {
name = "chrome-service"
namespace = kubernetes_namespace.chrome_service.metadata[0].name
labels = local.labels
}
spec {
selector = local.labels
port {
name = "cdp"
port = 9222
target_port = 9222
protocol = "TCP"
}
}
}
# noVNC view (Authentik-gated, exposed via ingress).
resource "kubernetes_service" "chrome_novnc" {
metadata {
name = "chrome"
namespace = kubernetes_namespace.chrome_service.metadata[0].name
labels = local.labels
}
spec {
selector = local.labels
port {
name = "http"
port = 80
target_port = 6080
protocol = "TCP"
}
}
}
# Snapshot-server endpoint (bearer-gated, exposed via ingress sub-path
# chrome.viktorbarzin.me/api/snapshot auth=none at the ingress layer
# because the bearer check happens inside snapshot_server.py).
resource "kubernetes_service" "chrome_snapshot" {
metadata {
name = "chrome-snapshot"
namespace = kubernetes_namespace.chrome_service.metadata[0].name
labels = local.labels
}
spec {
selector = local.labels
port {
name = "snap"
port = 8088
target_port = 8088
protocol = "TCP"
}
}
}
module "ingress" {
source = "../../modules/kubernetes/ingress_factory"
dns_type = "proxied"
namespace = kubernetes_namespace.chrome_service.metadata[0].name
name = "chrome"
tls_secret_name = var.tls_secret_name
auth = "required"
# noVNC defaults to /vnc.html auto-redirect / there.
ingress_path = ["/"]
extra_annotations = {
"gethomepage.dev/enabled" = "true"
"gethomepage.dev/name" = "Chrome Service"
"gethomepage.dev/description" = "Live noVNC view of headed Chromium"
"gethomepage.dev/icon" = "chromium.png"
"gethomepage.dev/group" = "Infrastructure"
}
}
# Second ingress on the same host (chrome.viktorbarzin.me) carving out
# /api/snapshot to the snapshot-server sidecar. Path-level carve-out
# pattern see CLAUDE.md "For path-level carve-outs (e.g. wrongmove has
# `/` behind Anubis but `/api` direct), declare a second ingress_factory
# with `ingress_path = ["/<path>"]` pointing at the bare backend service."
module "ingress_snapshot" {
source = "../../modules/kubernetes/ingress_factory"
# auth = "none": bearer-token gated inside snapshot-server.py; Authentik
# forward-auth would require an OIDC cookie that the dev-box refresh
# timer can't replay.
auth = "none"
dns_type = "none" # DNS already created by module.ingress
namespace = kubernetes_namespace.chrome_service.metadata[0].name
name = "chrome-snapshot"
host = "chrome"
service_name = kubernetes_service.chrome_snapshot.metadata[0].name
port = 8088
ingress_path = ["/api/snapshot"]
tls_secret_name = var.tls_secret_name
extra_annotations = {
"gethomepage.dev/enabled" = "false"
}
}
# --- NetworkPolicy: scoped ingress.
# - TCP/9222 (Chromium CDP): only from labelled client namespaces.
# - TCP/6080 (noVNC HTTP+WS): only from the traefik namespace (public path
# is chrome.viktorbarzin.me Traefik sidecar; Authentik forward-auth
# gates external access at the Traefik layer).
# - TCP/8088 (snapshot-server): only from the traefik namespace
# (chrome.viktorbarzin.me/api/snapshot Traefik sidecar; bearer token
# is the gate inside snapshot-server.py).
# The cluster has no default-deny, so this NP only takes effect inside
# chrome-service ns pods elsewhere remain unaffected.
resource "kubernetes_network_policy_v1" "ws_ingress" {
metadata {
name = "chrome-service-ws-ingress"
namespace = kubernetes_namespace.chrome_service.metadata[0].name
}
spec {
pod_selector {
match_labels = local.labels
}
policy_types = ["Ingress"]
ingress {
from {
namespace_selector {
match_labels = {
"chrome-service.viktorbarzin.me/client" = "true"
}
}
}
# Explicit fallback list admit f1-stream by name in case the label
# is removed by accident. Keep this in sync with the labels above.
from {
namespace_selector {
match_labels = {
"kubernetes.io/metadata.name" = "f1-stream"
}
}
}
# Also admit chrome-service's own namespace (the snapshot-harvester
# CronJob runs here and needs to reach the CDP endpoint).
from {
namespace_selector {
match_labels = {
"kubernetes.io/metadata.name" = "chrome-service"
}
}
}
ports {
port = "9222"
protocol = "TCP"
}
}
ingress {
from {
namespace_selector {
match_labels = {
"kubernetes.io/metadata.name" = "traefik"
}
}
}
ports {
port = "6080"
protocol = "TCP"
}
ports {
port = "8088"
protocol = "TCP"
}
}
}
}
# --- Backup CronJob: tar+gzip the profile every 6h, 30-day retention. ---
resource "kubernetes_cron_job_v1" "chrome_service_backup" {
metadata {
name = "chrome-service-backup"
namespace = kubernetes_namespace.chrome_service.metadata[0].name
}
spec {
concurrency_policy = "Replace"
failed_jobs_history_limit = 3
successful_jobs_history_limit = 1
schedule = "47 */6 * * *"
starting_deadline_seconds = 60
job_template {
metadata {}
spec {
backoff_limit = 2
ttl_seconds_after_finished = 300
template {
metadata {}
spec {
# PVC is RWO colocate the backup pod with the chrome-service
# pod so both can mount the volume on the same node.
affinity {
pod_affinity {
required_during_scheduling_ignored_during_execution {
label_selector {
match_labels = local.labels
}
topology_key = "kubernetes.io/hostname"
}
}
}
container {
name = "backup"
image = "docker.io/library/alpine:3.20"
command = ["/bin/sh", "-c", <<-EOT
set -euxo pipefail
ts=$(date +"%Y_%m_%d_%H")
tar -czf /backup/$${ts}.tar.gz -C /profile .
find /backup -maxdepth 1 -type f -name '*.tar.gz' -mtime +30 -delete
echo "Backup complete: $${ts}.tar.gz"
EOT
]
volume_mount {
name = "profile"
mount_path = "/profile"
read_only = true
}
volume_mount {
name = "backup"
mount_path = "/backup"
}
resources {
requests = { cpu = "10m", memory = "32Mi" }
limits = { memory = "64Mi" }
}
}
volume {
name = "profile"
persistent_volume_claim {
claim_name = kubernetes_persistent_volume_claim.profile_encrypted.metadata[0].name
}
}
volume {
name = "backup"
persistent_volume_claim {
claim_name = module.nfs_chrome_service_backup_host.claim_name
}
}
restart_policy = "OnFailure"
}
}
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}
# --- Snapshot harvester CronJob: hourly storage_state() dump via CDP ---
# Connects to the live chrome-service CDP endpoint, accesses the
# persistent default browser context (where Viktor's noVNC logins live),
# and writes cookies + localStorage to /profile/snapshots/storage-state.json
# (atomic rename). The snapshot-server sidecar reads from the same file.
resource "kubernetes_cron_job_v1" "chrome_service_snapshot_harvester" {
metadata {
name = "chrome-service-snapshot-harvester"
namespace = kubernetes_namespace.chrome_service.metadata[0].name
}
spec {
concurrency_policy = "Replace"
failed_jobs_history_limit = 3
successful_jobs_history_limit = 1
# Hourly, offset from the backup CronJob (which runs at :47 every 6h)
# so they don't fight for the encrypted PVC at the same minute.
schedule = "23 * * * *"
starting_deadline_seconds = 60
job_template {
metadata {}
spec {
backoff_limit = 2
ttl_seconds_after_finished = 300
template {
metadata {}
spec {
# PVC is RWO colocate with the chrome-service pod.
affinity {
pod_affinity {
required_during_scheduling_ignored_during_execution {
label_selector {
match_labels = local.labels
}
topology_key = "kubernetes.io/hostname"
}
}
}
container {
name = "harvester"
image = local.python_image
image_pull_policy = "IfNotPresent"
# The Microsoft playwright/python image ships only browsers +
# Python the `playwright` pip package itself is NOT installed
# (it's meant for CI that brings its own requirements). We
# install at startup, caching to the PVC so subsequent runs
# are near-instant.
command = ["bash", "-c"]
args = [
<<-EOT
set -e
export PIP_CACHE_DIR=/profile/.cache/pip
export PIP_DISABLE_PIP_VERSION_CHECK=1
python3 -c 'import playwright' 2>/dev/null \
|| pip install --quiet --no-warn-script-location playwright==1.48.0
exec python3 /scripts/snapshot_harvester.py
EOT
]
env {
name = "CDP_URL"
value = "http://chrome-service.chrome-service.svc.cluster.local:9222"
}
env {
name = "SNAPSHOT_DIR"
value = local.snapshot_dir
}
# Don't try to download browsers connect_over_cdp doesn't
# need them locally.
env {
name = "PLAYWRIGHT_SKIP_BROWSER_DOWNLOAD"
value = "1"
}
volume_mount {
name = "profile"
mount_path = "/profile"
}
volume_mount {
name = "scripts"
mount_path = "/scripts"
read_only = true
}
resources {
requests = { cpu = "20m", memory = "128Mi" }
limits = { memory = "512Mi" }
}
}
volume {
name = "profile"
persistent_volume_claim {
claim_name = kubernetes_persistent_volume_claim.profile_encrypted.metadata[0].name
}
}
volume {
name = "scripts"
config_map {
name = kubernetes_config_map_v1.snapshot_scripts.metadata[0].name
default_mode = "0555"
}
}
restart_policy = "OnFailure"
}
}
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}

View file

@ -0,0 +1,8 @@
include "root" {
path = find_in_parent_folders()
}
dependency "platform" {
config_path = "../platform"
skip_outputs = true
}