infra

Author	SHA1	Message	Date
Viktor Barzin	f201e4573e	immich: fix slow context search — prewarm clip_index + latency alert/healthcheck Context (smart) search latency was caused by the 665MB vchord clip_index decaying out of PG shared_buffers (~33% resident -> ~1.8s cold ANN reads vs ~4ms warm), NOT by yesterday's ML MODEL_TTL/clip-keepalive change (CLIP textual is warm ~15ms on GPU). The postStart prewarm runs once at pod start and pg_prewarm.autoprewarm only re-warms at startup, so the index decays under job buffer-pressure over days. - clip-index-prewarm CronJob (immich, /5): pg_prewarm('clip_index') keeps the whole index resident -> searches stay ~4ms. - immich-search-probe CronJob (immich, /5): times a random-vector ANN query + reads clip_index residency, pushes gauges to the Pushgateway. - Prometheus alerts ImmichSmartSearchSlow / ImmichClipIndexColdCache / ImmichSearchProbeStale (+ inhibition when the probe is stale). - cluster_healthcheck.sh check #46 check_immich_search (TOTAL_CHECKS 45->46). - Docs: infra CLAUDE.md immich note, monitoring.md, cluster-health skill. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:07 +00:00
Viktor Barzin	a2fa912b44	cluster-health: add check #45 — HA Sofia Status Dashboard Mirrors the verdict of emo's curated Барзини → Статус Lovelace view (dashboard-barzini / path 'status', 8 sections, ~43 mushroom-template cards). Pulls the dashboard config via the HA WebSocket API (one-shot, shared cache), batch-renders every card's secondary Jinja template against /api/template in a single POST, and classifies the rendered text per card: FAIL — contains "Offline" / "Disconnected" / "Разкачен" / "— No data" WARN — contains "⚠️" / "Abnormal" / "Trouble (" / "(ниска)" / "Пълен резервоар" / "Грешка" / "attention" / "Внимание" Roll-up is a single check with a per-section breakdown (Сигурност 0F/0W/4P; Мрежа 0F/1W/10P; …). On WARN/FAIL the non-quiet non-JSON path lists each offending card with its rendered status line. Verified live against ha-sofia: 2 offline devices (Пералня, Гардероб спалня) and 1 degraded (NAS_Barzini volume attention, 7% free) surfaced correctly in both human and JSON output.	2026-06-05 09:19:06 +00:00
Viktor Barzin	98f29edf34	technitium: CoreDNS rewrite forgejo.viktorbarzin.me -> Traefik ClusterIP In-cluster pods resolved forgejo.viktorbarzin.me to the public IP (176.12.22.76) and hairpinned out through the WAN gateway, intermittently timing out buildkit pushes from Woodpecker build pods (which, unlike kubelet, don't use the per-node containerd Forgejo mirror). This silently failed CI build-and-push for Forgejo-hosted repos (recruiter-responder pipelines #15-#18 at the push step). Add a CoreDNS `rewrite name exact forgejo.viktorbarzin.me traefik.traefik.svc.cluster.local` so pods resolve to the Traefik ClusterIP (reachable in-cluster, unlike the ETP=Local LB .203; the Service-name target auto-tracks the ClusterIP so it can't rot on a Traefik renumber). Traefik's *.viktorbarzin.me wildcard keeps SNI/TLS valid. Makes the per-pod woodpecker-server hostAlias belt-and-suspenders. Applied via targeted apply (coredns ConfigMap only, to avoid reconciling 7 unrelated pre-existing drifts in the stack) + verified: - pod resolves forgejo.viktorbarzin.me -> 10.111.111.95 (Traefik ClusterIP) - recruiter-responder pipeline #20 build-and-push succeeds via ClusterIP Docs: networking.md (K8s cluster DNS path) + .claude/CLAUDE.md (forgejo registry quick-ref). Advances beads code-yh33. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-04 07:34:30 +00:00
Viktor Barzin	922d95af9c	Reapply "tripit: Gmail ingest (12-month) + vbarzin owner + plans@ forward-to-parse" This reverts commit a82ba46ad83e85a231d839564c2f009c700dc4d1.	2026-06-03 10:24:25 +00:00
Viktor Barzin	f0843e398b	Revert "tripit: Gmail ingest (12-month) + vbarzin owner + plans@ forward-to-parse" This reverts commit 4cc9229e716b6683418a148a0f896442d5ab07ad.	2026-06-03 10:24:25 +00:00
Viktor Barzin	0c7ec3d470	tripit: Gmail ingest (12-month) + vbarzin owner + plans@ forward-to-parse Reconciles the tripit stack source with live state and adds the forward flow. Ingest now polls vbarzin@gmail.com [Gmail]/All Mail read-only over a rolling 12-month X-GM-RAW travel-sender window (Croatia Jet2 refs excluded), filing trips under MAIL_DEFAULT_OWNER_EMAIL=vbarzin@gmail.com (Viktor's Authentik login identity). Adds an ingest-plans CronJob that polls spam@ filtered to To:plans@viktorbarzin.me (the @viktorbarzin.me catch-all target) so forwarded bookings are extracted and attached to the matching trip; IMAP_PASSWORD is overridden per-job to spam@'s creds (PLANS_IMAP_PASSWORD). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-03 10:24:25 +00:00
Viktor Barzin	01ea7d6fa1	immich: clip-keepalive CronJob to pin smart-search model warm MACHINE_LEARNING_MODEL_TTL=600 is a single global knob, so it unloads the CLIP textual (smart-search) encoder after idle exactly like OCR/face — immich has no per-model pin. This CronJob pings the textual encoder every 5 min (< the 600s TTL) via immich-ml /predict, so a search query never pays the ~1.5s cold-load, while idle OCR/face still free their VRAM on the shared T4. Textual-only (search = text->embedding->pgvector); the visual encoder is import-time and left to unload. curl baked into the image (no runtime install). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-03 10:24:24 +00:00
Viktor Barzin	fe8db19aaf	job-hunter: build-triggers-deploy model; CronJob :latest + docs CI now drives the Deployment rollout (kubectl set image to the build SHA in .woodpecker.yml), so the stack moves to image_tag = "latest": the Deployment runs whatever CI last set (image ignore_changes keeps TF from fighting it), and the CronJob uses :latest + imagePullPolicy=Always (fresh pod each weekly run). Keel stays enrolled in parallel as a redundant net. Docs: rewrite the runbook "Deploying" section for build-triggers-deploy; record the reversal of decision #12 in the auto-upgrade design doc (owned apps drive their own rollout, Keel parallel — upstream stays Keel-only); add the owned-app deploy model to infra/.claude/CLAUDE.md CI/CD section. [ci skip] — applied locally (stack-scoped); avoids a broad CI auto-apply. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-02 20:24:50 +00:00
Viktor Barzin	052c776eba	immich: set MACHINE_LEARNING_MODEL_TTL 0->600 to stop GPU VRAM hog immich-ml at TTL=0 never unloaded models; a heavy OCR library job inflated onnxruntime's CUDA arena to ~10.7GB and held it on the shared time-sliced T4, starving llama-swap (qwen3-8b) so recruiter-responder triage 502'd silently for hours (emails preserved unseen, no loss). TTL=600 lets idle ad-hoc models (OCR, face) free VRAM while preloaded CLIP/smart-search stays warm. Docs: correct stale llama-cpp GPU notes (T4 is time-sliced, no VRAM isolation; add qwen3-8b to model table), immich MODEL_TTL gotcha in .claude/CLAUDE.md, and a post-mortem. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-02 20:16:11 +00:00
Viktor Barzin	848cc7211f	t3code: track t3 nightly via health-checked auto-updater Move t3 from pinned stable (0.0.24, catalog capped at opus-4-7) to the nightly channel so new models (Opus 4.8) land as t3 ships them. t3-autoupdate (daily systemd timer) pulls t3@nightly, but applies the Keel-incident lesson: it health-checks the new binary on a throwaway serve and AUTO-ROLLS-BACK on failure, and restarts only IDLE per-user instances (defers any with an active agent child) so an in-flight session is never killed by an update. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-02 19:24:30 +00:00
Viktor Barzin	481585f6e6	immich: cap streaming transcode bitrate to fix 4K video stutter [ci skip] Transcodes were uncapped (ffmpeg maxBitrate=0 + preset=ultrafast + targetResolution=original) -> 77-264 Mbps 4K H.264 files. Mobile playback streams that copy off the shared 7200rpm sdc pool over inter-VLAN NFS; a single stream needs ~10-13.5 MB/s and stuttered for every client, local and remote. Fix (DB system-config, applied via API): maxBitrate=20000k, preset=medium, transcode=bitrate. 4K resolution preserved; originals never modified. Existing oversized transcodes regenerated by deleting their asset_file encoded_video rows + videoConversion force=false (concurrency 1). Document config + add runbook docs/runbooks/immich-transcode-bitrate.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-02 19:24:30 +00:00
Viktor Barzin	deec540fad	t3code: docs — auto-provisioning service-catalog entry + design status implemented Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-02 19:24:30 +00:00
Viktor Barzin	73cb0aab8b	t3code: per-user isolation via Authentik + nginx username dispatcher t3 is single-owner (no in-app multi-user), so each person runs their own `t3 serve` on the DevVM as their own OS user: wizard→:3773 (t3-serve.service), emo→:3774 (t3-serve-emo.service). An in-cluster nginx `t3-dispatch` maps the Authentik-injected X-authentik-username to the right instance; unmapped identities get 403 (no shared fallback). Flipped the ingress auth app→required (Authentik forward-auth) — the same-origin self-served UI works behind it (WS carries the Authentik cookie) and t3's own pairing/bearer stays the inner gate. Mirrors the terminal stack's per-user model. Verified: dispatcher routes vbarzin→:3773, emil.barzin→:3774, unmapped→403; t3.viktorbarzin.me now 302s to Authentik. Cross-origin native app / app.t3.codes intentionally unsupported here — deferred until the native app is published. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 21:38:06 +00:00
Viktor Barzin	a382683c0e	infra: fix containerd forgejo-registry redirect .200->.203 (+skip_verify) Traefik moved off shared .200 to its dedicated .203 on 2026-05-30, but the containerd hosts.toml redirect for forgejo.viktorbarzin.me still pointed at the now-dead .200:443 -> every FRESH forgejo pull failed (cached images kept running, so it stayed hidden until a new image tag was pulled). Retarget to .203 and add skip_verify (node dials Traefik by IP; cert is for forgejo.viktorbarzin.me) in both the new-node cloud-init and existing-node deploy scripts. Already rolled to all 7 nodes (rewrite + restart containerd, no drain). Doc fix in .claude/CLAUDE.md.	2026-06-01 21:22:05 +00:00
Viktor Barzin	32e1042ca8	t3code: expose `t3 serve` (DevVM) publicly at t3.viktorbarzin.me (app-tier) New stacks/t3code mirrors stacks/terminal: K8s Service + Endpoints → 10.0.10.10:3773 plus an ingress_factory route (dns_type=proxied, auth="app"). t3 ships its own owner-pairing + bearer-session auth, so Authentik forward-auth is intentionally omitted — it would break the cross-origin native mobile app and app.t3.codes (bearer-only, no Authentik cookie). CrowdSec + anti-AI (both default-on for app-tier) rate-limit the public surface; t3's pairing is the gate. TLS is auto-synced into the namespace by Kyverno's sync-tls-secret policy. Verified end-to-end: t3.viktorbarzin.me → CF → Traefik → devvm:3773 = 200. Trade-off (public RCE surface behind app-native auth, no Authentik SSO) accepted 2026-06-01 to keep the native app + app.t3.codes working. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 19:50:41 +00:00
Viktor Barzin	ddd582a28c	backup: stop offsite-copying regenerable data; shrink nextcloud backup; pin nextcloud image The offsite Synology hit 97% — the Backup share grew +670G in a week, traced to the 2026-05-26 change that began mirroring large regenerable services offsite, plus an unbounded nextcloud.log bloating its backups to 87G. - nfs-mirror: re-exclude ollama, prometheus-backup, audiblez, ebook2audiobook (regenerable; live-only on sdc). Keep *-backup DB dumps (real safety copies). - offsite-sync Step 2: nfs-ssd leg is now immich-only; ollama/llamacpp on the SSD no longer ship offsite (re-pullable models). - daily-backup: skip nextcloud/nextcloud-data-proxmox (orphaned pre-encryption PV, still backed up weekly). - nextcloud: cap+rotate the log (log_rotate_size=10MB); the dedicated backup now excludes html/ (app code, from image), logs, and preview cache and keeps only the latest copy (pvc-data holds version history) → <5G (was 87G). - nextcloud: pin image to 32.0.9 in chart_values. A 2026-05-26 Keel bump moved the live pod to 32.0.9 (data migrated to 32.0.9.2) but TF still defaulted to 32.0.3; reconciling that drift this session rolled a 32.0.3 pod that CrashLooped on the downgrade. Pinning eliminates the drift. Docs: backup-dr.md + infra CLAUDE.md updated (add nfs-mirror, new exclusions). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 15:15:26 +00:00
Viktor Barzin	0dd4a31eff	docs(immich): cap server-side job concurrency to protect sdc + log recurrence A library-wide Duplicate Detection run on 2026-06-01 fanned the ML/thumbnail backfill out at thumbnailGeneration concurrency 8, saturating the shared sdc HDD and starving etcd -> kube-apiserver down ~30 min (5th IO-pressure incident on sdc). Capped server-side thumbnailGeneration/metadataExtraction/library to 2 in the Immich DB system-config; documented in the Immich row and recorded the recurrence + still-TODO IO-isolation fixes in the 2026-05-25 post-mortem (this also commits that previously-untracked post-mortem). [ci skip] Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 15:15:26 +00:00
Viktor Barzin	e63a812062	kms: dedicated vlmcs.viktorbarzin.me endpoint + Anubis /scripts carve-out Internal split-horizon resolves kms.viktorbarzin.me to Traefik (10.0.20.203), which has no :1688 listener — so LAN clients pointed at kms.viktorbarzin.me:1688 failed with 0xC004F074 "no KMS could be contacted". Add a dedicated A-only vlmcs.viktorbarzin.me (cloudflare_record.vlmcs -> 176.12.22.76 for the public WAN NAT; Technitium -> 10.0.20.202 internal, set via API) so it resolves to vlmcsd both ways. Also carve /scripts/* out of Anubis (module.ingress_scripts -> bare kms-web-page service) so `iwr \| iex` downloads the real script instead of the PoW challenge HTML. Verified end-to-end on Win VM 300: reproduced 0xC004F074 on the old host, then slmgr + ospp + both PowerShell one-liners all -> Licensed via vlmcs (10.0.20.202). Docs: kms-public-exposure runbook + service-catalog entry. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 10:36:49 +00:00
Viktor Barzin	5bcb4525a4	traefik: uncap download duration (writeTimeout 60s->0), upload window 3600s [ci skip] Large Immich video downloads and uploads failed at a hard ~60s wall. The websecure entrypoint set respondingTimeouts.{read,write}Timeout=60s; unlike nginx proxy_*_timeout (per-read idle), Traefik respondingTimeouts are hard caps on total request/response duration, so every transfer slower than 60s was cut mid-stream. Reproduced: a 6 MB/s throttled 650MB download died at 386MB / 62s with an HTTP/2 stream reset. - writeTimeout=0 (Traefik's default, which Immich's reverse-proxy guidance assumes): unlimited download size/duration. - readTimeout=3600s: passes multi-GB uploads while keeping a slow-loris backstop (Immich has no resumable upload, so the window must exceed real upload times). Verified: the same 650MB download now completes fully (650MB / 102s, exit 0). IPv6 path needs no change - the pfSense bridge HAProxy 1h timeouts are inactivity-based, not total caps. Applied via tg (Tier 1 / PG-authoritative state); this commit syncs source + docs only, hence [ci skip]. Docs: networking.md (Entrypoint Transport Timeouts + troubleshooting), .claude/CLAUDE.md networking note. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 17:46:59 +00:00
Viktor Barzin	a222c024fd	docs: correct tripit DNS classification to proxied [ci skip] tripit's ingress is dns_type="proxied" (Cloudflare), not non-proxied. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-30 15:00:49 +00:00
Viktor Barzin	b78378eda9	docs: catalog tripit service (service-catalog + databases) [ci skip] Add tripit (self-hosted TripIt-clone travel-itinerary PWA) to the service catalog Optional tier and Non-Proxied DNS list, and to the CNPG consumer + PostgreSQL rotation lists in the databases doc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-30 14:59:01 +00:00
Viktor Barzin	e9046e5a26	traefik+pfsense: real IPv6 client IPs via HAProxy PROXY-v2 bridge Replace the pfSense socat IPv6 forwarder (which masked every IPv6 client as 10.0.20.1) with a standalone HAProxy bridge using send-proxy-v2, so real IPv6 client IPs reach Traefik/CrowdSec. Traefik now trusts PROXY-v2 only from 10.0.20.1 on the web/websecure entrypoints; real IPv4 clients (ETP=Local, own source IP) are unaffected. Mail-over-IPv6 routed through the mail NodePorts (send-proxy-v2) too. Bridge is TCP/h2 only (no QUIC over IPv6). Persistence on pfSense: rc.d/ipv6proxy + ipv6_proxy.sh (config.xml shellcmd), keeping the nginx-off-[::] patch. Also fixes stale networking.md: Traefik was still documented on the shared .200; it moved to dedicated .203/ETP=Local on 2026-05-30. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-30 09:51:23 +00:00
Viktor Barzin	16c9aafafa	docs: Traefik dedicated-IP + ETP=Local cutover SUCCEEDED (attempt 2) Records the successful cutover and the key fix that made it safe: decouple cloudflared from the LB IP first (point its tunnel ingress at the in-cluster Traefik Service), so moving Traefik 10.0.20.200 -> 10.0.20.203 no longer breaks proxied apps or Vault's ingress. Updates infra CLAUDE.md Networking notes with the new Traefik LB IP / ETP=Local / cloudflared->ClusterIP state. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-30 08:12:57 +00:00
Viktor Barzin	bc41fe572a	immich: GPU-accelerate video transcoding (NVENC + NVDEC) Pin immich-server to the GPU node with a time-sliced nvidia.com/gpu slice so ffmpeg uses hardware NVENC encode + NVDEC decode instead of software. This frees the ~3-4 CPU cores the software transcoder was burning inside the request-serving pod (which was slowing thumbnail/photo browsing), and makes incompatible (HEVC/iPhone) videos playable in seconds. Activation is ffmpeg.accel=nvenc + accelDecode=true in the DB system-config (Immich app config is DB-managed here, like oauth/smtp — not Terraform). Also give immich-frame the same Keel ignore_changes immich-server already has, so an untargeted apply no longer churns it (pre-existing drift). Docs: .claude/CLAUDE.md Immich row + compute.md GPU-workloads list. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 18:05:34 +00:00
Viktor Barzin	188bdd50a0	infra: decommission foolery agent UI User no longer actively using foolery. Removed: - TF stack stacks/foolery (Cloudflare DNS, Traefik IngressRoute, Authentik forward-auth integration, K8s Service+Endpoints) - Devvm systemd unit /etc/systemd/system/foolery.service - Runtime at ~/.local/share/foolery and launcher ~/.local/bin/foolery - Stale foolery reference in .claude/CLAUDE.md auth="required" examples Uptime Kuma [External] foolery monitor will auto-prune on next external-monitor-sync reconcile. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 16:08:41 +00:00
Viktor Barzin	8b4bcc0ca2	blog: Anubis carve-out for /net-diag.sh curl\|bash clients can't solve PoW, so /net-diag.sh has to bypass Anubis. Adds a second ingress_factory pointing /net-diag.sh at the bare blog service (port 80), keeping every other path on the existing Anubis chain. Path-prefix specificity wins in Traefik routing — / stays gated. dns_type = "none" because the apex viktorbarzin.me CF record already exists from the main ingress. Doc update: CLAUDE.md Anubis section notes blog now follows the wrongmove carve-out pattern.	2026-05-28 13:22:57 +00:00
Viktor Barzin	9277d71d81	nfs-mirror: append transferred files to offsite-sync manifest Some checks failed ci/woodpecker/push/default Pipeline is running Details ci/woodpecker/push/build-cli Pipeline failed Details Step 1 of offsite-sync-backup is incremental on non-monthly days, driven by /mnt/backup/.changed-files which only daily-backup wrote to. nfs-mirror's writes were therefore invisible to Step 1 until the next monthly --delete pass — which would also wipe data pre-positioned on Synology pve-backup/ (e.g. the in-place btrfs rename we just did to relocate ~160G of NFS subtrees from /Backup/Viki/nfs/<svc>/ to /Backup/Viki/pve-backup/<svc>/). Fix: snapshot a timestamp before rsync, then after rsync use `find -newer $STAMP -type f -printf '%P\n'` to enumerate every file nfs-mirror created/modified and append to the manifest. Paths are relative to /mnt/backup/ (matches Step 1 --files-from expectation). State files are excluded. The current in-flight first run started before this patch was deployed, so its writes won't auto-populate the manifest — a one-off manual backfill will be done after it completes.	2026-05-24 15:32:22 +00:00
Viktor Barzin	247afdb220	cluster-health skill: document tightened #43 thermal threshold (65 C)	2026-05-22 14:09:12 +00:00
Viktor Barzin	6024cfb410	docs: update MySQL restore runbook + CLAUDE.md after 8.4.9 recovery Runbook rewritten for the standalone setup (InnoDB Cluster gone since 2026-04-16) and now covers the full disaster-recovery flow we just executed: stop pod, wipe PVC (incl. PV reclaim-policy flip from Retain → Delete), re-apply TF, restore via in-namespace Job, drop+create static users with fresh Vault passwords, restart dependents. CLAUDE.md MySQL row notes the 8.4.8 pin + links the runbook. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-18 22:51:52 +00:00
Viktor Barzin	01de3babd6	docs(security): wave 1 plan — Kyverno enforce, NetworkPolicy egress, audit logging, source-IP anomaly Locked design for wave 1 of cluster security hardening. Plan only — implementation lives in beads code-8ywc and follow-up commits. Captures: - security.md: Kyverno policy table updated (Audit → Enforce planned for the four security policies with the 31-namespace exclude list). New section "Audit Logging & Anomaly Detection" detailing the K8s API audit policy, Vault audit device + X-Forwarded-For trust, source-IP anomaly rules (K9, V7, S1), and the rejected-canary-tokens / rejected-K1 rationales. New section "NetworkPolicy Default-Deny Egress" describing the observe-then-enforce (γ) approach for tier 3+4. - monitoring.md: new "Security Alerts (Wave 1)" section listing the 16 rules (K2-K9, V1-V7, S1) and the Loki ruler → Alertmanager → #security routing path. - runbooks/security-incident.md (new): per-alert response playbook with LogQL queries, action steps, false-positive triage, and SEV1 escalation. - .claude/CLAUDE.md: new "Security Posture" section summarising the locked decisions: identity allowlist is me@viktorbarzin.me ONLY, source-IP allowlist CIDRs, no public-IP access policy, rationale for not adopting canary tokens. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-18 19:10:16 +00:00
Viktor Barzin	9a06a76883	k8s-version-upgrade: switch detection cron from weekly to daily Was `0 12 * * 0` (Sun 12:00 UTC) — patch releases waited up to 6 days before the chain picked them up. Now `0 12 * * *` (daily 12:00 UTC, still outside kured's 02:00-06:00 London window). Concurrency is bounded by Forbid + deterministic job-name idempotency (the detection job exits early if a preflight Job for the same target already exists), so back-to-back days can't pile up parallel runs. - stacks/k8s-version-upgrade/main.tf: var.schedule default + rationale comment - scripts/upgrade_state.sh: rename next_sunday_noon_utc -> next_daily_noon_utc (now returns "Tue 2026-05-19 12:00 UTC" form); change "(Sun cron)" label to "(daily cron)" - .claude/skills/upgrade-state/SKILL.md: cadence column + frontmatter Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-18 18:29:08 +00:00
Viktor Barzin	9e045e2c16	upgrade-state: skill + script + Keel scrape for periodic three-pipeline audit Three autonomous-upgrade pipelines run independently — Keel for apps (hourly registry polling), unattended-upgrades+kured for OS, and the k8s-version-check chain for kubeadm/kubelet/kubectl. Until now there was no single place to see whether each was healthy, what's pending, or whether anything's stuck. The /upgrade-state skill collapses the state of all three into one table you can run before each Sunday's k8s-version-check fires. - stacks/keel/main.tf: add Prometheus pod-annotation scrape on container port 9300. Surfaces pending_approvals, poll_trigger_tracked_images, and registries_scanned_total{image} so the skill has a real timeseries (also opens the door to a future "pending_approvals > 0 for 24h" alert). - scripts/upgrade_state.sh: collector + renderer. Three-row table (Apps / OS / K8s) + drill-down, --json for piping, exit 0/1/2. SSH fan-out (parallel subshells) to all five nodes for apt state + reboot-required + uu log; Prometheus query for Keel; Pushgateway parse for k8s_upgrade_* gauges. Read-only. - .claude/skills/upgrade-state/SKILL.md: hardlinked to ~/.claude/skills/upgrade-state/SKILL.md so the skill is discoverable from both monorepo-rooted and global sessions. Verification: ran the script, stress-tested the ✗ stalled path by pushing in_flight=1 + started_timestamp=-100min to Pushgateway and resetting after — script correctly raised ✗ and exit 2. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-18 10:50:43 +00:00
Viktor Barzin	9521bb0b17	paperless-mcp: deploy MCP for AI document search - New stack `paperless-mcp` running barryw/PaperlessMCP v0.1.19 (.NET, HTTP+SSE on :5000) wraps paperless-ngx's built-in FTS. 43 tools exposed. - In-cluster only egress to paperless-ngx svc; no Cloudflare hop on MCP-internal traffic. - Read-only at paperless layer: dedicated `claude-mcp` user (non-superuser) in new `claude-mcp-readers` group with view-only Django perms; existing 279 docs bulk-granted view perm via /api/documents/bulk_edit/; workflow #2 auto-grants the group on new docs (Consumption Added). - Gateway-level bearer auth via new Traefik plugin Aetherinox/traefik-api-token-middleware@v0.1.4 (loaded in traefik stack alongside crowdsec-bouncer); per-stack Middleware CRD `bearer-auth` pulls token list from Vault `secret/paperless-mcp/bearer_tokens`. - Vault `secret/paperless-mcp` holds: paperless_api_token (synced to K8s Secret via ESO; pod env via secret_key_ref), bearer_tokens (JSON array, read at plan time), bearer_token_viktor_laptop (mirror for laptop wiring), paperless_user_password (paperless UI fallback). - Image auto-update via Keel (semver minor policy, hourly poll). - Ingress dns_type=proxied → Uptime Kuma external monitor auto-created by external-monitor-sync CronJob. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-17 11:14:35 +00:00
Viktor Barzin	e030750507	openclaw: native MCP servers + daily claude-memory sync Wire ha-mcp, context7, and the in-pod playwright sidecar as native MCP servers on OpenClaw via `mcp set` in the container startup (ConfigMap-baked mcp.servers gets stripped by `doctor --fix`; CLI-set entries persist). HA URL pulled from new Vault key secret/openclaw.ha_sofia_mcp_url and passed via the HA_SOFIA_MCP_URL env var. Add a daily 03:00 UTC `memory-sync` CronJob in the openclaw namespace: pulls all non-sensitive memories from claude-memory.claude-memory.svc:80/api/memories, groups by category, writes 18 Markdown files into /workspace/memory/projects/claude- memory-sync/ (the path memory-core indexes), then triggers `openclaw memory index --force` via kubectl exec. Reuses the existing cluster-healthcheck SA (pods+pods/exec). Smoke test: 1488 memories synced, 25/25 files indexed, search returns hits. Also drops the legacy /app/extensions entry from plugins.load.paths (doctor warning), wires HA_SOFIA_MCP_URL env, and one-shot deletes the stale 2026-02-28 metaclaw-export.json from the openclaw home volume. claude_memory MCP intentionally NOT wired — its /mcp/mcp transport 404s on the deployed claude-memory-mcp:17 image (tracked as code-z1so). Shared knowledge is delivered via the CronJob's REST sync instead. Adding claude_memory to mcp.servers is a one-line follow-up once that's fixed.	2026-05-16 14:01:46 +00:00
Viktor Barzin	910167105e	Phase 0: install Keel + Kyverno auto-update annotation injector Foundation for opt-out-pure auto-update model per docs/plans/2026-05-16-auto-upgrade-apps-{design,plan}.md. - New stack `stacks/keel/` deploys Keel via Helm (charts.keel.sh, v1.0.6). Polls registries hourly per design decision #8. Default schedule overridable per-workload via keel.sh/pollSchedule annotation. - New Kyverno ClusterPolicy `inject-keel-annotations` mutates Deployments, StatefulSets, and DaemonSets in namespaces labeled `keel.sh/enrolled=true` with keel.sh/policy=force + trigger=poll + pollSchedule=@every 1h. - Phase 0 enrolls no namespaces. Phase 1 (next session) labels the self-hosted set. - Per-workload opt-out: label `keel.sh/policy: never` (used by rollback runbook and chrome-service-style deliberate pins). - Keel namespace excluded from the mutate — supervisor self-update has too-bad a failure mode (decision #11). - AGENTS.md: KYVERNO_LIFECYCLE_V2 marker convention added for the ignore_changes block enrolled workloads need. - .claude/CLAUDE.md: docker-images rule flagged as transitional. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-16 12:19:34 +00:00
Viktor Barzin	b1b14ee370	service-catalog: add aiostreams entry Stremio stream aggregator now has its own row in the Active Use tier. Captures the auth model (own UUID+password, not Authentik), monitoring posture (canary probe + 3 alerts), and backup pipeline (weekly NFS dumps of both decrypted config and the Stremio account addon collection). Follow-up from the 2026-05-15/16 hardening session: 5 commits on servarr/aiostreams, none previously catalogued.	2026-05-16 10:47:41 +00:00
Viktor Barzin	01bc16d592	k8s-version-upgrade: decompose into Job chain to fix self-preemption The agent-based v1 ran inside claude-agent-service (replicas=1, no nodeSelector) and self-evicted when it tried to drain its host (k8s-node4 on 2026-05-11). Cluster ended half-upgraded (master v1.34.7, workers v1.34.2) until manual recovery. Rewrite the pipeline as a chain of nodeSelector-pinned Jobs: preflight (k8s-node1) → master (k8s-node1) drains k8s-master → worker × 4 (k8s-node1) drains k8s-node{4,3,2} → worker (k8s-master + control-plane toleration) drains k8s-node1 → postflight (no pinning) Each Job runs scripts/upgrade-step.sh (case-on-$PHASE) and ends by envsubst-ing job-template.yaml into the next Job. Deterministic names (k8s-upgrade-<phase>-<target_version>[-<node>]) make `kubectl apply` idempotent — a failed Job can be re-created without duplicating downstream. Also lands `predrain_unstick`: deletes pods on the target node whose PDB has 0 disruptionsAllowed. Without this, drain loops indefinitely on single-replica deployments (e.g. every Anubis instance — discovered the hard way during 2026-05-11 manual recovery of k8s-node3). Adds K8sUpgradeStalled alert (in_flight + started_timestamp > 90 min). Deprecates the agent prompt (renamed to *.deprecated.md with a header pointer to the new code). Apply order: k8s-version-upgrade first (consumes new SA + ConfigMaps), then monitoring (loads the new alert). Both applied 2026-05-11. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-11 23:54:22 +00:00
Viktor Barzin	0712a1b659	infra/scripts/tg: enforce ingress_factory auth-comment convention Every `tg plan/apply/destroy/refresh` now runs `scripts/check-ingress-auth-comments.py` against the current stack before invoking terragrunt. The check fails closed if any `auth = "app"` or `auth = "none"` line in the stack's .tf files lacks an immediately-preceding `# auth = "<tier>": ...` comment documenting what gates the app (for "app") or why the endpoint is intentionally public (for "none"). Why tg-level (not git pre-commit): tg is the universal entry point for all infra changes. CI runs it, headless agents run it, humans run it. A pre-commit hook only catches the human path. Wiring the check into tg means the anti-exposure guard fires regardless of who or what is invoking terragrunt. Stack-scoped: each stack documents itself the next time it's edited. The 30+ existing `auth = "none"` stacks that predate this guard are not blocked from operating today; they'll need the comment added the next time someone runs `tg plan` on them — at which point the gate forces a conscious "yes, this is intentional" moment before any state change can land. Skipped on: init, fmt, validate, output, etc. — anything that doesn't read or write infra state.	2026-05-11 19:18:27 +00:00
Viktor Barzin	459b00fa74	infra/ingress_factory: add auth = "app" mode for self-authed backends Adds a fourth auth tier alongside required/public/none. "app" is functionally identical to "none" — no Authentik middleware attached — but the distinct name records intent at the call site: this backend has its own user login (NextAuth, Django, OAuth, bearer-token API, etc.) and Authentik would only break it. Why the new tier: with only required/none, every "the app has its own auth so drop Authentik" decision looked identical at the call site to "this is an OAuth callback / webhook receiver / native-client API". Future readers couldn't tell whether a stack was intentionally unauthenticated or relying on backend auth. Now they can. Migrates the 8 stacks flipped earlier this session (novelapp, immich, linkwarden, tandoor, freshrss, affine, actualbudget, ebooks/audiobookshelf) from "none" to "app". Confirmed no-op: `tg plan` on novelapp showed "No changes" — same middleware chain, same live state. The variable description and the .claude/CLAUDE.md Auth section now spell out the anti-exposure rule: only pick "app" or "none" AFTER verifying the app has its own user auth ("app") or the endpoint is intentionally public ("none"). Default stays "required" so accidental omission fails closed. [ci skip]	2026-05-11 18:59:20 +00:00
Viktor Barzin	2db8bdac0d	state(dbaas): update encrypted state	2026-05-10 21:00:00 +00:00
Viktor Barzin	fecfa211fd	fix: pvc-autoresizer threshold should be 10%, not 80% topolvm/pvc-autoresizer's threshold annotation is the FREE-SPACE percentage below which expansion fires (per upstream README). Setting it to "80%" means "expand when free-space drops below 80%", i.e. as soon as the PVC crosses 20% utilization — which caused prometheus-data-proxmox to be repeatedly expanded from 200Gi to 433Gi in 70 minutes (six 10% bumps, all when the volume was only ~14% used). Once the SC opt-in fix landed (`1e4eac53`) and the inode metrics fix landed (`02a12f1a`), the autoresizer started actively misfiring across 75+ PVCs cluster-wide. Flip the value to "10%" everywhere — that's "expand when free-space drops below 10%", i.e. at 90% utilization, which is the conventional semantic and matches the alert thresholds in prometheus_chart_values.tpl (PVAutoExpanding fires at 80%, PVFillingUp at 95%). The CLAUDE.md PVC template was the source of the misconfig, so update it too. Live PVC annotations were patched in parallel via kubectl annotate; TF apply on each affected stack will be a no-op against those live values. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-10 19:56:16 +00:00
Viktor Barzin	988bfde45c	k8s-version-upgrade: trigger etcd snapshot via existing backup-etcd Job; broaden agent RBAC Stage 2 now reuses the existing default/backup-etcd CronJob (NFS-backed PV pointing at 192.168.1.127:/srv/nfs/etcd-backup) instead of trying to ssh into master and run etcdctl against a non-existent /mnt/main mount. The agent triggers a one-shot Job from cronjob/backup-etcd, waits up to 10 min, then parses the backup-manage container log for "Backup done" line + byte count. Test 2 (dry-run) surfaced 5 real cluster blockers — agent loop works end-to-end at the planning level. Expanded the claude-agent ServiceAccount's privileges via a sibling ClusterRole (claude-agent-upgrade-ops): - patch namespaces/k8s-upgrade (in-flight annotation) - create batch/jobs (trigger etcd snapshot Job) - patch nodes (cordon/uncordon) - create pods/eviction (drain) - delete pods (drain fallback)	2026-05-10 19:16:12 +00:00
Viktor Barzin	a58d777059	k8s-version-upgrade: automated kubeadm/kubelet/kubectl upgrade pipeline Adds a weekly detection CronJob (Sun 12:00 UTC) that probes apt-cache madison on master for new patches + HEAD pkgs.k8s.io for next-minor availability, then POSTs to claude-agent-service to dispatch the k8s-version-upgrade agent. The agent (.claude/agents/k8s-version-upgrade.md) orchestrates: pre-flight (5 nodes Ready + halt-on-alert + 24h-quiet + plan target match) -> etcd snapshot save -> optional master containerd skew fix -> apt repo URL rewrite (minor bumps only) -> drain/upgrade/uncordon master via ssh < update_k8s.sh -> sequential workers k8s-node4 -> 3 -> 2 -> 1 with 10-min soak each -> post-flight verification Two new Upgrade Gates alerts catch failure modes: - K8sVersionSkew (kubelet/apiserver gitVersion mismatch >30m) - EtcdPreUpgradeSnapshotMissing (in_flight without snapshot_taken >10m) update_k8s.sh refactored to take --role / --release args; the agent shells it into each node via SSH pipe. update_node.sh annotated as OS-major path. Operator-facing docs: docs/runbooks/k8s-version-upgrade.md and a new section in docs/architecture/automated-upgrades.md. Secrets: secret/k8s-upgrade/{ssh_key,ssh_key_pub,slack_webhook} (ed25519 keypair distributed to all 5 nodes via authorized_keys; slack_webhook reuses kured webhook URL on initial deploy).	2026-05-10 19:07:42 +00:00
Viktor Barzin	6c4e096688	authentik: zero-endpoints alert + upgrade-validation checklist Add `AuthentikForwardAuthFallbackActive` Prometheus alert: fires on sustained 401/s spike on the websecure entrypoint (>5/s for 5m), which is the symptom of the auth-proxy Emergency-Access fallback firing — in turn caused by zero ready endpoints on the outpost service. Why this rule and not `kube_endpoint_address_available == 0`: kube-state-metrics endpoint metrics exist as series names but never have current values in this Prometheus pipeline (something is dropping them silently). Detecting the failure at the edge via Traefik is more reliable than instrumenting the broken middle. Also fix the pre-existing `AuthentikOutpostForwardAuth400Spike` regex — the service label is `authentik-ak-outpost-...`, not `authentik-authentik-outpost-...`, so the alert never matched any series and never could have fired. Verified in Prometheus before/after the fix. Add an "Upgrade Validation Checklist" section to `.claude/reference/authentik-state.md` with the seven-step smoke test to run after Authentik chart bumps, provider bumps, or outpost pod recreation. Covers the brittle surfaces (Service selector, JSON patches, postgres backend wiring, access_token_validity TTL, edge auth flow, plan-to-zero).	2026-05-10 16:54:48 +00:00
Viktor Barzin	117b99e28f	docs/authentik: document postgres session backend + close out 2026-04-18 post-mortem items Update `.claude/reference/authentik-state.md`: - Add `ProxyProvider.access_token_validity = "weeks=4"` to the Session Duration table with the gotcha that the gorilla session store binds the value once at outpost startup (rollout restart needed). - Replace the "session storage moved to Postgres in 2025.10" note that falsely implied the migration was automatic — explain that the `Outpost.managed` field gates the postgres path and our outpost silently stayed on `FilesystemStore` until 2026-05-10. - Document the goauthentik 2026.2.2 service-selector bug (service.py:52) and the JSON-patch workaround. - Document that the standalone embedded-outpost deployment needs `AUTHENTIK_POSTGRESQL__*` env vars injected via JSON patch, plus the `app.kubernetes.io/component=server` pod label. - Note the "Terraform doesn't expose `Outpost.managed`" assumption that holds the `managed=embedded` value in place across applies. Close out post-mortem `2026-04-18-authentik-outpost-shm-full.md`: - P2 codify-in-Terraform: DONE. - P3 access_token_validity reduce: DONE-alt (we did the opposite — bumped to 4 weeks — because postgres backend mooted the storage concern). - P3 move-off-embedded-outpost: DONE-alt (postgres backend addresses the loss-of-state class on the embedded outpost itself).	2026-05-10 16:28:11 +00:00
Viktor Barzin	efd28ccce5	anubis: fix 500 on multi-replica + roll out to 6 more public sites Browser visits to viktorbarzin.me started returning HTTP 500 with `store: key not found: "challenge:..."` in pod logs. Root cause: each Anubis pod stores in-flight challenges in process memory; with 2 replicas behind a ClusterIP, the PoW-solved request can be routed to a different pod than the one that issued the challenge. Anubis upstream documents the same caveat ("when running multiple instances on the same base domain, the key must be the same across all instances" — true for the ed25519 signing key, but the challenge store is still pod-local without a shared backend). Drop module default replicas: 2 → 1. Worst-case: ~1s cold-start on pod restart. Real fix (Redis-backed challenge store) noted as a follow-up in CLAUDE.md. Roll Anubis out to: f1-stream, cyberchef (cc), jsoncrack (json), privatebin (pb), homepage (home), real-estate-crawler (wrongmove UI only — `/api` ingress stays direct via path-based ingress carve- out so XHRs from the SPA bypass the challenge). End-state: 9 public hosts now Anubis-fronted (blog, www, kms, travel, f1, cc, json, pb, home, wrongmove). All return the challenge HTML to bare curl/browser; verified-IP search engines and /robots.txt + /.well-known still skip via the strict-policy allowlist.	2026-05-10 00:50:30 +00:00
Viktor Barzin	f48da84770	anubis: per-site PoW reverse proxy on blog + kms + travel-blog Adds modules/kubernetes/anubis_instance/ — a per-site reverse proxy instance pinned to ghcr.io/techarohq/anubis:v1.25.0. Each instance issues a 30-day JWT cookie scoped to viktorbarzin.me after a tiny proof-of-work (difficulty 2 ≈ 250 ms desktop / 700 ms mobile). The shared ed25519 signing key (Vault: secret/viktor → anubis_ed25519_key) makes a single solve good across every Anubis-fronted subdomain. Wired into blog (viktorbarzin.me + www), kms.viktorbarzin.me, and travel.viktorbarzin.me — each with anti_ai_scraping=false on the ingress so the redundant ai-bot-block forwardAuth is dropped from the chain. Skipped forgejo (Git/API clients can't solve PoW) and resume (replicas=0). Also tightens bot-block-proxy nginx timeouts (3s/5s → 100ms/200ms) so any ingress still using the ai-bot-block forwardAuth pays at most ~150 ms when poison-fountain is scaled down, instead of 3 s. End-to-end TTFB on viktorbarzin.me dropped from ~3.2 s to ~150-200 ms. Docs: .claude/reference/patterns.md "Anti-AI Scraping" updated to 4 layers; .claude/CLAUDE.md adds the Anubis usage paragraph and Forgejo/API caveat.	2026-05-10 00:06:21 +00:00
Viktor Barzin	d62a9dcda1	docs: PVC templates need lifecycle.ignore_changes for autoresizer The canonical proxmox-lvm and proxmox-lvm-encrypted PVC templates were missing `lifecycle { ignore_changes = [spec[0].resources[0].requests] }`. Without it, every PVC created from these templates becomes a drift bomb the moment pvc-autoresizer expands it: the next `tg apply` on that stack will try to shrink the PVC back to the TF-declared size, K8s rejects the shrink, and apply fails. This was latent because pvc-autoresizer was silently broken cluster-wide (commit `9d5da4d8` fixed it by allow-listing kubelet_volume_stats_available_bytes in Prometheus). Now that the autoresizer actually works, every existing proxmox-lvm/encrypted PVC without ignore_changes is at risk. Sweep needed (separate task): grep for kubernetes_persistent_volume_claim across stacks/ and add ignore_changes to any with resize.topolvm.io annotations.	2026-05-09 12:02:18 +00:00
Viktor Barzin	3148d15d5a	[forgejo] Phases 3+4+5: cutover, decommission, docs sweep End of forgejo-registry-consolidation. After Phase 0/1 already landed (Forgejo ready, dual-push CI, integrity probe, retention CronJob, images migrated via forgejo-migrate-orphan-images.sh), this commit flips everything off registry.viktorbarzin.me onto Forgejo and removes the legacy infrastructure. Phase 3 — image= flips: * infra/stacks/{payslip-ingest,job-hunter,claude-agent-service, fire-planner,freedify/factory,chrome-service,beads-server}/main.tf — image= now points to forgejo.viktorbarzin.me/viktor/<name>. * infra/stacks/claude-memory/main.tf — also moved off DockerHub (viktorbarzin/claude-memory-mcp:17 → forgejo.viktorbarzin.me/viktor/...). * infra/.woodpecker/{default,drift-detection}.yml — infra-ci pulled from Forgejo. build-ci-image.yml dual-pushes still until next build cycle confirms Forgejo as canonical. * /home/wizard/code/CLAUDE.md — claude-memory-mcp install URL updated. Phase 4 — decommission registry-private: * registry-credentials Secret: dropped registry.viktorbarzin.me / registry.viktorbarzin.me:5050 / 10.0.20.10:5050 auths entries. Forgejo entry is the only one left. * infra/stacks/infra/main.tf cloud-init: dropped containerd hosts.toml entries for registry.viktorbarzin.me + 10.0.20.10:5050. (Existing nodes already had the file removed manually by `setup-forgejo-containerd-mirror.sh` rollout — the cloud-init template only fires on new VM provision.) * infra/modules/docker-registry/docker-compose.yml: registry-private service block removed; nginx 5050 port mapping dropped. Pull- through caches for upstream registries (5000/5010/5020/5030/5040) stay on the VM permanently. * infra/modules/docker-registry/nginx_registry.conf: upstream `private` block + port 5050 server block removed. * infra/stacks/monitoring/modules/monitoring/main.tf: registry_ integrity_probe + registry_probe_credentials resources stripped. forgejo_integrity_probe is the only manifest probe now. Phase 5 — final docs sweep: * infra/docs/runbooks/registry-vm.md — VM scope reduced to pull- through caches; forgejo-registry-breakglass.md cross-ref added. * infra/docs/architecture/ci-cd.md — registry component table + diagram now reflect Forgejo. Pre-migration root-cause sentence preserved as historical context with a pointer to the design doc. * infra/docs/architecture/monitoring.md — Registry Integrity Probe row updated to point at the Forgejo probe. * infra/.claude/CLAUDE.md — Private registry section rewritten end- to-end (auth, retention, integrity, where the bake came from). * prometheus_chart_values.tpl — RegistryManifestIntegrityFailure alert annotation simplified now that only one registry is in scope. Operational follow-up (cannot be done from a TF apply): 1. ssh root@10.0.20.10 — edit /opt/registry/docker-compose.yml to match the new template AND `docker compose up -d --remove-orphans` to actually stop the registry-private container. Memory id=1078 confirms cloud-init won't redeploy on TF apply alone. 2. After 1 week of no incidents, `rm -rf /opt/registry/data/private/` on the VM (~2.6GB freed). 3. Open the dual-push step in build-ci-image.yml and drop registry.viktorbarzin.me:5050 from the `repo:` list — at that point the post-push integrity check at line 33-107 also needs to be repointed at Forgejo or removed (the per-build verify is redundant with the every-15min Forgejo probe). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-07 18:30:02 +00:00
Viktor Barzin	d77a02357c	chrome-service: in-cluster headed Chromium pool for f1-stream verifier The f1-stream verifier's in-process headless Chromium kept tripping hmembeds' disable-devtool.js Performance detector (CDP latency on console.log vs console.table) and getting redirected to google.com. This adds a single-replica chrome-service stack running Playwright launch-server under Xvfb so callers can connect via WS+token to a shared headed browser. f1-stream's _ensure_browser now prefers chromium.connect(CHROME_WS_URL/CHROME_WS_TOKEN) and adds a vendored stealth init script (webdriver/plugins/languages/Permissions/WebGL spoofs + querySelector hijack to disarm disable-devtool-auto) on every new context. Falls back to in-process headless if the env vars aren't set. Encrypted PVC for profile + npm cache, NetworkPolicy to TCP/3000 gated by client-namespace label, 6h tar.gz backup CronJob to NFS, Authentik-gated nginx sidecar at chrome.viktorbarzin.me for human liveness checks. Image pinned to playwright:v1.48.0-noble in lockstep with the Python client's playwright==1.48.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-07 10:43:40 +00:00

1 2 3 4 5 ...

253 commits