infra

Author	SHA1	Message	Date
Viktor Barzin	d77a02357c	chrome-service: in-cluster headed Chromium pool for f1-stream verifier The f1-stream verifier's in-process headless Chromium kept tripping hmembeds' disable-devtool.js Performance detector (CDP latency on console.log vs console.table) and getting redirected to google.com. This adds a single-replica chrome-service stack running Playwright launch-server under Xvfb so callers can connect via WS+token to a shared headed browser. f1-stream's _ensure_browser now prefers chromium.connect(CHROME_WS_URL/CHROME_WS_TOKEN) and adds a vendored stealth init script (webdriver/plugins/languages/Permissions/WebGL spoofs + querySelector hijack to disarm disable-devtool-auto) on every new context. Falls back to in-process headless if the env vars aren't set. Encrypted PVC for profile + npm cache, NetworkPolicy to TCP/3000 gated by client-namespace label, 6h tar.gz backup CronJob to NFS, Authentik-gated nginx sidecar at chrome.viktorbarzin.me for human liveness checks. Image pinned to playwright:v1.48.0-noble in lockstep with the Python client's playwright==1.48.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-07 10:43:40 +00:00
Viktor Barzin	7cb44d7264	[registry] Stop recurring orphan OCI-index incidents — detection + prevention + recovery Second identical registry incident on 2026-04-19 (first 2026-04-13): the infra-ci:latest image index resolved to child manifests whose blobs had been garbage-collected out from under the index. Pipelines P366→P376 all exited 126 "image can't be pulled". Hot fix (`a05d63e` / `6371e75` / `c113be4`) restored green CI but left the underlying bug unaddressed. Root cause: cleanup-tags.sh rmtrees tag dirs on the registry VM daily at 02:00, registry:2's GC (Sunday 03:25) walks OCI index children imperfectly (distribution/distribution#3324 class). Nothing verified pushes end-to-end; nothing probed the registry for fetchability; nothing caught orphan indexes. Phase 1 — Detection: - .woodpecker/build-ci-image.yml: after build-and-push, a verify-integrity step walks the just-pushed manifest (index + children + config + every layer blob) via HEAD and fails the pipeline on any non-200. Catches broken pushes at the source. - stacks/monitoring: new registry-integrity-probe CronJob (every 15m) and three alerts — RegistryManifestIntegrityFailure, RegistryIntegrityProbeStale, RegistryCatalogInaccessible — closing the "registry serves 404 for a tag that exists" gap that masked the incident for 2+ hours. - docs/post-mortems/2026-04-19-registry-orphan-index.md: root cause, timeline, monitoring gaps, permanent fix. Phase 2 — Prevention: - modules/docker-registry/docker-compose.yml: pin registry:2 → registry:2.8.3 across all six registry services. Removes the floating-tag footgun. - modules/docker-registry/fix-broken-blobs.sh: new scan walks every _manifests/revisions/sha256/<digest> that is an image index and logs a loud WARNING when a referenced child blob is missing. Does NOT auto- delete — deleting a published image is a conscious decision. Layer-link scan preserved. Phase 3 — Recovery: - build-ci-image.yml: accept `manual` event so Woodpecker API/UI rebuilds don't need a cosmetic Dockerfile edit (matches convention from pve-nfs-exports-sync.yml). - docs/runbooks/registry-rebuild-image.md: exact command sequence for diagnosing + rebuilding after an orphan-index incident, plus a fallback for building directly on the registry VM if Woodpecker itself is down. - docs/runbooks/registry-vm.md + .claude/reference/service-catalog.md: cross-references to the new runbook. Out of scope (verified healthy or intentionally deferred): - Pull-through DockerHub/GHCR mirrors (74.5% hit rate, no 404s). - Registry HA/replication (single-VM SPOF is a known architectural choice; Synology offsite covers RPO < 1 day). - Diun exclude for registry:2 — not applicable; Diun only watches k8s (DIUN_PROVIDERS_KUBERNETES=true), not the VM's docker-compose. Verified locally: - fix-broken-blobs.sh --dry-run on a synthetic registry directory correctly flags both orphan layer links and orphan OCI-index children. - terraform fmt + validate on stacks/monitoring: success (only unrelated deprecation warnings). - python3 yaml.safe_load on .woodpecker/build-ci-image.yml and modules/docker-registry/docker-compose.yml: both parse clean. Closes: code-4b8 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 17:08:28 +00:00
Viktor Barzin	b6cd83f85a	[redis] Phase 3-7: cutover to redis-v2, Nextcloud HAProxy-only Phase 3 — replication chain (old → v2): - Discovered the v2 cluster was running redis:7.4-alpine, but the Bitnami old master ships redis 8.6.2 which writes RDB format 13 — the 7.4 replicas rejected the stream with "Can't handle RDB format version 13". Bumped v2 image to redis:8-alpine (also 8.6.2) to restore PSYNC compatibility. - Discovered that sentinel on BOTH v2 and old Bitnami clusters auto-discovered the cross-cluster replication chain when v2-0 REPLICAOF'd the old master, triggering a failover that reparented old-master to a v2 replica and took HAProxy's backend offline. Mitigation: `SENTINEL REMOVE mymaster` on all 5 sentinels (both clusters) during the REPLICAOF surgery, then re-MONITOR after cutover. This must be done on the OLD sentinels too, not just v2 — they're the ones that kept fighting our REPLICAOF. - Set up the chain: v2-0 REPLICAOF old-master; v2-{1,2} REPLICAOF v2-0. All 76 keys (db0:76, db1:22, db4:16) synced including `immich_bull:` BullMQ queues and `_kombu.` Celery queues — the user-stated must-survive data class. Phase 4 — HAProxy cutover: - Updated `kubernetes_config_map.haproxy` to point at `redis-v2-{0,1,2}.redis-v2-headless` for both redis_master and redis_sentinel backends (removed redis-node-{0,1}). - Promoted v2-0 (`REPLICAOF NO ONE`) at the same time as the ConfigMap apply so HAProxy's 1s health-check interval found a role:master within a few seconds. Cutover disruption on HAProxy rollout was brief; old clients naturally moved to new HAProxy pods within the rolling update window. - Re-enabled sentinel monitoring on v2 with `SENTINEL MONITOR mymaster <hostname> 6379 2` after verifying `resolve-hostnames yes` + `announce-hostnames yes` were active — this ensures sentinel stores the hostname (not resolved IP) in its rewritten config, so pod-IP churn on restart doesn't break failover. Phase 5 — chaos: - Round 1: killed master v2-0 mid-probe. First run exposed the sentinel IP-storage issue (stored 10.10.107.222, went stale on restart) — ~12s probe disruption. Fixed hostname persistence and re-MONITORed. - Round 2: killed new master v2-2 with hostnames correctly stored. Sentinel elected v2-0, HAProxy re-routed, 1/40 probe failures over 60s — target <3s of actual user-visible disruption. Phase 6 — Nextcloud simplification: - `zzz-redis.config.php` no longer queries sentinel in-process — just points at `redis-master.redis.svc.cluster.local`. Removed 20 lines of PHP. HAProxy handles master tracking transparently now that it's scaled to 3 + PDB minAvailable=2. Phase 7 step 1: - `kubectl scale statefulset/redis-node --replicas=0` (transient — TF removal in a 24h follow-up). Old PVCs `redis-data-redis-node-{0,1}` preserved as cold rollback. Docs: - Rewrote `databases.md` Redis section to reflect post-cutover reality and the sentinel hostname gotcha (so future sessions don't relearn it). - `.claude/reference/service-catalog.md` entry updated. The parallel-bootstrap race documented in the previous commit is still worth watching — the init container now defaults to pod-0 as master when no peer reports role:master-with-slaves, so fresh boots land in a deterministic topology. Closes: code-7n4 Closes: code-9y6 Closes: code-cnf Closes: code-tc4 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 16:13:43 +00:00
Viktor Barzin	c33f597111	feat(upgrade-agent): add automated service upgrade pipeline with n8n + DIUN Pipeline: DIUN detects new image versions every 6h → webhook to n8n → n8n filters (skip databases/custom/infra/:latest) and rate-limits (max 5/6h) → SSH to dev VM → claude -p runs upgrade agent. Agent workflow: resolve GitHub repo → fetch changelogs → classify risk (SAFE/CAUTION) → backup DB if needed → bump version in .tf → commit+push → wait for CI → verify (pod ready + HTTP + Uptime Kuma) → rollback on failure. Changes: - stacks/n8n: add N8N_PORT=5678 to fix K8s env var conflict - stacks/n8n/workflows: version-controlled n8n workflow backup - docs/architecture/automated-upgrades.md: full pipeline documentation - AGENTS.md: add upgrade agent section - service-catalog.md: update DIUN description Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 21:38:27 +00:00
Viktor Barzin	1ef40daeec	docs: update for MySQL 3→1, CrowdSec/Technitium PG migration, PG tuning, NFS async, node OS tuning [ci skip]	2026-04-13 23:05:46 +01:00
Viktor Barzin	c740ed1301	docs: update Technitium DNS docs after cache optimization - Fix Technitium IP typo: 10.0.20.101 → 10.0.20.201 (service-catalog, vpn.md) - Fix PDB minAvailable: 1 → 2 (networking.md) - Add emrsn.org stub zone, cache TTL tuning, PG query logging, CronJobs - Update forwarders: was "Cloudflare + Google", actually Cloudflare DoH only - Update config storage: was generic PVC, now NFS path	2026-04-12 18:29:25 +01:00
Viktor Barzin	8cd8743140	docs: add phpIPAM, Kea DDNS, and DNS sync documentation - networking.md: Add phpIPAM IPAM section, Kea DDNS config, reverse DNS zones, Technitium dynamic update policy - CLAUDE.md: Add phpipam to DB rotation list, service notes, networking section - service-catalog.md: Add phpipam, mark netbox as disabled/replaced [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 16:01:32 +00:00
Viktor Barzin	fc233bd27f	docs: comprehensive audit and update of all architecture docs and runbooks [ci skip] Audited 14 documentation files against live cluster state and Terraform code. Architecture docs: - databases.md: MySQL 8.4.4, proxmox-lvm storage (not iSCSI), anti-affinity excludes k8s-node1 (GPU), 2Gi/3Gi resources, 7-day rotation (not 24h), CNPG 2 instances, PostGIS 16, postgresql.dbaas has endpoints - overview.md: 1x CPU, ~160GB RAM, all nodes 32GB, proxmox-lvm storage, correct Vault paths (secret/ not kv/) - compute.md: 272GB physical host RAM, ~160GB allocated to VMs - secrets.md: 7-day rotation, 7 MySQL + 5 PG roles, correct ESO config - networking.md: MetalLB pool 10.0.20.200-220 - ci-cd.md: 9 GHA projects, travel_blog 5.7GB Runbooks: - restore-mysql/postgresql: backup files are .sql.gz (not .sql) - restore-vault: weekly backup (not daily), auto-unseal sidecar note - restore-vaultwarden: PVC is proxmox (not iscsi) - restore-full-cluster: updated node roles, removed trading Reference docs: - CLAUDE.md: 7-day rotation, removed trading from PG list - AGENTS.md: 100+ stacks, proxmox-lvm, platform empty shell - service-catalog.md: 6 new stacks, 14 stack column updates	2026-04-06 13:21:05 +03:00
Viktor Barzin	c8de2c4803	[ci skip] Sunset Drone CI: remove all artifacts, DNS, configs, and references Drone CI has been fully replaced by Woodpecker CI at ci.viktorbarzin.me. Destroys K8s resources (12), removes DNS records, NFS exports, Uptime Kuma monitor, dashboard entry, and all code/doc references across 18 files.	2026-02-23 19:38:55 +00:00
Viktor Barzin	abe89c926e	[ci skip] Refactor knowledge: CLAUDE.md 881→190 lines, extract reference data CLAUDE.md changes: - Extract service catalog + Cloudflare domains → .claude/reference/service-catalog.md - Extract Proxmox VMs, hardware, network → .claude/reference/proxmox-inventory.md - Extract GitHub/Drone API patterns → .claude/reference/github-drone-api.md - Extract Authentik state snapshot → .claude/reference/authentik-state.md - Remove Init Container pattern (duplicates setup-project skill) - Remove Poison Fountain service notes (duplicates Anti-AI section) - Consolidate Authentik section (link to skills + reference) - Remove resource limit tables (kept tier definitions inline) Skill merges (37→32): - helm-release-force-rerender + helm-stuck-release-recovery → helm-release-troubleshooting - containerd-multi-registry-pull-through-cache + k8s-docker-registry-cache-bypass → k8s-container-image-caching - (traefik merges in previous commits)	2026-02-22 22:11:31 +00:00

10 commits