infra

Author	SHA1	Message	Date
Viktor Barzin	8c73a0243a	[forgejo] Phase 4 final decommission: drop registry-private container + port 5050 Image migration completed (forgejo-migrate-orphan-images.sh ran + all in-scope images now under forgejo.viktorbarzin.me/viktor/) and the cluster cutover landed in commit 3148d15d. registry-private is no longer needed. * infra/modules/docker-registry/docker-compose.yml — registry-private service block removed; nginx 5050 port mapping dropped. * infra/modules/docker-registry/nginx_registry.conf — upstream private block + port 5050 server block removed. * infra/.woodpecker/build-ci-image.yml — drop the dual-push to registry.viktorbarzin.me:5050; only push to Forgejo. Verify- integrity step removed (the every-15min forgejo-integrity-probe in monitoring covers it). Break-glass tarball step still runs but pulls from Forgejo (the only registry left). The registry-config-sync.yml pipeline will pick this commit up and sync the new compose+nginx to the VM. Manual final step on the VM: ssh root@10.0.20.10 'cd /opt/registry && docker compose up -d --remove-orphans' to actually destroy the registry-private container — compose does NOT do orphan removal on a normal up -d. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-07 23:29:34 +00:00
Viktor Barzin	56fbd281c9	[forgejo] Restore registry-private temporarily until image migration completes The Phase 4 docker-compose + nginx changes I landed earlier dropped the registry-private container's port-5050 listener BEFORE migrating the existing images to Forgejo. The registry-config-sync pipeline applied the new nginx config, breaking pulls from registry-private — which is the source of every image we still need to copy to Forgejo. Restore registry-private + the 5050 listener until the migration script has finished. Subsequent commit will drop them once images are confirmed in Forgejo. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-07 23:29:34 +00:00
Viktor Barzin	4ec40ea804	[forgejo] Phases 3+4+5: cutover, decommission, docs sweep End of forgejo-registry-consolidation. After Phase 0/1 already landed (Forgejo ready, dual-push CI, integrity probe, retention CronJob, images migrated via forgejo-migrate-orphan-images.sh), this commit flips everything off registry.viktorbarzin.me onto Forgejo and removes the legacy infrastructure. Phase 3 — image= flips: * infra/stacks/{payslip-ingest,job-hunter,claude-agent-service, fire-planner,freedify/factory,chrome-service,beads-server}/main.tf — image= now points to forgejo.viktorbarzin.me/viktor/<name>. * infra/stacks/claude-memory/main.tf — also moved off DockerHub (viktorbarzin/claude-memory-mcp:17 → forgejo.viktorbarzin.me/viktor/...). * infra/.woodpecker/{default,drift-detection}.yml — infra-ci pulled from Forgejo. build-ci-image.yml dual-pushes still until next build cycle confirms Forgejo as canonical. * /home/wizard/code/CLAUDE.md — claude-memory-mcp install URL updated. Phase 4 — decommission registry-private: * registry-credentials Secret: dropped registry.viktorbarzin.me / registry.viktorbarzin.me:5050 / 10.0.20.10:5050 auths entries. Forgejo entry is the only one left. * infra/stacks/infra/main.tf cloud-init: dropped containerd hosts.toml entries for registry.viktorbarzin.me + 10.0.20.10:5050. (Existing nodes already had the file removed manually by `setup-forgejo-containerd-mirror.sh` rollout — the cloud-init template only fires on new VM provision.) * infra/modules/docker-registry/docker-compose.yml: registry-private service block removed; nginx 5050 port mapping dropped. Pull- through caches for upstream registries (5000/5010/5020/5030/5040) stay on the VM permanently. * infra/modules/docker-registry/nginx_registry.conf: upstream `private` block + port 5050 server block removed. * infra/stacks/monitoring/modules/monitoring/main.tf: registry_ integrity_probe + registry_probe_credentials resources stripped. forgejo_integrity_probe is the only manifest probe now. Phase 5 — final docs sweep: * infra/docs/runbooks/registry-vm.md — VM scope reduced to pull- through caches; forgejo-registry-breakglass.md cross-ref added. * infra/docs/architecture/ci-cd.md — registry component table + diagram now reflect Forgejo. Pre-migration root-cause sentence preserved as historical context with a pointer to the design doc. * infra/docs/architecture/monitoring.md — Registry Integrity Probe row updated to point at the Forgejo probe. * infra/.claude/CLAUDE.md — Private registry section rewritten end- to-end (auth, retention, integrity, where the bake came from). * prometheus_chart_values.tpl — RegistryManifestIntegrityFailure alert annotation simplified now that only one registry is in scope. Operational follow-up (cannot be done from a TF apply): 1. ssh root@10.0.20.10 — edit /opt/registry/docker-compose.yml to match the new template AND `docker compose up -d --remove-orphans` to actually stop the registry-private container. Memory id=1078 confirms cloud-init won't redeploy on TF apply alone. 2. After 1 week of no incidents, `rm -rf /opt/registry/data/private/` on the VM (~2.6GB freed). 3. Open the dual-push step in build-ci-image.yml and drop registry.viktorbarzin.me:5050 from the `repo:` list — at that point the post-push integrity check at line 33-107 also needs to be repointed at Forgejo or removed (the per-build verify is redundant with the every-15min Forgejo probe). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-07 23:29:34 +00:00
Viktor Barzin	42961a5f58	[registry] fix-broken-blobs.sh — check revision-link, not blob data The original index-child scan checked if the child's blob data file existed under /blobs/sha256/<child>/data. That's wrong in a subtle way: registry:2 serves a per-repo manifest via the link file at <repo>/_manifests/revisions/sha256/<child-digest>/link, NOT by blob presence. When cleanup-tags.sh rmtrees a tag, the per-repo revision links for its index's children also disappear — but the blob data survives (GC owns that, and runs weekly). Result: blob present, link absent, API 404 on HEAD — the exact 2026-04-19 failure mode. Live proof: the registry-integrity-probe CronJob just found 38 real orphan children (including 98f718c8 from the original incident) while the previous fix-broken-blobs.sh scan reported 0. After the fix, both tools agree. The probe had been authoritative all along; the scan was a false-negative because it was asking the wrong question. Post-mortem updated to reflect the true mechanism (link-file absence, not blob deletion). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 17:43:35 +00:00
Viktor Barzin	9f9d7d10ff	[registry] Scope OCI-index scan to private registry only Live run on the registry VM surfaced 632 "orphaned" index children across 156 indexes in the pull-through caches (ghcr, immich, affine, linkwarden, openclaw). These aren't bugs — pull-through caches only fetch what's been requested, so missing arm64 / arm / attestation children are normal partial state. Scanning them generates noise that would mask the real signal from the private registry (where we push full manifests ourselves and a missing child IS always a bug — the 2026-04-13 + 2026-04-19 failure mode). Change: index-child scan is now gated on registry_name == "private". Layer- link scan still runs across all registries (missing blob under a live link is always a bug, regardless of pull-through semantics). Verified: live run now reports 0 orphans in private registry — consistent with the hot-fix rebuild of infra-ci:latest earlier today. Layer scan still inspects 425 links across all registries and finds 0 orphans. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 17:23:04 +00:00
Viktor Barzin	7cb44d7264	[registry] Stop recurring orphan OCI-index incidents — detection + prevention + recovery Second identical registry incident on 2026-04-19 (first 2026-04-13): the infra-ci:latest image index resolved to child manifests whose blobs had been garbage-collected out from under the index. Pipelines P366→P376 all exited 126 "image can't be pulled". Hot fix (`a05d63e` / `6371e75` / `c113be4`) restored green CI but left the underlying bug unaddressed. Root cause: cleanup-tags.sh rmtrees tag dirs on the registry VM daily at 02:00, registry:2's GC (Sunday 03:25) walks OCI index children imperfectly (distribution/distribution#3324 class). Nothing verified pushes end-to-end; nothing probed the registry for fetchability; nothing caught orphan indexes. Phase 1 — Detection: - .woodpecker/build-ci-image.yml: after build-and-push, a verify-integrity step walks the just-pushed manifest (index + children + config + every layer blob) via HEAD and fails the pipeline on any non-200. Catches broken pushes at the source. - stacks/monitoring: new registry-integrity-probe CronJob (every 15m) and three alerts — RegistryManifestIntegrityFailure, RegistryIntegrityProbeStale, RegistryCatalogInaccessible — closing the "registry serves 404 for a tag that exists" gap that masked the incident for 2+ hours. - docs/post-mortems/2026-04-19-registry-orphan-index.md: root cause, timeline, monitoring gaps, permanent fix. Phase 2 — Prevention: - modules/docker-registry/docker-compose.yml: pin registry:2 → registry:2.8.3 across all six registry services. Removes the floating-tag footgun. - modules/docker-registry/fix-broken-blobs.sh: new scan walks every _manifests/revisions/sha256/<digest> that is an image index and logs a loud WARNING when a referenced child blob is missing. Does NOT auto- delete — deleting a published image is a conscious decision. Layer-link scan preserved. Phase 3 — Recovery: - build-ci-image.yml: accept `manual` event so Woodpecker API/UI rebuilds don't need a cosmetic Dockerfile edit (matches convention from pve-nfs-exports-sync.yml). - docs/runbooks/registry-rebuild-image.md: exact command sequence for diagnosing + rebuilding after an orphan-index incident, plus a fallback for building directly on the registry VM if Woodpecker itself is down. - docs/runbooks/registry-vm.md + .claude/reference/service-catalog.md: cross-references to the new runbook. Out of scope (verified healthy or intentionally deferred): - Pull-through DockerHub/GHCR mirrors (74.5% hit rate, no 404s). - Registry HA/replication (single-VM SPOF is a known architectural choice; Synology offsite covers RPO < 1 day). - Diun exclude for registry:2 — not applicable; Diun only watches k8s (DIUN_PROVIDERS_KUBERNETES=true), not the VM's docker-compose. Verified locally: - fix-broken-blobs.sh --dry-run on a synthetic registry directory correctly flags both orphan layer links and orphan OCI-index children. - terraform fmt + validate on stacks/monitoring: success (only unrelated deprecation warnings). - python3 yaml.safe_load on .woodpecker/build-ci-image.yml and modules/docker-registry/docker-compose.yml: both parse clean. Closes: code-4b8 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 17:08:28 +00:00
Viktor Barzin	d1059d6017	registry: set proxy TTL to 0 to prevent stale :latest images Blob caching (content-addressed by SHA256) is unaffected — only manifest re-validation changes. Every pull now checks upstream for the current manifest digest, eliminating stale :latest tag issues.	2026-03-30 00:02:48 +03:00
Viktor Barzin	28587c674d	fix-broken-blobs: use argparse for proper flag handling --dry-run as first arg was being parsed as the BASE directory path.	2026-03-29 22:33:33 +03:00
Viktor Barzin	dd461beb33	add registry blob integrity checker to self-heal corrupted cache The cleanup-tags.sh + garbage-collect cycle can delete blob data while leaving _layers/ link files intact. The registry then returns HTTP 200 with 0 bytes for those layers, causing "unexpected EOF" on image pulls. fix-broken-blobs.sh walks all repositories, checks each layer link against actual blob data, and removes orphaned links so the registry re-fetches from upstream on next pull. Schedule: daily at 2:30am (after tag cleanup) and Sunday 3:30am (after garbage collection). First run found 2335/2556 (91%) of layer links were orphaned.	2026-03-29 22:31:39 +03:00
Viktor Barzin	facf959ecf	fix registry healthchecks: use 127.0.0.1 instead of localhost localhost resolves to IPv6 ::1 but containers bind to 0.0.0.0 (IPv4 only), causing wget to fail with "Connection refused". The nginx proxy had 18,462 consecutive health check failures because of this. Also cleared corrupted pull-through cache for mghee/novelapp — the registry had layer link files pointing to non-existent blob data, causing containerd to get 200 responses with 0 bytes (unexpected EOF).	2026-03-29 22:29:27 +03:00
Viktor Barzin	3f0ecda737	harden pull-through cache: intercept errors, reduce lock timeout, add healthz - Add proxy_intercept_errors + error_page for 502/503/504 on blob locations to prevent caching truncated upstream responses (root cause of repeated ImagePullBackOff across services) - Reduce proxy_cache_lock_timeout from 15m to 5m — fail fast, let containerd retry instead of all concurrent pulls waiting on a failed first download - Add proxy_cache_valid any 0 — never cache error responses - Add /healthz endpoints on Docker Hub and GHCR servers - Add draintimeout and proxy.ttl to registry proxy configs	2026-03-23 11:33:06 +02:00
Viktor Barzin	36171bcda4	add htpasswd auth to private docker registry + expose at registry.viktorbarzin.me - Add auth.htpasswd section to config-private.yml - Mount htpasswd file in registry-private container, fix healthcheck for 401 - Rename registry UI from registry.viktorbarzin.me → docker.viktorbarzin.me - Add Docker CLI ingress at registry.viktorbarzin.me (HTTPS backend, no rate-limit, unlimited body) - Add docker to cloudflare_proxied_names (registry stays non-proxied) - Add Kyverno ClusterPolicy to sync registry-credentials secret to all namespaces - Update infra provisioning to install apache2-utils and generate htpasswd from Vault	2026-03-22 22:10:10 +02:00
Viktor Barzin	f8a36f0621	fix pull-through cache: remove maxsize, harden nginx caching [ci skip] Root cause: storage.filesystem.maxsize (5GiB) caused Docker Registry to delete blob data while keeping metadata. Registry then served 200 OK with correct Content-Length but 0 bytes body. nginx cached these broken responses. Fixes: - Remove maxsize from dockerhub/ghcr proxy configs (rely on weekly GC) - nginx: don't cache 206 responses, require 2 requests before caching - Wiped corrupted cache on registry VM and fixed corrupted pause container blobs on node3/node4	2026-03-16 07:41:11 +00:00
Viktor Barzin	7e72a10848	exclude manifest requests from nginx registry cache Split /v2/ location into two: regex match for blobs (cached 24h, immutable content-addressed by SHA256) and prefix match for everything else including manifests (proxy_cache off, mutable tags). Also remove disabled registries (quay, k8s, kyverno) whose containers/configs don't exist on the VM.	2026-03-14 23:42:17 +00:00
Viktor Barzin	09a810f8fb	[ci skip] fix: use $http_host in nginx to preserve port in registry redirects	2026-02-28 20:16:03 +00:00
Viktor Barzin	96c0353c13	[ci skip] add TLS to private registry, switch to registry.viktorbarzin.me	2026-02-28 19:40:38 +00:00
Viktor Barzin	925dbe39c1	[ci skip] add registry-private service to Docker Compose stack	2026-02-28 17:57:04 +00:00
Viktor Barzin	64c55a6710	[ci skip] add nginx upstream and server block for private registry on port 5050	2026-02-28 17:57:03 +00:00
Viktor Barzin	2102ffdb8b	[ci skip] add private R/W registry config for CI build caching	2026-02-28 17:56:50 +00:00
Viktor Barzin	865b68ce77	[ci skip] Rebuild docker-registry with nginx serialization on all ports Replace individual `docker run` commands with Docker Compose stack managed by systemd. Nginx now fronts all 5 registry ports (5000/5010/5020/5030/5040) with proxy_cache_lock to serialize concurrent blob pulls and prevent corrupt partial responses. Adds QEMU guest agent for remote management.	2026-02-22 21:45:53 +00:00
Viktor Barzin	a67a6f350e	[ci skip] Fix pull-through cache for all registries Replace deprecated wildcard containerd mirror with per-registry config_path approach. Add proxy containers for ghcr.io, quay.io, registry.k8s.io, and reg.kyverno.io on the docker-registry VM. Set static IP for docker-registry VM to avoid DHCP issues.	2026-02-15 14:35:52 +00:00
Viktor Barzin	375e3e115a	[ci skip] Fix registry tag cleanup for pull-through cache - Rewrite cleanup script to use filesystem deletion (shutil.rmtree) since proxy registries don't support DELETE via API (405) - Fix cron entry to invoke with python3	2026-02-07 22:45:17 +00:00
Viktor Barzin	11d328fb99	Add Docker registry UI and tag cleanup automation Deploy joxit/docker-registry-ui on port 8080 for browsing images/tags. Add Python script to prune old registry tags (keeps last N per image), scheduled daily at 2am via cron. Expose UI via reverse proxy at registry.viktorbarzin.me with Authentik auth.	2026-02-07 22:38:15 +00:00
Viktor Barzin	3b7d295119	add nginx reverse proxy to serialize registyr requests for the same path to avoid race conditions [ci skip]	2025-12-29 20:16:13 +00:00
Viktor Barzin	b15246a2cb	add docker registry vm and allow multiple provisioning cmds in templates [ci skip]	2025-10-12 18:54:29 +00:00

25 commits