Commit graph

25 commits

Author SHA1 Message Date
Viktor Barzin
8c73a0243a [forgejo] Phase 4 final decommission: drop registry-private container + port 5050
Image migration completed (forgejo-migrate-orphan-images.sh ran +
all in-scope images now under forgejo.viktorbarzin.me/viktor/) and
the cluster cutover landed in commit 3148d15d. registry-private is
no longer needed.

* infra/modules/docker-registry/docker-compose.yml — registry-private
  service block removed; nginx 5050 port mapping dropped.
* infra/modules/docker-registry/nginx_registry.conf — upstream
  private block + port 5050 server block removed.
* infra/.woodpecker/build-ci-image.yml — drop the dual-push to
  registry.viktorbarzin.me:5050; only push to Forgejo. Verify-
  integrity step removed (the every-15min forgejo-integrity-probe
  in monitoring covers it). Break-glass tarball step still runs but
  pulls from Forgejo (the only registry left).

The registry-config-sync.yml pipeline will pick this commit up and
sync the new compose+nginx to the VM. Manual final step on the VM:
  ssh root@10.0.20.10 'cd /opt/registry && docker compose up -d --remove-orphans'
to actually destroy the registry-private container — compose does
NOT do orphan removal on a normal up -d.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-07 23:29:34 +00:00
Viktor Barzin
56fbd281c9 [forgejo] Restore registry-private temporarily until image migration completes
The Phase 4 docker-compose + nginx changes I landed earlier dropped
the registry-private container's port-5050 listener BEFORE migrating
the existing images to Forgejo. The registry-config-sync pipeline
applied the new nginx config, breaking pulls from registry-private —
which is the source of every image we still need to copy to Forgejo.

Restore registry-private + the 5050 listener until the migration
script has finished. Subsequent commit will drop them once images
are confirmed in Forgejo.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-07 23:29:34 +00:00
Viktor Barzin
4ec40ea804 [forgejo] Phases 3+4+5: cutover, decommission, docs sweep
End of forgejo-registry-consolidation. After Phase 0/1 already landed
(Forgejo ready, dual-push CI, integrity probe, retention CronJob,
images migrated via forgejo-migrate-orphan-images.sh), this commit
flips everything off registry.viktorbarzin.me onto Forgejo and
removes the legacy infrastructure.

Phase 3 — image= flips:
* infra/stacks/{payslip-ingest,job-hunter,claude-agent-service,
  fire-planner,freedify/factory,chrome-service,beads-server}/main.tf
  — image= now points to forgejo.viktorbarzin.me/viktor/<name>.
* infra/stacks/claude-memory/main.tf — also moved off DockerHub
  (viktorbarzin/claude-memory-mcp:17 → forgejo.viktorbarzin.me/viktor/...).
* infra/.woodpecker/{default,drift-detection}.yml — infra-ci pulled
  from Forgejo. build-ci-image.yml dual-pushes still until next
  build cycle confirms Forgejo as canonical.
* /home/wizard/code/CLAUDE.md — claude-memory-mcp install URL updated.

Phase 4 — decommission registry-private:
* registry-credentials Secret: dropped registry.viktorbarzin.me /
  registry.viktorbarzin.me:5050 / 10.0.20.10:5050 auths entries.
  Forgejo entry is the only one left.
* infra/stacks/infra/main.tf cloud-init: dropped containerd
  hosts.toml entries for registry.viktorbarzin.me +
  10.0.20.10:5050. (Existing nodes already had the file removed
  manually by `setup-forgejo-containerd-mirror.sh` rollout — the
  cloud-init template only fires on new VM provision.)
* infra/modules/docker-registry/docker-compose.yml: registry-private
  service block removed; nginx 5050 port mapping dropped. Pull-
  through caches for upstream registries (5000/5010/5020/5030/5040)
  stay on the VM permanently.
* infra/modules/docker-registry/nginx_registry.conf: upstream
  `private` block + port 5050 server block removed.
* infra/stacks/monitoring/modules/monitoring/main.tf: registry_
  integrity_probe + registry_probe_credentials resources stripped.
  forgejo_integrity_probe is the only manifest probe now.

Phase 5 — final docs sweep:
* infra/docs/runbooks/registry-vm.md — VM scope reduced to pull-
  through caches; forgejo-registry-breakglass.md cross-ref added.
* infra/docs/architecture/ci-cd.md — registry component table +
  diagram now reflect Forgejo. Pre-migration root-cause sentence
  preserved as historical context with a pointer to the design doc.
* infra/docs/architecture/monitoring.md — Registry Integrity Probe
  row updated to point at the Forgejo probe.
* infra/.claude/CLAUDE.md — Private registry section rewritten end-
  to-end (auth, retention, integrity, where the bake came from).
* prometheus_chart_values.tpl — RegistryManifestIntegrityFailure
  alert annotation simplified now that only one registry is in
  scope.

Operational follow-up (cannot be done from a TF apply):
1. ssh root@10.0.20.10 — edit /opt/registry/docker-compose.yml to
   match the new template AND `docker compose up -d --remove-orphans`
   to actually stop the registry-private container. Memory id=1078
   confirms cloud-init won't redeploy on TF apply alone.
2. After 1 week of no incidents, `rm -rf /opt/registry/data/private/`
   on the VM (~2.6GB freed).
3. Open the dual-push step in build-ci-image.yml and drop
   registry.viktorbarzin.me:5050 from the `repo:` list — at that
   point the post-push integrity check at line 33-107 also needs
   to be repointed at Forgejo or removed (the per-build verify is
   redundant with the every-15min Forgejo probe).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-07 23:29:34 +00:00
Viktor Barzin
42961a5f58 [registry] fix-broken-blobs.sh — check revision-link, not blob data
The original index-child scan checked if the child's blob data file
existed under /blobs/sha256/<child>/data. That's wrong in a subtle
way: registry:2 serves a per-repo manifest via the link file at
<repo>/_manifests/revisions/sha256/<child-digest>/link, NOT by blob
presence. When cleanup-tags.sh rmtrees a tag, the per-repo revision
links for its index's children also disappear — but the blob data
survives (GC owns that, and runs weekly). Result: blob present,
link absent, API 404 on HEAD — the exact 2026-04-19 failure mode.

Live proof: the registry-integrity-probe CronJob just found 38 real
orphan children (including 98f718c8 from the original incident) while
the previous fix-broken-blobs.sh scan reported 0. After the fix, both
tools agree. The probe had been authoritative all along; the scan was
a false-negative because it was asking the wrong question.

Post-mortem updated to reflect the true mechanism (link-file absence,
not blob deletion).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 17:43:35 +00:00
Viktor Barzin
9f9d7d10ff [registry] Scope OCI-index scan to private registry only
Live run on the registry VM surfaced 632 "orphaned" index children across
156 indexes in the pull-through caches (ghcr, immich, affine, linkwarden,
openclaw). These aren't bugs — pull-through caches only fetch what's been
requested, so missing arm64 / arm / attestation children are normal partial
state. Scanning them generates noise that would mask the real signal from
the private registry (where we push full manifests ourselves and a missing
child IS always a bug — the 2026-04-13 + 2026-04-19 failure mode).

Change: index-child scan is now gated on registry_name == "private". Layer-
link scan still runs across all registries (missing blob under a live link
is always a bug, regardless of pull-through semantics).

Verified: live run now reports 0 orphans in private registry — consistent
with the hot-fix rebuild of infra-ci:latest earlier today. Layer scan
still inspects 425 links across all registries and finds 0 orphans.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 17:23:04 +00:00
Viktor Barzin
7cb44d7264 [registry] Stop recurring orphan OCI-index incidents — detection + prevention + recovery
Second identical registry incident on 2026-04-19 (first 2026-04-13): the
infra-ci:latest image index resolved to child manifests whose blobs had been
garbage-collected out from under the index. Pipelines P366→P376 all exited
126 "image can't be pulled". Hot fix (a05d63e / 6371e75 / c113be4) restored
green CI but left the underlying bug unaddressed.

Root cause: cleanup-tags.sh rmtrees tag dirs on the registry VM daily at
02:00, registry:2's GC (Sunday 03:25) walks OCI index children imperfectly
(distribution/distribution#3324 class). Nothing verified pushes end-to-end;
nothing probed the registry for fetchability; nothing caught orphan indexes.

Phase 1 — Detection:
 - .woodpecker/build-ci-image.yml: after build-and-push, a verify-integrity
   step walks the just-pushed manifest (index + children + config + every
   layer blob) via HEAD and fails the pipeline on any non-200. Catches
   broken pushes at the source.
 - stacks/monitoring: new registry-integrity-probe CronJob (every 15m) and
   three alerts — RegistryManifestIntegrityFailure,
   RegistryIntegrityProbeStale, RegistryCatalogInaccessible — closing the
   "registry serves 404 for a tag that exists" gap that masked the incident
   for 2+ hours.
 - docs/post-mortems/2026-04-19-registry-orphan-index.md: root cause,
   timeline, monitoring gaps, permanent fix.

Phase 2 — Prevention:
 - modules/docker-registry/docker-compose.yml: pin registry:2 → registry:2.8.3
   across all six registry services. Removes the floating-tag footgun.
 - modules/docker-registry/fix-broken-blobs.sh: new scan walks every
   _manifests/revisions/sha256/<digest> that is an image index and logs a
   loud WARNING when a referenced child blob is missing. Does NOT auto-
   delete — deleting a published image is a conscious decision. Layer-link
   scan preserved.

Phase 3 — Recovery:
 - build-ci-image.yml: accept `manual` event so Woodpecker API/UI rebuilds
   don't need a cosmetic Dockerfile edit (matches convention from
   pve-nfs-exports-sync.yml).
 - docs/runbooks/registry-rebuild-image.md: exact command sequence for
   diagnosing + rebuilding after an orphan-index incident, plus a fallback
   for building directly on the registry VM if Woodpecker itself is down.
 - docs/runbooks/registry-vm.md + .claude/reference/service-catalog.md:
   cross-references to the new runbook.

Out of scope (verified healthy or intentionally deferred):
 - Pull-through DockerHub/GHCR mirrors (74.5% hit rate, no 404s).
 - Registry HA/replication (single-VM SPOF is a known architectural
   choice; Synology offsite covers RPO < 1 day).
 - Diun exclude for registry:2 — not applicable; Diun only watches
   k8s (DIUN_PROVIDERS_KUBERNETES=true), not the VM's docker-compose.

Verified locally:
 - fix-broken-blobs.sh --dry-run on a synthetic registry directory correctly
   flags both orphan layer links and orphan OCI-index children.
 - terraform fmt + validate on stacks/monitoring: success (only unrelated
   deprecation warnings).
 - python3 yaml.safe_load on .woodpecker/build-ci-image.yml and
   modules/docker-registry/docker-compose.yml: both parse clean.

Closes: code-4b8

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 17:08:28 +00:00
Viktor Barzin
d1059d6017 registry: set proxy TTL to 0 to prevent stale :latest images
Blob caching (content-addressed by SHA256) is unaffected — only manifest
re-validation changes. Every pull now checks upstream for the current
manifest digest, eliminating stale :latest tag issues.
2026-03-30 00:02:48 +03:00
Viktor Barzin
28587c674d fix-broken-blobs: use argparse for proper flag handling
--dry-run as first arg was being parsed as the BASE directory path.
2026-03-29 22:33:33 +03:00
Viktor Barzin
dd461beb33 add registry blob integrity checker to self-heal corrupted cache
The cleanup-tags.sh + garbage-collect cycle can delete blob data while
leaving _layers/ link files intact. The registry then returns HTTP 200
with 0 bytes for those layers, causing "unexpected EOF" on image pulls.

fix-broken-blobs.sh walks all repositories, checks each layer link
against actual blob data, and removes orphaned links so the registry
re-fetches from upstream on next pull.

Schedule: daily at 2:30am (after tag cleanup) and Sunday 3:30am
(after garbage collection). First run found 2335/2556 (91%) of
layer links were orphaned.
2026-03-29 22:31:39 +03:00
Viktor Barzin
facf959ecf fix registry healthchecks: use 127.0.0.1 instead of localhost
localhost resolves to IPv6 ::1 but containers bind to 0.0.0.0 (IPv4
only), causing wget to fail with "Connection refused". The nginx
proxy had 18,462 consecutive health check failures because of this.

Also cleared corrupted pull-through cache for mghee/novelapp — the
registry had layer link files pointing to non-existent blob data,
causing containerd to get 200 responses with 0 bytes (unexpected EOF).
2026-03-29 22:29:27 +03:00
Viktor Barzin
3f0ecda737 harden pull-through cache: intercept errors, reduce lock timeout, add healthz
- Add proxy_intercept_errors + error_page for 502/503/504 on blob locations
  to prevent caching truncated upstream responses (root cause of repeated
  ImagePullBackOff across services)
- Reduce proxy_cache_lock_timeout from 15m to 5m — fail fast, let containerd
  retry instead of all concurrent pulls waiting on a failed first download
- Add proxy_cache_valid any 0 — never cache error responses
- Add /healthz endpoints on Docker Hub and GHCR servers
- Add draintimeout and proxy.ttl to registry proxy configs
2026-03-23 11:33:06 +02:00
Viktor Barzin
36171bcda4 add htpasswd auth to private docker registry + expose at registry.viktorbarzin.me
- Add auth.htpasswd section to config-private.yml
- Mount htpasswd file in registry-private container, fix healthcheck for 401
- Rename registry UI from registry.viktorbarzin.me → docker.viktorbarzin.me
- Add Docker CLI ingress at registry.viktorbarzin.me (HTTPS backend, no rate-limit, unlimited body)
- Add docker to cloudflare_proxied_names (registry stays non-proxied)
- Add Kyverno ClusterPolicy to sync registry-credentials secret to all namespaces
- Update infra provisioning to install apache2-utils and generate htpasswd from Vault
2026-03-22 22:10:10 +02:00
Viktor Barzin
f8a36f0621 fix pull-through cache: remove maxsize, harden nginx caching [ci skip]
Root cause: storage.filesystem.maxsize (5GiB) caused Docker Registry to
delete blob data while keeping metadata. Registry then served 200 OK with
correct Content-Length but 0 bytes body. nginx cached these broken responses.

Fixes:
- Remove maxsize from dockerhub/ghcr proxy configs (rely on weekly GC)
- nginx: don't cache 206 responses, require 2 requests before caching
- Wiped corrupted cache on registry VM and fixed corrupted pause container
  blobs on node3/node4
2026-03-16 07:41:11 +00:00
Viktor Barzin
7e72a10848 exclude manifest requests from nginx registry cache
Split /v2/ location into two: regex match for blobs (cached 24h, immutable
content-addressed by SHA256) and prefix match for everything else including
manifests (proxy_cache off, mutable tags). Also remove disabled registries
(quay, k8s, kyverno) whose containers/configs don't exist on the VM.
2026-03-14 23:42:17 +00:00
Viktor Barzin
09a810f8fb [ci skip] fix: use $http_host in nginx to preserve port in registry redirects 2026-02-28 20:16:03 +00:00
Viktor Barzin
96c0353c13 [ci skip] add TLS to private registry, switch to registry.viktorbarzin.me 2026-02-28 19:40:38 +00:00
Viktor Barzin
925dbe39c1 [ci skip] add registry-private service to Docker Compose stack 2026-02-28 17:57:04 +00:00
Viktor Barzin
64c55a6710 [ci skip] add nginx upstream and server block for private registry on port 5050 2026-02-28 17:57:03 +00:00
Viktor Barzin
2102ffdb8b [ci skip] add private R/W registry config for CI build caching 2026-02-28 17:56:50 +00:00
Viktor Barzin
865b68ce77 [ci skip] Rebuild docker-registry with nginx serialization on all ports
Replace individual `docker run` commands with Docker Compose stack managed
by systemd. Nginx now fronts all 5 registry ports (5000/5010/5020/5030/5040)
with proxy_cache_lock to serialize concurrent blob pulls and prevent
corrupt partial responses. Adds QEMU guest agent for remote management.
2026-02-22 21:45:53 +00:00
Viktor Barzin
a67a6f350e [ci skip] Fix pull-through cache for all registries
Replace deprecated wildcard containerd mirror with per-registry
config_path approach. Add proxy containers for ghcr.io, quay.io,
registry.k8s.io, and reg.kyverno.io on the docker-registry VM.
Set static IP for docker-registry VM to avoid DHCP issues.
2026-02-15 14:35:52 +00:00
Viktor Barzin
375e3e115a [ci skip] Fix registry tag cleanup for pull-through cache
- Rewrite cleanup script to use filesystem deletion (shutil.rmtree)
  since proxy registries don't support DELETE via API (405)
- Fix cron entry to invoke with python3
2026-02-07 22:45:17 +00:00
Viktor Barzin
11d328fb99 Add Docker registry UI and tag cleanup automation
Deploy joxit/docker-registry-ui on port 8080 for browsing images/tags.
Add Python script to prune old registry tags (keeps last N per image),
scheduled daily at 2am via cron. Expose UI via reverse proxy at
registry.viktorbarzin.me with Authentik auth.
2026-02-07 22:38:15 +00:00
Viktor Barzin
3b7d295119 add nginx reverse proxy to serialize registyr requests for the same path to avoid race conditions [ci skip] 2025-12-29 20:16:13 +00:00
Viktor Barzin
b15246a2cb add docker registry vm and allow multiple provisioning cmds in templates [ci skip] 2025-10-12 18:54:29 +00:00