Commit graph

5 commits

Author SHA1 Message Date
Viktor Barzin
42961a5f58 [registry] fix-broken-blobs.sh — check revision-link, not blob data
The original index-child scan checked if the child's blob data file
existed under /blobs/sha256/<child>/data. That's wrong in a subtle
way: registry:2 serves a per-repo manifest via the link file at
<repo>/_manifests/revisions/sha256/<child-digest>/link, NOT by blob
presence. When cleanup-tags.sh rmtrees a tag, the per-repo revision
links for its index's children also disappear — but the blob data
survives (GC owns that, and runs weekly). Result: blob present,
link absent, API 404 on HEAD — the exact 2026-04-19 failure mode.

Live proof: the registry-integrity-probe CronJob just found 38 real
orphan children (including 98f718c8 from the original incident) while
the previous fix-broken-blobs.sh scan reported 0. After the fix, both
tools agree. The probe had been authoritative all along; the scan was
a false-negative because it was asking the wrong question.

Post-mortem updated to reflect the true mechanism (link-file absence,
not blob deletion).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 17:43:35 +00:00
Viktor Barzin
9f9d7d10ff [registry] Scope OCI-index scan to private registry only
Live run on the registry VM surfaced 632 "orphaned" index children across
156 indexes in the pull-through caches (ghcr, immich, affine, linkwarden,
openclaw). These aren't bugs — pull-through caches only fetch what's been
requested, so missing arm64 / arm / attestation children are normal partial
state. Scanning them generates noise that would mask the real signal from
the private registry (where we push full manifests ourselves and a missing
child IS always a bug — the 2026-04-13 + 2026-04-19 failure mode).

Change: index-child scan is now gated on registry_name == "private". Layer-
link scan still runs across all registries (missing blob under a live link
is always a bug, regardless of pull-through semantics).

Verified: live run now reports 0 orphans in private registry — consistent
with the hot-fix rebuild of infra-ci:latest earlier today. Layer scan
still inspects 425 links across all registries and finds 0 orphans.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 17:23:04 +00:00
Viktor Barzin
7cb44d7264 [registry] Stop recurring orphan OCI-index incidents — detection + prevention + recovery
Second identical registry incident on 2026-04-19 (first 2026-04-13): the
infra-ci:latest image index resolved to child manifests whose blobs had been
garbage-collected out from under the index. Pipelines P366→P376 all exited
126 "image can't be pulled". Hot fix (a05d63e / 6371e75 / c113be4) restored
green CI but left the underlying bug unaddressed.

Root cause: cleanup-tags.sh rmtrees tag dirs on the registry VM daily at
02:00, registry:2's GC (Sunday 03:25) walks OCI index children imperfectly
(distribution/distribution#3324 class). Nothing verified pushes end-to-end;
nothing probed the registry for fetchability; nothing caught orphan indexes.

Phase 1 — Detection:
 - .woodpecker/build-ci-image.yml: after build-and-push, a verify-integrity
   step walks the just-pushed manifest (index + children + config + every
   layer blob) via HEAD and fails the pipeline on any non-200. Catches
   broken pushes at the source.
 - stacks/monitoring: new registry-integrity-probe CronJob (every 15m) and
   three alerts — RegistryManifestIntegrityFailure,
   RegistryIntegrityProbeStale, RegistryCatalogInaccessible — closing the
   "registry serves 404 for a tag that exists" gap that masked the incident
   for 2+ hours.
 - docs/post-mortems/2026-04-19-registry-orphan-index.md: root cause,
   timeline, monitoring gaps, permanent fix.

Phase 2 — Prevention:
 - modules/docker-registry/docker-compose.yml: pin registry:2 → registry:2.8.3
   across all six registry services. Removes the floating-tag footgun.
 - modules/docker-registry/fix-broken-blobs.sh: new scan walks every
   _manifests/revisions/sha256/<digest> that is an image index and logs a
   loud WARNING when a referenced child blob is missing. Does NOT auto-
   delete — deleting a published image is a conscious decision. Layer-link
   scan preserved.

Phase 3 — Recovery:
 - build-ci-image.yml: accept `manual` event so Woodpecker API/UI rebuilds
   don't need a cosmetic Dockerfile edit (matches convention from
   pve-nfs-exports-sync.yml).
 - docs/runbooks/registry-rebuild-image.md: exact command sequence for
   diagnosing + rebuilding after an orphan-index incident, plus a fallback
   for building directly on the registry VM if Woodpecker itself is down.
 - docs/runbooks/registry-vm.md + .claude/reference/service-catalog.md:
   cross-references to the new runbook.

Out of scope (verified healthy or intentionally deferred):
 - Pull-through DockerHub/GHCR mirrors (74.5% hit rate, no 404s).
 - Registry HA/replication (single-VM SPOF is a known architectural
   choice; Synology offsite covers RPO < 1 day).
 - Diun exclude for registry:2 — not applicable; Diun only watches
   k8s (DIUN_PROVIDERS_KUBERNETES=true), not the VM's docker-compose.

Verified locally:
 - fix-broken-blobs.sh --dry-run on a synthetic registry directory correctly
   flags both orphan layer links and orphan OCI-index children.
 - terraform fmt + validate on stacks/monitoring: success (only unrelated
   deprecation warnings).
 - python3 yaml.safe_load on .woodpecker/build-ci-image.yml and
   modules/docker-registry/docker-compose.yml: both parse clean.

Closes: code-4b8

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 17:08:28 +00:00
Viktor Barzin
28587c674d fix-broken-blobs: use argparse for proper flag handling
--dry-run as first arg was being parsed as the BASE directory path.
2026-03-29 22:33:33 +03:00
Viktor Barzin
dd461beb33 add registry blob integrity checker to self-heal corrupted cache
The cleanup-tags.sh + garbage-collect cycle can delete blob data while
leaving _layers/ link files intact. The registry then returns HTTP 200
with 0 bytes for those layers, causing "unexpected EOF" on image pulls.

fix-broken-blobs.sh walks all repositories, checks each layer link
against actual blob data, and removes orphaned links so the registry
re-fetches from upstream on next pull.

Schedule: daily at 2:30am (after tag cleanup) and Sunday 3:30am
(after garbage collection). First run found 2335/2556 (91%) of
layer links were orphaned.
2026-03-29 22:31:39 +03:00