[registry] Stop recurring orphan OCI-index incidents — detection + prevention + recovery

Second identical registry incident on 2026-04-19 (first 2026-04-13): the
infra-ci:latest image index resolved to child manifests whose blobs had been
garbage-collected out from under the index. Pipelines P366→P376 all exited
126 "image can't be pulled". Hot fix (a05d63e / 6371e75 / c113be4) restored
green CI but left the underlying bug unaddressed.

Root cause: cleanup-tags.sh rmtrees tag dirs on the registry VM daily at
02:00, registry:2's GC (Sunday 03:25) walks OCI index children imperfectly
(distribution/distribution#3324 class). Nothing verified pushes end-to-end;
nothing probed the registry for fetchability; nothing caught orphan indexes.

Phase 1 — Detection:
 - .woodpecker/build-ci-image.yml: after build-and-push, a verify-integrity
   step walks the just-pushed manifest (index + children + config + every
   layer blob) via HEAD and fails the pipeline on any non-200. Catches
   broken pushes at the source.
 - stacks/monitoring: new registry-integrity-probe CronJob (every 15m) and
   three alerts — RegistryManifestIntegrityFailure,
   RegistryIntegrityProbeStale, RegistryCatalogInaccessible — closing the
   "registry serves 404 for a tag that exists" gap that masked the incident
   for 2+ hours.
 - docs/post-mortems/2026-04-19-registry-orphan-index.md: root cause,
   timeline, monitoring gaps, permanent fix.

Phase 2 — Prevention:
 - modules/docker-registry/docker-compose.yml: pin registry:2 → registry:2.8.3
   across all six registry services. Removes the floating-tag footgun.
 - modules/docker-registry/fix-broken-blobs.sh: new scan walks every
   _manifests/revisions/sha256/<digest> that is an image index and logs a
   loud WARNING when a referenced child blob is missing. Does NOT auto-
   delete — deleting a published image is a conscious decision. Layer-link
   scan preserved.

Phase 3 — Recovery:
 - build-ci-image.yml: accept `manual` event so Woodpecker API/UI rebuilds
   don't need a cosmetic Dockerfile edit (matches convention from
   pve-nfs-exports-sync.yml).
 - docs/runbooks/registry-rebuild-image.md: exact command sequence for
   diagnosing + rebuilding after an orphan-index incident, plus a fallback
   for building directly on the registry VM if Woodpecker itself is down.
 - docs/runbooks/registry-vm.md + .claude/reference/service-catalog.md:
   cross-references to the new runbook.

Out of scope (verified healthy or intentionally deferred):
 - Pull-through DockerHub/GHCR mirrors (74.5% hit rate, no 404s).
 - Registry HA/replication (single-VM SPOF is a known architectural
   choice; Synology offsite covers RPO < 1 day).
 - Diun exclude for registry:2 — not applicable; Diun only watches
   k8s (DIUN_PROVIDERS_KUBERNETES=true), not the VM's docker-compose.

Verified locally:
 - fix-broken-blobs.sh --dry-run on a synthetic registry directory correctly
   flags both orphan layer links and orphan OCI-index children.
 - terraform fmt + validate on stacks/monitoring: success (only unrelated
   deprecation warnings).
 - python3 yaml.safe_load on .woodpecker/build-ci-image.yml and
   modules/docker-registry/docker-compose.yml: both parse clean.

Closes: code-4b8

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-04-19 17:08:28 +00:00
parent df2c53db8d
commit 7cb44d7264
10 changed files with 779 additions and 41 deletions

View file

@ -3,8 +3,12 @@ networks:
driver: bridge
services:
# registry:2 is pinned after the 2026-04-13 + 2026-04-19 orphan-index incidents.
# Floating tags were swapping to regressed versions between GC runs. Upgrade
# path: bump all six registry-* services in lockstep and bounce via
# `systemctl restart docker-compose-registry.service`.
registry-dockerhub:
image: registry:2
image: registry:2.8.3
container_name: registry-dockerhub
restart: always
volumes:
@ -22,7 +26,7 @@ services:
start_period: 10s
registry-ghcr:
image: registry:2
image: registry:2.8.3
container_name: registry-ghcr
restart: always
volumes:
@ -38,7 +42,7 @@ services:
start_period: 10s
registry-quay:
image: registry:2
image: registry:2.8.3
container_name: registry-quay
restart: always
volumes:
@ -54,7 +58,7 @@ services:
start_period: 10s
registry-k8s:
image: registry:2
image: registry:2.8.3
container_name: registry-k8s
restart: always
volumes:
@ -70,7 +74,7 @@ services:
start_period: 10s
registry-kyverno:
image: registry:2
image: registry:2.8.3
container_name: registry-kyverno
restart: always
volumes:
@ -86,7 +90,7 @@ services:
start_period: 10s
registry-private:
image: registry:2
image: registry:2.8.3
container_name: registry-private
restart: always
volumes:

View file

@ -1,25 +1,33 @@
#!/usr/bin/env python3
"""Finds and removes layer links that point to non-existent blobs.
"""Registry integrity scanner — two classes of brokenness.
When the cleanup-tags.sh + garbage-collect cycle runs, it can delete blob data
while leaving _layers/ link files intact. The registry then returns HTTP 200
with 0 bytes for those layers (it finds the link, trusts the blob exists, but
the data is gone). This causes containerd to fail with "unexpected EOF".
1. Orphaned layer links: the cleanup-tags.sh + garbage-collect cycle can delete
blob data while leaving _layers/ link files intact. The registry then returns
HTTP 200 with 0 bytes for those layers (it finds the link, trusts the blob
exists, but the data is gone). Containerd sees "unexpected EOF".
Action: delete the orphan link so the next pull re-fetches cleanly.
This script walks all repositories, checks each layer link against the actual
blobs directory, and removes any orphaned links. On next pull, the registry
will re-fetch the missing blobs from the upstream registry.
2. Orphaned OCI-index children: an image index (multi-platform manifest list)
references child manifests by digest. If a child's blob has been deleted —
by a cleanup-tags.sh tag rmtree followed by garbage-collect walking the
children wrong (distribution/distribution#3324 class), or by an incomplete
`buildx --push` whose partial blob was later purged by `uploadpurging`
the index survives but pulls fail with `manifest unknown`.
Action: log loudly. Deleting an index is a conscious decision (the image
was published; removing it breaks downstream consumers), so we surface
the problem and leave repair to a human or to the rebuild runbook.
Run after garbage-collect (e.g., 3:15 AM Sunday) or daily.
Run after garbage-collect (Sunday 03:30) and daily (Mon-Sat 02:30).
"""
import argparse
import json
import os
import sys
sys.stdout.reconfigure(line_buffering=True)
parser = argparse.ArgumentParser(description="Remove orphaned registry layer links")
parser = argparse.ArgumentParser(description="Scan registry for orphaned blobs and indexes")
parser.add_argument("base", nargs="?", default="/opt/registry/data", help="Registry data directory")
parser.add_argument("--dry-run", action="store_true", help="Report but don't delete")
args = parser.parse_args()
@ -27,39 +35,101 @@ args = parser.parse_args()
BASE = args.base
DRY_RUN = args.dry_run
total_removed = 0
total_checked = 0
INDEX_MEDIA_TYPES = (
"application/vnd.oci.image.index.v1+json",
"application/vnd.docker.distribution.manifest.list.v2+json",
)
total_layer_removed = 0
total_layer_checked = 0
total_index_scanned = 0
total_index_orphans = 0
def load_manifest_blob(blobs_root, digest_hex):
blob_path = os.path.join(blobs_root, digest_hex[:2], digest_hex, "data")
if not os.path.isfile(blob_path):
return None
try:
with open(blob_path, "rb") as f:
raw = f.read(1024 * 1024)
except OSError:
return None
try:
return json.loads(raw)
except (json.JSONDecodeError, UnicodeDecodeError):
return None
for registry_name in sorted(os.listdir(BASE)):
repos_dir = os.path.join(BASE, registry_name, "docker/registry/v2/repositories")
blobs_dir = os.path.join(BASE, registry_name, "docker/registry/v2/blobs")
blobs_root = os.path.join(BASE, registry_name, "docker/registry/v2/blobs/sha256")
if not os.path.isdir(repos_dir):
continue
for root, dirs, files in os.walk(repos_dir):
if not root.endswith("/_layers/sha256"):
continue
for root, _, _ in os.walk(repos_dir):
# --- Scan 1: orphan layer links ----------------------------------------
if root.endswith("/_layers/sha256"):
repo = root.replace(repos_dir + "/", "").replace("/_layers/sha256", "")
repo = root.replace(repos_dir + "/", "").replace("/_layers/sha256", "")
for digest_dir in os.listdir(root):
link_file = os.path.join(root, digest_dir, "link")
if not os.path.isfile(link_file):
continue
for digest_dir in os.listdir(root):
link_file = os.path.join(root, digest_dir, "link")
if not os.path.isfile(link_file):
continue
total_layer_checked += 1
blob_data = os.path.join(blobs_root, digest_dir[:2], digest_dir, "data")
if os.path.isfile(blob_data):
continue
total_checked += 1
# Check if the actual blob data exists
blob_data = os.path.join(blobs_dir, "sha256", digest_dir[:2], digest_dir, "data")
if not os.path.isfile(blob_data):
prefix = "[DRY RUN] " if DRY_RUN else ""
print(f"{prefix}[{registry_name}/{repo}] removing orphaned layer link: {digest_dir[:12]}...")
if not DRY_RUN:
# Remove the entire digest directory (contains the link file)
import shutil
shutil.rmtree(os.path.join(root, digest_dir))
total_removed += 1
total_layer_removed += 1
# --- Scan 2: orphan OCI-index children --------------------------------
elif root.endswith("/_manifests/revisions/sha256"):
repo = root.replace(repos_dir + "/", "").replace("/_manifests/revisions/sha256", "")
for digest_dir in os.listdir(root):
# Manifest revision entry. Load the blob it points to.
manifest = load_manifest_blob(blobs_root, digest_dir)
if manifest is None:
continue
media_type = manifest.get("mediaType", "")
if media_type not in INDEX_MEDIA_TYPES:
continue
total_index_scanned += 1
for child in manifest.get("manifests", []):
child_digest = child.get("digest", "")
if not child_digest.startswith("sha256:"):
continue
child_hex = child_digest[len("sha256:"):]
child_blob = os.path.join(blobs_root, child_hex[:2], child_hex, "data")
if os.path.isfile(child_blob):
continue
platform = child.get("platform", {})
arch = platform.get("architecture", "?")
os_ = platform.get("os", "?")
print(
f"WARNING [{registry_name}/{repo}] ORPHAN INDEX: "
f"{digest_dir[:12]} references missing child {child_hex[:12]} "
f"({arch}/{os_}) — rebuild required, will not auto-repair"
)
total_index_orphans += 1
mode = "DRY RUN — " if DRY_RUN else ""
print(f"\n{mode}Checked {total_checked} layer links, removed {total_removed} orphaned.")
print(f"\n{mode}Layer scan: checked {total_layer_checked} links, removed {total_layer_removed} orphaned.")
print(f"{mode}Index scan: inspected {total_index_scanned} image indexes, found {total_index_orphans} orphaned children.")
if total_index_orphans > 0:
print(f"\nACTION REQUIRED: {total_index_orphans} orphan index child(ren) detected. "
"See docs/runbooks/registry-rebuild-image.md — the affected image must be rebuilt "
"(a registry DELETE on an index is a conscious decision, not an automated repair).")