Viktor Barzin 7cb44d7264 [registry] Stop recurring orphan OCI-index incidents — detection + prevention + recovery

Second identical registry incident on 2026-04-19 (first 2026-04-13): the
infra-ci:latest image index resolved to child manifests whose blobs had been
garbage-collected out from under the index. Pipelines P366→P376 all exited
126 "image can't be pulled". Hot fix (a05d63e / 6371e75 / c113be4) restored
green CI but left the underlying bug unaddressed.

Root cause: cleanup-tags.sh rmtrees tag dirs on the registry VM daily at
02:00, registry:2's GC (Sunday 03:25) walks OCI index children imperfectly
(distribution/distribution#3324 class). Nothing verified pushes end-to-end;
nothing probed the registry for fetchability; nothing caught orphan indexes.

Phase 1 — Detection:
 - .woodpecker/build-ci-image.yml: after build-and-push, a verify-integrity
   step walks the just-pushed manifest (index + children + config + every
   layer blob) via HEAD and fails the pipeline on any non-200. Catches
   broken pushes at the source.
 - stacks/monitoring: new registry-integrity-probe CronJob (every 15m) and
   three alerts — RegistryManifestIntegrityFailure,
   RegistryIntegrityProbeStale, RegistryCatalogInaccessible — closing the
   "registry serves 404 for a tag that exists" gap that masked the incident
   for 2+ hours.
 - docs/post-mortems/2026-04-19-registry-orphan-index.md: root cause,
   timeline, monitoring gaps, permanent fix.

Phase 2 — Prevention:
 - modules/docker-registry/docker-compose.yml: pin registry:2 → registry:2.8.3
   across all six registry services. Removes the floating-tag footgun.
 - modules/docker-registry/fix-broken-blobs.sh: new scan walks every
   _manifests/revisions/sha256/<digest> that is an image index and logs a
   loud WARNING when a referenced child blob is missing. Does NOT auto-
   delete — deleting a published image is a conscious decision. Layer-link
   scan preserved.

Phase 3 — Recovery:
 - build-ci-image.yml: accept `manual` event so Woodpecker API/UI rebuilds
   don't need a cosmetic Dockerfile edit (matches convention from
   pve-nfs-exports-sync.yml).
 - docs/runbooks/registry-rebuild-image.md: exact command sequence for
   diagnosing + rebuilding after an orphan-index incident, plus a fallback
   for building directly on the registry VM if Woodpecker itself is down.
 - docs/runbooks/registry-vm.md + .claude/reference/service-catalog.md:
   cross-references to the new runbook.

Out of scope (verified healthy or intentionally deferred):
 - Pull-through DockerHub/GHCR mirrors (74.5% hit rate, no 404s).
 - Registry HA/replication (single-VM SPOF is a known architectural
   choice; Synology offsite covers RPO < 1 day).
 - Diun exclude for registry:2 — not applicable; Diun only watches
   k8s (DIUN_PROVIDERS_KUBERNETES=true), not the VM's docker-compose.

Verified locally:
 - fix-broken-blobs.sh --dry-run on a synthetic registry directory correctly
   flags both orphan layer links and orphan OCI-index children.
 - terraform fmt + validate on stacks/monitoring: success (only unrelated
   deprecation warnings).
 - python3 yaml.safe_load on .woodpecker/build-ci-image.yml and
   modules/docker-registry/docker-compose.yml: both parse clean.

Closes: code-4b8

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-19 17:08:28 +00:00

5.8 KiB

Raw Blame History

Runbook: Rebuild an Image After a Registry Orphan-Index Incident

Last updated: 2026-04-19

When to use this

Pipelines that pull from registry.viktorbarzin.me:5050 are failing with messages like:

failed to resolve reference … : not found
manifest unknown
image can't be pulled (Woodpecker exit 126)
error pulling image: HEAD on a child manifest digest returns 404

…and skopeo inspect --tls-verify --creds "$USER:$PASS" docker://registry.viktorbarzin.me:5050/<image>:<tag> returns an OCI image index whose manifests[].digest references are 404 on the registry.

This is the orphan OCI-index failure mode documented in docs/post-mortems/2026-04-19-registry-orphan-index.md. The fix is to rebuild the affected image from source so the registry receives a fresh, complete push.

If the symptom is different (e.g., registry container down, TLS expiry, auth failure), use docs/runbooks/registry-vm.md instead.

Phase 1 — Confirm the diagnosis

From any host with skopeo:

REG=registry.viktorbarzin.me:5050
IMAGE=infra-ci
TAG=latest

# 1. Confirm the index exists.
skopeo inspect --tls-verify --creds "$USER:$PASS" \
  --raw "docker://$REG/$IMAGE:$TAG" | jq '.mediaType, .manifests[].digest'

# 2. HEAD each child. Any non-200 = confirmed orphan.
for d in $(skopeo inspect --tls-verify --creds "$USER:$PASS" --raw \
           "docker://$REG/$IMAGE:$TAG" | jq -r '.manifests[].digest'); do
  code=$(curl -sk -u "$USER:$PASS" -o /dev/null -w '%{http_code}' \
         -I "https://$REG/v2/$IMAGE/manifests/$d")
  echo "$d → $code"
done

If every child is 200, the problem is elsewhere — stop here and check the registry VM, TLS, or auth.

The registry-integrity-probe CronJob in the monitoring namespace runs this same check every 15 minutes across every tag in the catalog; its last run is also a fast way to see which image(s) are affected:

kubectl -n monitoring logs \
  $(kubectl -n monitoring get pods -l job-name -o name \
     | grep registry-integrity-probe | head -1)

Phase 2 — Rebuild

Option A (preferred): rebuild via CI

Find the build-*.yml pipeline that produces the image:

Image	Pipeline	Repo ID
`infra-ci`	`.woodpecker/build-ci-image.yml`	1 (infra)
`infra` (cli)	`.woodpecker/build-cli.yml`	1 (infra)
`k8s-portal`	`.woodpecker/k8s-portal.yml`	1 (infra)

Trigger a manual build. The Woodpecker API expects a numeric repo ID (paths with owner/name return HTML):

WOODPECKER_TOKEN=$(vault kv get -field=woodpecker_admin_token secret/viktor)

# Kick off a manual build against master.
curl -s -X POST \
  -H "Authorization: Bearer $WOODPECKER_TOKEN" \
  -H "Content-Type: application/json" \
  "https://ci.viktorbarzin.me/api/repos/1/pipelines" \
  -d '{"branch":"master"}' | jq .number

# Follow the pipeline at https://ci.viktorbarzin.me/repos/1/pipeline/<number>

The pipeline's verify-integrity step walks every blob the push references. If it passes, the registry now has a clean index; pull consumers will recover on next attempt.

Option B (fallback): build on the registry VM

Only use this if Woodpecker itself is broken (its own pipeline runs from the same infra-ci image, so a corrupted infra-ci:latest can prevent Option A from recovering).

ssh root@10.0.20.10 '
  cd /tmp
  git clone --depth 1 https://github.com/ViktorBarzin/infra
  cd infra/ci
  docker build -t registry.viktorbarzin.me:5050/infra-ci:manual -t registry.viktorbarzin.me:5050/infra-ci:latest .
  docker login -u "$USER" -p "$PASS" registry.viktorbarzin.me:5050
  docker push registry.viktorbarzin.me:5050/infra-ci:manual
  docker push registry.viktorbarzin.me:5050/infra-ci:latest
'

Then re-run any pipelines that failed — Woodpecker UI → Restart, or:

curl -s -X POST \
  -H "Authorization: Bearer $WOODPECKER_TOKEN" \
  "https://ci.viktorbarzin.me/api/repos/1/pipelines/<failed-pipeline-number>"

Phase 3 — Verify

# 1. Pull the image fresh (bypassing containerd cache) and check its index.
REG=registry.viktorbarzin.me:5050
skopeo inspect --tls-verify --creds "$USER:$PASS" \
  --raw "docker://$REG/infra-ci:latest" \
  | jq '.manifests[] | {digest, platform}'

# 2. HEAD every child digest — all should be 200.
for d in $(skopeo inspect --tls-verify --creds "$USER:$PASS" --raw \
           "docker://$REG/infra-ci:latest" | jq -r '.manifests[].digest'); do
  code=$(curl -sk -u "$USER:$PASS" -o /dev/null -w '%{http_code}' \
         -I "https://$REG/v2/infra-ci/manifests/$d")
  [ "$code" = "200" ] || echo "STILL BROKEN: $d → $code"
done
echo "verified"

# 3. Kick off the next scheduled probe for good measure.
kubectl -n monitoring create job --from=cronjob/registry-integrity-probe \
  registry-integrity-probe-verify-$(date +%s)
kubectl -n monitoring logs -f -l job-name=registry-integrity-probe-verify-$(date +%s)

The RegistryManifestIntegrityFailure alert clears automatically when the probe's next run returns zero failures.

Phase 4 — Investigate orphans

Once the immediate fix is in, check whether any OTHER images on the registry have orphan children:

ssh root@10.0.20.10 'python3 /opt/registry/fix-broken-blobs.sh --dry-run 2>&1 | grep "ORPHAN INDEX"'

Each hit is a separate image that will eventually fail to pull. Rebuild them in the same way (Option A preferred). If the list is long, open a beads task — do NOT batch-delete the indexes; that's a destructive registry operation outside this runbook's scope.

docs/post-mortems/2026-04-19-registry-orphan-index.md — why this happens.
docs/runbooks/registry-vm.md — VM-level operations (DNS, docker compose restarts).
modules/docker-registry/fix-broken-blobs.sh — the scanner cron itself, runs nightly and after each GC.
stacks/monitoring/modules/monitoring/main.tf — registry_integrity_probe CronJob definition.

5.8 KiB Raw Blame History