infra/docs/post-mortems/2026-04-19-registry-orphan-index.md
Viktor Barzin 7cb44d7264 [registry] Stop recurring orphan OCI-index incidents — detection + prevention + recovery
Second identical registry incident on 2026-04-19 (first 2026-04-13): the
infra-ci:latest image index resolved to child manifests whose blobs had been
garbage-collected out from under the index. Pipelines P366→P376 all exited
126 "image can't be pulled". Hot fix (a05d63e / 6371e75 / c113be4) restored
green CI but left the underlying bug unaddressed.

Root cause: cleanup-tags.sh rmtrees tag dirs on the registry VM daily at
02:00, registry:2's GC (Sunday 03:25) walks OCI index children imperfectly
(distribution/distribution#3324 class). Nothing verified pushes end-to-end;
nothing probed the registry for fetchability; nothing caught orphan indexes.

Phase 1 — Detection:
 - .woodpecker/build-ci-image.yml: after build-and-push, a verify-integrity
   step walks the just-pushed manifest (index + children + config + every
   layer blob) via HEAD and fails the pipeline on any non-200. Catches
   broken pushes at the source.
 - stacks/monitoring: new registry-integrity-probe CronJob (every 15m) and
   three alerts — RegistryManifestIntegrityFailure,
   RegistryIntegrityProbeStale, RegistryCatalogInaccessible — closing the
   "registry serves 404 for a tag that exists" gap that masked the incident
   for 2+ hours.
 - docs/post-mortems/2026-04-19-registry-orphan-index.md: root cause,
   timeline, monitoring gaps, permanent fix.

Phase 2 — Prevention:
 - modules/docker-registry/docker-compose.yml: pin registry:2 → registry:2.8.3
   across all six registry services. Removes the floating-tag footgun.
 - modules/docker-registry/fix-broken-blobs.sh: new scan walks every
   _manifests/revisions/sha256/<digest> that is an image index and logs a
   loud WARNING when a referenced child blob is missing. Does NOT auto-
   delete — deleting a published image is a conscious decision. Layer-link
   scan preserved.

Phase 3 — Recovery:
 - build-ci-image.yml: accept `manual` event so Woodpecker API/UI rebuilds
   don't need a cosmetic Dockerfile edit (matches convention from
   pve-nfs-exports-sync.yml).
 - docs/runbooks/registry-rebuild-image.md: exact command sequence for
   diagnosing + rebuilding after an orphan-index incident, plus a fallback
   for building directly on the registry VM if Woodpecker itself is down.
 - docs/runbooks/registry-vm.md + .claude/reference/service-catalog.md:
   cross-references to the new runbook.

Out of scope (verified healthy or intentionally deferred):
 - Pull-through DockerHub/GHCR mirrors (74.5% hit rate, no 404s).
 - Registry HA/replication (single-VM SPOF is a known architectural
   choice; Synology offsite covers RPO < 1 day).
 - Diun exclude for registry:2 — not applicable; Diun only watches
   k8s (DIUN_PROVIDERS_KUBERNETES=true), not the VM's docker-compose.

Verified locally:
 - fix-broken-blobs.sh --dry-run on a synthetic registry directory correctly
   flags both orphan layer links and orphan OCI-index children.
 - terraform fmt + validate on stacks/monitoring: success (only unrelated
   deprecation warnings).
 - python3 yaml.safe_load on .woodpecker/build-ci-image.yml and
   modules/docker-registry/docker-compose.yml: both parse clean.

Closes: code-4b8

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 17:08:28 +00:00

9.1 KiB
Raw Blame History

Post-Mortem: Private Registry Orphan OCI-Index — Repeat Incident

Field Value
Date 2026-04-19 (first occurrence 2026-04-13)
Duration ~40 min of blocked CI each time; only detected via pipeline failures
Severity SEV2 — all infra CI pipelines using infra-ci:latest failed (P366 → P376 all exit 126 "image can't be pulled")
Affected Services Every Woodpecker pipeline that starts with image: registry.viktorbarzin.me:5050/infra-ci:latestdefault.yml, build-cli.yml, renew-tls.yml, drift-detection.yml, provision-user.yml, k8s-portal.yml, postmortem-todos.yml, issue-automation.yml, pve-nfs-exports-sync.yml
Status Hot fix green (three commits: a05d63ee, 6371e75e, c113be4d — URL fix + rebuild). This doc captures the permanent fix landed in the same branch.

Summary

On 2026-04-19 ~09:00 UTC, every infra CI pipeline started failing at the clone step with "image can't be pulled". The image in question — the CI toolchain image registry.viktorbarzin.me:5050/infra-ci:latest — resolved to an OCI image index whose linux/amd64 platform manifest (sha256:98f718c8…) and its in-toto attestation (sha256:27d5ab83…) returned HTTP 404 from the private registry. The index record itself still existed — it's the children that had been garbage-collected out from under it.

This is the second identical incident: the same failure mode occurred on 2026-04-13 against a different image. Both times the immediate fix was to rebuild the image from scratch; both times the root cause was left unaddressed.

Impact

  • User-facing: all CI pipelines failed. No automated Terraform applies, no TLS renewal, no drift detection. Manual workflows (Woodpecker UI reruns) all failed with the same error.
  • Blast radius: every pipeline that pulls infra-ci. Does NOT affect k8s workloads (those pull via containerd, which goes through the pull-through proxy on :5000/:5010 — a completely different code path).
  • Duration on 2026-04-19: from first P366 failure to the hot-fix commit c113be4d — roughly 40 min. Pipelines that had already been triggered queued up until the rebuild restored :latest.
  • Data loss: none. The registry has the index object; the child manifests are re-producible by rebuilding the source image.
  • Monitoring gap: nothing alerted. The only signal was the individual pipeline failures from Woodpecker. No Prometheus alert fires on "the registry served a 404 for a tag that exists".

Timeline (UTC, 2026-04-19)

Time Event
~09:00 P366 (default.yml on master) fails with exit 126.
09:0011:00 P367, P368, … P376 all fail with the same error. Nobody pages — there's no alert configured.
11:15 User notices and investigates: skopeo inspect reveals the missing platform manifest.
11:20 Hot fix phase begins: a05d63ee fixes a push-URL misalignment, 6371e75e and c113be4d trigger a full rebuild.
11:40 Rebuild completes; infra-ci:latest resolves to a fresh, complete index. Pipelines green from P377 onward.
11:45 User requests a proper root-cause fix: "this is the second time — what's actually broken?"
12:00 Investigation begins (this document's work).

Root Cause Chain

[1] cleanup-tags.sh runs daily at 02:00 on the registry VM
 └─> For each repository, keeps the last 10 tags by mtime, rmtrees the rest.
     This walks `_manifests/tags/<tag>` directly, bypassing the registry API.
         │
         ├─> [2] registry:2 garbage-collect runs weekly (Sun 03:25 for the
         │    private registry). Walks live manifests through refcounts, but
         │    distribution/distribution#3324 showed this walker has historical
         │    bugs with OCI image-index children — it can decrement a shared
         │    child's refcount below 1 and delete the blob even while the
         │    index that references it is still referenced.
         │
         └─> [3] Result: the `infra-ci:latest` index is intact
              (`_manifests/revisions/sha256/<A>/data` present on disk), but
              its `.manifests[0].digest` — the `linux/amd64` child — points
              to a `blobs/sha256/98/98f718c8…/` whose `data` file is gone.

[pull] containerd resolves `infra-ci:latest`
         │
         ├─> GET /v2/infra-ci/manifests/latest → 200 OK, returns the index
         │
         └─> GET /v2/infra-ci/manifests/sha256:98f718c8… → 404 Not Found
              └─> containerd fails the pull with "manifest unknown"
                    └─> woodpecker exit 126

Why Existing Remediation Missed It

  1. fix-broken-blobs.sh only scans layer links. The existing cron walks _layers/sha256/ and removes link files whose blob data is missing. It does NOT inspect _manifests/revisions/sha256/ to see whether an image-index's referenced children still exist. That's exactly the class of orphan this incident represents.
  2. registry:2 image tag was floating. docker-compose.yml pinned only to registry:2. Whatever Docker Inc. last rebuilt as "v2-current" was running, with no version pin. Any regression in the upstream walker would silently swap in.
  3. No integrity monitoring. Prometheus alerted on cache hit rate and registry-down, but nothing probes "are the manifests the registry advertises actually fetchable?"
  4. CI pipeline didn't verify its own push. buildx --push returns success as soon as it uploads. If a child blob upload 0-byted or the client disconnected mid-push (distinct from the GC mode but the same on-disk symptom), nothing would notice until the next pull.

Permanent Fix — Three Phases

Phase 1 — Detection (ship today)

  1. Post-push integrity check in .woodpecker/build-ci-image.yml. After build-and-push, a new step walks the just-pushed manifest (and every child of an image index) and HEADs every referenced blob. Any non-200 fails the pipeline immediately, catching broken pushes at the source rather than leaking them to consumers.
  2. Prometheus alert RegistryManifestIntegrityFailure. A new CronJob (registry-integrity-probe, every 15m, in the monitoring namespace) walks the private registry's catalog, HEADs every tag's manifest, follows each image index's children, and pushes registry_manifest_integrity_failures to Pushgateway. Accompanying alerts: RegistryIntegrityProbeStale, RegistryCatalogInaccessible.
  3. Post-mortem — this document. Linked from .claude/reference/service-catalog.md via the new runbook.

Phase 2 — Prevention

  1. Pin registry:2registry:2.8.3 in modules/docker-registry/docker-compose.yml (all six registry services). Removes the floating-tag footgun.
  2. Extend fix-broken-blobs.sh to scan every _manifests/revisions/sha256/<digest> that is an image index and flag children whose blob data file is missing. The script prints a loud WARNING per orphan; it does not auto-delete the index, because deleting a published image is a conscious decision, not an automated repair.

Phase 3 — Recovery tooling

  1. Manual event trigger on build-ci-image.yml. Rebuilds no longer need a cosmetic Dockerfile edit — POST to the Woodpecker API or click "Run manually" in the UI.
  2. Runbook docs/runbooks/registry-rebuild-image.md — exact command sequence for the next time this happens, plus fallback paths.

Out of Scope

  • Pull-through caches. The DockerHub / GHCR mirrors on :5000 / :5010 are healthy (74.5% cache hit rate, no 404s). The orphan problem is private-registry-only. No changes to nginx or containerd hosts.toml.
  • Registry HA / replication. Single-VM SPOF is a known architectural choice. Harbor or a replicated registry would solve more than this incident requires, at multi-day cost. Synology offsite snapshots already give RPO < 1 day.
  • Disabling cleanup-tags.sh. Keeping storage bounded is still necessary; the fix is detection + rebuild, not "stop cleaning up".

Lessons

  • Repeat incidents deserve root-cause work, not a third hot-fix. The 2026-04-13 incident was closed when CI turned green. Without a probe and without a scan for orphan indexes, the next incident was inevitable — and it happened six days later against a different image.
  • "No alert fired, so it wasn't detected" is a monitoring gap, not an outage feature. The registry was serving 404s for 2+ hours before anyone noticed, because our only signal was "pipeline failures" and our eyes were elsewhere. The new probe closes that gap.
  • CI pipelines should verify their own output. The buildx --push "success" exit code is not a guarantee of pulled-back integrity — as this incident proves. A 30-second post-push HEAD walk is cheap insurance.
  • Prior incident (same failure mode, different image): memory 709 / 710 — 2026-04-13.
  • Runbook: docs/runbooks/registry-rebuild-image.md (new).
  • Hot-fix commits: a05d63ee, 6371e75e, c113be4d.
  • Upstream bug class: distribution/distribution#3324.