Viktor Barzin 7cb44d7264 [registry] Stop recurring orphan OCI-index incidents — detection + prevention + recovery

Second identical registry incident on 2026-04-19 (first 2026-04-13): the
infra-ci:latest image index resolved to child manifests whose blobs had been
garbage-collected out from under the index. Pipelines P366→P376 all exited
126 "image can't be pulled". Hot fix (a05d63e / 6371e75 / c113be4) restored
green CI but left the underlying bug unaddressed.

Root cause: cleanup-tags.sh rmtrees tag dirs on the registry VM daily at
02:00, registry:2's GC (Sunday 03:25) walks OCI index children imperfectly
(distribution/distribution#3324 class). Nothing verified pushes end-to-end;
nothing probed the registry for fetchability; nothing caught orphan indexes.

Phase 1 — Detection:
 - .woodpecker/build-ci-image.yml: after build-and-push, a verify-integrity
   step walks the just-pushed manifest (index + children + config + every
   layer blob) via HEAD and fails the pipeline on any non-200. Catches
   broken pushes at the source.
 - stacks/monitoring: new registry-integrity-probe CronJob (every 15m) and
   three alerts — RegistryManifestIntegrityFailure,
   RegistryIntegrityProbeStale, RegistryCatalogInaccessible — closing the
   "registry serves 404 for a tag that exists" gap that masked the incident
   for 2+ hours.
 - docs/post-mortems/2026-04-19-registry-orphan-index.md: root cause,
   timeline, monitoring gaps, permanent fix.

Phase 2 — Prevention:
 - modules/docker-registry/docker-compose.yml: pin registry:2 → registry:2.8.3
   across all six registry services. Removes the floating-tag footgun.
 - modules/docker-registry/fix-broken-blobs.sh: new scan walks every
   _manifests/revisions/sha256/<digest> that is an image index and logs a
   loud WARNING when a referenced child blob is missing. Does NOT auto-
   delete — deleting a published image is a conscious decision. Layer-link
   scan preserved.

Phase 3 — Recovery:
 - build-ci-image.yml: accept `manual` event so Woodpecker API/UI rebuilds
   don't need a cosmetic Dockerfile edit (matches convention from
   pve-nfs-exports-sync.yml).
 - docs/runbooks/registry-rebuild-image.md: exact command sequence for
   diagnosing + rebuilding after an orphan-index incident, plus a fallback
   for building directly on the registry VM if Woodpecker itself is down.
 - docs/runbooks/registry-vm.md + .claude/reference/service-catalog.md:
   cross-references to the new runbook.

Out of scope (verified healthy or intentionally deferred):
 - Pull-through DockerHub/GHCR mirrors (74.5% hit rate, no 404s).
 - Registry HA/replication (single-VM SPOF is a known architectural
   choice; Synology offsite covers RPO < 1 day).
 - Diun exclude for registry:2 — not applicable; Diun only watches
   k8s (DIUN_PROVIDERS_KUBERNETES=true), not the VM's docker-compose.

Verified locally:
 - fix-broken-blobs.sh --dry-run on a synthetic registry directory correctly
   flags both orphan layer links and orphan OCI-index children.
 - terraform fmt + validate on stacks/monitoring: success (only unrelated
   deprecation warnings).
 - python3 yaml.safe_load on .woodpecker/build-ci-image.yml and
   modules/docker-registry/docker-compose.yml: both parse clean.

Closes: code-4b8

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-19 17:08:28 +00:00

9.1 KiB

Raw Blame History

Post-Mortem: Private Registry Orphan OCI-Index — Repeat Incident

Field	Value
Date	2026-04-19 (first occurrence 2026-04-13)
Duration	~40 min of blocked CI each time; only detected via pipeline failures
Severity	SEV2 — all infra CI pipelines using `infra-ci:latest` failed (P366 → P376 all exit 126 "image can't be pulled")
Affected Services	Every Woodpecker pipeline that starts with `image: registry.viktorbarzin.me:5050/infra-ci:latest` — `default.yml`, `build-cli.yml`, `renew-tls.yml`, `drift-detection.yml`, `provision-user.yml`, `k8s-portal.yml`, `postmortem-todos.yml`, `issue-automation.yml`, `pve-nfs-exports-sync.yml`
Status	Hot fix green (three commits: `a05d63ee`, `6371e75e`, `c113be4d` — URL fix + rebuild). This doc captures the permanent fix landed in the same branch.

Summary

On 2026-04-19 ~09:00 UTC, every infra CI pipeline started failing at the clone step with "image can't be pulled". The image in question — the CI toolchain image registry.viktorbarzin.me:5050/infra-ci:latest — resolved to an OCI image index whose linux/amd64 platform manifest (sha256:98f718c8…) and its in-toto attestation (sha256:27d5ab83…) returned HTTP 404 from the private registry. The index record itself still existed — it's the children that had been garbage-collected out from under it.

This is the second identical incident: the same failure mode occurred on 2026-04-13 against a different image. Both times the immediate fix was to rebuild the image from scratch; both times the root cause was left unaddressed.

Impact

User-facing: all CI pipelines failed. No automated Terraform applies, no TLS renewal, no drift detection. Manual workflows (Woodpecker UI reruns) all failed with the same error.
Blast radius: every pipeline that pulls infra-ci. Does NOT affect k8s workloads (those pull via containerd, which goes through the pull-through proxy on :5000/:5010 — a completely different code path).
Duration on 2026-04-19: from first P366 failure to the hot-fix commit c113be4d — roughly 40 min. Pipelines that had already been triggered queued up until the rebuild restored :latest.
Data loss: none. The registry has the index object; the child manifests are re-producible by rebuilding the source image.
Monitoring gap: nothing alerted. The only signal was the individual pipeline failures from Woodpecker. No Prometheus alert fires on "the registry served a 404 for a tag that exists".

Timeline (UTC, 2026-04-19)

Time	Event
~09:00	P366 (`default.yml` on master) fails with exit 126.
09:00–11:00	P367, P368, … P376 all fail with the same error. Nobody pages — there's no alert configured.
11:15	User notices and investigates: `skopeo inspect` reveals the missing platform manifest.
11:20	Hot fix phase begins: `a05d63ee` fixes a push-URL misalignment, `6371e75e` and `c113be4d` trigger a full rebuild.
11:40	Rebuild completes; `infra-ci:latest` resolves to a fresh, complete index. Pipelines green from P377 onward.
11:45	User requests a proper root-cause fix: "this is the second time — what's actually broken?"
12:00	Investigation begins (this document's work).

Root Cause Chain

[1] cleanup-tags.sh runs daily at 02:00 on the registry VM
 └─> For each repository, keeps the last 10 tags by mtime, rmtrees the rest.
     This walks `_manifests/tags/<tag>` directly, bypassing the registry API.
         │
         ├─> [2] registry:2 garbage-collect runs weekly (Sun 03:25 for the
         │    private registry). Walks live manifests through refcounts, but
         │    distribution/distribution#3324 showed this walker has historical
         │    bugs with OCI image-index children — it can decrement a shared
         │    child's refcount below 1 and delete the blob even while the
         │    index that references it is still referenced.
         │
         └─> [3] Result: the `infra-ci:latest` index is intact
              (`_manifests/revisions/sha256/<A>/data` present on disk), but
              its `.manifests[0].digest` — the `linux/amd64` child — points
              to a `blobs/sha256/98/98f718c8…/` whose `data` file is gone.

[pull] containerd resolves `infra-ci:latest`
         │
         ├─> GET /v2/infra-ci/manifests/latest → 200 OK, returns the index
         │
         └─> GET /v2/infra-ci/manifests/sha256:98f718c8… → 404 Not Found
              └─> containerd fails the pull with "manifest unknown"
                    └─> woodpecker exit 126

Why Existing Remediation Missed It

fix-broken-blobs.sh only scans layer links. The existing cron walks _layers/sha256/ and removes link files whose blob data is missing. It does NOT inspect _manifests/revisions/sha256/ to see whether an image-index's referenced children still exist. That's exactly the class of orphan this incident represents.
registry:2 image tag was floating. docker-compose.yml pinned only to registry:2. Whatever Docker Inc. last rebuilt as "v2-current" was running, with no version pin. Any regression in the upstream walker would silently swap in.
No integrity monitoring. Prometheus alerted on cache hit rate and registry-down, but nothing probes "are the manifests the registry advertises actually fetchable?"
CI pipeline didn't verify its own push. buildx --push returns success as soon as it uploads. If a child blob upload 0-byted or the client disconnected mid-push (distinct from the GC mode but the same on-disk symptom), nothing would notice until the next pull.

Permanent Fix — Three Phases

Phase 1 — Detection (ship today)

Post-push integrity check in .woodpecker/build-ci-image.yml. After build-and-push, a new step walks the just-pushed manifest (and every child of an image index) and HEADs every referenced blob. Any non-200 fails the pipeline immediately, catching broken pushes at the source rather than leaking them to consumers.
Prometheus alert RegistryManifestIntegrityFailure. A new CronJob (registry-integrity-probe, every 15m, in the monitoring namespace) walks the private registry's catalog, HEADs every tag's manifest, follows each image index's children, and pushes registry_manifest_integrity_failures to Pushgateway. Accompanying alerts: RegistryIntegrityProbeStale, RegistryCatalogInaccessible.
Post-mortem — this document. Linked from .claude/reference/service-catalog.md via the new runbook.

Phase 2 — Prevention

Pin registry:2 → registry:2.8.3 in modules/docker-registry/docker-compose.yml (all six registry services). Removes the floating-tag footgun.
Extend fix-broken-blobs.sh to scan every _manifests/revisions/sha256/<digest> that is an image index and flag children whose blob data file is missing. The script prints a loud WARNING per orphan; it does not auto-delete the index, because deleting a published image is a conscious decision, not an automated repair.

Phase 3 — Recovery tooling

Manual event trigger on build-ci-image.yml. Rebuilds no longer need a cosmetic Dockerfile edit — POST to the Woodpecker API or click "Run manually" in the UI.
Runbook docs/runbooks/registry-rebuild-image.md — exact command sequence for the next time this happens, plus fallback paths.

Out of Scope

Pull-through caches. The DockerHub / GHCR mirrors on :5000 / :5010 are healthy (74.5% cache hit rate, no 404s). The orphan problem is private-registry-only. No changes to nginx or containerd hosts.toml.
Registry HA / replication. Single-VM SPOF is a known architectural choice. Harbor or a replicated registry would solve more than this incident requires, at multi-day cost. Synology offsite snapshots already give RPO < 1 day.
Disabling cleanup-tags.sh. Keeping storage bounded is still necessary; the fix is detection + rebuild, not "stop cleaning up".

Lessons

Repeat incidents deserve root-cause work, not a third hot-fix. The 2026-04-13 incident was closed when CI turned green. Without a probe and without a scan for orphan indexes, the next incident was inevitable — and it happened six days later against a different image.
"No alert fired, so it wasn't detected" is a monitoring gap, not an outage feature. The registry was serving 404s for 2+ hours before anyone noticed, because our only signal was "pipeline failures" and our eyes were elsewhere. The new probe closes that gap.
CI pipelines should verify their own output. The buildx --push "success" exit code is not a guarantee of pulled-back integrity — as this incident proves. A 30-second post-push HEAD walk is cheap insurance.

Prior incident (same failure mode, different image): memory 709 / 710 — 2026-04-13.
Runbook: docs/runbooks/registry-rebuild-image.md (new).
Hot-fix commits: a05d63ee, 6371e75e, c113be4d.
Upstream bug class: distribution/distribution#3324.

9.1 KiB Raw Blame History Unescape Escape