Second identical registry incident on 2026-04-19 (first 2026-04-13): the infra-ci:latest image index resolved to child manifests whose blobs had been garbage-collected out from under the index. Pipelines P366→P376 all exited 126 "image can't be pulled". Hot fix (a05d63e/6371e75/c113be4) restored green CI but left the underlying bug unaddressed. Root cause: cleanup-tags.sh rmtrees tag dirs on the registry VM daily at 02:00, registry:2's GC (Sunday 03:25) walks OCI index children imperfectly (distribution/distribution#3324 class). Nothing verified pushes end-to-end; nothing probed the registry for fetchability; nothing caught orphan indexes. Phase 1 — Detection: - .woodpecker/build-ci-image.yml: after build-and-push, a verify-integrity step walks the just-pushed manifest (index + children + config + every layer blob) via HEAD and fails the pipeline on any non-200. Catches broken pushes at the source. - stacks/monitoring: new registry-integrity-probe CronJob (every 15m) and three alerts — RegistryManifestIntegrityFailure, RegistryIntegrityProbeStale, RegistryCatalogInaccessible — closing the "registry serves 404 for a tag that exists" gap that masked the incident for 2+ hours. - docs/post-mortems/2026-04-19-registry-orphan-index.md: root cause, timeline, monitoring gaps, permanent fix. Phase 2 — Prevention: - modules/docker-registry/docker-compose.yml: pin registry:2 → registry:2.8.3 across all six registry services. Removes the floating-tag footgun. - modules/docker-registry/fix-broken-blobs.sh: new scan walks every _manifests/revisions/sha256/<digest> that is an image index and logs a loud WARNING when a referenced child blob is missing. Does NOT auto- delete — deleting a published image is a conscious decision. Layer-link scan preserved. Phase 3 — Recovery: - build-ci-image.yml: accept `manual` event so Woodpecker API/UI rebuilds don't need a cosmetic Dockerfile edit (matches convention from pve-nfs-exports-sync.yml). - docs/runbooks/registry-rebuild-image.md: exact command sequence for diagnosing + rebuilding after an orphan-index incident, plus a fallback for building directly on the registry VM if Woodpecker itself is down. - docs/runbooks/registry-vm.md + .claude/reference/service-catalog.md: cross-references to the new runbook. Out of scope (verified healthy or intentionally deferred): - Pull-through DockerHub/GHCR mirrors (74.5% hit rate, no 404s). - Registry HA/replication (single-VM SPOF is a known architectural choice; Synology offsite covers RPO < 1 day). - Diun exclude for registry:2 — not applicable; Diun only watches k8s (DIUN_PROVIDERS_KUBERNETES=true), not the VM's docker-compose. Verified locally: - fix-broken-blobs.sh --dry-run on a synthetic registry directory correctly flags both orphan layer links and orphan OCI-index children. - terraform fmt + validate on stacks/monitoring: success (only unrelated deprecation warnings). - python3 yaml.safe_load on .woodpecker/build-ci-image.yml and modules/docker-registry/docker-compose.yml: both parse clean. Closes: code-4b8 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
9.1 KiB
9.1 KiB
Post-Mortem: Private Registry Orphan OCI-Index — Repeat Incident
| Field | Value |
|---|---|
| Date | 2026-04-19 (first occurrence 2026-04-13) |
| Duration | ~40 min of blocked CI each time; only detected via pipeline failures |
| Severity | SEV2 — all infra CI pipelines using infra-ci:latest failed (P366 → P376 all exit 126 "image can't be pulled") |
| Affected Services | Every Woodpecker pipeline that starts with image: registry.viktorbarzin.me:5050/infra-ci:latest — default.yml, build-cli.yml, renew-tls.yml, drift-detection.yml, provision-user.yml, k8s-portal.yml, postmortem-todos.yml, issue-automation.yml, pve-nfs-exports-sync.yml |
| Status | Hot fix green (three commits: a05d63ee, 6371e75e, c113be4d — URL fix + rebuild). This doc captures the permanent fix landed in the same branch. |
Summary
On 2026-04-19 ~09:00 UTC, every infra CI pipeline started failing at the
clone step with "image can't be pulled". The image in question — the CI
toolchain image registry.viktorbarzin.me:5050/infra-ci:latest — resolved
to an OCI image index whose linux/amd64 platform manifest
(sha256:98f718c8…) and its in-toto attestation
(sha256:27d5ab83…) returned HTTP 404 from the private registry.
The index record itself still existed — it's the children that had been
garbage-collected out from under it.
This is the second identical incident: the same failure mode occurred on 2026-04-13 against a different image. Both times the immediate fix was to rebuild the image from scratch; both times the root cause was left unaddressed.
Impact
- User-facing: all CI pipelines failed. No automated Terraform applies, no TLS renewal, no drift detection. Manual workflows (Woodpecker UI reruns) all failed with the same error.
- Blast radius: every pipeline that pulls
infra-ci. Does NOT affect k8s workloads (those pull via containerd, which goes through the pull-through proxy on :5000/:5010 — a completely different code path). - Duration on 2026-04-19: from first P366 failure to the hot-fix
commit
c113be4d— roughly 40 min. Pipelines that had already been triggered queued up until the rebuild restored:latest. - Data loss: none. The registry has the index object; the child manifests are re-producible by rebuilding the source image.
- Monitoring gap: nothing alerted. The only signal was the individual pipeline failures from Woodpecker. No Prometheus alert fires on "the registry served a 404 for a tag that exists".
Timeline (UTC, 2026-04-19)
| Time | Event |
|---|---|
| ~09:00 | P366 (default.yml on master) fails with exit 126. |
| 09:00–11:00 | P367, P368, … P376 all fail with the same error. Nobody pages — there's no alert configured. |
| 11:15 | User notices and investigates: skopeo inspect reveals the missing platform manifest. |
| 11:20 | Hot fix phase begins: a05d63ee fixes a push-URL misalignment, 6371e75e and c113be4d trigger a full rebuild. |
| 11:40 | Rebuild completes; infra-ci:latest resolves to a fresh, complete index. Pipelines green from P377 onward. |
| 11:45 | User requests a proper root-cause fix: "this is the second time — what's actually broken?" |
| 12:00 | Investigation begins (this document's work). |
Root Cause Chain
[1] cleanup-tags.sh runs daily at 02:00 on the registry VM
└─> For each repository, keeps the last 10 tags by mtime, rmtrees the rest.
This walks `_manifests/tags/<tag>` directly, bypassing the registry API.
│
├─> [2] registry:2 garbage-collect runs weekly (Sun 03:25 for the
│ private registry). Walks live manifests through refcounts, but
│ distribution/distribution#3324 showed this walker has historical
│ bugs with OCI image-index children — it can decrement a shared
│ child's refcount below 1 and delete the blob even while the
│ index that references it is still referenced.
│
└─> [3] Result: the `infra-ci:latest` index is intact
(`_manifests/revisions/sha256/<A>/data` present on disk), but
its `.manifests[0].digest` — the `linux/amd64` child — points
to a `blobs/sha256/98/98f718c8…/` whose `data` file is gone.
[pull] containerd resolves `infra-ci:latest`
│
├─> GET /v2/infra-ci/manifests/latest → 200 OK, returns the index
│
└─> GET /v2/infra-ci/manifests/sha256:98f718c8… → 404 Not Found
└─> containerd fails the pull with "manifest unknown"
└─> woodpecker exit 126
Why Existing Remediation Missed It
fix-broken-blobs.shonly scans layer links. The existing cron walks_layers/sha256/and removes link files whose blobdatais missing. It does NOT inspect_manifests/revisions/sha256/to see whether an image-index's referenced children still exist. That's exactly the class of orphan this incident represents.registry:2image tag was floating.docker-compose.ymlpinned only toregistry:2. Whatever Docker Inc. last rebuilt as "v2-current" was running, with no version pin. Any regression in the upstream walker would silently swap in.- No integrity monitoring. Prometheus alerted on cache hit rate and registry-down, but nothing probes "are the manifests the registry advertises actually fetchable?"
- CI pipeline didn't verify its own push.
buildx --pushreturns success as soon as it uploads. If a child blob upload 0-byted or the client disconnected mid-push (distinct from the GC mode but the same on-disk symptom), nothing would notice until the next pull.
Permanent Fix — Three Phases
Phase 1 — Detection (ship today)
- Post-push integrity check in
.woodpecker/build-ci-image.yml. Afterbuild-and-push, a new step walks the just-pushed manifest (and every child of an image index) and HEADs every referenced blob. Any non-200 fails the pipeline immediately, catching broken pushes at the source rather than leaking them to consumers. - Prometheus alert
RegistryManifestIntegrityFailure. A new CronJob (registry-integrity-probe, every 15m, in themonitoringnamespace) walks the private registry's catalog, HEADs every tag's manifest, follows each image index's children, and pushesregistry_manifest_integrity_failuresto Pushgateway. Accompanying alerts:RegistryIntegrityProbeStale,RegistryCatalogInaccessible. - Post-mortem — this document. Linked from
.claude/reference/service-catalog.mdvia the new runbook.
Phase 2 — Prevention
- Pin
registry:2→registry:2.8.3inmodules/docker-registry/docker-compose.yml(all six registry services). Removes the floating-tag footgun. - Extend
fix-broken-blobs.shto scan every_manifests/revisions/sha256/<digest>that is an image index and flag children whose blobdatafile is missing. The script prints a loud WARNING per orphan; it does not auto-delete the index, because deleting a published image is a conscious decision, not an automated repair.
Phase 3 — Recovery tooling
- Manual event trigger on
build-ci-image.yml. Rebuilds no longer need a cosmetic Dockerfile edit — POST to the Woodpecker API or click "Run manually" in the UI. - Runbook
docs/runbooks/registry-rebuild-image.md— exact command sequence for the next time this happens, plus fallback paths.
Out of Scope
- Pull-through caches. The DockerHub / GHCR mirrors on
:5000/:5010are healthy (74.5% cache hit rate, no 404s). The orphan problem is private-registry-only. No changes to nginx or containerdhosts.toml. - Registry HA / replication. Single-VM SPOF is a known architectural choice. Harbor or a replicated registry would solve more than this incident requires, at multi-day cost. Synology offsite snapshots already give RPO < 1 day.
- Disabling
cleanup-tags.sh. Keeping storage bounded is still necessary; the fix is detection + rebuild, not "stop cleaning up".
Lessons
- Repeat incidents deserve root-cause work, not a third hot-fix. The 2026-04-13 incident was closed when CI turned green. Without a probe and without a scan for orphan indexes, the next incident was inevitable — and it happened six days later against a different image.
- "No alert fired, so it wasn't detected" is a monitoring gap, not an outage feature. The registry was serving 404s for 2+ hours before anyone noticed, because our only signal was "pipeline failures" and our eyes were elsewhere. The new probe closes that gap.
- CI pipelines should verify their own output. The
buildx --push"success" exit code is not a guarantee of pulled-back integrity — as this incident proves. A 30-second post-push HEAD walk is cheap insurance.
Related
- Prior incident (same failure mode, different image): memory
709/710— 2026-04-13. - Runbook:
docs/runbooks/registry-rebuild-image.md(new). - Hot-fix commits:
a05d63ee,6371e75e,c113be4d. - Upstream bug class:
distribution/distribution#3324.