Registry integrity probe surfaced 38 broken manifest references
(34 unique repo:tag pairs, same OCI-index orphan pattern as the 04-19
infra-ci incident). Deleted all via registry HTTP API + ran GC;
reclaimed ~3GB blob storage.
beads-server CronJobs were stuck ImagePullBackOff on
claude-agent-service:0c24c9b6 for >6h — bumped variable default to
2fd7670d (canonical tag in claude-agent-service stack, already healthy
in registry) so new ticks can fire.
Rebuilt in-use broken tags: freedify:{latest,c803de02} and
beadboard:{17a38e43,latest} on registry VM; priority-pass via
Woodpecker pipeline #8. wealthfolio-sync:latest deferred (monthly
CronJob, next run 2026-05-01).
Probe now reports 0/39 failures. RegistryManifestIntegrityFailure
alert cleared.
Closes: code-8hk
Closes: code-jh3c
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
13 KiB
Post-Mortem: Private Registry Orphan OCI-Index — Repeat Incident
| Field | Value |
|---|---|
| Date | 2026-04-19 (first occurrence 2026-04-13) |
| Duration | ~40 min of blocked CI each time; only detected via pipeline failures |
| Severity | SEV2 — all infra CI pipelines using infra-ci:latest failed (P366 → P376 all exit 126 "image can't be pulled") |
| Affected Services | Every Woodpecker pipeline that starts with image: registry.viktorbarzin.me:5050/infra-ci:latest — default.yml, build-cli.yml, renew-tls.yml, drift-detection.yml, provision-user.yml, k8s-portal.yml, postmortem-todos.yml, issue-automation.yml, pve-nfs-exports-sync.yml |
| Status | Hot fix green (three commits: a05d63ee, 6371e75e, c113be4d — URL fix + rebuild). This doc captures the permanent fix landed in the same branch. |
Summary
On 2026-04-19 ~09:00 UTC, every infra CI pipeline started failing at the
clone step with "image can't be pulled". The image in question — the CI
toolchain image registry.viktorbarzin.me:5050/infra-ci:latest — resolved
to an OCI image index whose linux/amd64 platform manifest
(sha256:98f718c8…) and its in-toto attestation
(sha256:27d5ab83…) returned HTTP 404 from the private registry.
The index record itself still existed — it's the children that had been
garbage-collected out from under it.
This is the second identical incident: the same failure mode occurred on 2026-04-13 against a different image. Both times the immediate fix was to rebuild the image from scratch; both times the root cause was left unaddressed.
Impact
- User-facing: all CI pipelines failed. No automated Terraform applies, no TLS renewal, no drift detection. Manual workflows (Woodpecker UI reruns) all failed with the same error.
- Blast radius: every pipeline that pulls
infra-ci. Does NOT affect k8s workloads (those pull via containerd, which goes through the pull-through proxy on :5000/:5010 — a completely different code path). - Duration on 2026-04-19: from first P366 failure to the hot-fix
commit
c113be4d— roughly 40 min. Pipelines that had already been triggered queued up until the rebuild restored:latest. - Data loss: none. The registry has the index object; the child manifests are re-producible by rebuilding the source image.
- Monitoring gap: nothing alerted. The only signal was the individual pipeline failures from Woodpecker. No Prometheus alert fires on "the registry served a 404 for a tag that exists".
Timeline (UTC, 2026-04-19)
| Time | Event |
|---|---|
| ~09:00 | P366 (default.yml on master) fails with exit 126. |
| 09:00–11:00 | P367, P368, … P376 all fail with the same error. Nobody pages — there's no alert configured. |
| 11:15 | User notices and investigates: skopeo inspect reveals the missing platform manifest. |
| 11:20 | Hot fix phase begins: a05d63ee fixes a push-URL misalignment, 6371e75e and c113be4d trigger a full rebuild. |
| 11:40 | Rebuild completes; infra-ci:latest resolves to a fresh, complete index. Pipelines green from P377 onward. |
| 11:45 | User requests a proper root-cause fix: "this is the second time — what's actually broken?" |
| 12:00 | Investigation begins (this document's work). |
Root Cause Chain
[1] cleanup-tags.sh runs daily at 02:00 on the registry VM
└─> For each repository, keeps the last 10 tags by mtime, rmtrees the rest.
This walks `_manifests/tags/<tag>` directly, bypassing the registry API.
│
├─> [2] Subtle on-disk asymmetry: a registry:2 tag rmtree removes
│ BOTH the `_manifests/tags/<tag>/` dir AND — on 2.8.x — the
│ per-repo revision-link files under
│ `<repo>/_manifests/revisions/sha256/<child-digest>/link` for
│ every child referenced by that tag's index. The raw blob data
│ under `/var/lib/registry/docker/registry/v2/blobs/sha256/<.>/data`
│ is NOT touched — GC owns that, and GC only runs Sunday.
│
├─> [3] If ANOTHER tag's index still references one of those same
│ children (common — successive rebuilds share layers), the child
│ blob survives. But the revision-link is gone, so the registry
│ API can no longer map `<repo>/manifests/sha256:<child>` back
│ to the blob. HEAD → 404, even though the bytes are on disk.
│ distribution/distribution#3324 is the upstream class of this bug.
│
└─> [4] Result: the surviving index (e.g. `infra-ci:5319f03e`) is
intact on disk, its children's blob data files are intact on
disk, but HEAD `/v2/infra-ci/manifests/sha256:98f718c8…`
returns 404. The registry has the bytes, but cannot find them
through the API because the per-repo link bridge is gone.
[pull] containerd resolves `infra-ci:latest`
│
├─> GET /v2/infra-ci/manifests/latest → 200 OK, returns the index
│
└─> GET /v2/infra-ci/manifests/sha256:98f718c8… → 404 Not Found
└─> containerd fails the pull with "manifest unknown"
└─> woodpecker exit 126
Detection-gotcha uncovered 2026-04-19 while implementing
fix-broken-blobs.sh: a scan that checks/blobs/sha256/<child>/datafor presence is NOT equivalent to "can the registry serve this child?" The authoritative check is whether<repo>/_manifests/revisions/sha256/<child>/linkexists. The script was rewritten to check the per-repo link file after the HTTP probe caught 38 real orphans the filesystem scan had reported clean.
Why Existing Remediation Missed It
fix-broken-blobs.shonly scans layer links. The existing cron walks_layers/sha256/and removes link files whose blobdatais missing. It does NOT inspect_manifests/revisions/sha256/to see whether an image-index's referenced children still exist. That's exactly the class of orphan this incident represents.registry:2image tag was floating.docker-compose.ymlpinned only toregistry:2. Whatever Docker Inc. last rebuilt as "v2-current" was running, with no version pin. Any regression in the upstream walker would silently swap in.- No integrity monitoring. Prometheus alerted on cache hit rate and registry-down, but nothing probes "are the manifests the registry advertises actually fetchable?"
- CI pipeline didn't verify its own push.
buildx --pushreturns success as soon as it uploads. If a child blob upload 0-byted or the client disconnected mid-push (distinct from the GC mode but the same on-disk symptom), nothing would notice until the next pull.
Permanent Fix — Three Phases
Phase 1 — Detection (ship today)
- Post-push integrity check in
.woodpecker/build-ci-image.yml. Afterbuild-and-push, a new step walks the just-pushed manifest (and every child of an image index) and HEADs every referenced blob. Any non-200 fails the pipeline immediately, catching broken pushes at the source rather than leaking them to consumers. - Prometheus alert
RegistryManifestIntegrityFailure. A new CronJob (registry-integrity-probe, every 15m, in themonitoringnamespace) walks the private registry's catalog, HEADs every tag's manifest, follows each image index's children, and pushesregistry_manifest_integrity_failuresto Pushgateway. Accompanying alerts:RegistryIntegrityProbeStale,RegistryCatalogInaccessible. - Post-mortem — this document. Linked from
.claude/reference/service-catalog.mdvia the new runbook.
Phase 2 — Prevention
- Pin
registry:2→registry:2.8.3inmodules/docker-registry/docker-compose.yml(all six registry services). Removes the floating-tag footgun. - Extend
fix-broken-blobs.shto scan every_manifests/revisions/sha256/<digest>that is an image index and flag children whose blobdatafile is missing. The script prints a loud WARNING per orphan; it does not auto-delete the index, because deleting a published image is a conscious decision, not an automated repair.
Phase 3 — Recovery tooling
- Manual event trigger on
build-ci-image.yml. Rebuilds no longer need a cosmetic Dockerfile edit — POST to the Woodpecker API or click "Run manually" in the UI. - Runbook
docs/runbooks/registry-rebuild-image.md— exact command sequence for the next time this happens, plus fallback paths.
Out of Scope
- Pull-through caches. The DockerHub / GHCR mirrors on
:5000/:5010are healthy (74.5% cache hit rate, no 404s). The orphan problem is private-registry-only. No changes to nginx or containerdhosts.toml. - Registry HA / replication. Single-VM SPOF is a known architectural choice. Harbor or a replicated registry would solve more than this incident requires, at multi-day cost. Synology offsite snapshots already give RPO < 1 day.
- Disabling
cleanup-tags.sh. Keeping storage bounded is still necessary; the fix is detection + rebuild, not "stop cleaning up".
Lessons
- Repeat incidents deserve root-cause work, not a third hot-fix. The 2026-04-13 incident was closed when CI turned green. Without a probe and without a scan for orphan indexes, the next incident was inevitable — and it happened six days later against a different image.
- "No alert fired, so it wasn't detected" is a monitoring gap, not an outage feature. The registry was serving 404s for 2+ hours before anyone noticed, because our only signal was "pipeline failures" and our eyes were elsewhere. The new probe closes that gap.
- CI pipelines should verify their own output. The
buildx --push"success" exit code is not a guarantee of pulled-back integrity — as this incident proves. A 30-second post-push HEAD walk is cheap insurance.
Related
- Prior incident (same failure mode, different image): memory
709/710— 2026-04-13. - Runbook:
docs/runbooks/registry-rebuild-image.md(new). - Hot-fix commits:
a05d63ee,6371e75e,c113be4d. - Upstream bug class:
distribution/distribution#3324.
2026-04-19 — Bulk cleanup sweep (beads code-8hk + code-jh3c)
Same failure class, broader scope. The registry-integrity-probe
surfaced 38 broken manifest references persisting after the 04-19
infra-ci fix. beads-dispatcher + beads-reaper CronJobs were stuck
ImagePullBackOff on claude-agent-service:0c24c9b6 for >6h. All 34
affected repo:tag pairs were OCI indexes whose linux/amd64 child
manifests were absent from blob storage (same orphan pattern).
Action taken:
- Bumped
beads-server/main.tfvar defaultclaude_agent_service_image_tagfrom0c24c9b6→2fd7670d(the canonical tag inclaude-agent-service/main.tf), reused — same image already healthy on the registry.scripts/tg applyonbeads-server. Deleted the stuck Jobs so new CronJob ticks could fire. - Enumerated 34 broken
(repo, tag, parent_digest)triples via HTTP probe usingregistry-probe-credentialsK8s Secret. Deleted each viaDELETE /v2/<repo>/manifests/<digest>(33× 202, 1× 404 — claude-agent-service:latest pointed at an already-deleted digest). - Ran
docker exec registry-private /bin/registry garbage-collect /etc/docker/registry/config.yml— reclaimed ~3GB of orphan blob storage. - Rebuilt the 3 in-use broken tags (all 3 OCI-index parents pointed
at missing children, so no cached copies would survive pod
reschedule):
freedify:latest/freedify:c803de02— built on registry VM directly (no CI pipeline exists for this image; python FastAPI).beadboard:17a38e43/beadboard:latest— GHAworkflow_dispatchfailed at registry login (missingREGISTRY_USERNAME/REGISTRY_PASSWORDGH secrets). Built on registry VM directly as the fallback. GitHub secret gap is a follow-up — beadscode-8hknotes it.priority-pass-backend:ae1420a0/priority-pass-frontend:ae1420a0— Woodpecker pipeline #8 on repo 81. Pipelinekubectl set image'd the Deployment toae1420a0(drift vs TFv5/v8defaults, but that drift is pre-existing, not introduced by this cleanup).wealthfolio-sync:latest— not rebuilt. Monthly CronJob (next run 2026-05-01), no source tree or CI pipeline available in the monorepo; deferred for separate follow-up.
Post-cleanup state:
- Probe: 39 tags, 0 failures.
registry_manifest_integrity_failures{} = 0. - Alert
RegistryManifestIntegrityFailurecleared (was firing for 5h 32m). - No
ImagePullBackOffpods anywhere in the cluster. - 28 of 34 deleted manifests were dangling tags not referenced by any
workload — old
382d6b1*,v2-v7,yt-fallback, etc. Safe deletes, no rebuilds needed.
Permanent fix still in flight: Phase 2/3 of this post-mortem
(post-push verification in CI, atomic cleanup-tags.sh) — not
addressed by this cleanup. The probe continues to be the
authoritative detector.