infra/docs/post-mortems/2026-04-19-registry-orphan-index.md
Viktor Barzin ac695dea38 [registry] bulk-clean 34 orphan manifests + beads-server image bump
Registry integrity probe surfaced 38 broken manifest references
(34 unique repo:tag pairs, same OCI-index orphan pattern as the 04-19
infra-ci incident). Deleted all via registry HTTP API + ran GC;
reclaimed ~3GB blob storage.

beads-server CronJobs were stuck ImagePullBackOff on
claude-agent-service:0c24c9b6 for >6h — bumped variable default to
2fd7670d (canonical tag in claude-agent-service stack, already healthy
in registry) so new ticks can fire.

Rebuilt in-use broken tags: freedify:{latest,c803de02} and
beadboard:{17a38e43,latest} on registry VM; priority-pass via
Woodpecker pipeline #8. wealthfolio-sync:latest deferred (monthly
CronJob, next run 2026-05-01).

Probe now reports 0/39 failures. RegistryManifestIntegrityFailure
alert cleared.

Closes: code-8hk
Closes: code-jh3c

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 23:16:34 +00:00

13 KiB
Raw Blame History

Post-Mortem: Private Registry Orphan OCI-Index — Repeat Incident

Field Value
Date 2026-04-19 (first occurrence 2026-04-13)
Duration ~40 min of blocked CI each time; only detected via pipeline failures
Severity SEV2 — all infra CI pipelines using infra-ci:latest failed (P366 → P376 all exit 126 "image can't be pulled")
Affected Services Every Woodpecker pipeline that starts with image: registry.viktorbarzin.me:5050/infra-ci:latestdefault.yml, build-cli.yml, renew-tls.yml, drift-detection.yml, provision-user.yml, k8s-portal.yml, postmortem-todos.yml, issue-automation.yml, pve-nfs-exports-sync.yml
Status Hot fix green (three commits: a05d63ee, 6371e75e, c113be4d — URL fix + rebuild). This doc captures the permanent fix landed in the same branch.

Summary

On 2026-04-19 ~09:00 UTC, every infra CI pipeline started failing at the clone step with "image can't be pulled". The image in question — the CI toolchain image registry.viktorbarzin.me:5050/infra-ci:latest — resolved to an OCI image index whose linux/amd64 platform manifest (sha256:98f718c8…) and its in-toto attestation (sha256:27d5ab83…) returned HTTP 404 from the private registry. The index record itself still existed — it's the children that had been garbage-collected out from under it.

This is the second identical incident: the same failure mode occurred on 2026-04-13 against a different image. Both times the immediate fix was to rebuild the image from scratch; both times the root cause was left unaddressed.

Impact

  • User-facing: all CI pipelines failed. No automated Terraform applies, no TLS renewal, no drift detection. Manual workflows (Woodpecker UI reruns) all failed with the same error.
  • Blast radius: every pipeline that pulls infra-ci. Does NOT affect k8s workloads (those pull via containerd, which goes through the pull-through proxy on :5000/:5010 — a completely different code path).
  • Duration on 2026-04-19: from first P366 failure to the hot-fix commit c113be4d — roughly 40 min. Pipelines that had already been triggered queued up until the rebuild restored :latest.
  • Data loss: none. The registry has the index object; the child manifests are re-producible by rebuilding the source image.
  • Monitoring gap: nothing alerted. The only signal was the individual pipeline failures from Woodpecker. No Prometheus alert fires on "the registry served a 404 for a tag that exists".

Timeline (UTC, 2026-04-19)

Time Event
~09:00 P366 (default.yml on master) fails with exit 126.
09:0011:00 P367, P368, … P376 all fail with the same error. Nobody pages — there's no alert configured.
11:15 User notices and investigates: skopeo inspect reveals the missing platform manifest.
11:20 Hot fix phase begins: a05d63ee fixes a push-URL misalignment, 6371e75e and c113be4d trigger a full rebuild.
11:40 Rebuild completes; infra-ci:latest resolves to a fresh, complete index. Pipelines green from P377 onward.
11:45 User requests a proper root-cause fix: "this is the second time — what's actually broken?"
12:00 Investigation begins (this document's work).

Root Cause Chain

[1] cleanup-tags.sh runs daily at 02:00 on the registry VM
 └─> For each repository, keeps the last 10 tags by mtime, rmtrees the rest.
     This walks `_manifests/tags/<tag>` directly, bypassing the registry API.
         │
         ├─> [2] Subtle on-disk asymmetry: a registry:2 tag rmtree removes
         │    BOTH the `_manifests/tags/<tag>/` dir AND — on 2.8.x — the
         │    per-repo revision-link files under
         │    `<repo>/_manifests/revisions/sha256/<child-digest>/link` for
         │    every child referenced by that tag's index. The raw blob data
         │    under `/var/lib/registry/docker/registry/v2/blobs/sha256/<.>/data`
         │    is NOT touched — GC owns that, and GC only runs Sunday.
         │
         ├─> [3] If ANOTHER tag's index still references one of those same
         │    children (common — successive rebuilds share layers), the child
         │    blob survives. But the revision-link is gone, so the registry
         │    API can no longer map `<repo>/manifests/sha256:<child>` back
         │    to the blob. HEAD → 404, even though the bytes are on disk.
         │    distribution/distribution#3324 is the upstream class of this bug.
         │
         └─> [4] Result: the surviving index (e.g. `infra-ci:5319f03e`) is
              intact on disk, its children's blob data files are intact on
              disk, but HEAD `/v2/infra-ci/manifests/sha256:98f718c8…`
              returns 404. The registry has the bytes, but cannot find them
              through the API because the per-repo link bridge is gone.

[pull] containerd resolves `infra-ci:latest`
         │
         ├─> GET /v2/infra-ci/manifests/latest → 200 OK, returns the index
         │
         └─> GET /v2/infra-ci/manifests/sha256:98f718c8… → 404 Not Found
              └─> containerd fails the pull with "manifest unknown"
                    └─> woodpecker exit 126

Detection-gotcha uncovered 2026-04-19 while implementing fix-broken-blobs.sh: a scan that checks /blobs/sha256/<child>/data for presence is NOT equivalent to "can the registry serve this child?" The authoritative check is whether <repo>/_manifests/revisions/sha256/<child>/link exists. The script was rewritten to check the per-repo link file after the HTTP probe caught 38 real orphans the filesystem scan had reported clean.

Why Existing Remediation Missed It

  1. fix-broken-blobs.sh only scans layer links. The existing cron walks _layers/sha256/ and removes link files whose blob data is missing. It does NOT inspect _manifests/revisions/sha256/ to see whether an image-index's referenced children still exist. That's exactly the class of orphan this incident represents.
  2. registry:2 image tag was floating. docker-compose.yml pinned only to registry:2. Whatever Docker Inc. last rebuilt as "v2-current" was running, with no version pin. Any regression in the upstream walker would silently swap in.
  3. No integrity monitoring. Prometheus alerted on cache hit rate and registry-down, but nothing probes "are the manifests the registry advertises actually fetchable?"
  4. CI pipeline didn't verify its own push. buildx --push returns success as soon as it uploads. If a child blob upload 0-byted or the client disconnected mid-push (distinct from the GC mode but the same on-disk symptom), nothing would notice until the next pull.

Permanent Fix — Three Phases

Phase 1 — Detection (ship today)

  1. Post-push integrity check in .woodpecker/build-ci-image.yml. After build-and-push, a new step walks the just-pushed manifest (and every child of an image index) and HEADs every referenced blob. Any non-200 fails the pipeline immediately, catching broken pushes at the source rather than leaking them to consumers.
  2. Prometheus alert RegistryManifestIntegrityFailure. A new CronJob (registry-integrity-probe, every 15m, in the monitoring namespace) walks the private registry's catalog, HEADs every tag's manifest, follows each image index's children, and pushes registry_manifest_integrity_failures to Pushgateway. Accompanying alerts: RegistryIntegrityProbeStale, RegistryCatalogInaccessible.
  3. Post-mortem — this document. Linked from .claude/reference/service-catalog.md via the new runbook.

Phase 2 — Prevention

  1. Pin registry:2registry:2.8.3 in modules/docker-registry/docker-compose.yml (all six registry services). Removes the floating-tag footgun.
  2. Extend fix-broken-blobs.sh to scan every _manifests/revisions/sha256/<digest> that is an image index and flag children whose blob data file is missing. The script prints a loud WARNING per orphan; it does not auto-delete the index, because deleting a published image is a conscious decision, not an automated repair.

Phase 3 — Recovery tooling

  1. Manual event trigger on build-ci-image.yml. Rebuilds no longer need a cosmetic Dockerfile edit — POST to the Woodpecker API or click "Run manually" in the UI.
  2. Runbook docs/runbooks/registry-rebuild-image.md — exact command sequence for the next time this happens, plus fallback paths.

Out of Scope

  • Pull-through caches. The DockerHub / GHCR mirrors on :5000 / :5010 are healthy (74.5% cache hit rate, no 404s). The orphan problem is private-registry-only. No changes to nginx or containerd hosts.toml.
  • Registry HA / replication. Single-VM SPOF is a known architectural choice. Harbor or a replicated registry would solve more than this incident requires, at multi-day cost. Synology offsite snapshots already give RPO < 1 day.
  • Disabling cleanup-tags.sh. Keeping storage bounded is still necessary; the fix is detection + rebuild, not "stop cleaning up".

Lessons

  • Repeat incidents deserve root-cause work, not a third hot-fix. The 2026-04-13 incident was closed when CI turned green. Without a probe and without a scan for orphan indexes, the next incident was inevitable — and it happened six days later against a different image.
  • "No alert fired, so it wasn't detected" is a monitoring gap, not an outage feature. The registry was serving 404s for 2+ hours before anyone noticed, because our only signal was "pipeline failures" and our eyes were elsewhere. The new probe closes that gap.
  • CI pipelines should verify their own output. The buildx --push "success" exit code is not a guarantee of pulled-back integrity — as this incident proves. A 30-second post-push HEAD walk is cheap insurance.
  • Prior incident (same failure mode, different image): memory 709 / 710 — 2026-04-13.
  • Runbook: docs/runbooks/registry-rebuild-image.md (new).
  • Hot-fix commits: a05d63ee, 6371e75e, c113be4d.
  • Upstream bug class: distribution/distribution#3324.

2026-04-19 — Bulk cleanup sweep (beads code-8hk + code-jh3c)

Same failure class, broader scope. The registry-integrity-probe surfaced 38 broken manifest references persisting after the 04-19 infra-ci fix. beads-dispatcher + beads-reaper CronJobs were stuck ImagePullBackOff on claude-agent-service:0c24c9b6 for >6h. All 34 affected repo:tag pairs were OCI indexes whose linux/amd64 child manifests were absent from blob storage (same orphan pattern).

Action taken:

  1. Bumped beads-server/main.tf var default claude_agent_service_image_tag from 0c24c9b62fd7670d (the canonical tag in claude-agent-service/main.tf), reused — same image already healthy on the registry. scripts/tg apply on beads-server. Deleted the stuck Jobs so new CronJob ticks could fire.
  2. Enumerated 34 broken (repo, tag, parent_digest) triples via HTTP probe using registry-probe-credentials K8s Secret. Deleted each via DELETE /v2/<repo>/manifests/<digest> (33× 202, 1× 404 — claude-agent-service:latest pointed at an already-deleted digest).
  3. Ran docker exec registry-private /bin/registry garbage-collect /etc/docker/registry/config.yml — reclaimed ~3GB of orphan blob storage.
  4. Rebuilt the 3 in-use broken tags (all 3 OCI-index parents pointed at missing children, so no cached copies would survive pod reschedule):
    • freedify:latest / freedify:c803de02 — built on registry VM directly (no CI pipeline exists for this image; python FastAPI).
    • beadboard:17a38e43 / beadboard:latest — GHA workflow_dispatch failed at registry login (missing REGISTRY_USERNAME/REGISTRY_PASSWORD GH secrets). Built on registry VM directly as the fallback. GitHub secret gap is a follow-up — beads code-8hk notes it.
    • priority-pass-backend:ae1420a0 / priority-pass-frontend:ae1420a0 — Woodpecker pipeline #8 on repo 81. Pipeline kubectl set image'd the Deployment to ae1420a0 (drift vs TF v5/v8 defaults, but that drift is pre-existing, not introduced by this cleanup).
    • wealthfolio-sync:latestnot rebuilt. Monthly CronJob (next run 2026-05-01), no source tree or CI pipeline available in the monorepo; deferred for separate follow-up.

Post-cleanup state:

  • Probe: 39 tags, 0 failures. registry_manifest_integrity_failures{} = 0.
  • Alert RegistryManifestIntegrityFailure cleared (was firing for 5h 32m).
  • No ImagePullBackOff pods anywhere in the cluster.
  • 28 of 34 deleted manifests were dangling tags not referenced by any workload — old 382d6b1*, v2-v7, yt-fallback, etc. Safe deletes, no rebuilds needed.

Permanent fix still in flight: Phase 2/3 of this post-mortem (post-push verification in CI, atomic cleanup-tags.sh) — not addressed by this cleanup. The probe continues to be the authoritative detector.