infra/docs/post-mortems/2026-04-19-registry-orphan-index.md
Viktor Barzin ac695dea38 [registry] bulk-clean 34 orphan manifests + beads-server image bump
Registry integrity probe surfaced 38 broken manifest references
(34 unique repo:tag pairs, same OCI-index orphan pattern as the 04-19
infra-ci incident). Deleted all via registry HTTP API + ran GC;
reclaimed ~3GB blob storage.

beads-server CronJobs were stuck ImagePullBackOff on
claude-agent-service:0c24c9b6 for >6h — bumped variable default to
2fd7670d (canonical tag in claude-agent-service stack, already healthy
in registry) so new ticks can fire.

Rebuilt in-use broken tags: freedify:{latest,c803de02} and
beadboard:{17a38e43,latest} on registry VM; priority-pass via
Woodpecker pipeline #8. wealthfolio-sync:latest deferred (monthly
CronJob, next run 2026-05-01).

Probe now reports 0/39 failures. RegistryManifestIntegrityFailure
alert cleared.

Closes: code-8hk
Closes: code-jh3c

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 23:16:34 +00:00

246 lines
13 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Post-Mortem: Private Registry Orphan OCI-Index — Repeat Incident
| Field | Value |
|-------|-------|
| **Date** | 2026-04-19 (first occurrence 2026-04-13) |
| **Duration** | ~40 min of blocked CI each time; only detected via pipeline failures |
| **Severity** | SEV2 — all infra CI pipelines using `infra-ci:latest` failed (P366 → P376 all exit 126 "image can't be pulled") |
| **Affected Services** | Every Woodpecker pipeline that starts with `image: registry.viktorbarzin.me:5050/infra-ci:latest``default.yml`, `build-cli.yml`, `renew-tls.yml`, `drift-detection.yml`, `provision-user.yml`, `k8s-portal.yml`, `postmortem-todos.yml`, `issue-automation.yml`, `pve-nfs-exports-sync.yml` |
| **Status** | Hot fix green (three commits: `a05d63ee`, `6371e75e`, `c113be4d` — URL fix + rebuild). This doc captures the permanent fix landed in the same branch. |
## Summary
On 2026-04-19 ~09:00 UTC, every infra CI pipeline started failing at the
`clone` step with "image can't be pulled". The image in question — the CI
toolchain image `registry.viktorbarzin.me:5050/infra-ci:latest` — resolved
to an OCI image index whose `linux/amd64` platform manifest
(`sha256:98f718c8…`) and its in-toto attestation
(`sha256:27d5ab83…`) returned **HTTP 404** from the private registry.
The index record itself still existed — it's the children that had been
garbage-collected out from under it.
This is the **second identical incident**: the same failure mode occurred
on 2026-04-13 against a different image. Both times the immediate fix was
to rebuild the image from scratch; both times the root cause was left
unaddressed.
## Impact
- **User-facing**: all CI pipelines failed. No automated Terraform applies,
no TLS renewal, no drift detection. Manual workflows (Woodpecker UI
reruns) all failed with the same error.
- **Blast radius**: every pipeline that pulls `infra-ci`. Does NOT affect
k8s workloads (those pull via containerd, which goes through the
pull-through proxy on :5000/:5010 — a completely different code path).
- **Duration on 2026-04-19**: from first P366 failure to the hot-fix
commit `c113be4d` — roughly 40 min. Pipelines that had already been
triggered queued up until the rebuild restored `:latest`.
- **Data loss**: none. The registry has the index object; the child
manifests are re-producible by rebuilding the source image.
- **Monitoring gap**: nothing alerted. The only signal was the individual
pipeline failures from Woodpecker. No Prometheus alert fires on "the
registry served a 404 for a tag that exists".
## Timeline (UTC, 2026-04-19)
| Time | Event |
|------|-------|
| ~09:00 | P366 (`default.yml` on master) fails with exit 126. |
| 09:0011:00 | P367, P368, … P376 all fail with the same error. Nobody pages — there's no alert configured. |
| 11:15 | User notices and investigates: `skopeo inspect` reveals the missing platform manifest. |
| 11:20 | Hot fix phase begins: `a05d63ee` fixes a push-URL misalignment, `6371e75e` and `c113be4d` trigger a full rebuild. |
| 11:40 | Rebuild completes; `infra-ci:latest` resolves to a fresh, complete index. Pipelines green from P377 onward. |
| 11:45 | User requests a proper root-cause fix: "this is the second time — what's actually broken?" |
| 12:00 | Investigation begins (this document's work). |
## Root Cause Chain
```
[1] cleanup-tags.sh runs daily at 02:00 on the registry VM
└─> For each repository, keeps the last 10 tags by mtime, rmtrees the rest.
This walks `_manifests/tags/<tag>` directly, bypassing the registry API.
├─> [2] Subtle on-disk asymmetry: a registry:2 tag rmtree removes
│ BOTH the `_manifests/tags/<tag>/` dir AND — on 2.8.x — the
│ per-repo revision-link files under
│ `<repo>/_manifests/revisions/sha256/<child-digest>/link` for
│ every child referenced by that tag's index. The raw blob data
│ under `/var/lib/registry/docker/registry/v2/blobs/sha256/<.>/data`
│ is NOT touched — GC owns that, and GC only runs Sunday.
├─> [3] If ANOTHER tag's index still references one of those same
│ children (common — successive rebuilds share layers), the child
│ blob survives. But the revision-link is gone, so the registry
│ API can no longer map `<repo>/manifests/sha256:<child>` back
│ to the blob. HEAD → 404, even though the bytes are on disk.
│ distribution/distribution#3324 is the upstream class of this bug.
└─> [4] Result: the surviving index (e.g. `infra-ci:5319f03e`) is
intact on disk, its children's blob data files are intact on
disk, but HEAD `/v2/infra-ci/manifests/sha256:98f718c8…`
returns 404. The registry has the bytes, but cannot find them
through the API because the per-repo link bridge is gone.
[pull] containerd resolves `infra-ci:latest`
├─> GET /v2/infra-ci/manifests/latest → 200 OK, returns the index
└─> GET /v2/infra-ci/manifests/sha256:98f718c8… → 404 Not Found
└─> containerd fails the pull with "manifest unknown"
└─> woodpecker exit 126
```
> **Detection-gotcha** uncovered 2026-04-19 while implementing
> `fix-broken-blobs.sh`: a scan that checks `/blobs/sha256/<child>/data` for
> presence is NOT equivalent to "can the registry serve this child?" The
> authoritative check is whether
> `<repo>/_manifests/revisions/sha256/<child>/link` exists. The script
> was rewritten to check the per-repo link file after the HTTP probe
> caught 38 real orphans the filesystem scan had reported clean.
## Why Existing Remediation Missed It
1. **`fix-broken-blobs.sh` only scans layer links.** The existing cron
walks `_layers/sha256/` and removes link files whose blob `data` is
missing. It does NOT inspect `_manifests/revisions/sha256/` to see
whether an image-index's referenced children still exist. That's
exactly the class of orphan this incident represents.
2. **`registry:2` image tag was floating.** `docker-compose.yml` pinned
only to `registry:2`. Whatever Docker Inc. last rebuilt as
"v2-current" was running, with no version pin. Any regression in
the upstream walker would silently swap in.
3. **No integrity monitoring.** Prometheus alerted on cache hit rate
and registry-down, but nothing probes "are the manifests the registry
advertises actually fetchable?"
4. **CI pipeline didn't verify its own push.** `buildx --push` returns
success as soon as it uploads. If a child blob upload 0-byted or
the client disconnected mid-push (distinct from the GC mode but the
same on-disk symptom), nothing would notice until the next pull.
## Permanent Fix — Three Phases
### Phase 1 — Detection (ship today)
1. **Post-push integrity check** in `.woodpecker/build-ci-image.yml`.
After `build-and-push`, a new step walks the just-pushed manifest
(and every child of an image index) and HEADs every referenced blob.
Any non-200 fails the pipeline immediately, catching broken pushes at
the source rather than leaking them to consumers.
2. **Prometheus alert `RegistryManifestIntegrityFailure`.** A new
CronJob (`registry-integrity-probe`, every 15m, in the `monitoring`
namespace) walks the private registry's catalog, HEADs every tag's
manifest, follows each image index's children, and pushes
`registry_manifest_integrity_failures` to Pushgateway. Accompanying
alerts: `RegistryIntegrityProbeStale`, `RegistryCatalogInaccessible`.
3. **Post-mortem** — this document. Linked from
`.claude/reference/service-catalog.md` via the new runbook.
### Phase 2 — Prevention
4. **Pin `registry:2` → `registry:2.8.3`** in
`modules/docker-registry/docker-compose.yml` (all six registry
services). Removes the floating-tag footgun.
5. **Extend `fix-broken-blobs.sh`** to scan every
`_manifests/revisions/sha256/<digest>` that is an image index and
flag children whose blob `data` file is missing. The script prints a
loud WARNING per orphan; it does not auto-delete the index, because
deleting a published image is a conscious decision, not an automated
repair.
### Phase 3 — Recovery tooling
6. **Manual event trigger** on `build-ci-image.yml`. Rebuilds no longer
need a cosmetic Dockerfile edit — POST to the Woodpecker API or
click "Run manually" in the UI.
7. **Runbook** `docs/runbooks/registry-rebuild-image.md` — exact
command sequence for the next time this happens, plus fallback paths.
## Out of Scope
- **Pull-through caches.** The DockerHub / GHCR mirrors on
`:5000` / `:5010` are healthy (74.5% cache hit rate, no 404s). The
orphan problem is private-registry-only. No changes to nginx or
containerd `hosts.toml`.
- **Registry HA / replication.** Single-VM SPOF is a known
architectural choice. Harbor or a replicated registry would solve
more than this incident requires, at multi-day cost. Synology offsite
snapshots already give RPO < 1 day.
- **Disabling `cleanup-tags.sh`.** Keeping storage bounded is still
necessary; the fix is detection + rebuild, not "stop cleaning up".
## Lessons
- **Repeat incidents deserve root-cause work, not a third hot-fix.** The
2026-04-13 incident was closed when CI turned green. Without a probe
and without a scan for orphan indexes, the next incident was
inevitable and it happened six days later against a different image.
- **"No alert fired, so it wasn't detected" is a monitoring gap, not an
outage feature.** The registry was serving 404s for 2+ hours before
anyone noticed, because our only signal was "pipeline failures" and
our eyes were elsewhere. The new probe closes that gap.
- **CI pipelines should verify their own output.** The `buildx --push`
"success" exit code is not a guarantee of pulled-back integrity as
this incident proves. A 30-second post-push HEAD walk is cheap
insurance.
## Related
- **Prior incident (same failure mode, different image)**: memory `709`
/ `710` 2026-04-13.
- **Runbook**: `docs/runbooks/registry-rebuild-image.md` (new).
- **Hot-fix commits**: `a05d63ee`, `6371e75e`, `c113be4d`.
- **Upstream bug class**: `distribution/distribution#3324`.
## 2026-04-19 — Bulk cleanup sweep (beads code-8hk + code-jh3c)
Same failure class, broader scope. The `registry-integrity-probe`
surfaced 38 broken manifest references persisting after the 04-19
infra-ci fix. `beads-dispatcher` + `beads-reaper` CronJobs were stuck
`ImagePullBackOff` on `claude-agent-service:0c24c9b6` for >6h. All 34
affected `repo:tag` pairs were OCI indexes whose `linux/amd64` child
manifests were absent from blob storage (same orphan pattern).
**Action taken**:
1. Bumped `beads-server/main.tf` var default `claude_agent_service_image_tag`
from `0c24c9b6``2fd7670d` (the canonical tag in
`claude-agent-service/main.tf`), reused — same image already healthy
on the registry. `scripts/tg apply` on `beads-server`. Deleted the
stuck Jobs so new CronJob ticks could fire.
2. Enumerated 34 broken `(repo, tag, parent_digest)` triples via HTTP
probe using `registry-probe-credentials` K8s Secret. Deleted each
via `DELETE /v2/<repo>/manifests/<digest>` (33× 202, 1× 404 —
claude-agent-service:latest pointed at an already-deleted digest).
3. Ran `docker exec registry-private /bin/registry garbage-collect
/etc/docker/registry/config.yml` — reclaimed ~3GB of orphan blob
storage.
4. Rebuilt the 3 in-use broken tags (all 3 OCI-index parents pointed
at missing children, so no cached copies would survive pod
reschedule):
- `freedify:latest` / `freedify:c803de02` — built on registry VM
directly (no CI pipeline exists for this image; python FastAPI).
- `beadboard:17a38e43` / `beadboard:latest` — GHA
`workflow_dispatch` failed at registry login (missing
`REGISTRY_USERNAME`/`REGISTRY_PASSWORD` GH secrets). Built on
registry VM directly as the fallback. GitHub secret gap is a
follow-up — beads `code-8hk` notes it.
- `priority-pass-backend:ae1420a0` / `priority-pass-frontend:ae1420a0`
— Woodpecker pipeline #8 on repo 81. Pipeline `kubectl set image`'d
the Deployment to `ae1420a0` (drift vs TF `v5`/`v8` defaults, but
that drift is pre-existing, not introduced by this cleanup).
- `wealthfolio-sync:latest` — **not rebuilt**. Monthly CronJob (next
run 2026-05-01), no source tree or CI pipeline available in the
monorepo; deferred for separate follow-up.
**Post-cleanup state**:
- Probe: 39 tags, 0 failures. `registry_manifest_integrity_failures{} = 0`.
- Alert `RegistryManifestIntegrityFailure` cleared (was firing for
5h 32m).
- No `ImagePullBackOff` pods anywhere in the cluster.
- 28 of 34 deleted manifests were **dangling tags not referenced by any
workload** — old `382d6b1*`, `v2`-`v7`, `yt-fallback`, etc. Safe
deletes, no rebuilds needed.
**Permanent fix still in flight**: Phase 2/3 of this post-mortem
(post-push verification in CI, atomic `cleanup-tags.sh`) — not
addressed by this cleanup. The probe continues to be the
authoritative detector.