247 lines
13 KiB
Markdown
247 lines
13 KiB
Markdown
|
|
# Post-Mortem: Private Registry Orphan OCI-Index — Repeat Incident
|
|||
|
|
|
|||
|
|
| Field | Value |
|
|||
|
|
|-------|-------|
|
|||
|
|
| **Date** | 2026-04-19 (first occurrence 2026-04-13) |
|
|||
|
|
| **Duration** | ~40 min of blocked CI each time; only detected via pipeline failures |
|
|||
|
|
| **Severity** | SEV2 — all infra CI pipelines using `infra-ci:latest` failed (P366 → P376 all exit 126 "image can't be pulled") |
|
|||
|
|
| **Affected Services** | Every Woodpecker pipeline that starts with `image: registry.viktorbarzin.me:5050/infra-ci:latest` — `default.yml`, `build-cli.yml`, `renew-tls.yml`, `drift-detection.yml`, `provision-user.yml`, `k8s-portal.yml`, `postmortem-todos.yml`, `issue-automation.yml`, `pve-nfs-exports-sync.yml` |
|
|||
|
|
| **Status** | Hot fix green (three commits: `a05d63ee`, `6371e75e`, `c113be4d` — URL fix + rebuild). This doc captures the permanent fix landed in the same branch. |
|
|||
|
|
|
|||
|
|
## Summary
|
|||
|
|
|
|||
|
|
On 2026-04-19 ~09:00 UTC, every infra CI pipeline started failing at the
|
|||
|
|
`clone` step with "image can't be pulled". The image in question — the CI
|
|||
|
|
toolchain image `registry.viktorbarzin.me:5050/infra-ci:latest` — resolved
|
|||
|
|
to an OCI image index whose `linux/amd64` platform manifest
|
|||
|
|
(`sha256:98f718c8…`) and its in-toto attestation
|
|||
|
|
(`sha256:27d5ab83…`) returned **HTTP 404** from the private registry.
|
|||
|
|
The index record itself still existed — it's the children that had been
|
|||
|
|
garbage-collected out from under it.
|
|||
|
|
|
|||
|
|
This is the **second identical incident**: the same failure mode occurred
|
|||
|
|
on 2026-04-13 against a different image. Both times the immediate fix was
|
|||
|
|
to rebuild the image from scratch; both times the root cause was left
|
|||
|
|
unaddressed.
|
|||
|
|
|
|||
|
|
## Impact
|
|||
|
|
|
|||
|
|
- **User-facing**: all CI pipelines failed. No automated Terraform applies,
|
|||
|
|
no TLS renewal, no drift detection. Manual workflows (Woodpecker UI
|
|||
|
|
reruns) all failed with the same error.
|
|||
|
|
- **Blast radius**: every pipeline that pulls `infra-ci`. Does NOT affect
|
|||
|
|
k8s workloads (those pull via containerd, which goes through the
|
|||
|
|
pull-through proxy on :5000/:5010 — a completely different code path).
|
|||
|
|
- **Duration on 2026-04-19**: from first P366 failure to the hot-fix
|
|||
|
|
commit `c113be4d` — roughly 40 min. Pipelines that had already been
|
|||
|
|
triggered queued up until the rebuild restored `:latest`.
|
|||
|
|
- **Data loss**: none. The registry has the index object; the child
|
|||
|
|
manifests are re-producible by rebuilding the source image.
|
|||
|
|
- **Monitoring gap**: nothing alerted. The only signal was the individual
|
|||
|
|
pipeline failures from Woodpecker. No Prometheus alert fires on "the
|
|||
|
|
registry served a 404 for a tag that exists".
|
|||
|
|
|
|||
|
|
## Timeline (UTC, 2026-04-19)
|
|||
|
|
|
|||
|
|
| Time | Event |
|
|||
|
|
|------|-------|
|
|||
|
|
| ~09:00 | P366 (`default.yml` on master) fails with exit 126. |
|
|||
|
|
| 09:00–11:00 | P367, P368, … P376 all fail with the same error. Nobody pages — there's no alert configured. |
|
|||
|
|
| 11:15 | User notices and investigates: `skopeo inspect` reveals the missing platform manifest. |
|
|||
|
|
| 11:20 | Hot fix phase begins: `a05d63ee` fixes a push-URL misalignment, `6371e75e` and `c113be4d` trigger a full rebuild. |
|
|||
|
|
| 11:40 | Rebuild completes; `infra-ci:latest` resolves to a fresh, complete index. Pipelines green from P377 onward. |
|
|||
|
|
| 11:45 | User requests a proper root-cause fix: "this is the second time — what's actually broken?" |
|
|||
|
|
| 12:00 | Investigation begins (this document's work). |
|
|||
|
|
|
|||
|
|
## Root Cause Chain
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
[1] cleanup-tags.sh runs daily at 02:00 on the registry VM
|
|||
|
|
└─> For each repository, keeps the last 10 tags by mtime, rmtrees the rest.
|
|||
|
|
This walks `_manifests/tags/<tag>` directly, bypassing the registry API.
|
|||
|
|
│
|
|||
|
|
├─> [2] Subtle on-disk asymmetry: a registry:2 tag rmtree removes
|
|||
|
|
│ BOTH the `_manifests/tags/<tag>/` dir AND — on 2.8.x — the
|
|||
|
|
│ per-repo revision-link files under
|
|||
|
|
│ `<repo>/_manifests/revisions/sha256/<child-digest>/link` for
|
|||
|
|
│ every child referenced by that tag's index. The raw blob data
|
|||
|
|
│ under `/var/lib/registry/docker/registry/v2/blobs/sha256/<.>/data`
|
|||
|
|
│ is NOT touched — GC owns that, and GC only runs Sunday.
|
|||
|
|
│
|
|||
|
|
├─> [3] If ANOTHER tag's index still references one of those same
|
|||
|
|
│ children (common — successive rebuilds share layers), the child
|
|||
|
|
│ blob survives. But the revision-link is gone, so the registry
|
|||
|
|
│ API can no longer map `<repo>/manifests/sha256:<child>` back
|
|||
|
|
│ to the blob. HEAD → 404, even though the bytes are on disk.
|
|||
|
|
│ distribution/distribution#3324 is the upstream class of this bug.
|
|||
|
|
│
|
|||
|
|
└─> [4] Result: the surviving index (e.g. `infra-ci:5319f03e`) is
|
|||
|
|
intact on disk, its children's blob data files are intact on
|
|||
|
|
disk, but HEAD `/v2/infra-ci/manifests/sha256:98f718c8…`
|
|||
|
|
returns 404. The registry has the bytes, but cannot find them
|
|||
|
|
through the API because the per-repo link bridge is gone.
|
|||
|
|
|
|||
|
|
[pull] containerd resolves `infra-ci:latest`
|
|||
|
|
│
|
|||
|
|
├─> GET /v2/infra-ci/manifests/latest → 200 OK, returns the index
|
|||
|
|
│
|
|||
|
|
└─> GET /v2/infra-ci/manifests/sha256:98f718c8… → 404 Not Found
|
|||
|
|
└─> containerd fails the pull with "manifest unknown"
|
|||
|
|
└─> woodpecker exit 126
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
> **Detection-gotcha** uncovered 2026-04-19 while implementing
|
|||
|
|
> `fix-broken-blobs.sh`: a scan that checks `/blobs/sha256/<child>/data` for
|
|||
|
|
> presence is NOT equivalent to "can the registry serve this child?" The
|
|||
|
|
> authoritative check is whether
|
|||
|
|
> `<repo>/_manifests/revisions/sha256/<child>/link` exists. The script
|
|||
|
|
> was rewritten to check the per-repo link file after the HTTP probe
|
|||
|
|
> caught 38 real orphans the filesystem scan had reported clean.
|
|||
|
|
|
|||
|
|
## Why Existing Remediation Missed It
|
|||
|
|
|
|||
|
|
1. **`fix-broken-blobs.sh` only scans layer links.** The existing cron
|
|||
|
|
walks `_layers/sha256/` and removes link files whose blob `data` is
|
|||
|
|
missing. It does NOT inspect `_manifests/revisions/sha256/` to see
|
|||
|
|
whether an image-index's referenced children still exist. That's
|
|||
|
|
exactly the class of orphan this incident represents.
|
|||
|
|
2. **`registry:2` image tag was floating.** `docker-compose.yml` pinned
|
|||
|
|
only to `registry:2`. Whatever Docker Inc. last rebuilt as
|
|||
|
|
"v2-current" was running, with no version pin. Any regression in
|
|||
|
|
the upstream walker would silently swap in.
|
|||
|
|
3. **No integrity monitoring.** Prometheus alerted on cache hit rate
|
|||
|
|
and registry-down, but nothing probes "are the manifests the registry
|
|||
|
|
advertises actually fetchable?"
|
|||
|
|
4. **CI pipeline didn't verify its own push.** `buildx --push` returns
|
|||
|
|
success as soon as it uploads. If a child blob upload 0-byted or
|
|||
|
|
the client disconnected mid-push (distinct from the GC mode but the
|
|||
|
|
same on-disk symptom), nothing would notice until the next pull.
|
|||
|
|
|
|||
|
|
## Permanent Fix — Three Phases
|
|||
|
|
|
|||
|
|
### Phase 1 — Detection (ship today)
|
|||
|
|
|
|||
|
|
1. **Post-push integrity check** in `.woodpecker/build-ci-image.yml`.
|
|||
|
|
After `build-and-push`, a new step walks the just-pushed manifest
|
|||
|
|
(and every child of an image index) and HEADs every referenced blob.
|
|||
|
|
Any non-200 fails the pipeline immediately, catching broken pushes at
|
|||
|
|
the source rather than leaking them to consumers.
|
|||
|
|
2. **Prometheus alert `RegistryManifestIntegrityFailure`.** A new
|
|||
|
|
CronJob (`registry-integrity-probe`, every 15m, in the `monitoring`
|
|||
|
|
namespace) walks the private registry's catalog, HEADs every tag's
|
|||
|
|
manifest, follows each image index's children, and pushes
|
|||
|
|
`registry_manifest_integrity_failures` to Pushgateway. Accompanying
|
|||
|
|
alerts: `RegistryIntegrityProbeStale`, `RegistryCatalogInaccessible`.
|
|||
|
|
3. **Post-mortem** — this document. Linked from
|
|||
|
|
`.claude/reference/service-catalog.md` via the new runbook.
|
|||
|
|
|
|||
|
|
### Phase 2 — Prevention
|
|||
|
|
|
|||
|
|
4. **Pin `registry:2` → `registry:2.8.3`** in
|
|||
|
|
`modules/docker-registry/docker-compose.yml` (all six registry
|
|||
|
|
services). Removes the floating-tag footgun.
|
|||
|
|
5. **Extend `fix-broken-blobs.sh`** to scan every
|
|||
|
|
`_manifests/revisions/sha256/<digest>` that is an image index and
|
|||
|
|
flag children whose blob `data` file is missing. The script prints a
|
|||
|
|
loud WARNING per orphan; it does not auto-delete the index, because
|
|||
|
|
deleting a published image is a conscious decision, not an automated
|
|||
|
|
repair.
|
|||
|
|
|
|||
|
|
### Phase 3 — Recovery tooling
|
|||
|
|
|
|||
|
|
6. **Manual event trigger** on `build-ci-image.yml`. Rebuilds no longer
|
|||
|
|
need a cosmetic Dockerfile edit — POST to the Woodpecker API or
|
|||
|
|
click "Run manually" in the UI.
|
|||
|
|
7. **Runbook** `docs/runbooks/registry-rebuild-image.md` — exact
|
|||
|
|
command sequence for the next time this happens, plus fallback paths.
|
|||
|
|
|
|||
|
|
## Out of Scope
|
|||
|
|
|
|||
|
|
- **Pull-through caches.** The DockerHub / GHCR mirrors on
|
|||
|
|
`:5000` / `:5010` are healthy (74.5% cache hit rate, no 404s). The
|
|||
|
|
orphan problem is private-registry-only. No changes to nginx or
|
|||
|
|
containerd `hosts.toml`.
|
|||
|
|
- **Registry HA / replication.** Single-VM SPOF is a known
|
|||
|
|
architectural choice. Harbor or a replicated registry would solve
|
|||
|
|
more than this incident requires, at multi-day cost. Synology offsite
|
|||
|
|
snapshots already give RPO < 1 day.
|
|||
|
|
- **Disabling `cleanup-tags.sh`.** Keeping storage bounded is still
|
|||
|
|
necessary; the fix is detection + rebuild, not "stop cleaning up".
|
|||
|
|
|
|||
|
|
## Lessons
|
|||
|
|
|
|||
|
|
- **Repeat incidents deserve root-cause work, not a third hot-fix.** The
|
|||
|
|
2026-04-13 incident was closed when CI turned green. Without a probe
|
|||
|
|
and without a scan for orphan indexes, the next incident was
|
|||
|
|
inevitable — and it happened six days later against a different image.
|
|||
|
|
- **"No alert fired, so it wasn't detected" is a monitoring gap, not an
|
|||
|
|
outage feature.** The registry was serving 404s for 2+ hours before
|
|||
|
|
anyone noticed, because our only signal was "pipeline failures" and
|
|||
|
|
our eyes were elsewhere. The new probe closes that gap.
|
|||
|
|
- **CI pipelines should verify their own output.** The `buildx --push`
|
|||
|
|
"success" exit code is not a guarantee of pulled-back integrity — as
|
|||
|
|
this incident proves. A 30-second post-push HEAD walk is cheap
|
|||
|
|
insurance.
|
|||
|
|
|
|||
|
|
## Related
|
|||
|
|
|
|||
|
|
- **Prior incident (same failure mode, different image)**: memory `709`
|
|||
|
|
/ `710` — 2026-04-13.
|
|||
|
|
- **Runbook**: `docs/runbooks/registry-rebuild-image.md` (new).
|
|||
|
|
- **Hot-fix commits**: `a05d63ee`, `6371e75e`, `c113be4d`.
|
|||
|
|
- **Upstream bug class**: `distribution/distribution#3324`.
|
|||
|
|
|
|||
|
|
## 2026-04-19 — Bulk cleanup sweep (beads code-8hk + code-jh3c)
|
|||
|
|
|
|||
|
|
Same failure class, broader scope. The `registry-integrity-probe`
|
|||
|
|
surfaced 38 broken manifest references persisting after the 04-19
|
|||
|
|
infra-ci fix. `beads-dispatcher` + `beads-reaper` CronJobs were stuck
|
|||
|
|
`ImagePullBackOff` on `claude-agent-service:0c24c9b6` for >6h. All 34
|
|||
|
|
affected `repo:tag` pairs were OCI indexes whose `linux/amd64` child
|
|||
|
|
manifests were absent from blob storage (same orphan pattern).
|
|||
|
|
|
|||
|
|
**Action taken**:
|
|||
|
|
1. Bumped `beads-server/main.tf` var default `claude_agent_service_image_tag`
|
|||
|
|
from `0c24c9b6` → `2fd7670d` (the canonical tag in
|
|||
|
|
`claude-agent-service/main.tf`), reused — same image already healthy
|
|||
|
|
on the registry. `scripts/tg apply` on `beads-server`. Deleted the
|
|||
|
|
stuck Jobs so new CronJob ticks could fire.
|
|||
|
|
2. Enumerated 34 broken `(repo, tag, parent_digest)` triples via HTTP
|
|||
|
|
probe using `registry-probe-credentials` K8s Secret. Deleted each
|
|||
|
|
via `DELETE /v2/<repo>/manifests/<digest>` (33× 202, 1× 404 —
|
|||
|
|
claude-agent-service:latest pointed at an already-deleted digest).
|
|||
|
|
3. Ran `docker exec registry-private /bin/registry garbage-collect
|
|||
|
|
/etc/docker/registry/config.yml` — reclaimed ~3GB of orphan blob
|
|||
|
|
storage.
|
|||
|
|
4. Rebuilt the 3 in-use broken tags (all 3 OCI-index parents pointed
|
|||
|
|
at missing children, so no cached copies would survive pod
|
|||
|
|
reschedule):
|
|||
|
|
- `freedify:latest` / `freedify:c803de02` — built on registry VM
|
|||
|
|
directly (no CI pipeline exists for this image; python FastAPI).
|
|||
|
|
- `beadboard:17a38e43` / `beadboard:latest` — GHA
|
|||
|
|
`workflow_dispatch` failed at registry login (missing
|
|||
|
|
`REGISTRY_USERNAME`/`REGISTRY_PASSWORD` GH secrets). Built on
|
|||
|
|
registry VM directly as the fallback. GitHub secret gap is a
|
|||
|
|
follow-up — beads `code-8hk` notes it.
|
|||
|
|
- `priority-pass-backend:ae1420a0` / `priority-pass-frontend:ae1420a0`
|
|||
|
|
— Woodpecker pipeline #8 on repo 81. Pipeline `kubectl set image`'d
|
|||
|
|
the Deployment to `ae1420a0` (drift vs TF `v5`/`v8` defaults, but
|
|||
|
|
that drift is pre-existing, not introduced by this cleanup).
|
|||
|
|
- `wealthfolio-sync:latest` — **not rebuilt**. Monthly CronJob (next
|
|||
|
|
run 2026-05-01), no source tree or CI pipeline available in the
|
|||
|
|
monorepo; deferred for separate follow-up.
|
|||
|
|
|
|||
|
|
**Post-cleanup state**:
|
|||
|
|
- Probe: 39 tags, 0 failures. `registry_manifest_integrity_failures{} = 0`.
|
|||
|
|
- Alert `RegistryManifestIntegrityFailure` cleared (was firing for
|
|||
|
|
5h 32m).
|
|||
|
|
- No `ImagePullBackOff` pods anywhere in the cluster.
|
|||
|
|
- 28 of 34 deleted manifests were **dangling tags not referenced by any
|
|||
|
|
workload** — old `382d6b1*`, `v2`-`v7`, `yt-fallback`, etc. Safe
|
|||
|
|
deletes, no rebuilds needed.
|
|||
|
|
|
|||
|
|
**Permanent fix still in flight**: Phase 2/3 of this post-mortem
|
|||
|
|
(post-push verification in CI, atomic `cleanup-tags.sh`) — not
|
|||
|
|
addressed by this cleanup. The probe continues to be the
|
|||
|
|
authoritative detector.
|