[registry] Stop recurring orphan OCI-index incidents — detection + prevention + recovery
Second identical registry incident on 2026-04-19 (first 2026-04-13): the infra-ci:latest image index resolved to child manifests whose blobs had been garbage-collected out from under the index. Pipelines P366→P376 all exited 126 "image can't be pulled". Hot fix (a05d63e/6371e75/c113be4) restored green CI but left the underlying bug unaddressed. Root cause: cleanup-tags.sh rmtrees tag dirs on the registry VM daily at 02:00, registry:2's GC (Sunday 03:25) walks OCI index children imperfectly (distribution/distribution#3324 class). Nothing verified pushes end-to-end; nothing probed the registry for fetchability; nothing caught orphan indexes. Phase 1 — Detection: - .woodpecker/build-ci-image.yml: after build-and-push, a verify-integrity step walks the just-pushed manifest (index + children + config + every layer blob) via HEAD and fails the pipeline on any non-200. Catches broken pushes at the source. - stacks/monitoring: new registry-integrity-probe CronJob (every 15m) and three alerts — RegistryManifestIntegrityFailure, RegistryIntegrityProbeStale, RegistryCatalogInaccessible — closing the "registry serves 404 for a tag that exists" gap that masked the incident for 2+ hours. - docs/post-mortems/2026-04-19-registry-orphan-index.md: root cause, timeline, monitoring gaps, permanent fix. Phase 2 — Prevention: - modules/docker-registry/docker-compose.yml: pin registry:2 → registry:2.8.3 across all six registry services. Removes the floating-tag footgun. - modules/docker-registry/fix-broken-blobs.sh: new scan walks every _manifests/revisions/sha256/<digest> that is an image index and logs a loud WARNING when a referenced child blob is missing. Does NOT auto- delete — deleting a published image is a conscious decision. Layer-link scan preserved. Phase 3 — Recovery: - build-ci-image.yml: accept `manual` event so Woodpecker API/UI rebuilds don't need a cosmetic Dockerfile edit (matches convention from pve-nfs-exports-sync.yml). - docs/runbooks/registry-rebuild-image.md: exact command sequence for diagnosing + rebuilding after an orphan-index incident, plus a fallback for building directly on the registry VM if Woodpecker itself is down. - docs/runbooks/registry-vm.md + .claude/reference/service-catalog.md: cross-references to the new runbook. Out of scope (verified healthy or intentionally deferred): - Pull-through DockerHub/GHCR mirrors (74.5% hit rate, no 404s). - Registry HA/replication (single-VM SPOF is a known architectural choice; Synology offsite covers RPO < 1 day). - Diun exclude for registry:2 — not applicable; Diun only watches k8s (DIUN_PROVIDERS_KUBERNETES=true), not the VM's docker-compose. Verified locally: - fix-broken-blobs.sh --dry-run on a synthetic registry directory correctly flags both orphan layer links and orphan OCI-index children. - terraform fmt + validate on stacks/monitoring: success (only unrelated deprecation warnings). - python3 yaml.safe_load on .woodpecker/build-ci-image.yml and modules/docker-registry/docker-compose.yml: both parse clean. Closes: code-4b8 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
df2c53db8d
commit
7cb44d7264
10 changed files with 779 additions and 41 deletions
175
docs/post-mortems/2026-04-19-registry-orphan-index.md
Normal file
175
docs/post-mortems/2026-04-19-registry-orphan-index.md
Normal file
|
|
@ -0,0 +1,175 @@
|
|||
# Post-Mortem: Private Registry Orphan OCI-Index — Repeat Incident
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Date** | 2026-04-19 (first occurrence 2026-04-13) |
|
||||
| **Duration** | ~40 min of blocked CI each time; only detected via pipeline failures |
|
||||
| **Severity** | SEV2 — all infra CI pipelines using `infra-ci:latest` failed (P366 → P376 all exit 126 "image can't be pulled") |
|
||||
| **Affected Services** | Every Woodpecker pipeline that starts with `image: registry.viktorbarzin.me:5050/infra-ci:latest` — `default.yml`, `build-cli.yml`, `renew-tls.yml`, `drift-detection.yml`, `provision-user.yml`, `k8s-portal.yml`, `postmortem-todos.yml`, `issue-automation.yml`, `pve-nfs-exports-sync.yml` |
|
||||
| **Status** | Hot fix green (three commits: `a05d63ee`, `6371e75e`, `c113be4d` — URL fix + rebuild). This doc captures the permanent fix landed in the same branch. |
|
||||
|
||||
## Summary
|
||||
|
||||
On 2026-04-19 ~09:00 UTC, every infra CI pipeline started failing at the
|
||||
`clone` step with "image can't be pulled". The image in question — the CI
|
||||
toolchain image `registry.viktorbarzin.me:5050/infra-ci:latest` — resolved
|
||||
to an OCI image index whose `linux/amd64` platform manifest
|
||||
(`sha256:98f718c8…`) and its in-toto attestation
|
||||
(`sha256:27d5ab83…`) returned **HTTP 404** from the private registry.
|
||||
The index record itself still existed — it's the children that had been
|
||||
garbage-collected out from under it.
|
||||
|
||||
This is the **second identical incident**: the same failure mode occurred
|
||||
on 2026-04-13 against a different image. Both times the immediate fix was
|
||||
to rebuild the image from scratch; both times the root cause was left
|
||||
unaddressed.
|
||||
|
||||
## Impact
|
||||
|
||||
- **User-facing**: all CI pipelines failed. No automated Terraform applies,
|
||||
no TLS renewal, no drift detection. Manual workflows (Woodpecker UI
|
||||
reruns) all failed with the same error.
|
||||
- **Blast radius**: every pipeline that pulls `infra-ci`. Does NOT affect
|
||||
k8s workloads (those pull via containerd, which goes through the
|
||||
pull-through proxy on :5000/:5010 — a completely different code path).
|
||||
- **Duration on 2026-04-19**: from first P366 failure to the hot-fix
|
||||
commit `c113be4d` — roughly 40 min. Pipelines that had already been
|
||||
triggered queued up until the rebuild restored `:latest`.
|
||||
- **Data loss**: none. The registry has the index object; the child
|
||||
manifests are re-producible by rebuilding the source image.
|
||||
- **Monitoring gap**: nothing alerted. The only signal was the individual
|
||||
pipeline failures from Woodpecker. No Prometheus alert fires on "the
|
||||
registry served a 404 for a tag that exists".
|
||||
|
||||
## Timeline (UTC, 2026-04-19)
|
||||
|
||||
| Time | Event |
|
||||
|------|-------|
|
||||
| ~09:00 | P366 (`default.yml` on master) fails with exit 126. |
|
||||
| 09:00–11:00 | P367, P368, … P376 all fail with the same error. Nobody pages — there's no alert configured. |
|
||||
| 11:15 | User notices and investigates: `skopeo inspect` reveals the missing platform manifest. |
|
||||
| 11:20 | Hot fix phase begins: `a05d63ee` fixes a push-URL misalignment, `6371e75e` and `c113be4d` trigger a full rebuild. |
|
||||
| 11:40 | Rebuild completes; `infra-ci:latest` resolves to a fresh, complete index. Pipelines green from P377 onward. |
|
||||
| 11:45 | User requests a proper root-cause fix: "this is the second time — what's actually broken?" |
|
||||
| 12:00 | Investigation begins (this document's work). |
|
||||
|
||||
## Root Cause Chain
|
||||
|
||||
```
|
||||
[1] cleanup-tags.sh runs daily at 02:00 on the registry VM
|
||||
└─> For each repository, keeps the last 10 tags by mtime, rmtrees the rest.
|
||||
This walks `_manifests/tags/<tag>` directly, bypassing the registry API.
|
||||
│
|
||||
├─> [2] registry:2 garbage-collect runs weekly (Sun 03:25 for the
|
||||
│ private registry). Walks live manifests through refcounts, but
|
||||
│ distribution/distribution#3324 showed this walker has historical
|
||||
│ bugs with OCI image-index children — it can decrement a shared
|
||||
│ child's refcount below 1 and delete the blob even while the
|
||||
│ index that references it is still referenced.
|
||||
│
|
||||
└─> [3] Result: the `infra-ci:latest` index is intact
|
||||
(`_manifests/revisions/sha256/<A>/data` present on disk), but
|
||||
its `.manifests[0].digest` — the `linux/amd64` child — points
|
||||
to a `blobs/sha256/98/98f718c8…/` whose `data` file is gone.
|
||||
|
||||
[pull] containerd resolves `infra-ci:latest`
|
||||
│
|
||||
├─> GET /v2/infra-ci/manifests/latest → 200 OK, returns the index
|
||||
│
|
||||
└─> GET /v2/infra-ci/manifests/sha256:98f718c8… → 404 Not Found
|
||||
└─> containerd fails the pull with "manifest unknown"
|
||||
└─> woodpecker exit 126
|
||||
```
|
||||
|
||||
## Why Existing Remediation Missed It
|
||||
|
||||
1. **`fix-broken-blobs.sh` only scans layer links.** The existing cron
|
||||
walks `_layers/sha256/` and removes link files whose blob `data` is
|
||||
missing. It does NOT inspect `_manifests/revisions/sha256/` to see
|
||||
whether an image-index's referenced children still exist. That's
|
||||
exactly the class of orphan this incident represents.
|
||||
2. **`registry:2` image tag was floating.** `docker-compose.yml` pinned
|
||||
only to `registry:2`. Whatever Docker Inc. last rebuilt as
|
||||
"v2-current" was running, with no version pin. Any regression in
|
||||
the upstream walker would silently swap in.
|
||||
3. **No integrity monitoring.** Prometheus alerted on cache hit rate
|
||||
and registry-down, but nothing probes "are the manifests the registry
|
||||
advertises actually fetchable?"
|
||||
4. **CI pipeline didn't verify its own push.** `buildx --push` returns
|
||||
success as soon as it uploads. If a child blob upload 0-byted or
|
||||
the client disconnected mid-push (distinct from the GC mode but the
|
||||
same on-disk symptom), nothing would notice until the next pull.
|
||||
|
||||
## Permanent Fix — Three Phases
|
||||
|
||||
### Phase 1 — Detection (ship today)
|
||||
|
||||
1. **Post-push integrity check** in `.woodpecker/build-ci-image.yml`.
|
||||
After `build-and-push`, a new step walks the just-pushed manifest
|
||||
(and every child of an image index) and HEADs every referenced blob.
|
||||
Any non-200 fails the pipeline immediately, catching broken pushes at
|
||||
the source rather than leaking them to consumers.
|
||||
2. **Prometheus alert `RegistryManifestIntegrityFailure`.** A new
|
||||
CronJob (`registry-integrity-probe`, every 15m, in the `monitoring`
|
||||
namespace) walks the private registry's catalog, HEADs every tag's
|
||||
manifest, follows each image index's children, and pushes
|
||||
`registry_manifest_integrity_failures` to Pushgateway. Accompanying
|
||||
alerts: `RegistryIntegrityProbeStale`, `RegistryCatalogInaccessible`.
|
||||
3. **Post-mortem** — this document. Linked from
|
||||
`.claude/reference/service-catalog.md` via the new runbook.
|
||||
|
||||
### Phase 2 — Prevention
|
||||
|
||||
4. **Pin `registry:2` → `registry:2.8.3`** in
|
||||
`modules/docker-registry/docker-compose.yml` (all six registry
|
||||
services). Removes the floating-tag footgun.
|
||||
5. **Extend `fix-broken-blobs.sh`** to scan every
|
||||
`_manifests/revisions/sha256/<digest>` that is an image index and
|
||||
flag children whose blob `data` file is missing. The script prints a
|
||||
loud WARNING per orphan; it does not auto-delete the index, because
|
||||
deleting a published image is a conscious decision, not an automated
|
||||
repair.
|
||||
|
||||
### Phase 3 — Recovery tooling
|
||||
|
||||
6. **Manual event trigger** on `build-ci-image.yml`. Rebuilds no longer
|
||||
need a cosmetic Dockerfile edit — POST to the Woodpecker API or
|
||||
click "Run manually" in the UI.
|
||||
7. **Runbook** `docs/runbooks/registry-rebuild-image.md` — exact
|
||||
command sequence for the next time this happens, plus fallback paths.
|
||||
|
||||
## Out of Scope
|
||||
|
||||
- **Pull-through caches.** The DockerHub / GHCR mirrors on
|
||||
`:5000` / `:5010` are healthy (74.5% cache hit rate, no 404s). The
|
||||
orphan problem is private-registry-only. No changes to nginx or
|
||||
containerd `hosts.toml`.
|
||||
- **Registry HA / replication.** Single-VM SPOF is a known
|
||||
architectural choice. Harbor or a replicated registry would solve
|
||||
more than this incident requires, at multi-day cost. Synology offsite
|
||||
snapshots already give RPO < 1 day.
|
||||
- **Disabling `cleanup-tags.sh`.** Keeping storage bounded is still
|
||||
necessary; the fix is detection + rebuild, not "stop cleaning up".
|
||||
|
||||
## Lessons
|
||||
|
||||
- **Repeat incidents deserve root-cause work, not a third hot-fix.** The
|
||||
2026-04-13 incident was closed when CI turned green. Without a probe
|
||||
and without a scan for orphan indexes, the next incident was
|
||||
inevitable — and it happened six days later against a different image.
|
||||
- **"No alert fired, so it wasn't detected" is a monitoring gap, not an
|
||||
outage feature.** The registry was serving 404s for 2+ hours before
|
||||
anyone noticed, because our only signal was "pipeline failures" and
|
||||
our eyes were elsewhere. The new probe closes that gap.
|
||||
- **CI pipelines should verify their own output.** The `buildx --push`
|
||||
"success" exit code is not a guarantee of pulled-back integrity — as
|
||||
this incident proves. A 30-second post-push HEAD walk is cheap
|
||||
insurance.
|
||||
|
||||
## Related
|
||||
|
||||
- **Prior incident (same failure mode, different image)**: memory `709`
|
||||
/ `710` — 2026-04-13.
|
||||
- **Runbook**: `docs/runbooks/registry-rebuild-image.md` (new).
|
||||
- **Hot-fix commits**: `a05d63ee`, `6371e75e`, `c113be4d`.
|
||||
- **Upstream bug class**: `distribution/distribution#3324`.
|
||||
170
docs/runbooks/registry-rebuild-image.md
Normal file
170
docs/runbooks/registry-rebuild-image.md
Normal file
|
|
@ -0,0 +1,170 @@
|
|||
# Runbook: Rebuild an Image After a Registry Orphan-Index Incident
|
||||
|
||||
Last updated: 2026-04-19
|
||||
|
||||
## When to use this
|
||||
|
||||
Pipelines that pull from `registry.viktorbarzin.me:5050` are failing with
|
||||
messages like:
|
||||
|
||||
- `failed to resolve reference … : not found`
|
||||
- `manifest unknown`
|
||||
- `image can't be pulled` (Woodpecker exit 126)
|
||||
- `error pulling image`: HEAD on a child manifest digest returns 404
|
||||
|
||||
…and `skopeo inspect --tls-verify --creds "$USER:$PASS" docker://registry.viktorbarzin.me:5050/<image>:<tag>`
|
||||
returns an OCI image index whose `manifests[].digest` references are 404
|
||||
on the registry.
|
||||
|
||||
This is the **orphan OCI-index** failure mode documented in
|
||||
`docs/post-mortems/2026-04-19-registry-orphan-index.md`. The fix is to
|
||||
rebuild the affected image from source so the registry receives a fresh,
|
||||
complete push.
|
||||
|
||||
If the symptom is different (e.g., registry container down, TLS expiry,
|
||||
auth failure), use `docs/runbooks/registry-vm.md` instead.
|
||||
|
||||
## Phase 1 — Confirm the diagnosis
|
||||
|
||||
From any host with `skopeo`:
|
||||
|
||||
```sh
|
||||
REG=registry.viktorbarzin.me:5050
|
||||
IMAGE=infra-ci
|
||||
TAG=latest
|
||||
|
||||
# 1. Confirm the index exists.
|
||||
skopeo inspect --tls-verify --creds "$USER:$PASS" \
|
||||
--raw "docker://$REG/$IMAGE:$TAG" | jq '.mediaType, .manifests[].digest'
|
||||
|
||||
# 2. HEAD each child. Any non-200 = confirmed orphan.
|
||||
for d in $(skopeo inspect --tls-verify --creds "$USER:$PASS" --raw \
|
||||
"docker://$REG/$IMAGE:$TAG" | jq -r '.manifests[].digest'); do
|
||||
code=$(curl -sk -u "$USER:$PASS" -o /dev/null -w '%{http_code}' \
|
||||
-I "https://$REG/v2/$IMAGE/manifests/$d")
|
||||
echo "$d → $code"
|
||||
done
|
||||
```
|
||||
|
||||
If every child is 200, the problem is elsewhere — stop here and check
|
||||
the registry VM, TLS, or auth.
|
||||
|
||||
The `registry-integrity-probe` CronJob in the `monitoring` namespace
|
||||
runs this same check every 15 minutes across every tag in the catalog;
|
||||
its last run is also a fast way to see which image(s) are affected:
|
||||
|
||||
```sh
|
||||
kubectl -n monitoring logs \
|
||||
$(kubectl -n monitoring get pods -l job-name -o name \
|
||||
| grep registry-integrity-probe | head -1)
|
||||
```
|
||||
|
||||
## Phase 2 — Rebuild
|
||||
|
||||
### Option A (preferred): rebuild via CI
|
||||
|
||||
Find the `build-*.yml` pipeline that produces the image:
|
||||
|
||||
| Image | Pipeline | Repo ID |
|
||||
|---|---|---|
|
||||
| `infra-ci` | `.woodpecker/build-ci-image.yml` | 1 (infra) |
|
||||
| `infra` (cli) | `.woodpecker/build-cli.yml` | 1 (infra) |
|
||||
| `k8s-portal` | `.woodpecker/k8s-portal.yml` | 1 (infra) |
|
||||
|
||||
Trigger a manual build. The Woodpecker API expects a numeric repo ID
|
||||
(paths with `owner/name` return HTML):
|
||||
|
||||
```sh
|
||||
WOODPECKER_TOKEN=$(vault kv get -field=woodpecker_admin_token secret/viktor)
|
||||
|
||||
# Kick off a manual build against master.
|
||||
curl -s -X POST \
|
||||
-H "Authorization: Bearer $WOODPECKER_TOKEN" \
|
||||
-H "Content-Type: application/json" \
|
||||
"https://ci.viktorbarzin.me/api/repos/1/pipelines" \
|
||||
-d '{"branch":"master"}' | jq .number
|
||||
|
||||
# Follow the pipeline at https://ci.viktorbarzin.me/repos/1/pipeline/<number>
|
||||
```
|
||||
|
||||
The pipeline's `verify-integrity` step walks every blob the push
|
||||
references. If it passes, the registry now has a clean index; pull
|
||||
consumers will recover on next attempt.
|
||||
|
||||
### Option B (fallback): build on the registry VM
|
||||
|
||||
Only use this if Woodpecker itself is broken (its own pipeline runs
|
||||
from the same `infra-ci` image, so a corrupted `infra-ci:latest` can
|
||||
prevent Option A from recovering).
|
||||
|
||||
```sh
|
||||
ssh root@10.0.20.10 '
|
||||
cd /tmp
|
||||
git clone --depth 1 https://github.com/ViktorBarzin/infra
|
||||
cd infra/ci
|
||||
docker build -t registry.viktorbarzin.me:5050/infra-ci:manual -t registry.viktorbarzin.me:5050/infra-ci:latest .
|
||||
docker login -u "$USER" -p "$PASS" registry.viktorbarzin.me:5050
|
||||
docker push registry.viktorbarzin.me:5050/infra-ci:manual
|
||||
docker push registry.viktorbarzin.me:5050/infra-ci:latest
|
||||
'
|
||||
```
|
||||
|
||||
Then re-run any pipelines that failed — Woodpecker UI → Restart, or:
|
||||
|
||||
```sh
|
||||
curl -s -X POST \
|
||||
-H "Authorization: Bearer $WOODPECKER_TOKEN" \
|
||||
"https://ci.viktorbarzin.me/api/repos/1/pipelines/<failed-pipeline-number>"
|
||||
```
|
||||
|
||||
## Phase 3 — Verify
|
||||
|
||||
```sh
|
||||
# 1. Pull the image fresh (bypassing containerd cache) and check its index.
|
||||
REG=registry.viktorbarzin.me:5050
|
||||
skopeo inspect --tls-verify --creds "$USER:$PASS" \
|
||||
--raw "docker://$REG/infra-ci:latest" \
|
||||
| jq '.manifests[] | {digest, platform}'
|
||||
|
||||
# 2. HEAD every child digest — all should be 200.
|
||||
for d in $(skopeo inspect --tls-verify --creds "$USER:$PASS" --raw \
|
||||
"docker://$REG/infra-ci:latest" | jq -r '.manifests[].digest'); do
|
||||
code=$(curl -sk -u "$USER:$PASS" -o /dev/null -w '%{http_code}' \
|
||||
-I "https://$REG/v2/infra-ci/manifests/$d")
|
||||
[ "$code" = "200" ] || echo "STILL BROKEN: $d → $code"
|
||||
done
|
||||
echo "verified"
|
||||
|
||||
# 3. Kick off the next scheduled probe for good measure.
|
||||
kubectl -n monitoring create job --from=cronjob/registry-integrity-probe \
|
||||
registry-integrity-probe-verify-$(date +%s)
|
||||
kubectl -n monitoring logs -f -l job-name=registry-integrity-probe-verify-$(date +%s)
|
||||
```
|
||||
|
||||
The `RegistryManifestIntegrityFailure` alert clears automatically when
|
||||
the probe's next run returns zero failures.
|
||||
|
||||
## Phase 4 — Investigate orphans
|
||||
|
||||
Once the immediate fix is in, check whether any OTHER images on the
|
||||
registry have orphan children:
|
||||
|
||||
```sh
|
||||
ssh root@10.0.20.10 'python3 /opt/registry/fix-broken-blobs.sh --dry-run 2>&1 | grep "ORPHAN INDEX"'
|
||||
```
|
||||
|
||||
Each hit is a separate image that will eventually fail to pull. Rebuild
|
||||
them in the same way (Option A preferred). If the list is long, open a
|
||||
beads task — do NOT batch-delete the indexes; that's a destructive
|
||||
registry operation outside this runbook's scope.
|
||||
|
||||
## Related
|
||||
|
||||
- `docs/post-mortems/2026-04-19-registry-orphan-index.md` — why this
|
||||
happens.
|
||||
- `docs/runbooks/registry-vm.md` — VM-level operations (DNS,
|
||||
`docker compose` restarts).
|
||||
- `modules/docker-registry/fix-broken-blobs.sh` — the scanner cron
|
||||
itself, runs nightly and after each GC.
|
||||
- `stacks/monitoring/modules/monitoring/main.tf` —
|
||||
`registry_integrity_probe` CronJob definition.
|
||||
|
|
@ -145,3 +145,7 @@ ssh root@10.0.20.10 '
|
|||
- `docs/architecture/dns.md` — resolver IP assignments per subnet.
|
||||
- `.claude/CLAUDE.md` (at repo root) — notes on the private registry
|
||||
and `containerd` `hosts.toml` redirects.
|
||||
- `docs/runbooks/registry-rebuild-image.md` — rebuild an image after an
|
||||
orphan OCI-index incident (different class of problem than DNS).
|
||||
- `docs/post-mortems/2026-04-19-registry-orphan-index.md` — root cause
|
||||
+ detection gaps behind the recurring missing-blob incidents.
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue