Second identical registry incident on 2026-04-19 (first 2026-04-13): the infra-ci:latest image index resolved to child manifests whose blobs had been garbage-collected out from under the index. Pipelines P366→P376 all exited 126 "image can't be pulled". Hot fix (a05d63e/6371e75/c113be4) restored green CI but left the underlying bug unaddressed. Root cause: cleanup-tags.sh rmtrees tag dirs on the registry VM daily at 02:00, registry:2's GC (Sunday 03:25) walks OCI index children imperfectly (distribution/distribution#3324 class). Nothing verified pushes end-to-end; nothing probed the registry for fetchability; nothing caught orphan indexes. Phase 1 — Detection: - .woodpecker/build-ci-image.yml: after build-and-push, a verify-integrity step walks the just-pushed manifest (index + children + config + every layer blob) via HEAD and fails the pipeline on any non-200. Catches broken pushes at the source. - stacks/monitoring: new registry-integrity-probe CronJob (every 15m) and three alerts — RegistryManifestIntegrityFailure, RegistryIntegrityProbeStale, RegistryCatalogInaccessible — closing the "registry serves 404 for a tag that exists" gap that masked the incident for 2+ hours. - docs/post-mortems/2026-04-19-registry-orphan-index.md: root cause, timeline, monitoring gaps, permanent fix. Phase 2 — Prevention: - modules/docker-registry/docker-compose.yml: pin registry:2 → registry:2.8.3 across all six registry services. Removes the floating-tag footgun. - modules/docker-registry/fix-broken-blobs.sh: new scan walks every _manifests/revisions/sha256/<digest> that is an image index and logs a loud WARNING when a referenced child blob is missing. Does NOT auto- delete — deleting a published image is a conscious decision. Layer-link scan preserved. Phase 3 — Recovery: - build-ci-image.yml: accept `manual` event so Woodpecker API/UI rebuilds don't need a cosmetic Dockerfile edit (matches convention from pve-nfs-exports-sync.yml). - docs/runbooks/registry-rebuild-image.md: exact command sequence for diagnosing + rebuilding after an orphan-index incident, plus a fallback for building directly on the registry VM if Woodpecker itself is down. - docs/runbooks/registry-vm.md + .claude/reference/service-catalog.md: cross-references to the new runbook. Out of scope (verified healthy or intentionally deferred): - Pull-through DockerHub/GHCR mirrors (74.5% hit rate, no 404s). - Registry HA/replication (single-VM SPOF is a known architectural choice; Synology offsite covers RPO < 1 day). - Diun exclude for registry:2 — not applicable; Diun only watches k8s (DIUN_PROVIDERS_KUBERNETES=true), not the VM's docker-compose. Verified locally: - fix-broken-blobs.sh --dry-run on a synthetic registry directory correctly flags both orphan layer links and orphan OCI-index children. - terraform fmt + validate on stacks/monitoring: success (only unrelated deprecation warnings). - python3 yaml.safe_load on .woodpecker/build-ci-image.yml and modules/docker-registry/docker-compose.yml: both parse clean. Closes: code-4b8 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
170 lines
5.8 KiB
Markdown
170 lines
5.8 KiB
Markdown
# Runbook: Rebuild an Image After a Registry Orphan-Index Incident
|
|
|
|
Last updated: 2026-04-19
|
|
|
|
## When to use this
|
|
|
|
Pipelines that pull from `registry.viktorbarzin.me:5050` are failing with
|
|
messages like:
|
|
|
|
- `failed to resolve reference … : not found`
|
|
- `manifest unknown`
|
|
- `image can't be pulled` (Woodpecker exit 126)
|
|
- `error pulling image`: HEAD on a child manifest digest returns 404
|
|
|
|
…and `skopeo inspect --tls-verify --creds "$USER:$PASS" docker://registry.viktorbarzin.me:5050/<image>:<tag>`
|
|
returns an OCI image index whose `manifests[].digest` references are 404
|
|
on the registry.
|
|
|
|
This is the **orphan OCI-index** failure mode documented in
|
|
`docs/post-mortems/2026-04-19-registry-orphan-index.md`. The fix is to
|
|
rebuild the affected image from source so the registry receives a fresh,
|
|
complete push.
|
|
|
|
If the symptom is different (e.g., registry container down, TLS expiry,
|
|
auth failure), use `docs/runbooks/registry-vm.md` instead.
|
|
|
|
## Phase 1 — Confirm the diagnosis
|
|
|
|
From any host with `skopeo`:
|
|
|
|
```sh
|
|
REG=registry.viktorbarzin.me:5050
|
|
IMAGE=infra-ci
|
|
TAG=latest
|
|
|
|
# 1. Confirm the index exists.
|
|
skopeo inspect --tls-verify --creds "$USER:$PASS" \
|
|
--raw "docker://$REG/$IMAGE:$TAG" | jq '.mediaType, .manifests[].digest'
|
|
|
|
# 2. HEAD each child. Any non-200 = confirmed orphan.
|
|
for d in $(skopeo inspect --tls-verify --creds "$USER:$PASS" --raw \
|
|
"docker://$REG/$IMAGE:$TAG" | jq -r '.manifests[].digest'); do
|
|
code=$(curl -sk -u "$USER:$PASS" -o /dev/null -w '%{http_code}' \
|
|
-I "https://$REG/v2/$IMAGE/manifests/$d")
|
|
echo "$d → $code"
|
|
done
|
|
```
|
|
|
|
If every child is 200, the problem is elsewhere — stop here and check
|
|
the registry VM, TLS, or auth.
|
|
|
|
The `registry-integrity-probe` CronJob in the `monitoring` namespace
|
|
runs this same check every 15 minutes across every tag in the catalog;
|
|
its last run is also a fast way to see which image(s) are affected:
|
|
|
|
```sh
|
|
kubectl -n monitoring logs \
|
|
$(kubectl -n monitoring get pods -l job-name -o name \
|
|
| grep registry-integrity-probe | head -1)
|
|
```
|
|
|
|
## Phase 2 — Rebuild
|
|
|
|
### Option A (preferred): rebuild via CI
|
|
|
|
Find the `build-*.yml` pipeline that produces the image:
|
|
|
|
| Image | Pipeline | Repo ID |
|
|
|---|---|---|
|
|
| `infra-ci` | `.woodpecker/build-ci-image.yml` | 1 (infra) |
|
|
| `infra` (cli) | `.woodpecker/build-cli.yml` | 1 (infra) |
|
|
| `k8s-portal` | `.woodpecker/k8s-portal.yml` | 1 (infra) |
|
|
|
|
Trigger a manual build. The Woodpecker API expects a numeric repo ID
|
|
(paths with `owner/name` return HTML):
|
|
|
|
```sh
|
|
WOODPECKER_TOKEN=$(vault kv get -field=woodpecker_admin_token secret/viktor)
|
|
|
|
# Kick off a manual build against master.
|
|
curl -s -X POST \
|
|
-H "Authorization: Bearer $WOODPECKER_TOKEN" \
|
|
-H "Content-Type: application/json" \
|
|
"https://ci.viktorbarzin.me/api/repos/1/pipelines" \
|
|
-d '{"branch":"master"}' | jq .number
|
|
|
|
# Follow the pipeline at https://ci.viktorbarzin.me/repos/1/pipeline/<number>
|
|
```
|
|
|
|
The pipeline's `verify-integrity` step walks every blob the push
|
|
references. If it passes, the registry now has a clean index; pull
|
|
consumers will recover on next attempt.
|
|
|
|
### Option B (fallback): build on the registry VM
|
|
|
|
Only use this if Woodpecker itself is broken (its own pipeline runs
|
|
from the same `infra-ci` image, so a corrupted `infra-ci:latest` can
|
|
prevent Option A from recovering).
|
|
|
|
```sh
|
|
ssh root@10.0.20.10 '
|
|
cd /tmp
|
|
git clone --depth 1 https://github.com/ViktorBarzin/infra
|
|
cd infra/ci
|
|
docker build -t registry.viktorbarzin.me:5050/infra-ci:manual -t registry.viktorbarzin.me:5050/infra-ci:latest .
|
|
docker login -u "$USER" -p "$PASS" registry.viktorbarzin.me:5050
|
|
docker push registry.viktorbarzin.me:5050/infra-ci:manual
|
|
docker push registry.viktorbarzin.me:5050/infra-ci:latest
|
|
'
|
|
```
|
|
|
|
Then re-run any pipelines that failed — Woodpecker UI → Restart, or:
|
|
|
|
```sh
|
|
curl -s -X POST \
|
|
-H "Authorization: Bearer $WOODPECKER_TOKEN" \
|
|
"https://ci.viktorbarzin.me/api/repos/1/pipelines/<failed-pipeline-number>"
|
|
```
|
|
|
|
## Phase 3 — Verify
|
|
|
|
```sh
|
|
# 1. Pull the image fresh (bypassing containerd cache) and check its index.
|
|
REG=registry.viktorbarzin.me:5050
|
|
skopeo inspect --tls-verify --creds "$USER:$PASS" \
|
|
--raw "docker://$REG/infra-ci:latest" \
|
|
| jq '.manifests[] | {digest, platform}'
|
|
|
|
# 2. HEAD every child digest — all should be 200.
|
|
for d in $(skopeo inspect --tls-verify --creds "$USER:$PASS" --raw \
|
|
"docker://$REG/infra-ci:latest" | jq -r '.manifests[].digest'); do
|
|
code=$(curl -sk -u "$USER:$PASS" -o /dev/null -w '%{http_code}' \
|
|
-I "https://$REG/v2/infra-ci/manifests/$d")
|
|
[ "$code" = "200" ] || echo "STILL BROKEN: $d → $code"
|
|
done
|
|
echo "verified"
|
|
|
|
# 3. Kick off the next scheduled probe for good measure.
|
|
kubectl -n monitoring create job --from=cronjob/registry-integrity-probe \
|
|
registry-integrity-probe-verify-$(date +%s)
|
|
kubectl -n monitoring logs -f -l job-name=registry-integrity-probe-verify-$(date +%s)
|
|
```
|
|
|
|
The `RegistryManifestIntegrityFailure` alert clears automatically when
|
|
the probe's next run returns zero failures.
|
|
|
|
## Phase 4 — Investigate orphans
|
|
|
|
Once the immediate fix is in, check whether any OTHER images on the
|
|
registry have orphan children:
|
|
|
|
```sh
|
|
ssh root@10.0.20.10 'python3 /opt/registry/fix-broken-blobs.sh --dry-run 2>&1 | grep "ORPHAN INDEX"'
|
|
```
|
|
|
|
Each hit is a separate image that will eventually fail to pull. Rebuild
|
|
them in the same way (Option A preferred). If the list is long, open a
|
|
beads task — do NOT batch-delete the indexes; that's a destructive
|
|
registry operation outside this runbook's scope.
|
|
|
|
## Related
|
|
|
|
- `docs/post-mortems/2026-04-19-registry-orphan-index.md` — why this
|
|
happens.
|
|
- `docs/runbooks/registry-vm.md` — VM-level operations (DNS,
|
|
`docker compose` restarts).
|
|
- `modules/docker-registry/fix-broken-blobs.sh` — the scanner cron
|
|
itself, runs nightly and after each GC.
|
|
- `stacks/monitoring/modules/monitoring/main.tf` —
|
|
`registry_integrity_probe` CronJob definition.
|