[registry] Stop recurring orphan OCI-index incidents — detection + prevention + recovery

Second identical registry incident on 2026-04-19 (first 2026-04-13): the
infra-ci:latest image index resolved to child manifests whose blobs had been
garbage-collected out from under the index. Pipelines P366→P376 all exited
126 "image can't be pulled". Hot fix (a05d63e / 6371e75 / c113be4) restored
green CI but left the underlying bug unaddressed.

Root cause: cleanup-tags.sh rmtrees tag dirs on the registry VM daily at
02:00, registry:2's GC (Sunday 03:25) walks OCI index children imperfectly
(distribution/distribution#3324 class). Nothing verified pushes end-to-end;
nothing probed the registry for fetchability; nothing caught orphan indexes.

Phase 1 — Detection:
 - .woodpecker/build-ci-image.yml: after build-and-push, a verify-integrity
   step walks the just-pushed manifest (index + children + config + every
   layer blob) via HEAD and fails the pipeline on any non-200. Catches
   broken pushes at the source.
 - stacks/monitoring: new registry-integrity-probe CronJob (every 15m) and
   three alerts — RegistryManifestIntegrityFailure,
   RegistryIntegrityProbeStale, RegistryCatalogInaccessible — closing the
   "registry serves 404 for a tag that exists" gap that masked the incident
   for 2+ hours.
 - docs/post-mortems/2026-04-19-registry-orphan-index.md: root cause,
   timeline, monitoring gaps, permanent fix.

Phase 2 — Prevention:
 - modules/docker-registry/docker-compose.yml: pin registry:2 → registry:2.8.3
   across all six registry services. Removes the floating-tag footgun.
 - modules/docker-registry/fix-broken-blobs.sh: new scan walks every
   _manifests/revisions/sha256/<digest> that is an image index and logs a
   loud WARNING when a referenced child blob is missing. Does NOT auto-
   delete — deleting a published image is a conscious decision. Layer-link
   scan preserved.

Phase 3 — Recovery:
 - build-ci-image.yml: accept `manual` event so Woodpecker API/UI rebuilds
   don't need a cosmetic Dockerfile edit (matches convention from
   pve-nfs-exports-sync.yml).
 - docs/runbooks/registry-rebuild-image.md: exact command sequence for
   diagnosing + rebuilding after an orphan-index incident, plus a fallback
   for building directly on the registry VM if Woodpecker itself is down.
 - docs/runbooks/registry-vm.md + .claude/reference/service-catalog.md:
   cross-references to the new runbook.

Out of scope (verified healthy or intentionally deferred):
 - Pull-through DockerHub/GHCR mirrors (74.5% hit rate, no 404s).
 - Registry HA/replication (single-VM SPOF is a known architectural
   choice; Synology offsite covers RPO < 1 day).
 - Diun exclude for registry:2 — not applicable; Diun only watches
   k8s (DIUN_PROVIDERS_KUBERNETES=true), not the VM's docker-compose.

Verified locally:
 - fix-broken-blobs.sh --dry-run on a synthetic registry directory correctly
   flags both orphan layer links and orphan OCI-index children.
 - terraform fmt + validate on stacks/monitoring: success (only unrelated
   deprecation warnings).
 - python3 yaml.safe_load on .woodpecker/build-ci-image.yml and
   modules/docker-registry/docker-compose.yml: both parse clean.

Closes: code-4b8

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-04-19 17:08:28 +00:00
parent df2c53db8d
commit 7cb44d7264
10 changed files with 779 additions and 41 deletions

View file

@ -137,3 +137,18 @@ jellyfin, jellyseerr, tdarr, affine, health, family, openclaw
- `*.viktor.actualbudget` - Actualbudget factory instances - `*.viktor.actualbudget` - Actualbudget factory instances
- `*.freedify` - Freedify factory instances - `*.freedify` - Freedify factory instances
- `mailserver.*` - Mail server components (antispam, admin) - `mailserver.*` - Mail server components (antispam, admin)
## Key Runbooks
Operational surfaces that aren't k8s services (VMs, pipelines, host-side
procedures) are documented in `infra/docs/runbooks/`:
| Surface | Runbook |
|---|---|
| Private Docker registry VM (10.0.20.10) | [registry-vm.md](../../docs/runbooks/registry-vm.md) |
| Rebuild after orphan-index incident | [registry-rebuild-image.md](../../docs/runbooks/registry-rebuild-image.md) |
| PVE host operations (backups, LVM) | [proxmox-host.md](../../docs/runbooks/proxmox-host.md) |
| NFS prerequisites and CSI mount options | [nfs-prerequisites.md](../../docs/runbooks/nfs-prerequisites.md) |
| pfSense + Unbound DNS | [pfsense-unbound.md](../../docs/runbooks/pfsense-unbound.md) |
| Mailserver PROXY-protocol / HAProxy | [mailserver-pfsense-haproxy.md](../../docs/runbooks/mailserver-pfsense-haproxy.md) |
| Technitium apply flow | [technitium-apply.md](../../docs/runbooks/technitium-apply.md) |

View file

@ -1,12 +1,14 @@
# Build the CI tools Docker image used by all infra pipelines. # Build the CI tools Docker image used by all infra pipelines.
# Triggers on changes to ci/Dockerfile only (push to master). # Triggers on push that touches ci/Dockerfile, or manual (API/UI) so
# rebuilds after a registry incident don't need a cosmetic Dockerfile edit.
when: when:
event: push - event: push
branch: master branch: master
path: path:
include: include:
- 'ci/Dockerfile' - 'ci/Dockerfile'
- event: manual
steps: steps:
- name: build-and-push - name: build-and-push
@ -27,6 +29,83 @@ steps:
password: password:
from_secret: registry_password from_secret: registry_password
# Post-push integrity check. Re-resolves the image we just pushed and HEADs
# every blob it references — top-level manifest (index or single), each child
# platform manifest, each config blob, each layer blob. If any returns !=200
# the pipeline fails loudly here so we never ship a broken index downstream.
# Historical context: 2026-04-13 and 2026-04-19 incidents both shipped indexes
# whose platform/attestation children had been GC-orphaned on the registry VM.
- name: verify-integrity
image: alpine:3.20
environment:
REG_USER:
from_secret: registry_user
REG_PASS:
from_secret: registry_password
commands:
- apk add --no-cache curl jq
- REG=registry.viktorbarzin.me:5050
- REPO=infra-ci
- SHA=${CI_COMMIT_SHA:0:8}
- AUTH="$REG_USER:$REG_PASS"
- |
set -euo pipefail
ACCEPT='Accept: application/vnd.oci.image.index.v1+json,application/vnd.oci.image.manifest.v1+json,application/vnd.docker.distribution.manifest.list.v2+json,application/vnd.docker.distribution.manifest.v2+json'
fetch_manifest() {
# Prints the body to $2, returns the HTTP code as stdout.
curl -sk -u "$AUTH" -H "$ACCEPT" \
-o "$2" -w '%{http_code}' \
"https://$REG/v2/$REPO/manifests/$1"
}
head_blob() {
curl -sk -u "$AUTH" -o /dev/null -w '%{http_code}' \
-I "https://$REG/v2/$REPO/blobs/$1"
}
verify_single_manifest() {
local ref="$1" tmp=/tmp/m-$$.json
local rc cfg
rc=$(fetch_manifest "$ref" "$tmp")
if [ "$rc" != "200" ]; then
echo "FAIL: manifest $ref returned HTTP $rc"; return 1
fi
cfg=$(jq -r '.config.digest // empty' "$tmp")
if [ -n "$cfg" ]; then
rc=$(head_blob "$cfg")
[ "$rc" = "200" ] || { echo "FAIL: config blob $cfg returned HTTP $rc"; return 1; }
fi
jq -r '.layers[]?.digest' "$tmp" > /tmp/layers-$$.txt
while IFS= read -r layer; do
[ -z "$layer" ] && continue
rc=$(head_blob "$layer")
[ "$rc" = "200" ] || { echo "FAIL: layer blob $layer returned HTTP $rc"; return 1; }
done < /tmp/layers-$$.txt
return 0
}
echo "=== Verifying push integrity for $REPO:$SHA ==="
TOP=/tmp/top-$$.json
rc=$(fetch_manifest "$SHA" "$TOP")
[ "$rc" = "200" ] || { echo "FAIL: top manifest :$SHA returned HTTP $rc"; exit 1; }
MT=$(jq -r '.mediaType // empty' "$TOP")
echo "Top-level media type: ${MT:-<unset>}"
if echo "$MT" | grep -Eq 'manifest\.list|image\.index'; then
jq -r '.manifests[].digest' "$TOP" > /tmp/children-$$.txt
echo "Multi-platform index: $(wc -l </tmp/children-$$.txt) child manifest(s)"
while IFS= read -r d; do
echo "--- child $d ---"
verify_single_manifest "$d" || exit 1
done < /tmp/children-$$.txt
else
echo "Single-platform manifest — verifying directly"
verify_single_manifest "$SHA" || exit 1
fi
echo "=== All manifests + blobs verified. Push integrity intact. ==="
- name: slack - name: slack
image: curlimages/curl image: curlimages/curl
commands: commands:

View file

@ -0,0 +1,175 @@
# Post-Mortem: Private Registry Orphan OCI-Index — Repeat Incident
| Field | Value |
|-------|-------|
| **Date** | 2026-04-19 (first occurrence 2026-04-13) |
| **Duration** | ~40 min of blocked CI each time; only detected via pipeline failures |
| **Severity** | SEV2 — all infra CI pipelines using `infra-ci:latest` failed (P366 → P376 all exit 126 "image can't be pulled") |
| **Affected Services** | Every Woodpecker pipeline that starts with `image: registry.viktorbarzin.me:5050/infra-ci:latest``default.yml`, `build-cli.yml`, `renew-tls.yml`, `drift-detection.yml`, `provision-user.yml`, `k8s-portal.yml`, `postmortem-todos.yml`, `issue-automation.yml`, `pve-nfs-exports-sync.yml` |
| **Status** | Hot fix green (three commits: `a05d63ee`, `6371e75e`, `c113be4d` — URL fix + rebuild). This doc captures the permanent fix landed in the same branch. |
## Summary
On 2026-04-19 ~09:00 UTC, every infra CI pipeline started failing at the
`clone` step with "image can't be pulled". The image in question — the CI
toolchain image `registry.viktorbarzin.me:5050/infra-ci:latest` — resolved
to an OCI image index whose `linux/amd64` platform manifest
(`sha256:98f718c8…`) and its in-toto attestation
(`sha256:27d5ab83…`) returned **HTTP 404** from the private registry.
The index record itself still existed — it's the children that had been
garbage-collected out from under it.
This is the **second identical incident**: the same failure mode occurred
on 2026-04-13 against a different image. Both times the immediate fix was
to rebuild the image from scratch; both times the root cause was left
unaddressed.
## Impact
- **User-facing**: all CI pipelines failed. No automated Terraform applies,
no TLS renewal, no drift detection. Manual workflows (Woodpecker UI
reruns) all failed with the same error.
- **Blast radius**: every pipeline that pulls `infra-ci`. Does NOT affect
k8s workloads (those pull via containerd, which goes through the
pull-through proxy on :5000/:5010 — a completely different code path).
- **Duration on 2026-04-19**: from first P366 failure to the hot-fix
commit `c113be4d` — roughly 40 min. Pipelines that had already been
triggered queued up until the rebuild restored `:latest`.
- **Data loss**: none. The registry has the index object; the child
manifests are re-producible by rebuilding the source image.
- **Monitoring gap**: nothing alerted. The only signal was the individual
pipeline failures from Woodpecker. No Prometheus alert fires on "the
registry served a 404 for a tag that exists".
## Timeline (UTC, 2026-04-19)
| Time | Event |
|------|-------|
| ~09:00 | P366 (`default.yml` on master) fails with exit 126. |
| 09:0011:00 | P367, P368, … P376 all fail with the same error. Nobody pages — there's no alert configured. |
| 11:15 | User notices and investigates: `skopeo inspect` reveals the missing platform manifest. |
| 11:20 | Hot fix phase begins: `a05d63ee` fixes a push-URL misalignment, `6371e75e` and `c113be4d` trigger a full rebuild. |
| 11:40 | Rebuild completes; `infra-ci:latest` resolves to a fresh, complete index. Pipelines green from P377 onward. |
| 11:45 | User requests a proper root-cause fix: "this is the second time — what's actually broken?" |
| 12:00 | Investigation begins (this document's work). |
## Root Cause Chain
```
[1] cleanup-tags.sh runs daily at 02:00 on the registry VM
└─> For each repository, keeps the last 10 tags by mtime, rmtrees the rest.
This walks `_manifests/tags/<tag>` directly, bypassing the registry API.
├─> [2] registry:2 garbage-collect runs weekly (Sun 03:25 for the
│ private registry). Walks live manifests through refcounts, but
│ distribution/distribution#3324 showed this walker has historical
│ bugs with OCI image-index children — it can decrement a shared
│ child's refcount below 1 and delete the blob even while the
│ index that references it is still referenced.
└─> [3] Result: the `infra-ci:latest` index is intact
(`_manifests/revisions/sha256/<A>/data` present on disk), but
its `.manifests[0].digest` — the `linux/amd64` child — points
to a `blobs/sha256/98/98f718c8…/` whose `data` file is gone.
[pull] containerd resolves `infra-ci:latest`
├─> GET /v2/infra-ci/manifests/latest → 200 OK, returns the index
└─> GET /v2/infra-ci/manifests/sha256:98f718c8… → 404 Not Found
└─> containerd fails the pull with "manifest unknown"
└─> woodpecker exit 126
```
## Why Existing Remediation Missed It
1. **`fix-broken-blobs.sh` only scans layer links.** The existing cron
walks `_layers/sha256/` and removes link files whose blob `data` is
missing. It does NOT inspect `_manifests/revisions/sha256/` to see
whether an image-index's referenced children still exist. That's
exactly the class of orphan this incident represents.
2. **`registry:2` image tag was floating.** `docker-compose.yml` pinned
only to `registry:2`. Whatever Docker Inc. last rebuilt as
"v2-current" was running, with no version pin. Any regression in
the upstream walker would silently swap in.
3. **No integrity monitoring.** Prometheus alerted on cache hit rate
and registry-down, but nothing probes "are the manifests the registry
advertises actually fetchable?"
4. **CI pipeline didn't verify its own push.** `buildx --push` returns
success as soon as it uploads. If a child blob upload 0-byted or
the client disconnected mid-push (distinct from the GC mode but the
same on-disk symptom), nothing would notice until the next pull.
## Permanent Fix — Three Phases
### Phase 1 — Detection (ship today)
1. **Post-push integrity check** in `.woodpecker/build-ci-image.yml`.
After `build-and-push`, a new step walks the just-pushed manifest
(and every child of an image index) and HEADs every referenced blob.
Any non-200 fails the pipeline immediately, catching broken pushes at
the source rather than leaking them to consumers.
2. **Prometheus alert `RegistryManifestIntegrityFailure`.** A new
CronJob (`registry-integrity-probe`, every 15m, in the `monitoring`
namespace) walks the private registry's catalog, HEADs every tag's
manifest, follows each image index's children, and pushes
`registry_manifest_integrity_failures` to Pushgateway. Accompanying
alerts: `RegistryIntegrityProbeStale`, `RegistryCatalogInaccessible`.
3. **Post-mortem** — this document. Linked from
`.claude/reference/service-catalog.md` via the new runbook.
### Phase 2 — Prevention
4. **Pin `registry:2` → `registry:2.8.3`** in
`modules/docker-registry/docker-compose.yml` (all six registry
services). Removes the floating-tag footgun.
5. **Extend `fix-broken-blobs.sh`** to scan every
`_manifests/revisions/sha256/<digest>` that is an image index and
flag children whose blob `data` file is missing. The script prints a
loud WARNING per orphan; it does not auto-delete the index, because
deleting a published image is a conscious decision, not an automated
repair.
### Phase 3 — Recovery tooling
6. **Manual event trigger** on `build-ci-image.yml`. Rebuilds no longer
need a cosmetic Dockerfile edit — POST to the Woodpecker API or
click "Run manually" in the UI.
7. **Runbook** `docs/runbooks/registry-rebuild-image.md` — exact
command sequence for the next time this happens, plus fallback paths.
## Out of Scope
- **Pull-through caches.** The DockerHub / GHCR mirrors on
`:5000` / `:5010` are healthy (74.5% cache hit rate, no 404s). The
orphan problem is private-registry-only. No changes to nginx or
containerd `hosts.toml`.
- **Registry HA / replication.** Single-VM SPOF is a known
architectural choice. Harbor or a replicated registry would solve
more than this incident requires, at multi-day cost. Synology offsite
snapshots already give RPO < 1 day.
- **Disabling `cleanup-tags.sh`.** Keeping storage bounded is still
necessary; the fix is detection + rebuild, not "stop cleaning up".
## Lessons
- **Repeat incidents deserve root-cause work, not a third hot-fix.** The
2026-04-13 incident was closed when CI turned green. Without a probe
and without a scan for orphan indexes, the next incident was
inevitable — and it happened six days later against a different image.
- **"No alert fired, so it wasn't detected" is a monitoring gap, not an
outage feature.** The registry was serving 404s for 2+ hours before
anyone noticed, because our only signal was "pipeline failures" and
our eyes were elsewhere. The new probe closes that gap.
- **CI pipelines should verify their own output.** The `buildx --push`
"success" exit code is not a guarantee of pulled-back integrity — as
this incident proves. A 30-second post-push HEAD walk is cheap
insurance.
## Related
- **Prior incident (same failure mode, different image)**: memory `709`
/ `710` — 2026-04-13.
- **Runbook**: `docs/runbooks/registry-rebuild-image.md` (new).
- **Hot-fix commits**: `a05d63ee`, `6371e75e`, `c113be4d`.
- **Upstream bug class**: `distribution/distribution#3324`.

View file

@ -0,0 +1,170 @@
# Runbook: Rebuild an Image After a Registry Orphan-Index Incident
Last updated: 2026-04-19
## When to use this
Pipelines that pull from `registry.viktorbarzin.me:5050` are failing with
messages like:
- `failed to resolve reference … : not found`
- `manifest unknown`
- `image can't be pulled` (Woodpecker exit 126)
- `error pulling image`: HEAD on a child manifest digest returns 404
…and `skopeo inspect --tls-verify --creds "$USER:$PASS" docker://registry.viktorbarzin.me:5050/<image>:<tag>`
returns an OCI image index whose `manifests[].digest` references are 404
on the registry.
This is the **orphan OCI-index** failure mode documented in
`docs/post-mortems/2026-04-19-registry-orphan-index.md`. The fix is to
rebuild the affected image from source so the registry receives a fresh,
complete push.
If the symptom is different (e.g., registry container down, TLS expiry,
auth failure), use `docs/runbooks/registry-vm.md` instead.
## Phase 1 — Confirm the diagnosis
From any host with `skopeo`:
```sh
REG=registry.viktorbarzin.me:5050
IMAGE=infra-ci
TAG=latest
# 1. Confirm the index exists.
skopeo inspect --tls-verify --creds "$USER:$PASS" \
--raw "docker://$REG/$IMAGE:$TAG" | jq '.mediaType, .manifests[].digest'
# 2. HEAD each child. Any non-200 = confirmed orphan.
for d in $(skopeo inspect --tls-verify --creds "$USER:$PASS" --raw \
"docker://$REG/$IMAGE:$TAG" | jq -r '.manifests[].digest'); do
code=$(curl -sk -u "$USER:$PASS" -o /dev/null -w '%{http_code}' \
-I "https://$REG/v2/$IMAGE/manifests/$d")
echo "$d → $code"
done
```
If every child is 200, the problem is elsewhere — stop here and check
the registry VM, TLS, or auth.
The `registry-integrity-probe` CronJob in the `monitoring` namespace
runs this same check every 15 minutes across every tag in the catalog;
its last run is also a fast way to see which image(s) are affected:
```sh
kubectl -n monitoring logs \
$(kubectl -n monitoring get pods -l job-name -o name \
| grep registry-integrity-probe | head -1)
```
## Phase 2 — Rebuild
### Option A (preferred): rebuild via CI
Find the `build-*.yml` pipeline that produces the image:
| Image | Pipeline | Repo ID |
|---|---|---|
| `infra-ci` | `.woodpecker/build-ci-image.yml` | 1 (infra) |
| `infra` (cli) | `.woodpecker/build-cli.yml` | 1 (infra) |
| `k8s-portal` | `.woodpecker/k8s-portal.yml` | 1 (infra) |
Trigger a manual build. The Woodpecker API expects a numeric repo ID
(paths with `owner/name` return HTML):
```sh
WOODPECKER_TOKEN=$(vault kv get -field=woodpecker_admin_token secret/viktor)
# Kick off a manual build against master.
curl -s -X POST \
-H "Authorization: Bearer $WOODPECKER_TOKEN" \
-H "Content-Type: application/json" \
"https://ci.viktorbarzin.me/api/repos/1/pipelines" \
-d '{"branch":"master"}' | jq .number
# Follow the pipeline at https://ci.viktorbarzin.me/repos/1/pipeline/<number>
```
The pipeline's `verify-integrity` step walks every blob the push
references. If it passes, the registry now has a clean index; pull
consumers will recover on next attempt.
### Option B (fallback): build on the registry VM
Only use this if Woodpecker itself is broken (its own pipeline runs
from the same `infra-ci` image, so a corrupted `infra-ci:latest` can
prevent Option A from recovering).
```sh
ssh root@10.0.20.10 '
cd /tmp
git clone --depth 1 https://github.com/ViktorBarzin/infra
cd infra/ci
docker build -t registry.viktorbarzin.me:5050/infra-ci:manual -t registry.viktorbarzin.me:5050/infra-ci:latest .
docker login -u "$USER" -p "$PASS" registry.viktorbarzin.me:5050
docker push registry.viktorbarzin.me:5050/infra-ci:manual
docker push registry.viktorbarzin.me:5050/infra-ci:latest
'
```
Then re-run any pipelines that failed — Woodpecker UI → Restart, or:
```sh
curl -s -X POST \
-H "Authorization: Bearer $WOODPECKER_TOKEN" \
"https://ci.viktorbarzin.me/api/repos/1/pipelines/<failed-pipeline-number>"
```
## Phase 3 — Verify
```sh
# 1. Pull the image fresh (bypassing containerd cache) and check its index.
REG=registry.viktorbarzin.me:5050
skopeo inspect --tls-verify --creds "$USER:$PASS" \
--raw "docker://$REG/infra-ci:latest" \
| jq '.manifests[] | {digest, platform}'
# 2. HEAD every child digest — all should be 200.
for d in $(skopeo inspect --tls-verify --creds "$USER:$PASS" --raw \
"docker://$REG/infra-ci:latest" | jq -r '.manifests[].digest'); do
code=$(curl -sk -u "$USER:$PASS" -o /dev/null -w '%{http_code}' \
-I "https://$REG/v2/infra-ci/manifests/$d")
[ "$code" = "200" ] || echo "STILL BROKEN: $d → $code"
done
echo "verified"
# 3. Kick off the next scheduled probe for good measure.
kubectl -n monitoring create job --from=cronjob/registry-integrity-probe \
registry-integrity-probe-verify-$(date +%s)
kubectl -n monitoring logs -f -l job-name=registry-integrity-probe-verify-$(date +%s)
```
The `RegistryManifestIntegrityFailure` alert clears automatically when
the probe's next run returns zero failures.
## Phase 4 — Investigate orphans
Once the immediate fix is in, check whether any OTHER images on the
registry have orphan children:
```sh
ssh root@10.0.20.10 'python3 /opt/registry/fix-broken-blobs.sh --dry-run 2>&1 | grep "ORPHAN INDEX"'
```
Each hit is a separate image that will eventually fail to pull. Rebuild
them in the same way (Option A preferred). If the list is long, open a
beads task — do NOT batch-delete the indexes; that's a destructive
registry operation outside this runbook's scope.
## Related
- `docs/post-mortems/2026-04-19-registry-orphan-index.md` — why this
happens.
- `docs/runbooks/registry-vm.md` — VM-level operations (DNS,
`docker compose` restarts).
- `modules/docker-registry/fix-broken-blobs.sh` — the scanner cron
itself, runs nightly and after each GC.
- `stacks/monitoring/modules/monitoring/main.tf`
`registry_integrity_probe` CronJob definition.

View file

@ -145,3 +145,7 @@ ssh root@10.0.20.10 '
- `docs/architecture/dns.md` — resolver IP assignments per subnet. - `docs/architecture/dns.md` — resolver IP assignments per subnet.
- `.claude/CLAUDE.md` (at repo root) — notes on the private registry - `.claude/CLAUDE.md` (at repo root) — notes on the private registry
and `containerd` `hosts.toml` redirects. and `containerd` `hosts.toml` redirects.
- `docs/runbooks/registry-rebuild-image.md` — rebuild an image after an
orphan OCI-index incident (different class of problem than DNS).
- `docs/post-mortems/2026-04-19-registry-orphan-index.md` — root cause
+ detection gaps behind the recurring missing-blob incidents.

View file

@ -3,8 +3,12 @@ networks:
driver: bridge driver: bridge
services: services:
# registry:2 is pinned after the 2026-04-13 + 2026-04-19 orphan-index incidents.
# Floating tags were swapping to regressed versions between GC runs. Upgrade
# path: bump all six registry-* services in lockstep and bounce via
# `systemctl restart docker-compose-registry.service`.
registry-dockerhub: registry-dockerhub:
image: registry:2 image: registry:2.8.3
container_name: registry-dockerhub container_name: registry-dockerhub
restart: always restart: always
volumes: volumes:
@ -22,7 +26,7 @@ services:
start_period: 10s start_period: 10s
registry-ghcr: registry-ghcr:
image: registry:2 image: registry:2.8.3
container_name: registry-ghcr container_name: registry-ghcr
restart: always restart: always
volumes: volumes:
@ -38,7 +42,7 @@ services:
start_period: 10s start_period: 10s
registry-quay: registry-quay:
image: registry:2 image: registry:2.8.3
container_name: registry-quay container_name: registry-quay
restart: always restart: always
volumes: volumes:
@ -54,7 +58,7 @@ services:
start_period: 10s start_period: 10s
registry-k8s: registry-k8s:
image: registry:2 image: registry:2.8.3
container_name: registry-k8s container_name: registry-k8s
restart: always restart: always
volumes: volumes:
@ -70,7 +74,7 @@ services:
start_period: 10s start_period: 10s
registry-kyverno: registry-kyverno:
image: registry:2 image: registry:2.8.3
container_name: registry-kyverno container_name: registry-kyverno
restart: always restart: always
volumes: volumes:
@ -86,7 +90,7 @@ services:
start_period: 10s start_period: 10s
registry-private: registry-private:
image: registry:2 image: registry:2.8.3
container_name: registry-private container_name: registry-private
restart: always restart: always
volumes: volumes:

View file

@ -1,25 +1,33 @@
#!/usr/bin/env python3 #!/usr/bin/env python3
"""Finds and removes layer links that point to non-existent blobs. """Registry integrity scanner — two classes of brokenness.
When the cleanup-tags.sh + garbage-collect cycle runs, it can delete blob data 1. Orphaned layer links: the cleanup-tags.sh + garbage-collect cycle can delete
while leaving _layers/ link files intact. The registry then returns HTTP 200 blob data while leaving _layers/ link files intact. The registry then returns
with 0 bytes for those layers (it finds the link, trusts the blob exists, but HTTP 200 with 0 bytes for those layers (it finds the link, trusts the blob
the data is gone). This causes containerd to fail with "unexpected EOF". exists, but the data is gone). Containerd sees "unexpected EOF".
Action: delete the orphan link so the next pull re-fetches cleanly.
This script walks all repositories, checks each layer link against the actual 2. Orphaned OCI-index children: an image index (multi-platform manifest list)
blobs directory, and removes any orphaned links. On next pull, the registry references child manifests by digest. If a child's blob has been deleted —
will re-fetch the missing blobs from the upstream registry. by a cleanup-tags.sh tag rmtree followed by garbage-collect walking the
children wrong (distribution/distribution#3324 class), or by an incomplete
`buildx --push` whose partial blob was later purged by `uploadpurging`
the index survives but pulls fail with `manifest unknown`.
Action: log loudly. Deleting an index is a conscious decision (the image
was published; removing it breaks downstream consumers), so we surface
the problem and leave repair to a human or to the rebuild runbook.
Run after garbage-collect (e.g., 3:15 AM Sunday) or daily. Run after garbage-collect (Sunday 03:30) and daily (Mon-Sat 02:30).
""" """
import argparse import argparse
import json
import os import os
import sys import sys
sys.stdout.reconfigure(line_buffering=True) sys.stdout.reconfigure(line_buffering=True)
parser = argparse.ArgumentParser(description="Remove orphaned registry layer links") parser = argparse.ArgumentParser(description="Scan registry for orphaned blobs and indexes")
parser.add_argument("base", nargs="?", default="/opt/registry/data", help="Registry data directory") parser.add_argument("base", nargs="?", default="/opt/registry/data", help="Registry data directory")
parser.add_argument("--dry-run", action="store_true", help="Report but don't delete") parser.add_argument("--dry-run", action="store_true", help="Report but don't delete")
args = parser.parse_args() args = parser.parse_args()
@ -27,39 +35,101 @@ args = parser.parse_args()
BASE = args.base BASE = args.base
DRY_RUN = args.dry_run DRY_RUN = args.dry_run
total_removed = 0 INDEX_MEDIA_TYPES = (
total_checked = 0 "application/vnd.oci.image.index.v1+json",
"application/vnd.docker.distribution.manifest.list.v2+json",
)
total_layer_removed = 0
total_layer_checked = 0
total_index_scanned = 0
total_index_orphans = 0
def load_manifest_blob(blobs_root, digest_hex):
blob_path = os.path.join(blobs_root, digest_hex[:2], digest_hex, "data")
if not os.path.isfile(blob_path):
return None
try:
with open(blob_path, "rb") as f:
raw = f.read(1024 * 1024)
except OSError:
return None
try:
return json.loads(raw)
except (json.JSONDecodeError, UnicodeDecodeError):
return None
for registry_name in sorted(os.listdir(BASE)): for registry_name in sorted(os.listdir(BASE)):
repos_dir = os.path.join(BASE, registry_name, "docker/registry/v2/repositories") repos_dir = os.path.join(BASE, registry_name, "docker/registry/v2/repositories")
blobs_dir = os.path.join(BASE, registry_name, "docker/registry/v2/blobs") blobs_root = os.path.join(BASE, registry_name, "docker/registry/v2/blobs/sha256")
if not os.path.isdir(repos_dir): if not os.path.isdir(repos_dir):
continue continue
for root, dirs, files in os.walk(repos_dir): for root, _, _ in os.walk(repos_dir):
if not root.endswith("/_layers/sha256"): # --- Scan 1: orphan layer links ----------------------------------------
continue if root.endswith("/_layers/sha256"):
repo = root.replace(repos_dir + "/", "").replace("/_layers/sha256", "")
repo = root.replace(repos_dir + "/", "").replace("/_layers/sha256", "") for digest_dir in os.listdir(root):
link_file = os.path.join(root, digest_dir, "link")
if not os.path.isfile(link_file):
continue
for digest_dir in os.listdir(root): total_layer_checked += 1
link_file = os.path.join(root, digest_dir, "link") blob_data = os.path.join(blobs_root, digest_dir[:2], digest_dir, "data")
if not os.path.isfile(link_file): if os.path.isfile(blob_data):
continue continue
total_checked += 1
# Check if the actual blob data exists
blob_data = os.path.join(blobs_dir, "sha256", digest_dir[:2], digest_dir, "data")
if not os.path.isfile(blob_data):
prefix = "[DRY RUN] " if DRY_RUN else "" prefix = "[DRY RUN] " if DRY_RUN else ""
print(f"{prefix}[{registry_name}/{repo}] removing orphaned layer link: {digest_dir[:12]}...") print(f"{prefix}[{registry_name}/{repo}] removing orphaned layer link: {digest_dir[:12]}...")
if not DRY_RUN: if not DRY_RUN:
# Remove the entire digest directory (contains the link file)
import shutil import shutil
shutil.rmtree(os.path.join(root, digest_dir)) shutil.rmtree(os.path.join(root, digest_dir))
total_removed += 1 total_layer_removed += 1
# --- Scan 2: orphan OCI-index children --------------------------------
elif root.endswith("/_manifests/revisions/sha256"):
repo = root.replace(repos_dir + "/", "").replace("/_manifests/revisions/sha256", "")
for digest_dir in os.listdir(root):
# Manifest revision entry. Load the blob it points to.
manifest = load_manifest_blob(blobs_root, digest_dir)
if manifest is None:
continue
media_type = manifest.get("mediaType", "")
if media_type not in INDEX_MEDIA_TYPES:
continue
total_index_scanned += 1
for child in manifest.get("manifests", []):
child_digest = child.get("digest", "")
if not child_digest.startswith("sha256:"):
continue
child_hex = child_digest[len("sha256:"):]
child_blob = os.path.join(blobs_root, child_hex[:2], child_hex, "data")
if os.path.isfile(child_blob):
continue
platform = child.get("platform", {})
arch = platform.get("architecture", "?")
os_ = platform.get("os", "?")
print(
f"WARNING [{registry_name}/{repo}] ORPHAN INDEX: "
f"{digest_dir[:12]} references missing child {child_hex[:12]} "
f"({arch}/{os_}) — rebuild required, will not auto-repair"
)
total_index_orphans += 1
mode = "DRY RUN — " if DRY_RUN else "" mode = "DRY RUN — " if DRY_RUN else ""
print(f"\n{mode}Checked {total_checked} layer links, removed {total_removed} orphaned.") print(f"\n{mode}Layer scan: checked {total_layer_checked} links, removed {total_layer_removed} orphaned.")
print(f"{mode}Index scan: inspected {total_index_scanned} image indexes, found {total_index_orphans} orphaned children.")
if total_index_orphans > 0:
print(f"\nACTION REQUIRED: {total_index_orphans} orphan index child(ren) detected. "
"See docs/runbooks/registry-rebuild-image.md — the affected image must be rebuilt "
"(a registry DELETE on an index is a conscious decision, not an automated repair).")

View file

@ -30,5 +30,7 @@ module "monitoring" {
haos_api_token = data.vault_kv_secret_v2.secrets.data["haos_api_token"] haos_api_token = data.vault_kv_secret_v2.secrets.data["haos_api_token"]
pve_password = data.vault_kv_secret_v2.secrets.data["pve_password"] pve_password = data.vault_kv_secret_v2.secrets.data["pve_password"]
grafana_admin_password = data.vault_kv_secret_v2.secrets.data["grafana_admin_password"] grafana_admin_password = data.vault_kv_secret_v2.secrets.data["grafana_admin_password"]
registry_user = data.vault_kv_secret_v2.viktor.data["registry_user"]
registry_password = data.vault_kv_secret_v2.viktor.data["registry_password"]
tier = local.tiers.cluster tier = local.tiers.cluster
} }

View file

@ -29,6 +29,14 @@ variable "grafana_admin_password" {
} }
variable "tier" { type = string } variable "tier" { type = string }
variable "mysql_host" { type = string } variable "mysql_host" { type = string }
variable "registry_user" {
type = string
sensitive = true
}
variable "registry_password" {
type = string
sensitive = true
}
resource "kubernetes_namespace" "monitoring" { resource "kubernetes_namespace" "monitoring" {
metadata { metadata {
@ -225,6 +233,195 @@ resource "kubernetes_cron_job_v1" "dns_anomaly_monitor" {
} }
} }
# -----------------------------------------------------------------------------
# Registry manifest-integrity probe HEADs every tag in the private R/W
# registry's catalog, walks multi-platform image indexes, and reports blob
# availability. Catches the orphan-index failure mode seen 2026-04-13 and
# 2026-04-19 before downstream pipelines hit it.
# See: docs/post-mortems/2026-04-19-registry-orphan-index.md
# -----------------------------------------------------------------------------
resource "kubernetes_secret" "registry_probe_credentials" {
metadata {
name = "registry-probe-credentials"
namespace = kubernetes_namespace.monitoring.metadata[0].name
}
type = "Opaque"
data = {
REG_USER = var.registry_user
REG_PASS = var.registry_password
}
}
resource "kubernetes_cron_job_v1" "registry_integrity_probe" {
metadata {
name = "registry-integrity-probe"
namespace = kubernetes_namespace.monitoring.metadata[0].name
}
spec {
concurrency_policy = "Forbid"
failed_jobs_history_limit = 3
successful_jobs_history_limit = 3
schedule = "*/15 * * * *"
job_template {
metadata {}
spec {
backoff_limit = 1
ttl_seconds_after_finished = 600
template {
metadata {}
spec {
container {
name = "registry-integrity-probe"
image = "docker.io/library/alpine:3.20"
env {
name = "REG_USER"
value_from {
secret_key_ref {
name = kubernetes_secret.registry_probe_credentials.metadata[0].name
key = "REG_USER"
}
}
}
env {
name = "REG_PASS"
value_from {
secret_key_ref {
name = kubernetes_secret.registry_probe_credentials.metadata[0].name
key = "REG_PASS"
}
}
}
env {
name = "REGISTRY_HOST"
value = "10.0.20.10:5050"
}
env {
name = "REGISTRY_INSTANCE"
value = "registry.viktorbarzin.me:5050"
}
env {
name = "PUSHGATEWAY"
value = "http://prometheus-prometheus-pushgateway.monitoring:9091/metrics/job/registry-integrity-probe"
}
env {
name = "TAGS_PER_REPO"
value = "5"
}
command = ["/bin/sh", "-c", <<-EOT
set -eu
apk add --no-cache curl jq >/dev/null
REG="$REGISTRY_HOST"
INSTANCE="$REGISTRY_INSTANCE"
AUTH="$REG_USER:$REG_PASS"
ACCEPT='application/vnd.oci.image.index.v1+json,application/vnd.oci.image.manifest.v1+json,application/vnd.docker.distribution.manifest.list.v2+json,application/vnd.docker.distribution.manifest.v2+json'
push() {
# Prometheus pushgateway body ends with blank line. Ignore push errors.
curl -sf --max-time 10 --data-binary @- "$PUSHGATEWAY" >/dev/null 2>&1 || true
}
CATALOG=$(curl -sk -u "$AUTH" --max-time 30 "https://$REG/v2/_catalog?n=1000" || echo "")
REPOS=$(echo "$CATALOG" | jq -r '.repositories[]?' 2>/dev/null || echo "")
if [ -z "$REPOS" ]; then
echo "ERROR: empty catalog or auth failure — cannot probe"
NOW=$(date +%s)
push <<METRICS
# TYPE registry_manifest_integrity_catalog_accessible gauge
registry_manifest_integrity_catalog_accessible{instance="$INSTANCE"} 0
# TYPE registry_manifest_integrity_last_run_timestamp gauge
registry_manifest_integrity_last_run_timestamp{instance="$INSTANCE"} $NOW
METRICS
exit 1
fi
FAIL=0
REPOS_N=0
TAGS_N=0
INDEXES_N=0
printf '%s\n' $REPOS > /tmp/repos.txt
while IFS= read -r repo; do
[ -z "$repo" ] && continue
REPOS_N=$((REPOS_N + 1))
TAGS_JSON=$(curl -sk -u "$AUTH" --max-time 15 "https://$REG/v2/$repo/tags/list" || echo "")
echo "$TAGS_JSON" | jq -r '.tags[]?' 2>/dev/null | tail -n "$TAGS_PER_REPO" > /tmp/tags.txt || true
while IFS= read -r tag; do
[ -z "$tag" ] && continue
TAGS_N=$((TAGS_N + 1))
HTTP=$(curl -sk -u "$AUTH" -o /tmp/m.json -w '%%{http_code}' \
-H "Accept: $ACCEPT" --max-time 15 \
"https://$REG/v2/$repo/manifests/$tag")
if [ "$HTTP" != "200" ]; then
echo "FAIL: $repo:$tag manifest HTTP $HTTP"
FAIL=$((FAIL + 1))
continue
fi
MT=$(jq -r '.mediaType // empty' /tmp/m.json 2>/dev/null || echo "")
if echo "$MT" | grep -Eq 'manifest\.list|image\.index'; then
INDEXES_N=$((INDEXES_N + 1))
jq -r '.manifests[].digest' /tmp/m.json > /tmp/children.txt 2>/dev/null || true
while IFS= read -r d; do
[ -z "$d" ] && continue
CH=$(curl -sk -u "$AUTH" -o /dev/null -w '%%{http_code}' \
-H "Accept: $ACCEPT" --max-time 10 -I \
"https://$REG/v2/$repo/manifests/$d")
if [ "$CH" != "200" ]; then
echo "FAIL: $repo:$tag index child $d HTTP $CH"
FAIL=$((FAIL + 1))
fi
done < /tmp/children.txt
fi
done < /tmp/tags.txt
done < /tmp/repos.txt
NOW=$(date +%s)
push <<METRICS
# TYPE registry_manifest_integrity_failures gauge
registry_manifest_integrity_failures{instance="$INSTANCE"} $FAIL
# TYPE registry_manifest_integrity_catalog_accessible gauge
registry_manifest_integrity_catalog_accessible{instance="$INSTANCE"} 1
# TYPE registry_manifest_integrity_repos_checked gauge
registry_manifest_integrity_repos_checked{instance="$INSTANCE"} $REPOS_N
# TYPE registry_manifest_integrity_tags_checked gauge
registry_manifest_integrity_tags_checked{instance="$INSTANCE"} $TAGS_N
# TYPE registry_manifest_integrity_indexes_checked gauge
registry_manifest_integrity_indexes_checked{instance="$INSTANCE"} $INDEXES_N
# TYPE registry_manifest_integrity_last_run_timestamp gauge
registry_manifest_integrity_last_run_timestamp{instance="$INSTANCE"} $NOW
METRICS
echo "Probe complete: $FAIL failures across $REPOS_N repos / $TAGS_N tags / $INDEXES_N indexes"
if [ "$FAIL" -gt 0 ]; then exit 1; fi
EOT
]
resources {
requests = {
cpu = "10m"
memory = "48Mi"
}
limits = {
memory = "96Mi"
}
}
}
restart_policy = "OnFailure"
}
}
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}
# Expose Pushgateway via NodePort so the PVE host can push LVM snapshot metrics # Expose Pushgateway via NodePort so the PVE host can push LVM snapshot metrics
resource "kubernetes_service" "pushgateway_nodeport" { resource "kubernetes_service" "pushgateway_nodeport" {
metadata { metadata {

View file

@ -1471,6 +1471,28 @@ serverFiles:
severity: info severity: info
annotations: annotations:
summary: "Registry cache hit rate: {{ $value | printf \"%.0f\" }}% (threshold: 25%)" summary: "Registry cache hit rate: {{ $value | printf \"%.0f\" }}% (threshold: 25%)"
- alert: RegistryManifestIntegrityFailure
expr: registry_manifest_integrity_failures > 0
for: 30m
labels:
severity: critical
annotations:
summary: "Registry has {{ $value }} broken manifest reference(s) — orphan index or missing blob"
description: "The registry-integrity-probe CronJob in the monitoring namespace found {{ $value }} manifest/blob references that return non-200 on the private registry. Almost certainly an orphan OCI-index child from the cleanup-tags.sh+GC race. Rebuild the affected image per docs/runbooks/registry-rebuild-image.md and investigate which tag(s) the probe logs flagged."
- alert: RegistryIntegrityProbeStale
expr: time() - registry_manifest_integrity_last_run_timestamp > 3600
for: 15m
labels:
severity: warning
annotations:
summary: "Registry integrity probe has not reported in >1h — CronJob may be broken"
- alert: RegistryCatalogInaccessible
expr: registry_manifest_integrity_catalog_accessible == 0
for: 15m
labels:
severity: critical
annotations:
summary: "Registry probe cannot fetch /v2/_catalog — auth failure or registry down"
- alert: NodeHighCPUUsage - alert: NodeHighCPUUsage
expr: pve_cpu_usage_ratio * 100 > 60 expr: pve_cpu_usage_ratio * 100 > 60
for: 6h for: 6h