Second identical registry incident on 2026-04-19 (first 2026-04-13): the infra-ci:latest image index resolved to child manifests whose blobs had been garbage-collected out from under the index. Pipelines P366→P376 all exited 126 "image can't be pulled". Hot fix (a05d63e/6371e75/c113be4) restored green CI but left the underlying bug unaddressed. Root cause: cleanup-tags.sh rmtrees tag dirs on the registry VM daily at 02:00, registry:2's GC (Sunday 03:25) walks OCI index children imperfectly (distribution/distribution#3324 class). Nothing verified pushes end-to-end; nothing probed the registry for fetchability; nothing caught orphan indexes. Phase 1 — Detection: - .woodpecker/build-ci-image.yml: after build-and-push, a verify-integrity step walks the just-pushed manifest (index + children + config + every layer blob) via HEAD and fails the pipeline on any non-200. Catches broken pushes at the source. - stacks/monitoring: new registry-integrity-probe CronJob (every 15m) and three alerts — RegistryManifestIntegrityFailure, RegistryIntegrityProbeStale, RegistryCatalogInaccessible — closing the "registry serves 404 for a tag that exists" gap that masked the incident for 2+ hours. - docs/post-mortems/2026-04-19-registry-orphan-index.md: root cause, timeline, monitoring gaps, permanent fix. Phase 2 — Prevention: - modules/docker-registry/docker-compose.yml: pin registry:2 → registry:2.8.3 across all six registry services. Removes the floating-tag footgun. - modules/docker-registry/fix-broken-blobs.sh: new scan walks every _manifests/revisions/sha256/<digest> that is an image index and logs a loud WARNING when a referenced child blob is missing. Does NOT auto- delete — deleting a published image is a conscious decision. Layer-link scan preserved. Phase 3 — Recovery: - build-ci-image.yml: accept `manual` event so Woodpecker API/UI rebuilds don't need a cosmetic Dockerfile edit (matches convention from pve-nfs-exports-sync.yml). - docs/runbooks/registry-rebuild-image.md: exact command sequence for diagnosing + rebuilding after an orphan-index incident, plus a fallback for building directly on the registry VM if Woodpecker itself is down. - docs/runbooks/registry-vm.md + .claude/reference/service-catalog.md: cross-references to the new runbook. Out of scope (verified healthy or intentionally deferred): - Pull-through DockerHub/GHCR mirrors (74.5% hit rate, no 404s). - Registry HA/replication (single-VM SPOF is a known architectural choice; Synology offsite covers RPO < 1 day). - Diun exclude for registry:2 — not applicable; Diun only watches k8s (DIUN_PROVIDERS_KUBERNETES=true), not the VM's docker-compose. Verified locally: - fix-broken-blobs.sh --dry-run on a synthetic registry directory correctly flags both orphan layer links and orphan OCI-index children. - terraform fmt + validate on stacks/monitoring: success (only unrelated deprecation warnings). - python3 yaml.safe_load on .woodpecker/build-ci-image.yml and modules/docker-registry/docker-compose.yml: both parse clean. Closes: code-4b8 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
120 lines
4.4 KiB
YAML
120 lines
4.4 KiB
YAML
# Build the CI tools Docker image used by all infra pipelines.
|
|
# Triggers on push that touches ci/Dockerfile, or manual (API/UI) so
|
|
# rebuilds after a registry incident don't need a cosmetic Dockerfile edit.
|
|
|
|
when:
|
|
- event: push
|
|
branch: master
|
|
path:
|
|
include:
|
|
- 'ci/Dockerfile'
|
|
- event: manual
|
|
|
|
steps:
|
|
- name: build-and-push
|
|
image: woodpeckerci/plugin-docker-buildx
|
|
settings:
|
|
repo: registry.viktorbarzin.me:5050/infra-ci
|
|
dockerfile: ci/Dockerfile
|
|
context: ci/
|
|
tags:
|
|
- latest
|
|
- "${CI_COMMIT_SHA:0:8}"
|
|
platforms: linux/amd64
|
|
registry: registry.viktorbarzin.me:5050
|
|
logins:
|
|
- registry: registry.viktorbarzin.me:5050
|
|
username:
|
|
from_secret: registry_user
|
|
password:
|
|
from_secret: registry_password
|
|
|
|
# Post-push integrity check. Re-resolves the image we just pushed and HEADs
|
|
# every blob it references — top-level manifest (index or single), each child
|
|
# platform manifest, each config blob, each layer blob. If any returns !=200
|
|
# the pipeline fails loudly here so we never ship a broken index downstream.
|
|
# Historical context: 2026-04-13 and 2026-04-19 incidents both shipped indexes
|
|
# whose platform/attestation children had been GC-orphaned on the registry VM.
|
|
- name: verify-integrity
|
|
image: alpine:3.20
|
|
environment:
|
|
REG_USER:
|
|
from_secret: registry_user
|
|
REG_PASS:
|
|
from_secret: registry_password
|
|
commands:
|
|
- apk add --no-cache curl jq
|
|
- REG=registry.viktorbarzin.me:5050
|
|
- REPO=infra-ci
|
|
- SHA=${CI_COMMIT_SHA:0:8}
|
|
- AUTH="$REG_USER:$REG_PASS"
|
|
- |
|
|
set -euo pipefail
|
|
ACCEPT='Accept: application/vnd.oci.image.index.v1+json,application/vnd.oci.image.manifest.v1+json,application/vnd.docker.distribution.manifest.list.v2+json,application/vnd.docker.distribution.manifest.v2+json'
|
|
|
|
fetch_manifest() {
|
|
# Prints the body to $2, returns the HTTP code as stdout.
|
|
curl -sk -u "$AUTH" -H "$ACCEPT" \
|
|
-o "$2" -w '%{http_code}' \
|
|
"https://$REG/v2/$REPO/manifests/$1"
|
|
}
|
|
head_blob() {
|
|
curl -sk -u "$AUTH" -o /dev/null -w '%{http_code}' \
|
|
-I "https://$REG/v2/$REPO/blobs/$1"
|
|
}
|
|
|
|
verify_single_manifest() {
|
|
local ref="$1" tmp=/tmp/m-$$.json
|
|
local rc cfg
|
|
rc=$(fetch_manifest "$ref" "$tmp")
|
|
if [ "$rc" != "200" ]; then
|
|
echo "FAIL: manifest $ref returned HTTP $rc"; return 1
|
|
fi
|
|
cfg=$(jq -r '.config.digest // empty' "$tmp")
|
|
if [ -n "$cfg" ]; then
|
|
rc=$(head_blob "$cfg")
|
|
[ "$rc" = "200" ] || { echo "FAIL: config blob $cfg returned HTTP $rc"; return 1; }
|
|
fi
|
|
jq -r '.layers[]?.digest' "$tmp" > /tmp/layers-$$.txt
|
|
while IFS= read -r layer; do
|
|
[ -z "$layer" ] && continue
|
|
rc=$(head_blob "$layer")
|
|
[ "$rc" = "200" ] || { echo "FAIL: layer blob $layer returned HTTP $rc"; return 1; }
|
|
done < /tmp/layers-$$.txt
|
|
return 0
|
|
}
|
|
|
|
echo "=== Verifying push integrity for $REPO:$SHA ==="
|
|
TOP=/tmp/top-$$.json
|
|
rc=$(fetch_manifest "$SHA" "$TOP")
|
|
[ "$rc" = "200" ] || { echo "FAIL: top manifest :$SHA returned HTTP $rc"; exit 1; }
|
|
|
|
MT=$(jq -r '.mediaType // empty' "$TOP")
|
|
echo "Top-level media type: ${MT:-<unset>}"
|
|
|
|
if echo "$MT" | grep -Eq 'manifest\.list|image\.index'; then
|
|
jq -r '.manifests[].digest' "$TOP" > /tmp/children-$$.txt
|
|
echo "Multi-platform index: $(wc -l </tmp/children-$$.txt) child manifest(s)"
|
|
while IFS= read -r d; do
|
|
echo "--- child $d ---"
|
|
verify_single_manifest "$d" || exit 1
|
|
done < /tmp/children-$$.txt
|
|
else
|
|
echo "Single-platform manifest — verifying directly"
|
|
verify_single_manifest "$SHA" || exit 1
|
|
fi
|
|
|
|
echo "=== All manifests + blobs verified. Push integrity intact. ==="
|
|
|
|
- name: slack
|
|
image: curlimages/curl
|
|
commands:
|
|
- |
|
|
curl -s -X POST -H 'Content-type: application/json' \
|
|
--data "{\"text\":\"CI image built: registry.viktorbarzin.me:5050/infra-ci:${CI_COMMIT_SHA:0:8}\"}" \
|
|
"$SLACK_WEBHOOK" || true
|
|
environment:
|
|
SLACK_WEBHOOK:
|
|
from_secret: slack_webhook
|
|
when:
|
|
status: [success]
|