Second identical registry incident on 2026-04-19 (first 2026-04-13): the infra-ci:latest image index resolved to child manifests whose blobs had been garbage-collected out from under the index. Pipelines P366→P376 all exited 126 "image can't be pulled". Hot fix (a05d63e/6371e75/c113be4) restored green CI but left the underlying bug unaddressed. Root cause: cleanup-tags.sh rmtrees tag dirs on the registry VM daily at 02:00, registry:2's GC (Sunday 03:25) walks OCI index children imperfectly (distribution/distribution#3324 class). Nothing verified pushes end-to-end; nothing probed the registry for fetchability; nothing caught orphan indexes. Phase 1 — Detection: - .woodpecker/build-ci-image.yml: after build-and-push, a verify-integrity step walks the just-pushed manifest (index + children + config + every layer blob) via HEAD and fails the pipeline on any non-200. Catches broken pushes at the source. - stacks/monitoring: new registry-integrity-probe CronJob (every 15m) and three alerts — RegistryManifestIntegrityFailure, RegistryIntegrityProbeStale, RegistryCatalogInaccessible — closing the "registry serves 404 for a tag that exists" gap that masked the incident for 2+ hours. - docs/post-mortems/2026-04-19-registry-orphan-index.md: root cause, timeline, monitoring gaps, permanent fix. Phase 2 — Prevention: - modules/docker-registry/docker-compose.yml: pin registry:2 → registry:2.8.3 across all six registry services. Removes the floating-tag footgun. - modules/docker-registry/fix-broken-blobs.sh: new scan walks every _manifests/revisions/sha256/<digest> that is an image index and logs a loud WARNING when a referenced child blob is missing. Does NOT auto- delete — deleting a published image is a conscious decision. Layer-link scan preserved. Phase 3 — Recovery: - build-ci-image.yml: accept `manual` event so Woodpecker API/UI rebuilds don't need a cosmetic Dockerfile edit (matches convention from pve-nfs-exports-sync.yml). - docs/runbooks/registry-rebuild-image.md: exact command sequence for diagnosing + rebuilding after an orphan-index incident, plus a fallback for building directly on the registry VM if Woodpecker itself is down. - docs/runbooks/registry-vm.md + .claude/reference/service-catalog.md: cross-references to the new runbook. Out of scope (verified healthy or intentionally deferred): - Pull-through DockerHub/GHCR mirrors (74.5% hit rate, no 404s). - Registry HA/replication (single-VM SPOF is a known architectural choice; Synology offsite covers RPO < 1 day). - Diun exclude for registry:2 — not applicable; Diun only watches k8s (DIUN_PROVIDERS_KUBERNETES=true), not the VM's docker-compose. Verified locally: - fix-broken-blobs.sh --dry-run on a synthetic registry directory correctly flags both orphan layer links and orphan OCI-index children. - terraform fmt + validate on stacks/monitoring: success (only unrelated deprecation warnings). - python3 yaml.safe_load on .woodpecker/build-ci-image.yml and modules/docker-registry/docker-compose.yml: both parse clean. Closes: code-4b8 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
169 lines
4.7 KiB
YAML
169 lines
4.7 KiB
YAML
networks:
|
|
registry:
|
|
driver: bridge
|
|
|
|
services:
|
|
# registry:2 is pinned after the 2026-04-13 + 2026-04-19 orphan-index incidents.
|
|
# Floating tags were swapping to regressed versions between GC runs. Upgrade
|
|
# path: bump all six registry-* services in lockstep and bounce via
|
|
# `systemctl restart docker-compose-registry.service`.
|
|
registry-dockerhub:
|
|
image: registry:2.8.3
|
|
container_name: registry-dockerhub
|
|
restart: always
|
|
volumes:
|
|
- /opt/registry/data/dockerhub:/var/lib/registry
|
|
- /opt/registry/config-dockerhub.yml:/etc/docker/registry/config.yml:ro
|
|
networks:
|
|
- registry
|
|
ports:
|
|
- "5001:5001"
|
|
healthcheck:
|
|
test: ["CMD", "sh", "-c", "wget -qO- http://127.0.0.1:5000/v2/ >/dev/null 2>&1"]
|
|
interval: 30s
|
|
timeout: 10s
|
|
retries: 3
|
|
start_period: 10s
|
|
|
|
registry-ghcr:
|
|
image: registry:2.8.3
|
|
container_name: registry-ghcr
|
|
restart: always
|
|
volumes:
|
|
- /opt/registry/data/ghcr:/var/lib/registry
|
|
- /opt/registry/config-ghcr.yml:/etc/docker/registry/config.yml:ro
|
|
networks:
|
|
- registry
|
|
healthcheck:
|
|
test: ["CMD", "sh", "-c", "wget -qO- http://127.0.0.1:5000/v2/ >/dev/null 2>&1"]
|
|
interval: 30s
|
|
timeout: 10s
|
|
retries: 3
|
|
start_period: 10s
|
|
|
|
registry-quay:
|
|
image: registry:2.8.3
|
|
container_name: registry-quay
|
|
restart: always
|
|
volumes:
|
|
- /opt/registry/data/quay:/var/lib/registry
|
|
- /opt/registry/config-quay.yml:/etc/docker/registry/config.yml:ro
|
|
networks:
|
|
- registry
|
|
healthcheck:
|
|
test: ["CMD", "sh", "-c", "wget -qO- http://127.0.0.1:5000/v2/ >/dev/null 2>&1"]
|
|
interval: 30s
|
|
timeout: 10s
|
|
retries: 3
|
|
start_period: 10s
|
|
|
|
registry-k8s:
|
|
image: registry:2.8.3
|
|
container_name: registry-k8s
|
|
restart: always
|
|
volumes:
|
|
- /opt/registry/data/k8s:/var/lib/registry
|
|
- /opt/registry/config-k8s.yml:/etc/docker/registry/config.yml:ro
|
|
networks:
|
|
- registry
|
|
healthcheck:
|
|
test: ["CMD", "sh", "-c", "wget -qO- http://127.0.0.1:5000/v2/ >/dev/null 2>&1"]
|
|
interval: 30s
|
|
timeout: 10s
|
|
retries: 3
|
|
start_period: 10s
|
|
|
|
registry-kyverno:
|
|
image: registry:2.8.3
|
|
container_name: registry-kyverno
|
|
restart: always
|
|
volumes:
|
|
- /opt/registry/data/kyverno:/var/lib/registry
|
|
- /opt/registry/config-kyverno.yml:/etc/docker/registry/config.yml:ro
|
|
networks:
|
|
- registry
|
|
healthcheck:
|
|
test: ["CMD", "sh", "-c", "wget -qO- http://127.0.0.1:5000/v2/ >/dev/null 2>&1"]
|
|
interval: 30s
|
|
timeout: 10s
|
|
retries: 3
|
|
start_period: 10s
|
|
|
|
registry-private:
|
|
image: registry:2.8.3
|
|
container_name: registry-private
|
|
restart: always
|
|
volumes:
|
|
- /opt/registry/data/private:/var/lib/registry
|
|
- /opt/registry/config-private.yml:/etc/docker/registry/config.yml:ro
|
|
- /opt/registry/htpasswd:/auth/htpasswd:ro
|
|
networks:
|
|
- registry
|
|
healthcheck:
|
|
# 401 is expected (auth required) — any HTTP response means the registry is healthy
|
|
test: ["CMD", "sh", "-c", "wget -qS -O /dev/null http://127.0.0.1:5000/v2/ 2>&1 | grep -q 'HTTP/'"]
|
|
interval: 30s
|
|
timeout: 10s
|
|
retries: 3
|
|
start_period: 10s
|
|
|
|
nginx:
|
|
image: nginx:alpine
|
|
container_name: registry-nginx
|
|
restart: always
|
|
ports:
|
|
- "5000:5000"
|
|
- "5010:5010"
|
|
- "5020:5020"
|
|
- "5030:5030"
|
|
- "5040:5040"
|
|
- "5050:5050"
|
|
volumes:
|
|
- /opt/registry/nginx.conf:/etc/nginx/nginx.conf:ro
|
|
- /opt/registry/tls:/etc/nginx/tls:ro
|
|
- nginx-cache:/var/cache/nginx
|
|
networks:
|
|
- registry
|
|
depends_on:
|
|
registry-dockerhub:
|
|
condition: service_healthy
|
|
registry-ghcr:
|
|
condition: service_healthy
|
|
registry-quay:
|
|
condition: service_healthy
|
|
registry-k8s:
|
|
condition: service_healthy
|
|
registry-kyverno:
|
|
condition: service_healthy
|
|
registry-private:
|
|
condition: service_healthy
|
|
healthcheck:
|
|
test: ["CMD", "sh", "-c", "wget -qO- http://127.0.0.1:5000/v2/ >/dev/null 2>&1"]
|
|
interval: 30s
|
|
timeout: 10s
|
|
retries: 3
|
|
start_period: 15s
|
|
|
|
registry-ui:
|
|
image: joxit/docker-registry-ui:latest
|
|
container_name: registry-ui
|
|
restart: always
|
|
ports:
|
|
- "8080:80"
|
|
environment:
|
|
- NGINX_PROXY_PASS_URL=http://registry-dockerhub:5000
|
|
- DELETE_IMAGES=true
|
|
- SINGLE_REGISTRY=true
|
|
- SHOW_CONTENT_DIGEST=true
|
|
- SHOW_CATALOG_NB_TAGS=true
|
|
- CATALOG_ELEMENTS_LIMIT=1000
|
|
- TAGLIST_PAGE_SIZE=100
|
|
- REGISTRY_TITLE=viktorbarzin.me
|
|
networks:
|
|
- registry
|
|
depends_on:
|
|
registry-dockerhub:
|
|
condition: service_healthy
|
|
|
|
volumes:
|
|
nginx-cache:
|