2026-02-22 21:45:53 +00:00
|
|
|
networks:
|
|
|
|
|
registry:
|
|
|
|
|
driver: bridge
|
|
|
|
|
|
|
|
|
|
services:
|
[registry] Stop recurring orphan OCI-index incidents — detection + prevention + recovery
Second identical registry incident on 2026-04-19 (first 2026-04-13): the
infra-ci:latest image index resolved to child manifests whose blobs had been
garbage-collected out from under the index. Pipelines P366→P376 all exited
126 "image can't be pulled". Hot fix (a05d63e / 6371e75 / c113be4) restored
green CI but left the underlying bug unaddressed.
Root cause: cleanup-tags.sh rmtrees tag dirs on the registry VM daily at
02:00, registry:2's GC (Sunday 03:25) walks OCI index children imperfectly
(distribution/distribution#3324 class). Nothing verified pushes end-to-end;
nothing probed the registry for fetchability; nothing caught orphan indexes.
Phase 1 — Detection:
- .woodpecker/build-ci-image.yml: after build-and-push, a verify-integrity
step walks the just-pushed manifest (index + children + config + every
layer blob) via HEAD and fails the pipeline on any non-200. Catches
broken pushes at the source.
- stacks/monitoring: new registry-integrity-probe CronJob (every 15m) and
three alerts — RegistryManifestIntegrityFailure,
RegistryIntegrityProbeStale, RegistryCatalogInaccessible — closing the
"registry serves 404 for a tag that exists" gap that masked the incident
for 2+ hours.
- docs/post-mortems/2026-04-19-registry-orphan-index.md: root cause,
timeline, monitoring gaps, permanent fix.
Phase 2 — Prevention:
- modules/docker-registry/docker-compose.yml: pin registry:2 → registry:2.8.3
across all six registry services. Removes the floating-tag footgun.
- modules/docker-registry/fix-broken-blobs.sh: new scan walks every
_manifests/revisions/sha256/<digest> that is an image index and logs a
loud WARNING when a referenced child blob is missing. Does NOT auto-
delete — deleting a published image is a conscious decision. Layer-link
scan preserved.
Phase 3 — Recovery:
- build-ci-image.yml: accept `manual` event so Woodpecker API/UI rebuilds
don't need a cosmetic Dockerfile edit (matches convention from
pve-nfs-exports-sync.yml).
- docs/runbooks/registry-rebuild-image.md: exact command sequence for
diagnosing + rebuilding after an orphan-index incident, plus a fallback
for building directly on the registry VM if Woodpecker itself is down.
- docs/runbooks/registry-vm.md + .claude/reference/service-catalog.md:
cross-references to the new runbook.
Out of scope (verified healthy or intentionally deferred):
- Pull-through DockerHub/GHCR mirrors (74.5% hit rate, no 404s).
- Registry HA/replication (single-VM SPOF is a known architectural
choice; Synology offsite covers RPO < 1 day).
- Diun exclude for registry:2 — not applicable; Diun only watches
k8s (DIUN_PROVIDERS_KUBERNETES=true), not the VM's docker-compose.
Verified locally:
- fix-broken-blobs.sh --dry-run on a synthetic registry directory correctly
flags both orphan layer links and orphan OCI-index children.
- terraform fmt + validate on stacks/monitoring: success (only unrelated
deprecation warnings).
- python3 yaml.safe_load on .woodpecker/build-ci-image.yml and
modules/docker-registry/docker-compose.yml: both parse clean.
Closes: code-4b8
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 17:08:28 +00:00
|
|
|
# registry:2 is pinned after the 2026-04-13 + 2026-04-19 orphan-index incidents.
|
|
|
|
|
# Floating tags were swapping to regressed versions between GC runs. Upgrade
|
|
|
|
|
# path: bump all six registry-* services in lockstep and bounce via
|
|
|
|
|
# `systemctl restart docker-compose-registry.service`.
|
2026-02-22 21:45:53 +00:00
|
|
|
registry-dockerhub:
|
[registry] Stop recurring orphan OCI-index incidents — detection + prevention + recovery
Second identical registry incident on 2026-04-19 (first 2026-04-13): the
infra-ci:latest image index resolved to child manifests whose blobs had been
garbage-collected out from under the index. Pipelines P366→P376 all exited
126 "image can't be pulled". Hot fix (a05d63e / 6371e75 / c113be4) restored
green CI but left the underlying bug unaddressed.
Root cause: cleanup-tags.sh rmtrees tag dirs on the registry VM daily at
02:00, registry:2's GC (Sunday 03:25) walks OCI index children imperfectly
(distribution/distribution#3324 class). Nothing verified pushes end-to-end;
nothing probed the registry for fetchability; nothing caught orphan indexes.
Phase 1 — Detection:
- .woodpecker/build-ci-image.yml: after build-and-push, a verify-integrity
step walks the just-pushed manifest (index + children + config + every
layer blob) via HEAD and fails the pipeline on any non-200. Catches
broken pushes at the source.
- stacks/monitoring: new registry-integrity-probe CronJob (every 15m) and
three alerts — RegistryManifestIntegrityFailure,
RegistryIntegrityProbeStale, RegistryCatalogInaccessible — closing the
"registry serves 404 for a tag that exists" gap that masked the incident
for 2+ hours.
- docs/post-mortems/2026-04-19-registry-orphan-index.md: root cause,
timeline, monitoring gaps, permanent fix.
Phase 2 — Prevention:
- modules/docker-registry/docker-compose.yml: pin registry:2 → registry:2.8.3
across all six registry services. Removes the floating-tag footgun.
- modules/docker-registry/fix-broken-blobs.sh: new scan walks every
_manifests/revisions/sha256/<digest> that is an image index and logs a
loud WARNING when a referenced child blob is missing. Does NOT auto-
delete — deleting a published image is a conscious decision. Layer-link
scan preserved.
Phase 3 — Recovery:
- build-ci-image.yml: accept `manual` event so Woodpecker API/UI rebuilds
don't need a cosmetic Dockerfile edit (matches convention from
pve-nfs-exports-sync.yml).
- docs/runbooks/registry-rebuild-image.md: exact command sequence for
diagnosing + rebuilding after an orphan-index incident, plus a fallback
for building directly on the registry VM if Woodpecker itself is down.
- docs/runbooks/registry-vm.md + .claude/reference/service-catalog.md:
cross-references to the new runbook.
Out of scope (verified healthy or intentionally deferred):
- Pull-through DockerHub/GHCR mirrors (74.5% hit rate, no 404s).
- Registry HA/replication (single-VM SPOF is a known architectural
choice; Synology offsite covers RPO < 1 day).
- Diun exclude for registry:2 — not applicable; Diun only watches
k8s (DIUN_PROVIDERS_KUBERNETES=true), not the VM's docker-compose.
Verified locally:
- fix-broken-blobs.sh --dry-run on a synthetic registry directory correctly
flags both orphan layer links and orphan OCI-index children.
- terraform fmt + validate on stacks/monitoring: success (only unrelated
deprecation warnings).
- python3 yaml.safe_load on .woodpecker/build-ci-image.yml and
modules/docker-registry/docker-compose.yml: both parse clean.
Closes: code-4b8
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 17:08:28 +00:00
|
|
|
image: registry:2.8.3
|
2026-02-22 21:45:53 +00:00
|
|
|
container_name: registry-dockerhub
|
|
|
|
|
restart: always
|
|
|
|
|
volumes:
|
|
|
|
|
- /opt/registry/data/dockerhub:/var/lib/registry
|
|
|
|
|
- /opt/registry/config-dockerhub.yml:/etc/docker/registry/config.yml:ro
|
|
|
|
|
networks:
|
|
|
|
|
- registry
|
|
|
|
|
ports:
|
|
|
|
|
- "5001:5001"
|
|
|
|
|
healthcheck:
|
2026-03-29 22:29:27 +03:00
|
|
|
test: ["CMD", "sh", "-c", "wget -qO- http://127.0.0.1:5000/v2/ >/dev/null 2>&1"]
|
2026-02-22 21:45:53 +00:00
|
|
|
interval: 30s
|
|
|
|
|
timeout: 10s
|
|
|
|
|
retries: 3
|
|
|
|
|
start_period: 10s
|
|
|
|
|
|
|
|
|
|
registry-ghcr:
|
[registry] Stop recurring orphan OCI-index incidents — detection + prevention + recovery
Second identical registry incident on 2026-04-19 (first 2026-04-13): the
infra-ci:latest image index resolved to child manifests whose blobs had been
garbage-collected out from under the index. Pipelines P366→P376 all exited
126 "image can't be pulled". Hot fix (a05d63e / 6371e75 / c113be4) restored
green CI but left the underlying bug unaddressed.
Root cause: cleanup-tags.sh rmtrees tag dirs on the registry VM daily at
02:00, registry:2's GC (Sunday 03:25) walks OCI index children imperfectly
(distribution/distribution#3324 class). Nothing verified pushes end-to-end;
nothing probed the registry for fetchability; nothing caught orphan indexes.
Phase 1 — Detection:
- .woodpecker/build-ci-image.yml: after build-and-push, a verify-integrity
step walks the just-pushed manifest (index + children + config + every
layer blob) via HEAD and fails the pipeline on any non-200. Catches
broken pushes at the source.
- stacks/monitoring: new registry-integrity-probe CronJob (every 15m) and
three alerts — RegistryManifestIntegrityFailure,
RegistryIntegrityProbeStale, RegistryCatalogInaccessible — closing the
"registry serves 404 for a tag that exists" gap that masked the incident
for 2+ hours.
- docs/post-mortems/2026-04-19-registry-orphan-index.md: root cause,
timeline, monitoring gaps, permanent fix.
Phase 2 — Prevention:
- modules/docker-registry/docker-compose.yml: pin registry:2 → registry:2.8.3
across all six registry services. Removes the floating-tag footgun.
- modules/docker-registry/fix-broken-blobs.sh: new scan walks every
_manifests/revisions/sha256/<digest> that is an image index and logs a
loud WARNING when a referenced child blob is missing. Does NOT auto-
delete — deleting a published image is a conscious decision. Layer-link
scan preserved.
Phase 3 — Recovery:
- build-ci-image.yml: accept `manual` event so Woodpecker API/UI rebuilds
don't need a cosmetic Dockerfile edit (matches convention from
pve-nfs-exports-sync.yml).
- docs/runbooks/registry-rebuild-image.md: exact command sequence for
diagnosing + rebuilding after an orphan-index incident, plus a fallback
for building directly on the registry VM if Woodpecker itself is down.
- docs/runbooks/registry-vm.md + .claude/reference/service-catalog.md:
cross-references to the new runbook.
Out of scope (verified healthy or intentionally deferred):
- Pull-through DockerHub/GHCR mirrors (74.5% hit rate, no 404s).
- Registry HA/replication (single-VM SPOF is a known architectural
choice; Synology offsite covers RPO < 1 day).
- Diun exclude for registry:2 — not applicable; Diun only watches
k8s (DIUN_PROVIDERS_KUBERNETES=true), not the VM's docker-compose.
Verified locally:
- fix-broken-blobs.sh --dry-run on a synthetic registry directory correctly
flags both orphan layer links and orphan OCI-index children.
- terraform fmt + validate on stacks/monitoring: success (only unrelated
deprecation warnings).
- python3 yaml.safe_load on .woodpecker/build-ci-image.yml and
modules/docker-registry/docker-compose.yml: both parse clean.
Closes: code-4b8
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 17:08:28 +00:00
|
|
|
image: registry:2.8.3
|
2026-02-22 21:45:53 +00:00
|
|
|
container_name: registry-ghcr
|
|
|
|
|
restart: always
|
|
|
|
|
volumes:
|
|
|
|
|
- /opt/registry/data/ghcr:/var/lib/registry
|
|
|
|
|
- /opt/registry/config-ghcr.yml:/etc/docker/registry/config.yml:ro
|
|
|
|
|
networks:
|
|
|
|
|
- registry
|
|
|
|
|
healthcheck:
|
2026-03-29 22:29:27 +03:00
|
|
|
test: ["CMD", "sh", "-c", "wget -qO- http://127.0.0.1:5000/v2/ >/dev/null 2>&1"]
|
2026-02-22 21:45:53 +00:00
|
|
|
interval: 30s
|
|
|
|
|
timeout: 10s
|
|
|
|
|
retries: 3
|
|
|
|
|
start_period: 10s
|
|
|
|
|
|
|
|
|
|
registry-quay:
|
[registry] Stop recurring orphan OCI-index incidents — detection + prevention + recovery
Second identical registry incident on 2026-04-19 (first 2026-04-13): the
infra-ci:latest image index resolved to child manifests whose blobs had been
garbage-collected out from under the index. Pipelines P366→P376 all exited
126 "image can't be pulled". Hot fix (a05d63e / 6371e75 / c113be4) restored
green CI but left the underlying bug unaddressed.
Root cause: cleanup-tags.sh rmtrees tag dirs on the registry VM daily at
02:00, registry:2's GC (Sunday 03:25) walks OCI index children imperfectly
(distribution/distribution#3324 class). Nothing verified pushes end-to-end;
nothing probed the registry for fetchability; nothing caught orphan indexes.
Phase 1 — Detection:
- .woodpecker/build-ci-image.yml: after build-and-push, a verify-integrity
step walks the just-pushed manifest (index + children + config + every
layer blob) via HEAD and fails the pipeline on any non-200. Catches
broken pushes at the source.
- stacks/monitoring: new registry-integrity-probe CronJob (every 15m) and
three alerts — RegistryManifestIntegrityFailure,
RegistryIntegrityProbeStale, RegistryCatalogInaccessible — closing the
"registry serves 404 for a tag that exists" gap that masked the incident
for 2+ hours.
- docs/post-mortems/2026-04-19-registry-orphan-index.md: root cause,
timeline, monitoring gaps, permanent fix.
Phase 2 — Prevention:
- modules/docker-registry/docker-compose.yml: pin registry:2 → registry:2.8.3
across all six registry services. Removes the floating-tag footgun.
- modules/docker-registry/fix-broken-blobs.sh: new scan walks every
_manifests/revisions/sha256/<digest> that is an image index and logs a
loud WARNING when a referenced child blob is missing. Does NOT auto-
delete — deleting a published image is a conscious decision. Layer-link
scan preserved.
Phase 3 — Recovery:
- build-ci-image.yml: accept `manual` event so Woodpecker API/UI rebuilds
don't need a cosmetic Dockerfile edit (matches convention from
pve-nfs-exports-sync.yml).
- docs/runbooks/registry-rebuild-image.md: exact command sequence for
diagnosing + rebuilding after an orphan-index incident, plus a fallback
for building directly on the registry VM if Woodpecker itself is down.
- docs/runbooks/registry-vm.md + .claude/reference/service-catalog.md:
cross-references to the new runbook.
Out of scope (verified healthy or intentionally deferred):
- Pull-through DockerHub/GHCR mirrors (74.5% hit rate, no 404s).
- Registry HA/replication (single-VM SPOF is a known architectural
choice; Synology offsite covers RPO < 1 day).
- Diun exclude for registry:2 — not applicable; Diun only watches
k8s (DIUN_PROVIDERS_KUBERNETES=true), not the VM's docker-compose.
Verified locally:
- fix-broken-blobs.sh --dry-run on a synthetic registry directory correctly
flags both orphan layer links and orphan OCI-index children.
- terraform fmt + validate on stacks/monitoring: success (only unrelated
deprecation warnings).
- python3 yaml.safe_load on .woodpecker/build-ci-image.yml and
modules/docker-registry/docker-compose.yml: both parse clean.
Closes: code-4b8
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 17:08:28 +00:00
|
|
|
image: registry:2.8.3
|
2026-02-22 21:45:53 +00:00
|
|
|
container_name: registry-quay
|
|
|
|
|
restart: always
|
|
|
|
|
volumes:
|
|
|
|
|
- /opt/registry/data/quay:/var/lib/registry
|
|
|
|
|
- /opt/registry/config-quay.yml:/etc/docker/registry/config.yml:ro
|
|
|
|
|
networks:
|
|
|
|
|
- registry
|
|
|
|
|
healthcheck:
|
2026-03-29 22:29:27 +03:00
|
|
|
test: ["CMD", "sh", "-c", "wget -qO- http://127.0.0.1:5000/v2/ >/dev/null 2>&1"]
|
2026-02-22 21:45:53 +00:00
|
|
|
interval: 30s
|
|
|
|
|
timeout: 10s
|
|
|
|
|
retries: 3
|
|
|
|
|
start_period: 10s
|
|
|
|
|
|
|
|
|
|
registry-k8s:
|
[registry] Stop recurring orphan OCI-index incidents — detection + prevention + recovery
Second identical registry incident on 2026-04-19 (first 2026-04-13): the
infra-ci:latest image index resolved to child manifests whose blobs had been
garbage-collected out from under the index. Pipelines P366→P376 all exited
126 "image can't be pulled". Hot fix (a05d63e / 6371e75 / c113be4) restored
green CI but left the underlying bug unaddressed.
Root cause: cleanup-tags.sh rmtrees tag dirs on the registry VM daily at
02:00, registry:2's GC (Sunday 03:25) walks OCI index children imperfectly
(distribution/distribution#3324 class). Nothing verified pushes end-to-end;
nothing probed the registry for fetchability; nothing caught orphan indexes.
Phase 1 — Detection:
- .woodpecker/build-ci-image.yml: after build-and-push, a verify-integrity
step walks the just-pushed manifest (index + children + config + every
layer blob) via HEAD and fails the pipeline on any non-200. Catches
broken pushes at the source.
- stacks/monitoring: new registry-integrity-probe CronJob (every 15m) and
three alerts — RegistryManifestIntegrityFailure,
RegistryIntegrityProbeStale, RegistryCatalogInaccessible — closing the
"registry serves 404 for a tag that exists" gap that masked the incident
for 2+ hours.
- docs/post-mortems/2026-04-19-registry-orphan-index.md: root cause,
timeline, monitoring gaps, permanent fix.
Phase 2 — Prevention:
- modules/docker-registry/docker-compose.yml: pin registry:2 → registry:2.8.3
across all six registry services. Removes the floating-tag footgun.
- modules/docker-registry/fix-broken-blobs.sh: new scan walks every
_manifests/revisions/sha256/<digest> that is an image index and logs a
loud WARNING when a referenced child blob is missing. Does NOT auto-
delete — deleting a published image is a conscious decision. Layer-link
scan preserved.
Phase 3 — Recovery:
- build-ci-image.yml: accept `manual` event so Woodpecker API/UI rebuilds
don't need a cosmetic Dockerfile edit (matches convention from
pve-nfs-exports-sync.yml).
- docs/runbooks/registry-rebuild-image.md: exact command sequence for
diagnosing + rebuilding after an orphan-index incident, plus a fallback
for building directly on the registry VM if Woodpecker itself is down.
- docs/runbooks/registry-vm.md + .claude/reference/service-catalog.md:
cross-references to the new runbook.
Out of scope (verified healthy or intentionally deferred):
- Pull-through DockerHub/GHCR mirrors (74.5% hit rate, no 404s).
- Registry HA/replication (single-VM SPOF is a known architectural
choice; Synology offsite covers RPO < 1 day).
- Diun exclude for registry:2 — not applicable; Diun only watches
k8s (DIUN_PROVIDERS_KUBERNETES=true), not the VM's docker-compose.
Verified locally:
- fix-broken-blobs.sh --dry-run on a synthetic registry directory correctly
flags both orphan layer links and orphan OCI-index children.
- terraform fmt + validate on stacks/monitoring: success (only unrelated
deprecation warnings).
- python3 yaml.safe_load on .woodpecker/build-ci-image.yml and
modules/docker-registry/docker-compose.yml: both parse clean.
Closes: code-4b8
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 17:08:28 +00:00
|
|
|
image: registry:2.8.3
|
2026-02-22 21:45:53 +00:00
|
|
|
container_name: registry-k8s
|
|
|
|
|
restart: always
|
|
|
|
|
volumes:
|
|
|
|
|
- /opt/registry/data/k8s:/var/lib/registry
|
|
|
|
|
- /opt/registry/config-k8s.yml:/etc/docker/registry/config.yml:ro
|
|
|
|
|
networks:
|
|
|
|
|
- registry
|
|
|
|
|
healthcheck:
|
2026-03-29 22:29:27 +03:00
|
|
|
test: ["CMD", "sh", "-c", "wget -qO- http://127.0.0.1:5000/v2/ >/dev/null 2>&1"]
|
2026-02-22 21:45:53 +00:00
|
|
|
interval: 30s
|
|
|
|
|
timeout: 10s
|
|
|
|
|
retries: 3
|
|
|
|
|
start_period: 10s
|
|
|
|
|
|
|
|
|
|
registry-kyverno:
|
[registry] Stop recurring orphan OCI-index incidents — detection + prevention + recovery
Second identical registry incident on 2026-04-19 (first 2026-04-13): the
infra-ci:latest image index resolved to child manifests whose blobs had been
garbage-collected out from under the index. Pipelines P366→P376 all exited
126 "image can't be pulled". Hot fix (a05d63e / 6371e75 / c113be4) restored
green CI but left the underlying bug unaddressed.
Root cause: cleanup-tags.sh rmtrees tag dirs on the registry VM daily at
02:00, registry:2's GC (Sunday 03:25) walks OCI index children imperfectly
(distribution/distribution#3324 class). Nothing verified pushes end-to-end;
nothing probed the registry for fetchability; nothing caught orphan indexes.
Phase 1 — Detection:
- .woodpecker/build-ci-image.yml: after build-and-push, a verify-integrity
step walks the just-pushed manifest (index + children + config + every
layer blob) via HEAD and fails the pipeline on any non-200. Catches
broken pushes at the source.
- stacks/monitoring: new registry-integrity-probe CronJob (every 15m) and
three alerts — RegistryManifestIntegrityFailure,
RegistryIntegrityProbeStale, RegistryCatalogInaccessible — closing the
"registry serves 404 for a tag that exists" gap that masked the incident
for 2+ hours.
- docs/post-mortems/2026-04-19-registry-orphan-index.md: root cause,
timeline, monitoring gaps, permanent fix.
Phase 2 — Prevention:
- modules/docker-registry/docker-compose.yml: pin registry:2 → registry:2.8.3
across all six registry services. Removes the floating-tag footgun.
- modules/docker-registry/fix-broken-blobs.sh: new scan walks every
_manifests/revisions/sha256/<digest> that is an image index and logs a
loud WARNING when a referenced child blob is missing. Does NOT auto-
delete — deleting a published image is a conscious decision. Layer-link
scan preserved.
Phase 3 — Recovery:
- build-ci-image.yml: accept `manual` event so Woodpecker API/UI rebuilds
don't need a cosmetic Dockerfile edit (matches convention from
pve-nfs-exports-sync.yml).
- docs/runbooks/registry-rebuild-image.md: exact command sequence for
diagnosing + rebuilding after an orphan-index incident, plus a fallback
for building directly on the registry VM if Woodpecker itself is down.
- docs/runbooks/registry-vm.md + .claude/reference/service-catalog.md:
cross-references to the new runbook.
Out of scope (verified healthy or intentionally deferred):
- Pull-through DockerHub/GHCR mirrors (74.5% hit rate, no 404s).
- Registry HA/replication (single-VM SPOF is a known architectural
choice; Synology offsite covers RPO < 1 day).
- Diun exclude for registry:2 — not applicable; Diun only watches
k8s (DIUN_PROVIDERS_KUBERNETES=true), not the VM's docker-compose.
Verified locally:
- fix-broken-blobs.sh --dry-run on a synthetic registry directory correctly
flags both orphan layer links and orphan OCI-index children.
- terraform fmt + validate on stacks/monitoring: success (only unrelated
deprecation warnings).
- python3 yaml.safe_load on .woodpecker/build-ci-image.yml and
modules/docker-registry/docker-compose.yml: both parse clean.
Closes: code-4b8
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 17:08:28 +00:00
|
|
|
image: registry:2.8.3
|
2026-02-22 21:45:53 +00:00
|
|
|
container_name: registry-kyverno
|
|
|
|
|
restart: always
|
|
|
|
|
volumes:
|
|
|
|
|
- /opt/registry/data/kyverno:/var/lib/registry
|
|
|
|
|
- /opt/registry/config-kyverno.yml:/etc/docker/registry/config.yml:ro
|
|
|
|
|
networks:
|
|
|
|
|
- registry
|
|
|
|
|
healthcheck:
|
2026-03-29 22:29:27 +03:00
|
|
|
test: ["CMD", "sh", "-c", "wget -qO- http://127.0.0.1:5000/v2/ >/dev/null 2>&1"]
|
2026-02-22 21:45:53 +00:00
|
|
|
interval: 30s
|
|
|
|
|
timeout: 10s
|
|
|
|
|
retries: 3
|
|
|
|
|
start_period: 10s
|
|
|
|
|
|
[forgejo] Phases 3+4+5: cutover, decommission, docs sweep
End of forgejo-registry-consolidation. After Phase 0/1 already landed
(Forgejo ready, dual-push CI, integrity probe, retention CronJob,
images migrated via forgejo-migrate-orphan-images.sh), this commit
flips everything off registry.viktorbarzin.me onto Forgejo and
removes the legacy infrastructure.
Phase 3 — image= flips:
* infra/stacks/{payslip-ingest,job-hunter,claude-agent-service,
fire-planner,freedify/factory,chrome-service,beads-server}/main.tf
— image= now points to forgejo.viktorbarzin.me/viktor/<name>.
* infra/stacks/claude-memory/main.tf — also moved off DockerHub
(viktorbarzin/claude-memory-mcp:17 → forgejo.viktorbarzin.me/viktor/...).
* infra/.woodpecker/{default,drift-detection}.yml — infra-ci pulled
from Forgejo. build-ci-image.yml dual-pushes still until next
build cycle confirms Forgejo as canonical.
* /home/wizard/code/CLAUDE.md — claude-memory-mcp install URL updated.
Phase 4 — decommission registry-private:
* registry-credentials Secret: dropped registry.viktorbarzin.me /
registry.viktorbarzin.me:5050 / 10.0.20.10:5050 auths entries.
Forgejo entry is the only one left.
* infra/stacks/infra/main.tf cloud-init: dropped containerd
hosts.toml entries for registry.viktorbarzin.me +
10.0.20.10:5050. (Existing nodes already had the file removed
manually by `setup-forgejo-containerd-mirror.sh` rollout — the
cloud-init template only fires on new VM provision.)
* infra/modules/docker-registry/docker-compose.yml: registry-private
service block removed; nginx 5050 port mapping dropped. Pull-
through caches for upstream registries (5000/5010/5020/5030/5040)
stay on the VM permanently.
* infra/modules/docker-registry/nginx_registry.conf: upstream
`private` block + port 5050 server block removed.
* infra/stacks/monitoring/modules/monitoring/main.tf: registry_
integrity_probe + registry_probe_credentials resources stripped.
forgejo_integrity_probe is the only manifest probe now.
Phase 5 — final docs sweep:
* infra/docs/runbooks/registry-vm.md — VM scope reduced to pull-
through caches; forgejo-registry-breakglass.md cross-ref added.
* infra/docs/architecture/ci-cd.md — registry component table +
diagram now reflect Forgejo. Pre-migration root-cause sentence
preserved as historical context with a pointer to the design doc.
* infra/docs/architecture/monitoring.md — Registry Integrity Probe
row updated to point at the Forgejo probe.
* infra/.claude/CLAUDE.md — Private registry section rewritten end-
to-end (auth, retention, integrity, where the bake came from).
* prometheus_chart_values.tpl — RegistryManifestIntegrityFailure
alert annotation simplified now that only one registry is in
scope.
Operational follow-up (cannot be done from a TF apply):
1. ssh root@10.0.20.10 — edit /opt/registry/docker-compose.yml to
match the new template AND `docker compose up -d --remove-orphans`
to actually stop the registry-private container. Memory id=1078
confirms cloud-init won't redeploy on TF apply alone.
2. After 1 week of no incidents, `rm -rf /opt/registry/data/private/`
on the VM (~2.6GB freed).
3. Open the dual-push step in build-ci-image.yml and drop
registry.viktorbarzin.me:5050 from the `repo:` list — at that
point the post-push integrity check at line 33-107 also needs
to be repointed at Forgejo or removed (the per-build verify is
redundant with the every-15min Forgejo probe).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-07 18:30:02 +00:00
|
|
|
# registry-private removed in Phase 4 of forgejo-registry-consolidation
|
|
|
|
|
# 2026-05-07. The /v2/ private registry has migrated to Forgejo at
|
|
|
|
|
# forgejo.viktorbarzin.me/viktor/<image>. Pull-through caches for upstream
|
|
|
|
|
# registries (dockerhub, ghcr, quay, k8s, kyverno) stay on this VM.
|
|
|
|
|
# Manual decommission step on the live VM:
|
|
|
|
|
# ssh root@10.0.20.10 'cd /opt/registry && docker compose up -d --remove-orphans'
|
|
|
|
|
# …and after 1 week of no incidents, `rm -rf /opt/registry/data/private/`.
|
2026-02-28 17:57:04 +00:00
|
|
|
|
2026-02-22 21:45:53 +00:00
|
|
|
nginx:
|
|
|
|
|
image: nginx:alpine
|
|
|
|
|
container_name: registry-nginx
|
|
|
|
|
restart: always
|
[forgejo] Phases 3+4+5: cutover, decommission, docs sweep
End of forgejo-registry-consolidation. After Phase 0/1 already landed
(Forgejo ready, dual-push CI, integrity probe, retention CronJob,
images migrated via forgejo-migrate-orphan-images.sh), this commit
flips everything off registry.viktorbarzin.me onto Forgejo and
removes the legacy infrastructure.
Phase 3 — image= flips:
* infra/stacks/{payslip-ingest,job-hunter,claude-agent-service,
fire-planner,freedify/factory,chrome-service,beads-server}/main.tf
— image= now points to forgejo.viktorbarzin.me/viktor/<name>.
* infra/stacks/claude-memory/main.tf — also moved off DockerHub
(viktorbarzin/claude-memory-mcp:17 → forgejo.viktorbarzin.me/viktor/...).
* infra/.woodpecker/{default,drift-detection}.yml — infra-ci pulled
from Forgejo. build-ci-image.yml dual-pushes still until next
build cycle confirms Forgejo as canonical.
* /home/wizard/code/CLAUDE.md — claude-memory-mcp install URL updated.
Phase 4 — decommission registry-private:
* registry-credentials Secret: dropped registry.viktorbarzin.me /
registry.viktorbarzin.me:5050 / 10.0.20.10:5050 auths entries.
Forgejo entry is the only one left.
* infra/stacks/infra/main.tf cloud-init: dropped containerd
hosts.toml entries for registry.viktorbarzin.me +
10.0.20.10:5050. (Existing nodes already had the file removed
manually by `setup-forgejo-containerd-mirror.sh` rollout — the
cloud-init template only fires on new VM provision.)
* infra/modules/docker-registry/docker-compose.yml: registry-private
service block removed; nginx 5050 port mapping dropped. Pull-
through caches for upstream registries (5000/5010/5020/5030/5040)
stay on the VM permanently.
* infra/modules/docker-registry/nginx_registry.conf: upstream
`private` block + port 5050 server block removed.
* infra/stacks/monitoring/modules/monitoring/main.tf: registry_
integrity_probe + registry_probe_credentials resources stripped.
forgejo_integrity_probe is the only manifest probe now.
Phase 5 — final docs sweep:
* infra/docs/runbooks/registry-vm.md — VM scope reduced to pull-
through caches; forgejo-registry-breakglass.md cross-ref added.
* infra/docs/architecture/ci-cd.md — registry component table +
diagram now reflect Forgejo. Pre-migration root-cause sentence
preserved as historical context with a pointer to the design doc.
* infra/docs/architecture/monitoring.md — Registry Integrity Probe
row updated to point at the Forgejo probe.
* infra/.claude/CLAUDE.md — Private registry section rewritten end-
to-end (auth, retention, integrity, where the bake came from).
* prometheus_chart_values.tpl — RegistryManifestIntegrityFailure
alert annotation simplified now that only one registry is in
scope.
Operational follow-up (cannot be done from a TF apply):
1. ssh root@10.0.20.10 — edit /opt/registry/docker-compose.yml to
match the new template AND `docker compose up -d --remove-orphans`
to actually stop the registry-private container. Memory id=1078
confirms cloud-init won't redeploy on TF apply alone.
2. After 1 week of no incidents, `rm -rf /opt/registry/data/private/`
on the VM (~2.6GB freed).
3. Open the dual-push step in build-ci-image.yml and drop
registry.viktorbarzin.me:5050 from the `repo:` list — at that
point the post-push integrity check at line 33-107 also needs
to be repointed at Forgejo or removed (the per-build verify is
redundant with the every-15min Forgejo probe).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-07 18:30:02 +00:00
|
|
|
# 5050 dropped Phase 4 of forgejo-registry-consolidation 2026-05-07.
|
2026-02-22 21:45:53 +00:00
|
|
|
ports:
|
|
|
|
|
- "5000:5000"
|
|
|
|
|
- "5010:5010"
|
|
|
|
|
- "5020:5020"
|
|
|
|
|
- "5030:5030"
|
|
|
|
|
- "5040:5040"
|
|
|
|
|
volumes:
|
|
|
|
|
- /opt/registry/nginx.conf:/etc/nginx/nginx.conf:ro
|
2026-02-28 19:40:38 +00:00
|
|
|
- /opt/registry/tls:/etc/nginx/tls:ro
|
2026-02-22 21:45:53 +00:00
|
|
|
- nginx-cache:/var/cache/nginx
|
|
|
|
|
networks:
|
|
|
|
|
- registry
|
|
|
|
|
depends_on:
|
|
|
|
|
registry-dockerhub:
|
|
|
|
|
condition: service_healthy
|
|
|
|
|
registry-ghcr:
|
|
|
|
|
condition: service_healthy
|
|
|
|
|
registry-quay:
|
|
|
|
|
condition: service_healthy
|
|
|
|
|
registry-k8s:
|
|
|
|
|
condition: service_healthy
|
|
|
|
|
registry-kyverno:
|
|
|
|
|
condition: service_healthy
|
|
|
|
|
healthcheck:
|
2026-03-29 22:29:27 +03:00
|
|
|
test: ["CMD", "sh", "-c", "wget -qO- http://127.0.0.1:5000/v2/ >/dev/null 2>&1"]
|
2026-02-22 21:45:53 +00:00
|
|
|
interval: 30s
|
|
|
|
|
timeout: 10s
|
|
|
|
|
retries: 3
|
|
|
|
|
start_period: 15s
|
|
|
|
|
|
|
|
|
|
registry-ui:
|
|
|
|
|
image: joxit/docker-registry-ui:latest
|
|
|
|
|
container_name: registry-ui
|
|
|
|
|
restart: always
|
|
|
|
|
ports:
|
|
|
|
|
- "8080:80"
|
|
|
|
|
environment:
|
|
|
|
|
- NGINX_PROXY_PASS_URL=http://registry-dockerhub:5000
|
|
|
|
|
- DELETE_IMAGES=true
|
|
|
|
|
- SINGLE_REGISTRY=true
|
|
|
|
|
- SHOW_CONTENT_DIGEST=true
|
|
|
|
|
- SHOW_CATALOG_NB_TAGS=true
|
|
|
|
|
- CATALOG_ELEMENTS_LIMIT=1000
|
|
|
|
|
- TAGLIST_PAGE_SIZE=100
|
|
|
|
|
- REGISTRY_TITLE=viktorbarzin.me
|
|
|
|
|
networks:
|
|
|
|
|
- registry
|
|
|
|
|
depends_on:
|
|
|
|
|
registry-dockerhub:
|
|
|
|
|
condition: service_healthy
|
|
|
|
|
|
|
|
|
|
volumes:
|
|
|
|
|
nginx-cache:
|