6d224861 came from a --no-checkout worktree whose empty index made the
commit drop every file except two. This restores 05b50d2b's full tree and
correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su
entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the
live infra was never applied from the broken commit.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
CronJob stem95su-gdrive-sync (*/10) mounts the content PVC RW and
rclone-syncs the read-only Drive folder "claude" (stem claude/files) onto
it (rclone/rclone:1.74.3, scope=drive.readonly, empty-source guard +
--max-delete 25). ESO ExternalSecret stem95su-rclone <- Vault
secret/stem95su. Requires the GCP OAuth app published to Production or the
refresh token expires ~weekly.
Lands the gdrive-sync stack on master (it had landed on a feature branch
by accident on the shared devvm checkout).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Registry integrity probe surfaced 38 broken manifest references
(34 unique repo:tag pairs, same OCI-index orphan pattern as the 04-19
infra-ci incident). Deleted all via registry HTTP API + ran GC;
reclaimed ~3GB blob storage.
beads-server CronJobs were stuck ImagePullBackOff on
claude-agent-service:0c24c9b6 for >6h — bumped variable default to
2fd7670d (canonical tag in claude-agent-service stack, already healthy
in registry) so new ticks can fire.
Rebuilt in-use broken tags: freedify:{latest,c803de02} and
beadboard:{17a38e43,latest} on registry VM; priority-pass via
Woodpecker pipeline #8. wealthfolio-sync:latest deferred (monthly
CronJob, next run 2026-05-01).
Probe now reports 0/39 failures. RegistryManifestIntegrityFailure
alert cleared.
Closes: code-8hk
Closes: code-jh3c
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The original index-child scan checked if the child's blob data file
existed under /blobs/sha256/<child>/data. That's wrong in a subtle
way: registry:2 serves a per-repo manifest via the link file at
<repo>/_manifests/revisions/sha256/<child-digest>/link, NOT by blob
presence. When cleanup-tags.sh rmtrees a tag, the per-repo revision
links for its index's children also disappear — but the blob data
survives (GC owns that, and runs weekly). Result: blob present,
link absent, API 404 on HEAD — the exact 2026-04-19 failure mode.
Live proof: the registry-integrity-probe CronJob just found 38 real
orphan children (including 98f718c8 from the original incident) while
the previous fix-broken-blobs.sh scan reported 0. After the fix, both
tools agree. The probe had been authoritative all along; the scan was
a false-negative because it was asking the wrong question.
Post-mortem updated to reflect the true mechanism (link-file absence,
not blob deletion).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Second identical registry incident on 2026-04-19 (first 2026-04-13): the
infra-ci:latest image index resolved to child manifests whose blobs had been
garbage-collected out from under the index. Pipelines P366→P376 all exited
126 "image can't be pulled". Hot fix (a05d63e / 6371e75 / c113be4) restored
green CI but left the underlying bug unaddressed.
Root cause: cleanup-tags.sh rmtrees tag dirs on the registry VM daily at
02:00, registry:2's GC (Sunday 03:25) walks OCI index children imperfectly
(distribution/distribution#3324 class). Nothing verified pushes end-to-end;
nothing probed the registry for fetchability; nothing caught orphan indexes.
Phase 1 — Detection:
- .woodpecker/build-ci-image.yml: after build-and-push, a verify-integrity
step walks the just-pushed manifest (index + children + config + every
layer blob) via HEAD and fails the pipeline on any non-200. Catches
broken pushes at the source.
- stacks/monitoring: new registry-integrity-probe CronJob (every 15m) and
three alerts — RegistryManifestIntegrityFailure,
RegistryIntegrityProbeStale, RegistryCatalogInaccessible — closing the
"registry serves 404 for a tag that exists" gap that masked the incident
for 2+ hours.
- docs/post-mortems/2026-04-19-registry-orphan-index.md: root cause,
timeline, monitoring gaps, permanent fix.
Phase 2 — Prevention:
- modules/docker-registry/docker-compose.yml: pin registry:2 → registry:2.8.3
across all six registry services. Removes the floating-tag footgun.
- modules/docker-registry/fix-broken-blobs.sh: new scan walks every
_manifests/revisions/sha256/<digest> that is an image index and logs a
loud WARNING when a referenced child blob is missing. Does NOT auto-
delete — deleting a published image is a conscious decision. Layer-link
scan preserved.
Phase 3 — Recovery:
- build-ci-image.yml: accept `manual` event so Woodpecker API/UI rebuilds
don't need a cosmetic Dockerfile edit (matches convention from
pve-nfs-exports-sync.yml).
- docs/runbooks/registry-rebuild-image.md: exact command sequence for
diagnosing + rebuilding after an orphan-index incident, plus a fallback
for building directly on the registry VM if Woodpecker itself is down.
- docs/runbooks/registry-vm.md + .claude/reference/service-catalog.md:
cross-references to the new runbook.
Out of scope (verified healthy or intentionally deferred):
- Pull-through DockerHub/GHCR mirrors (74.5% hit rate, no 404s).
- Registry HA/replication (single-VM SPOF is a known architectural
choice; Synology offsite covers RPO < 1 day).
- Diun exclude for registry:2 — not applicable; Diun only watches
k8s (DIUN_PROVIDERS_KUBERNETES=true), not the VM's docker-compose.
Verified locally:
- fix-broken-blobs.sh --dry-run on a synthetic registry directory correctly
flags both orphan layer links and orphan OCI-index children.
- terraform fmt + validate on stacks/monitoring: success (only unrelated
deprecation warnings).
- python3 yaml.safe_load on .woodpecker/build-ci-image.yml and
modules/docker-registry/docker-compose.yml: both parse clean.
Closes: code-4b8
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>