infra/docs/post-mortems/2026-04-19-registry-orphan-index.md

# Post-Mortem: Private Registry Orphan OCI-Index — Repeat Incident

| Field | Value |
|-------|-------|
| **Date** | 2026-04-19 (first occurrence 2026-04-13) |
| **Duration** | ~40 min of blocked CI each time; only detected via pipeline failures |
| **Severity** | SEV2 — all infra CI pipelines using `infra-ci:latest` failed (P366 → P376 all exit 126 "image can't be pulled") |
| **Affected Services** | Every Woodpecker pipeline that starts with `image: registry.viktorbarzin.me:5050/infra-ci:latest` — `default.yml`, `build-cli.yml`, `renew-tls.yml`, `drift-detection.yml`, `provision-user.yml`, `k8s-portal.yml`, `postmortem-todos.yml`, `issue-automation.yml`, `pve-nfs-exports-sync.yml` |
| **Status** | Hot fix green (three commits: `a05d63ee`, `6371e75e`, `c113be4d` — URL fix + rebuild). This doc captures the permanent fix landed in the same branch. |

## Summary

On 2026-04-19 ~09:00 UTC, every infra CI pipeline started failing at the
`clone` step with "image can't be pulled". The image in question — the CI
toolchain image `registry.viktorbarzin.me:5050/infra-ci:latest` — resolved
to an OCI image index whose `linux/amd64` platform manifest
(`sha256:98f718c8…`) and its in-toto attestation
(`sha256:27d5ab83…`) returned **HTTP 404** from the private registry.
The index record itself still existed — it's the children that had been
garbage-collected out from under it.

This is the **second identical incident**: the same failure mode occurred
on 2026-04-13 against a different image. Both times the immediate fix was
to rebuild the image from scratch; both times the root cause was left
unaddressed.

## Impact

- **User-facing**: all CI pipelines failed. No automated Terraform applies,
  no TLS renewal, no drift detection. Manual workflows (Woodpecker UI
  reruns) all failed with the same error.
- **Blast radius**: every pipeline that pulls `infra-ci`. Does NOT affect
  k8s workloads (those pull via containerd, which goes through the
  pull-through proxy on :5000/:5010 — a completely different code path).
- **Duration on 2026-04-19**: from first P366 failure to the hot-fix
  commit `c113be4d` — roughly 40 min. Pipelines that had already been
  triggered queued up until the rebuild restored `:latest`.
- **Data loss**: none. The registry has the index object; the child
  manifests are re-producible by rebuilding the source image.
- **Monitoring gap**: nothing alerted. The only signal was the individual
  pipeline failures from Woodpecker. No Prometheus alert fires on "the
  registry served a 404 for a tag that exists".

## Timeline (UTC, 2026-04-19)

| Time | Event |
|------|-------|
| ~09:00 | P366 (`default.yml` on master) fails with exit 126. |
| 09:00–11:00 | P367, P368, … P376 all fail with the same error. Nobody pages — there's no alert configured. |
| 11:15 | User notices and investigates: `skopeo inspect` reveals the missing platform manifest. |
| 11:20 | Hot fix phase begins: `a05d63ee` fixes a push-URL misalignment, `6371e75e` and `c113be4d` trigger a full rebuild. |
| 11:40 | Rebuild completes; `infra-ci:latest` resolves to a fresh, complete index. Pipelines green from P377 onward. |
| 11:45 | User requests a proper root-cause fix: "this is the second time — what's actually broken?" |
| 12:00 | Investigation begins (this document's work). |

## Root Cause Chain

```
[1] cleanup-tags.sh runs daily at 02:00 on the registry VM
 └─> For each repository, keeps the last 10 tags by mtime, rmtrees the rest.
     This walks `_manifests/tags/<tag>` directly, bypassing the registry API.
         │
         ├─> [2] Subtle on-disk asymmetry: a registry:2 tag rmtree removes
         │    BOTH the `_manifests/tags/<tag>/` dir AND — on 2.8.x — the
         │    per-repo revision-link files under
         │    `<repo>/_manifests/revisions/sha256/<child-digest>/link` for
         │    every child referenced by that tag's index. The raw blob data
         │    under `/var/lib/registry/docker/registry/v2/blobs/sha256/<.>/data`
         │    is NOT touched — GC owns that, and GC only runs Sunday.
         │
         ├─> [3] If ANOTHER tag's index still references one of those same
         │    children (common — successive rebuilds share layers), the child
         │    blob survives. But the revision-link is gone, so the registry
         │    API can no longer map `<repo>/manifests/sha256:<child>` back
         │    to the blob. HEAD → 404, even though the bytes are on disk.
         │    distribution/distribution#3324 is the upstream class of this bug.
         │
         └─> [4] Result: the surviving index (e.g. `infra-ci:5319f03e`) is
              intact on disk, its children's blob data files are intact on
              disk, but HEAD `/v2/infra-ci/manifests/sha256:98f718c8…`
              returns 404. The registry has the bytes, but cannot find them
              through the API because the per-repo link bridge is gone.

[pull] containerd resolves `infra-ci:latest`
         │
         ├─> GET /v2/infra-ci/manifests/latest → 200 OK, returns the index
         │
         └─> GET /v2/infra-ci/manifests/sha256:98f718c8… → 404 Not Found
              └─> containerd fails the pull with "manifest unknown"
                    └─> woodpecker exit 126
```

> **Detection-gotcha** uncovered 2026-04-19 while implementing
> `fix-broken-blobs.sh`: a scan that checks `/blobs/sha256/<child>/data` for
> presence is NOT equivalent to "can the registry serve this child?" The
> authoritative check is whether
> `<repo>/_manifests/revisions/sha256/<child>/link` exists. The script
> was rewritten to check the per-repo link file after the HTTP probe
> caught 38 real orphans the filesystem scan had reported clean.

## Why Existing Remediation Missed It

1. **`fix-broken-blobs.sh` only scans layer links.** The existing cron
   walks `_layers/sha256/` and removes link files whose blob `data` is
   missing. It does NOT inspect `_manifests/revisions/sha256/` to see
   whether an image-index's referenced children still exist. That's
   exactly the class of orphan this incident represents.
2. **`registry:2` image tag was floating.** `docker-compose.yml` pinned
   only to `registry:2`. Whatever Docker Inc. last rebuilt as
   "v2-current" was running, with no version pin. Any regression in
   the upstream walker would silently swap in.
3. **No integrity monitoring.** Prometheus alerted on cache hit rate
   and registry-down, but nothing probes "are the manifests the registry
   advertises actually fetchable?"
4. **CI pipeline didn't verify its own push.** `buildx --push` returns
   success as soon as it uploads. If a child blob upload 0-byted or
   the client disconnected mid-push (distinct from the GC mode but the
   same on-disk symptom), nothing would notice until the next pull.

## Permanent Fix — Three Phases

### Phase 1 — Detection (ship today)

1. **Post-push integrity check** in `.woodpecker/build-ci-image.yml`.
   After `build-and-push`, a new step walks the just-pushed manifest
   (and every child of an image index) and HEADs every referenced blob.
   Any non-200 fails the pipeline immediately, catching broken pushes at
   the source rather than leaking them to consumers.
2. **Prometheus alert `RegistryManifestIntegrityFailure`.** A new
   CronJob (`registry-integrity-probe`, every 15m, in the `monitoring`
   namespace) walks the private registry's catalog, HEADs every tag's
   manifest, follows each image index's children, and pushes
   `registry_manifest_integrity_failures` to Pushgateway. Accompanying
   alerts: `RegistryIntegrityProbeStale`, `RegistryCatalogInaccessible`.
3. **Post-mortem** — this document. Linked from
   `.claude/reference/service-catalog.md` via the new runbook.

### Phase 2 — Prevention

4. **Pin `registry:2` → `registry:2.8.3`** in
   `modules/docker-registry/docker-compose.yml` (all six registry
   services). Removes the floating-tag footgun.
5. **Extend `fix-broken-blobs.sh`** to scan every
   `_manifests/revisions/sha256/<digest>` that is an image index and
   flag children whose blob `data` file is missing. The script prints a
   loud WARNING per orphan; it does not auto-delete the index, because
   deleting a published image is a conscious decision, not an automated
   repair.

### Phase 3 — Recovery tooling

6. **Manual event trigger** on `build-ci-image.yml`. Rebuilds no longer
   need a cosmetic Dockerfile edit — POST to the Woodpecker API or
   click "Run manually" in the UI.
7. **Runbook** `docs/runbooks/registry-rebuild-image.md` — exact
   command sequence for the next time this happens, plus fallback paths.

## Out of Scope

- **Pull-through caches.** The DockerHub / GHCR mirrors on
  `:5000` / `:5010` are healthy (74.5% cache hit rate, no 404s). The
  orphan problem is private-registry-only. No changes to nginx or
  containerd `hosts.toml`.
- **Registry HA / replication.** Single-VM SPOF is a known
  architectural choice. Harbor or a replicated registry would solve
  more than this incident requires, at multi-day cost. Synology offsite
  snapshots already give RPO < 1 day.
- **Disabling `cleanup-tags.sh`.** Keeping storage bounded is still
  necessary; the fix is detection + rebuild, not "stop cleaning up".

## Lessons

- **Repeat incidents deserve root-cause work, not a third hot-fix.** The
  2026-04-13 incident was closed when CI turned green. Without a probe
  and without a scan for orphan indexes, the next incident was
  inevitable — and it happened six days later against a different image.
- **"No alert fired, so it wasn't detected" is a monitoring gap, not an
  outage feature.** The registry was serving 404s for 2+ hours before
  anyone noticed, because our only signal was "pipeline failures" and
  our eyes were elsewhere. The new probe closes that gap.
- **CI pipelines should verify their own output.** The `buildx --push`
  "success" exit code is not a guarantee of pulled-back integrity — as
  this incident proves. A 30-second post-push HEAD walk is cheap
  insurance.

## Related

- **Prior incident (same failure mode, different image)**: memory `709`
  / `710` — 2026-04-13.
- **Runbook**: `docs/runbooks/registry-rebuild-image.md` (new).
- **Hot-fix commits**: `a05d63ee`, `6371e75e`, `c113be4d`.
- **Upstream bug class**: `distribution/distribution#3324`.

## 2026-04-19 — Bulk cleanup sweep (beads code-8hk + code-jh3c)

Same failure class, broader scope. The `registry-integrity-probe`
surfaced 38 broken manifest references persisting after the 04-19
infra-ci fix. `beads-dispatcher` + `beads-reaper` CronJobs were stuck
`ImagePullBackOff` on `claude-agent-service:0c24c9b6` for >6h. All 34
affected `repo:tag` pairs were OCI indexes whose `linux/amd64` child
manifests were absent from blob storage (same orphan pattern).

**Action taken**:
1. Bumped `beads-server/main.tf` var default `claude_agent_service_image_tag`
   from `0c24c9b6` → `2fd7670d` (the canonical tag in
   `claude-agent-service/main.tf`), reused — same image already healthy
   on the registry. `scripts/tg apply` on `beads-server`. Deleted the
   stuck Jobs so new CronJob ticks could fire.
2. Enumerated 34 broken `(repo, tag, parent_digest)` triples via HTTP
   probe using `registry-probe-credentials` K8s Secret. Deleted each
   via `DELETE /v2/<repo>/manifests/<digest>` (33× 202, 1× 404 —
   claude-agent-service:latest pointed at an already-deleted digest).
3. Ran `docker exec registry-private /bin/registry garbage-collect
   /etc/docker/registry/config.yml` — reclaimed ~3GB of orphan blob
   storage.
4. Rebuilt the 3 in-use broken tags (all 3 OCI-index parents pointed
   at missing children, so no cached copies would survive pod
   reschedule):
   - `freedify:latest` / `freedify:c803de02` — built on registry VM
     directly (no CI pipeline exists for this image; python FastAPI).
   - `beadboard:17a38e43` / `beadboard:latest` — GHA
     `workflow_dispatch` failed at registry login (missing
     `REGISTRY_USERNAME`/`REGISTRY_PASSWORD` GH secrets). Built on
     registry VM directly as the fallback. GitHub secret gap is a
     follow-up — beads `code-8hk` notes it.
   - `priority-pass-backend:ae1420a0` / `priority-pass-frontend:ae1420a0`
     — Woodpecker pipeline #8 on repo 81. Pipeline `kubectl set image`'d
     the Deployment to `ae1420a0` (drift vs TF `v5`/`v8` defaults, but
     that drift is pre-existing, not introduced by this cleanup).
   - `wealthfolio-sync:latest` — **not rebuilt**. Monthly CronJob (next
     run 2026-05-01), no source tree or CI pipeline available in the
     monorepo; deferred for separate follow-up.

**Post-cleanup state**:
- Probe: 39 tags, 0 failures. `registry_manifest_integrity_failures{} = 0`.
- Alert `RegistryManifestIntegrityFailure` cleared (was firing for
  5h 32m).
- No `ImagePullBackOff` pods anywhere in the cluster.
- 28 of 34 deleted manifests were **dangling tags not referenced by any
  workload** — old `382d6b1*`, `v2`-`v7`, `yt-fallback`, etc. Safe
  deletes, no rebuilds needed.

**Permanent fix still in flight**: Phase 2/3 of this post-mortem
(post-push verification in CI, atomic `cleanup-tags.sh`) — not
addressed by this cleanup. The probe continues to be the
authoritative detector.
-												[registry] Stop recurring orphan OCI-index incidents — detection + prevention + recovery

Second identical registry incident on 2026-04-19 (first 2026-04-13): the
infra-ci:latest image index resolved to child manifests whose blobs had been
garbage-collected out from under the index. Pipelines P366→P376 all exited
126 "image can't be pulled". Hot fix (a05d63e / 6371e75 / c113be4) restored
green CI but left the underlying bug unaddressed.

Root cause: cleanup-tags.sh rmtrees tag dirs on the registry VM daily at
02:00, registry:2's GC (Sunday 03:25) walks OCI index children imperfectly
(distribution/distribution#3324 class). Nothing verified pushes end-to-end;
nothing probed the registry for fetchability; nothing caught orphan indexes.

Phase 1 — Detection:
 - .woodpecker/build-ci-image.yml: after build-and-push, a verify-integrity
   step walks the just-pushed manifest (index + children + config + every
   layer blob) via HEAD and fails the pipeline on any non-200. Catches
   broken pushes at the source.
 - stacks/monitoring: new registry-integrity-probe CronJob (every 15m) and
   three alerts — RegistryManifestIntegrityFailure,
   RegistryIntegrityProbeStale, RegistryCatalogInaccessible — closing the
   "registry serves 404 for a tag that exists" gap that masked the incident
   for 2+ hours.
 - docs/post-mortems/2026-04-19-registry-orphan-index.md: root cause,
   timeline, monitoring gaps, permanent fix.

Phase 2 — Prevention:
 - modules/docker-registry/docker-compose.yml: pin registry:2 → registry:2.8.3
   across all six registry services. Removes the floating-tag footgun.
 - modules/docker-registry/fix-broken-blobs.sh: new scan walks every
   _manifests/revisions/sha256/<digest> that is an image index and logs a
   loud WARNING when a referenced child blob is missing. Does NOT auto-
   delete — deleting a published image is a conscious decision. Layer-link
   scan preserved.

Phase 3 — Recovery:
 - build-ci-image.yml: accept `manual` event so Woodpecker API/UI rebuilds
   don't need a cosmetic Dockerfile edit (matches convention from
   pve-nfs-exports-sync.yml).
 - docs/runbooks/registry-rebuild-image.md: exact command sequence for
   diagnosing + rebuilding after an orphan-index incident, plus a fallback
   for building directly on the registry VM if Woodpecker itself is down.
 - docs/runbooks/registry-vm.md + .claude/reference/service-catalog.md:
   cross-references to the new runbook.

Out of scope (verified healthy or intentionally deferred):
 - Pull-through DockerHub/GHCR mirrors (74.5% hit rate, no 404s).
 - Registry HA/replication (single-VM SPOF is a known architectural
   choice; Synology offsite covers RPO < 1 day).
 - Diun exclude for registry:2 — not applicable; Diun only watches
   k8s (DIUN_PROVIDERS_KUBERNETES=true), not the VM's docker-compose.

Verified locally:
 - fix-broken-blobs.sh --dry-run on a synthetic registry directory correctly
   flags both orphan layer links and orphan OCI-index children.
 - terraform fmt + validate on stacks/monitoring: success (only unrelated
   deprecation warnings).
 - python3 yaml.safe_load on .woodpecker/build-ci-image.yml and
   modules/docker-registry/docker-compose.yml: both parse clean.

Closes: code-4b8

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-19 17:08:28 +00:00
+								# Post-Mortem: Private Registry Orphan OCI-Index — Repeat Incident
 								| Field | Value |
 								|-------|-------|
 								| **Date** | 2026-04-19 (first occurrence 2026-04-13) |
 								| **Duration** | ~40 min of blocked CI each time; only detected via pipeline failures |
 								| **Severity** | SEV2 — all infra CI pipelines using `infra-ci:latest` failed (P366 → P376 all exit 126 "image can't be pulled") |
 								| **Affected Services** | Every Woodpecker pipeline that starts with `image: registry.viktorbarzin.me:5050/infra-ci:latest` — `default.yml`, `build-cli.yml`, `renew-tls.yml`, `drift-detection.yml`, `provision-user.yml`, `k8s-portal.yml`, `postmortem-todos.yml`, `issue-automation.yml`, `pve-nfs-exports-sync.yml` |
 								| **Status** | Hot fix green (three commits: `a05d63ee`, `6371e75e`, `c113be4d` — URL fix + rebuild). This doc captures the permanent fix landed in the same branch. |
 								## Summary
 								On 2026-04-19 ~09:00 UTC, every infra CI pipeline started failing at the
 								`clone` step with "image can't be pulled". The image in question — the CI
 								toolchain image `registry.viktorbarzin.me:5050/infra-ci:latest` — resolved
 								to an OCI image index whose `linux/amd64` platform manifest
 								(`sha256:98f718c8…`) and its in-toto attestation
 								(`sha256:27d5ab83…`) returned **HTTP 404** from the private registry.
 								The index record itself still existed — it's the children that had been
 								garbage-collected out from under it.
 								This is the **second identical incident**: the same failure mode occurred
 								on 2026-04-13 against a different image. Both times the immediate fix was
 								to rebuild the image from scratch; both times the root cause was left
 								unaddressed.
 								## Impact
 								- **User-facing**: all CI pipelines failed. No automated Terraform applies,
 								  no TLS renewal, no drift detection. Manual workflows (Woodpecker UI
 								  reruns) all failed with the same error.
 								- **Blast radius**: every pipeline that pulls `infra-ci`. Does NOT affect
 								  k8s workloads (those pull via containerd, which goes through the
 								  pull-through proxy on :5000/:5010 — a completely different code path).
 								- **Duration on 2026-04-19**: from first P366 failure to the hot-fix
 								  commit `c113be4d` — roughly 40 min. Pipelines that had already been
 								  triggered queued up until the rebuild restored `:latest`.
 								- **Data loss**: none. The registry has the index object; the child
 								  manifests are re-producible by rebuilding the source image.
 								- **Monitoring gap**: nothing alerted. The only signal was the individual
 								  pipeline failures from Woodpecker. No Prometheus alert fires on "the
 								  registry served a 404 for a tag that exists".
 								## Timeline (UTC, 2026-04-19)
 								| Time | Event |
 								|------|-------|
 								| ~09:00 | P366 (`default.yml` on master) fails with exit 126. |
 								| 09:00–11:00 | P367, P368, … P376 all fail with the same error. Nobody pages — there's no alert configured. |
 								| 11:15 | User notices and investigates: `skopeo inspect` reveals the missing platform manifest. |
 								| 11:20 | Hot fix phase begins: `a05d63ee` fixes a push-URL misalignment, `6371e75e` and `c113be4d` trigger a full rebuild. |
 								| 11:40 | Rebuild completes; `infra-ci:latest` resolves to a fresh, complete index. Pipelines green from P377 onward. |
 								| 11:45 | User requests a proper root-cause fix: "this is the second time — what's actually broken?" |
 								| 12:00 | Investigation begins (this document's work). |
 								## Root Cause Chain
 								```
 								[1] cleanup-tags.sh runs daily at 02:00 on the registry VM
 								 └─> For each repository, keeps the last 10 tags by mtime, rmtrees the rest.
 								     This walks `_manifests/tags/<tag>` directly, bypassing the registry API.
 								         │
-												[registry] fix-broken-blobs.sh — check revision-link, not blob data

The original index-child scan checked if the child's blob data file
existed under /blobs/sha256/<child>/data. That's wrong in a subtle
way: registry:2 serves a per-repo manifest via the link file at
<repo>/_manifests/revisions/sha256/<child-digest>/link, NOT by blob
presence. When cleanup-tags.sh rmtrees a tag, the per-repo revision
links for its index's children also disappear — but the blob data
survives (GC owns that, and runs weekly). Result: blob present,
link absent, API 404 on HEAD — the exact 2026-04-19 failure mode.

Live proof: the registry-integrity-probe CronJob just found 38 real
orphan children (including 98f718c8 from the original incident) while
the previous fix-broken-blobs.sh scan reported 0. After the fix, both
tools agree. The probe had been authoritative all along; the scan was
a false-negative because it was asking the wrong question.

Post-mortem updated to reflect the true mechanism (link-file absence,
not blob deletion).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-19 17:43:35 +00:00
+								         ├─> [2] Subtle on-disk asymmetry: a registry:2 tag rmtree removes
 								         │    BOTH the `_manifests/tags/<tag>/` dir AND — on 2.8.x — the
 								         │    per-repo revision-link files under
 								         │    `<repo>/_manifests/revisions/sha256/<child-digest>/link` for
 								         │    every child referenced by that tag's index. The raw blob data
 								         │    under `/var/lib/registry/docker/registry/v2/blobs/sha256/<.>/data`
 								         │    is NOT touched — GC owns that, and GC only runs Sunday.
-												[registry] Stop recurring orphan OCI-index incidents — detection + prevention + recovery

Second identical registry incident on 2026-04-19 (first 2026-04-13): the
infra-ci:latest image index resolved to child manifests whose blobs had been
garbage-collected out from under the index. Pipelines P366→P376 all exited
126 "image can't be pulled". Hot fix (a05d63e / 6371e75 / c113be4) restored
green CI but left the underlying bug unaddressed.

Root cause: cleanup-tags.sh rmtrees tag dirs on the registry VM daily at
02:00, registry:2's GC (Sunday 03:25) walks OCI index children imperfectly
(distribution/distribution#3324 class). Nothing verified pushes end-to-end;
nothing probed the registry for fetchability; nothing caught orphan indexes.

Phase 1 — Detection:
 - .woodpecker/build-ci-image.yml: after build-and-push, a verify-integrity
   step walks the just-pushed manifest (index + children + config + every
   layer blob) via HEAD and fails the pipeline on any non-200. Catches
   broken pushes at the source.
 - stacks/monitoring: new registry-integrity-probe CronJob (every 15m) and
   three alerts — RegistryManifestIntegrityFailure,
   RegistryIntegrityProbeStale, RegistryCatalogInaccessible — closing the
   "registry serves 404 for a tag that exists" gap that masked the incident
   for 2+ hours.
 - docs/post-mortems/2026-04-19-registry-orphan-index.md: root cause,
   timeline, monitoring gaps, permanent fix.

Phase 2 — Prevention:
 - modules/docker-registry/docker-compose.yml: pin registry:2 → registry:2.8.3
   across all six registry services. Removes the floating-tag footgun.
 - modules/docker-registry/fix-broken-blobs.sh: new scan walks every
   _manifests/revisions/sha256/<digest> that is an image index and logs a
   loud WARNING when a referenced child blob is missing. Does NOT auto-
   delete — deleting a published image is a conscious decision. Layer-link
   scan preserved.

Phase 3 — Recovery:
 - build-ci-image.yml: accept `manual` event so Woodpecker API/UI rebuilds
   don't need a cosmetic Dockerfile edit (matches convention from
   pve-nfs-exports-sync.yml).
 - docs/runbooks/registry-rebuild-image.md: exact command sequence for
   diagnosing + rebuilding after an orphan-index incident, plus a fallback
   for building directly on the registry VM if Woodpecker itself is down.
 - docs/runbooks/registry-vm.md + .claude/reference/service-catalog.md:
   cross-references to the new runbook.

Out of scope (verified healthy or intentionally deferred):
 - Pull-through DockerHub/GHCR mirrors (74.5% hit rate, no 404s).
 - Registry HA/replication (single-VM SPOF is a known architectural
   choice; Synology offsite covers RPO < 1 day).
 - Diun exclude for registry:2 — not applicable; Diun only watches
   k8s (DIUN_PROVIDERS_KUBERNETES=true), not the VM's docker-compose.

Verified locally:
 - fix-broken-blobs.sh --dry-run on a synthetic registry directory correctly
   flags both orphan layer links and orphan OCI-index children.
 - terraform fmt + validate on stacks/monitoring: success (only unrelated
   deprecation warnings).
 - python3 yaml.safe_load on .woodpecker/build-ci-image.yml and
   modules/docker-registry/docker-compose.yml: both parse clean.

Closes: code-4b8

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-19 17:08:28 +00:00
+								         │
-												[registry] fix-broken-blobs.sh — check revision-link, not blob data

The original index-child scan checked if the child's blob data file
existed under /blobs/sha256/<child>/data. That's wrong in a subtle
way: registry:2 serves a per-repo manifest via the link file at
<repo>/_manifests/revisions/sha256/<child-digest>/link, NOT by blob
presence. When cleanup-tags.sh rmtrees a tag, the per-repo revision
links for its index's children also disappear — but the blob data
survives (GC owns that, and runs weekly). Result: blob present,
link absent, API 404 on HEAD — the exact 2026-04-19 failure mode.

Live proof: the registry-integrity-probe CronJob just found 38 real
orphan children (including 98f718c8 from the original incident) while
the previous fix-broken-blobs.sh scan reported 0. After the fix, both
tools agree. The probe had been authoritative all along; the scan was
a false-negative because it was asking the wrong question.

Post-mortem updated to reflect the true mechanism (link-file absence,
not blob deletion).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-19 17:43:35 +00:00
+								         ├─> [3] If ANOTHER tag's index still references one of those same
 								         │    children (common — successive rebuilds share layers), the child
 								         │    blob survives. But the revision-link is gone, so the registry
 								         │    API can no longer map `<repo>/manifests/sha256:<child>` back
 								         │    to the blob. HEAD → 404, even though the bytes are on disk.
 								         │    distribution/distribution#3324 is the upstream class of this bug.
 								         │
 								         └─> [4] Result: the surviving index (e.g. `infra-ci:5319f03e`) is
 								              intact on disk, its children's blob data files are intact on
 								              disk, but HEAD `/v2/infra-ci/manifests/sha256:98f718c8…`
 								              returns 404. The registry has the bytes, but cannot find them
 								              through the API because the per-repo link bridge is gone.
-												[registry] Stop recurring orphan OCI-index incidents — detection + prevention + recovery

Second identical registry incident on 2026-04-19 (first 2026-04-13): the
infra-ci:latest image index resolved to child manifests whose blobs had been
garbage-collected out from under the index. Pipelines P366→P376 all exited
126 "image can't be pulled". Hot fix (a05d63e / 6371e75 / c113be4) restored
green CI but left the underlying bug unaddressed.

Root cause: cleanup-tags.sh rmtrees tag dirs on the registry VM daily at
02:00, registry:2's GC (Sunday 03:25) walks OCI index children imperfectly
(distribution/distribution#3324 class). Nothing verified pushes end-to-end;
nothing probed the registry for fetchability; nothing caught orphan indexes.

Phase 1 — Detection:
 - .woodpecker/build-ci-image.yml: after build-and-push, a verify-integrity
   step walks the just-pushed manifest (index + children + config + every
   layer blob) via HEAD and fails the pipeline on any non-200. Catches
   broken pushes at the source.
 - stacks/monitoring: new registry-integrity-probe CronJob (every 15m) and
   three alerts — RegistryManifestIntegrityFailure,
   RegistryIntegrityProbeStale, RegistryCatalogInaccessible — closing the
   "registry serves 404 for a tag that exists" gap that masked the incident
   for 2+ hours.
 - docs/post-mortems/2026-04-19-registry-orphan-index.md: root cause,
   timeline, monitoring gaps, permanent fix.

Phase 2 — Prevention:
 - modules/docker-registry/docker-compose.yml: pin registry:2 → registry:2.8.3
   across all six registry services. Removes the floating-tag footgun.
 - modules/docker-registry/fix-broken-blobs.sh: new scan walks every
   _manifests/revisions/sha256/<digest> that is an image index and logs a
   loud WARNING when a referenced child blob is missing. Does NOT auto-
   delete — deleting a published image is a conscious decision. Layer-link
   scan preserved.

Phase 3 — Recovery:
 - build-ci-image.yml: accept `manual` event so Woodpecker API/UI rebuilds
   don't need a cosmetic Dockerfile edit (matches convention from
   pve-nfs-exports-sync.yml).
 - docs/runbooks/registry-rebuild-image.md: exact command sequence for
   diagnosing + rebuilding after an orphan-index incident, plus a fallback
   for building directly on the registry VM if Woodpecker itself is down.
 - docs/runbooks/registry-vm.md + .claude/reference/service-catalog.md:
   cross-references to the new runbook.

Out of scope (verified healthy or intentionally deferred):
 - Pull-through DockerHub/GHCR mirrors (74.5% hit rate, no 404s).
 - Registry HA/replication (single-VM SPOF is a known architectural
   choice; Synology offsite covers RPO < 1 day).
 - Diun exclude for registry:2 — not applicable; Diun only watches
   k8s (DIUN_PROVIDERS_KUBERNETES=true), not the VM's docker-compose.

Verified locally:
 - fix-broken-blobs.sh --dry-run on a synthetic registry directory correctly
   flags both orphan layer links and orphan OCI-index children.
 - terraform fmt + validate on stacks/monitoring: success (only unrelated
   deprecation warnings).
 - python3 yaml.safe_load on .woodpecker/build-ci-image.yml and
   modules/docker-registry/docker-compose.yml: both parse clean.

Closes: code-4b8

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-19 17:08:28 +00:00
 								[pull] containerd resolves `infra-ci:latest`
 								         │
 								         ├─> GET /v2/infra-ci/manifests/latest → 200 OK, returns the index
 								         │
 								         └─> GET /v2/infra-ci/manifests/sha256:98f718c8… → 404 Not Found
 								              └─> containerd fails the pull with "manifest unknown"
 								                    └─> woodpecker exit 126
 								```
-												[registry] fix-broken-blobs.sh — check revision-link, not blob data

The original index-child scan checked if the child's blob data file
existed under /blobs/sha256/<child>/data. That's wrong in a subtle
way: registry:2 serves a per-repo manifest via the link file at
<repo>/_manifests/revisions/sha256/<child-digest>/link, NOT by blob
presence. When cleanup-tags.sh rmtrees a tag, the per-repo revision
links for its index's children also disappear — but the blob data
survives (GC owns that, and runs weekly). Result: blob present,
link absent, API 404 on HEAD — the exact 2026-04-19 failure mode.

Live proof: the registry-integrity-probe CronJob just found 38 real
orphan children (including 98f718c8 from the original incident) while
the previous fix-broken-blobs.sh scan reported 0. After the fix, both
tools agree. The probe had been authoritative all along; the scan was
a false-negative because it was asking the wrong question.

Post-mortem updated to reflect the true mechanism (link-file absence,
not blob deletion).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-19 17:43:35 +00:00
+								> **Detection-gotcha** uncovered 2026-04-19 while implementing
 								> `fix-broken-blobs.sh`: a scan that checks `/blobs/sha256/<child>/data` for
 								> presence is NOT equivalent to "can the registry serve this child?" The
 								> authoritative check is whether
 								> `<repo>/_manifests/revisions/sha256/<child>/link` exists. The script
 								> was rewritten to check the per-repo link file after the HTTP probe
 								> caught 38 real orphans the filesystem scan had reported clean.
-												[registry] Stop recurring orphan OCI-index incidents — detection + prevention + recovery

Second identical registry incident on 2026-04-19 (first 2026-04-13): the
infra-ci:latest image index resolved to child manifests whose blobs had been
garbage-collected out from under the index. Pipelines P366→P376 all exited
126 "image can't be pulled". Hot fix (a05d63e / 6371e75 / c113be4) restored
green CI but left the underlying bug unaddressed.

Root cause: cleanup-tags.sh rmtrees tag dirs on the registry VM daily at
02:00, registry:2's GC (Sunday 03:25) walks OCI index children imperfectly
(distribution/distribution#3324 class). Nothing verified pushes end-to-end;
nothing probed the registry for fetchability; nothing caught orphan indexes.

Phase 1 — Detection:
 - .woodpecker/build-ci-image.yml: after build-and-push, a verify-integrity
   step walks the just-pushed manifest (index + children + config + every
   layer blob) via HEAD and fails the pipeline on any non-200. Catches
   broken pushes at the source.
 - stacks/monitoring: new registry-integrity-probe CronJob (every 15m) and
   three alerts — RegistryManifestIntegrityFailure,
   RegistryIntegrityProbeStale, RegistryCatalogInaccessible — closing the
   "registry serves 404 for a tag that exists" gap that masked the incident
   for 2+ hours.
 - docs/post-mortems/2026-04-19-registry-orphan-index.md: root cause,
   timeline, monitoring gaps, permanent fix.

Phase 2 — Prevention:
 - modules/docker-registry/docker-compose.yml: pin registry:2 → registry:2.8.3
   across all six registry services. Removes the floating-tag footgun.
 - modules/docker-registry/fix-broken-blobs.sh: new scan walks every
   _manifests/revisions/sha256/<digest> that is an image index and logs a
   loud WARNING when a referenced child blob is missing. Does NOT auto-
   delete — deleting a published image is a conscious decision. Layer-link
   scan preserved.

Phase 3 — Recovery:
 - build-ci-image.yml: accept `manual` event so Woodpecker API/UI rebuilds
   don't need a cosmetic Dockerfile edit (matches convention from
   pve-nfs-exports-sync.yml).
 - docs/runbooks/registry-rebuild-image.md: exact command sequence for
   diagnosing + rebuilding after an orphan-index incident, plus a fallback
   for building directly on the registry VM if Woodpecker itself is down.
 - docs/runbooks/registry-vm.md + .claude/reference/service-catalog.md:
   cross-references to the new runbook.

Out of scope (verified healthy or intentionally deferred):
 - Pull-through DockerHub/GHCR mirrors (74.5% hit rate, no 404s).
 - Registry HA/replication (single-VM SPOF is a known architectural
   choice; Synology offsite covers RPO < 1 day).
 - Diun exclude for registry:2 — not applicable; Diun only watches
   k8s (DIUN_PROVIDERS_KUBERNETES=true), not the VM's docker-compose.

Verified locally:
 - fix-broken-blobs.sh --dry-run on a synthetic registry directory correctly
   flags both orphan layer links and orphan OCI-index children.
 - terraform fmt + validate on stacks/monitoring: success (only unrelated
   deprecation warnings).
 - python3 yaml.safe_load on .woodpecker/build-ci-image.yml and
   modules/docker-registry/docker-compose.yml: both parse clean.

Closes: code-4b8

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-19 17:08:28 +00:00
+								## Why Existing Remediation Missed It
 . **`fix-broken-blobs.sh` only scans layer links.** The existing cron
 								   walks `_layers/sha256/` and removes link files whose blob `data` is
 								   missing. It does NOT inspect `_manifests/revisions/sha256/` to see
 								   whether an image-index's referenced children still exist. That's
 								   exactly the class of orphan this incident represents.
 . **`registry:2` image tag was floating.** `docker-compose.yml` pinned
 								   only to `registry:2`. Whatever Docker Inc. last rebuilt as
 								   "v2-current" was running, with no version pin. Any regression in
 								   the upstream walker would silently swap in.
 . **No integrity monitoring.** Prometheus alerted on cache hit rate
 								   and registry-down, but nothing probes "are the manifests the registry
 								   advertises actually fetchable?"
 . **CI pipeline didn't verify its own push.** `buildx --push` returns
 								   success as soon as it uploads. If a child blob upload 0-byted or
 								   the client disconnected mid-push (distinct from the GC mode but the
 								   same on-disk symptom), nothing would notice until the next pull.
 								## Permanent Fix — Three Phases
 								### Phase 1 — Detection (ship today)
 . **Post-push integrity check** in `.woodpecker/build-ci-image.yml`.
 								   After `build-and-push`, a new step walks the just-pushed manifest
 								   (and every child of an image index) and HEADs every referenced blob.
 								   Any non-200 fails the pipeline immediately, catching broken pushes at
 								   the source rather than leaking them to consumers.
 . **Prometheus alert `RegistryManifestIntegrityFailure`.** A new
 								   CronJob (`registry-integrity-probe`, every 15m, in the `monitoring`
 								   namespace) walks the private registry's catalog, HEADs every tag's
 								   manifest, follows each image index's children, and pushes
 								   `registry_manifest_integrity_failures` to Pushgateway. Accompanying
 								   alerts: `RegistryIntegrityProbeStale`, `RegistryCatalogInaccessible`.
 . **Post-mortem** — this document. Linked from
 								   `.claude/reference/service-catalog.md` via the new runbook.
 								### Phase 2 — Prevention
 . **Pin `registry:2` → `registry:2.8.3`** in
 								   `modules/docker-registry/docker-compose.yml` (all six registry
 								   services). Removes the floating-tag footgun.
 . **Extend `fix-broken-blobs.sh`** to scan every
 								   `_manifests/revisions/sha256/<digest>` that is an image index and
 								   flag children whose blob `data` file is missing. The script prints a
 								   loud WARNING per orphan; it does not auto-delete the index, because
 								   deleting a published image is a conscious decision, not an automated
 								   repair.
 								### Phase 3 — Recovery tooling
 . **Manual event trigger** on `build-ci-image.yml`. Rebuilds no longer
 								   need a cosmetic Dockerfile edit — POST to the Woodpecker API or
 								   click "Run manually" in the UI.
 . **Runbook** `docs/runbooks/registry-rebuild-image.md` — exact
 								   command sequence for the next time this happens, plus fallback paths.
 								## Out of Scope
 								- **Pull-through caches.** The DockerHub / GHCR mirrors on
 								  `:5000` / `:5010` are healthy (74.5% cache hit rate, no 404s). The
 								  orphan problem is private-registry-only. No changes to nginx or
 								  containerd `hosts.toml`.
 								- **Registry HA / replication.** Single-VM SPOF is a known
 								  architectural choice. Harbor or a replicated registry would solve
 								  more than this incident requires, at multi-day cost. Synology offsite
 								  snapshots already give RPO < 1 day.
 								- **Disabling `cleanup-tags.sh`.** Keeping storage bounded is still
 								  necessary; the fix is detection + rebuild, not "stop cleaning up".
 								## Lessons
 								- **Repeat incidents deserve root-cause work, not a third hot-fix.** The
 -04-13 incident was closed when CI turned green. Without a probe
 								  and without a scan for orphan indexes, the next incident was
 								  inevitable — and it happened six days later against a different image.
 								- **"No alert fired, so it wasn't detected" is a monitoring gap, not an
 								  outage feature.** The registry was serving 404s for 2+ hours before
 								  anyone noticed, because our only signal was "pipeline failures" and
 								  our eyes were elsewhere. The new probe closes that gap.
 								- **CI pipelines should verify their own output.** The `buildx --push`
 								  "success" exit code is not a guarantee of pulled-back integrity — as
 								  this incident proves. A 30-second post-push HEAD walk is cheap
 								  insurance.
 								## Related
 								- **Prior incident (same failure mode, different image)**: memory `709`
 								  / `710` — 2026-04-13.
 								- **Runbook**: `docs/runbooks/registry-rebuild-image.md` (new).
 								- **Hot-fix commits**: `a05d63ee`, `6371e75e`, `c113be4d`.
 								- **Upstream bug class**: `distribution/distribution#3324`.
-												[registry] bulk-clean 34 orphan manifests + beads-server image bump

Registry integrity probe surfaced 38 broken manifest references
(34 unique repo:tag pairs, same OCI-index orphan pattern as the 04-19
infra-ci incident). Deleted all via registry HTTP API + ran GC;
reclaimed ~3GB blob storage.

beads-server CronJobs were stuck ImagePullBackOff on
claude-agent-service:0c24c9b6 for >6h — bumped variable default to
2fd7670d (canonical tag in claude-agent-service stack, already healthy
in registry) so new ticks can fire.

Rebuilt in-use broken tags: freedify:{latest,c803de02} and
beadboard:{17a38e43,latest} on registry VM; priority-pass via
Woodpecker pipeline #8. wealthfolio-sync:latest deferred (monthly
CronJob, next run 2026-05-01).

Probe now reports 0/39 failures. RegistryManifestIntegrityFailure
alert cleared.

Closes: code-8hk
Closes: code-jh3c

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-19 23:16:34 +00:00
 								## 2026-04-19 — Bulk cleanup sweep (beads code-8hk + code-jh3c)
 								Same failure class, broader scope. The `registry-integrity-probe`
 								surfaced 38 broken manifest references persisting after the 04-19
 								infra-ci fix. `beads-dispatcher` + `beads-reaper` CronJobs were stuck
 								`ImagePullBackOff` on `claude-agent-service:0c24c9b6` for >6h. All 34
 								affected `repo:tag` pairs were OCI indexes whose `linux/amd64` child
 								manifests were absent from blob storage (same orphan pattern).
 								**Action taken**:
 . Bumped `beads-server/main.tf` var default `claude_agent_service_image_tag`
 								   from `0c24c9b6` → `2fd7670d` (the canonical tag in
 								   `claude-agent-service/main.tf`), reused — same image already healthy
 								   on the registry. `scripts/tg apply` on `beads-server`. Deleted the
 								   stuck Jobs so new CronJob ticks could fire.
 . Enumerated 34 broken `(repo, tag, parent_digest)` triples via HTTP
 								   probe using `registry-probe-credentials` K8s Secret. Deleted each
 								   via `DELETE /v2/<repo>/manifests/<digest>` (33× 202, 1× 404 —
 								   claude-agent-service:latest pointed at an already-deleted digest).
 . Ran `docker exec registry-private /bin/registry garbage-collect
 								   /etc/docker/registry/config.yml` — reclaimed ~3GB of orphan blob
 								   storage.
 . Rebuilt the 3 in-use broken tags (all 3 OCI-index parents pointed
 								   at missing children, so no cached copies would survive pod
 								   reschedule):
 								   - `freedify:latest` / `freedify:c803de02` — built on registry VM
 								     directly (no CI pipeline exists for this image; python FastAPI).
 								   - `beadboard:17a38e43` / `beadboard:latest` — GHA
 								     `workflow_dispatch` failed at registry login (missing
 								     `REGISTRY_USERNAME`/`REGISTRY_PASSWORD` GH secrets). Built on
 								     registry VM directly as the fallback. GitHub secret gap is a
 								     follow-up — beads `code-8hk` notes it.
 								   - `priority-pass-backend:ae1420a0` / `priority-pass-frontend:ae1420a0`
 								     — Woodpecker pipeline #8 on repo 81. Pipeline `kubectl set image`'d
 								     the Deployment to `ae1420a0` (drift vs TF `v5`/`v8` defaults, but
 								     that drift is pre-existing, not introduced by this cleanup).
 								   - `wealthfolio-sync:latest` — **not rebuilt**. Monthly CronJob (next
 								     run 2026-05-01), no source tree or CI pipeline available in the
 								     monorepo; deferred for separate follow-up.
 								**Post-cleanup state**:
 								- Probe: 39 tags, 0 failures. `registry_manifest_integrity_failures{} = 0`.
 								- Alert `RegistryManifestIntegrityFailure` cleared (was firing for
 h 32m).
 								- No `ImagePullBackOff` pods anywhere in the cluster.
 								- 28 of 34 deleted manifests were **dangling tags not referenced by any
 								  workload** — old `382d6b1*`, `v2`-`v7`, `yt-fallback`, etc. Safe
 								  deletes, no rebuilds needed.
 								**Permanent fix still in flight**: Phase 2/3 of this post-mortem
 								(post-push verification in CI, atomic `cleanup-tags.sh`) — not
 								addressed by this cleanup. The probe continues to be the
 								authoritative detector.