From ac695dea38d3d148eea6eab92272c6b4a83bddee Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Sun, 19 Apr 2026 23:16:34 +0000 Subject: [PATCH] [registry] bulk-clean 34 orphan manifests + beads-server image bump MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Registry integrity probe surfaced 38 broken manifest references (34 unique repo:tag pairs, same OCI-index orphan pattern as the 04-19 infra-ci incident). Deleted all via registry HTTP API + ran GC; reclaimed ~3GB blob storage. beads-server CronJobs were stuck ImagePullBackOff on claude-agent-service:0c24c9b6 for >6h — bumped variable default to 2fd7670d (canonical tag in claude-agent-service stack, already healthy in registry) so new ticks can fire. Rebuilt in-use broken tags: freedify:{latest,c803de02} and beadboard:{17a38e43,latest} on registry VM; priority-pass via Woodpecker pipeline #8. wealthfolio-sync:latest deferred (monthly CronJob, next run 2026-05-01). Probe now reports 0/39 failures. RegistryManifestIntegrityFailure alert cleared. Closes: code-8hk Closes: code-jh3c Co-Authored-By: Claude Opus 4.7 (1M context) --- .../2026-04-19-registry-orphan-index.md | 54 +++++++++++++++++++ stacks/beads-server/main.tf | 2 +- 2 files changed, 55 insertions(+), 1 deletion(-) diff --git a/docs/post-mortems/2026-04-19-registry-orphan-index.md b/docs/post-mortems/2026-04-19-registry-orphan-index.md index da883760..ee596c63 100644 --- a/docs/post-mortems/2026-04-19-registry-orphan-index.md +++ b/docs/post-mortems/2026-04-19-registry-orphan-index.md @@ -190,3 +190,57 @@ unaddressed. - **Runbook**: `docs/runbooks/registry-rebuild-image.md` (new). - **Hot-fix commits**: `a05d63ee`, `6371e75e`, `c113be4d`. - **Upstream bug class**: `distribution/distribution#3324`. + +## 2026-04-19 — Bulk cleanup sweep (beads code-8hk + code-jh3c) + +Same failure class, broader scope. The `registry-integrity-probe` +surfaced 38 broken manifest references persisting after the 04-19 +infra-ci fix. `beads-dispatcher` + `beads-reaper` CronJobs were stuck +`ImagePullBackOff` on `claude-agent-service:0c24c9b6` for >6h. All 34 +affected `repo:tag` pairs were OCI indexes whose `linux/amd64` child +manifests were absent from blob storage (same orphan pattern). + +**Action taken**: +1. Bumped `beads-server/main.tf` var default `claude_agent_service_image_tag` + from `0c24c9b6` → `2fd7670d` (the canonical tag in + `claude-agent-service/main.tf`), reused — same image already healthy + on the registry. `scripts/tg apply` on `beads-server`. Deleted the + stuck Jobs so new CronJob ticks could fire. +2. Enumerated 34 broken `(repo, tag, parent_digest)` triples via HTTP + probe using `registry-probe-credentials` K8s Secret. Deleted each + via `DELETE /v2//manifests/` (33× 202, 1× 404 — + claude-agent-service:latest pointed at an already-deleted digest). +3. Ran `docker exec registry-private /bin/registry garbage-collect + /etc/docker/registry/config.yml` — reclaimed ~3GB of orphan blob + storage. +4. Rebuilt the 3 in-use broken tags (all 3 OCI-index parents pointed + at missing children, so no cached copies would survive pod + reschedule): + - `freedify:latest` / `freedify:c803de02` — built on registry VM + directly (no CI pipeline exists for this image; python FastAPI). + - `beadboard:17a38e43` / `beadboard:latest` — GHA + `workflow_dispatch` failed at registry login (missing + `REGISTRY_USERNAME`/`REGISTRY_PASSWORD` GH secrets). Built on + registry VM directly as the fallback. GitHub secret gap is a + follow-up — beads `code-8hk` notes it. + - `priority-pass-backend:ae1420a0` / `priority-pass-frontend:ae1420a0` + — Woodpecker pipeline #8 on repo 81. Pipeline `kubectl set image`'d + the Deployment to `ae1420a0` (drift vs TF `v5`/`v8` defaults, but + that drift is pre-existing, not introduced by this cleanup). + - `wealthfolio-sync:latest` — **not rebuilt**. Monthly CronJob (next + run 2026-05-01), no source tree or CI pipeline available in the + monorepo; deferred for separate follow-up. + +**Post-cleanup state**: +- Probe: 39 tags, 0 failures. `registry_manifest_integrity_failures{} = 0`. +- Alert `RegistryManifestIntegrityFailure` cleared (was firing for + 5h 32m). +- No `ImagePullBackOff` pods anywhere in the cluster. +- 28 of 34 deleted manifests were **dangling tags not referenced by any + workload** — old `382d6b1*`, `v2`-`v7`, `yt-fallback`, etc. Safe + deletes, no rebuilds needed. + +**Permanent fix still in flight**: Phase 2/3 of this post-mortem +(post-push verification in CI, atomic `cleanup-tags.sh`) — not +addressed by this cleanup. The probe continues to be the +authoritative detector. diff --git a/stacks/beads-server/main.tf b/stacks/beads-server/main.tf index 01f75ff4..e11b0ac7 100644 --- a/stacks/beads-server/main.tf +++ b/stacks/beads-server/main.tf @@ -14,7 +14,7 @@ variable "beadboard_image_tag" { # already ships. variable "claude_agent_service_image_tag" { type = string - default = "0c24c9b6" + default = "2fd7670d" } # Kill switch for auto-dispatch. When false, both CronJobs are suspended. The