forgejo retention: revert to DRY_RUN — first live run orphaned OCI indexes [ci skip]
The keep-set (newest 10 versions + latest + *cache* tags) treats multi-arch/attestation index CHILDREN — separate untagged sha256 versions — as deletable: for images not rebuilt recently they sort outside the newest-10 window and were pruned while their kept parent index survived. kms-website :latest and :dfc83fb children 404'd (RegistryManifestIntegrityFailure, caught by forgejo-integrity-probe within hours; deployed tag a794d1a unaffected). Healed: :latest re-pointed at the intact a794d1a index (also the newest commit), corrupt :dfc83fb version deleted, probe re-run clean (0 failures / 22 repos / 63 tags / 59 indexes). DRY_RUN=true applied live. Re-enable only with a container-aware keep-set — options in the post-mortem. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This commit is contained in:
parent
e49c91e60c
commit
a1b7b0ca53
3 changed files with 84 additions and 7 deletions
|
|
@ -0,0 +1,67 @@
|
|||
# 2026-06-10 — forgejo retention orphaned OCI index children (kms-website)
|
||||
|
||||
## Impact
|
||||
|
||||
- `viktor/kms-website:latest` and `:dfc83fb` unpullable (index children
|
||||
HTTP 404). No runtime impact — the deployed tag `:a794d1a` was intact
|
||||
and `imagePullPolicy: IfNotPresent` kept running pods unaffected.
|
||||
- `RegistryManifestIntegrityFailure` firing from ~08:30 EEST;
|
||||
`forgejo-integrity-probe` reported 4 failures across 60 indexes.
|
||||
|
||||
## Root cause
|
||||
|
||||
The `forgejo-cleanup` retention CronJob (live since 2026-06-09, first
|
||||
deleting run 2026-06-10 04:00) computes its keep-set over package
|
||||
**versions**: newest `KEEP_LAST_N=10` + tag `latest` + `*cache*` tags.
|
||||
Forgejo's container registry stores multi-arch / buildx-attestation
|
||||
**index children as separate untagged sha256 versions**. For images not
|
||||
rebuilt recently, those children sort *older* than the newest-10 window
|
||||
and were deleted while their parent index (a kept tag) survived →
|
||||
orphaned indexes, children 404.
|
||||
|
||||
The 2026-06-09 go-live verification ("0 running images on the delete
|
||||
set") checked running **pods** against the delete list — it could not
|
||||
see index→child references, so the corruption class passed review.
|
||||
|
||||
Detection worked as designed: `forgejo-integrity-probe` (15-min catalog
|
||||
walk + manifest HEAD) caught it the same morning. Two probe-run quirks
|
||||
slowed diagnosis: runs occasionally die at startup (`apk add` during
|
||||
transient DNS blips at cron ticks, `set -eu`), so the alert's
|
||||
active-since (08:29:52) lagged the 04:00 corruption.
|
||||
|
||||
## Fix applied (2026-06-10)
|
||||
|
||||
1. `forgejo_cleanup_dry_run = true` (stacks/forgejo/cleanup.tf, applied)
|
||||
— retention logs but deletes nothing until the keep-set is
|
||||
container-aware.
|
||||
2. `:latest` re-pointed at the intact `:a794d1a` index (registry
|
||||
manifest PUT — `a794d1a` is also the newest commit of the repo, so
|
||||
content is correct).
|
||||
3. Corrupt, obsolete `:dfc83fb` package version deleted.
|
||||
4. Probe re-run: **0 failures across 22 repos / 63 tags / 59 indexes**.
|
||||
|
||||
## Follow-up (required before re-enabling deletes)
|
||||
|
||||
Pick one:
|
||||
- (a) keep-set expansion: for every kept tagged version, resolve the
|
||||
manifest via the registry API; if it is an index, add all child
|
||||
digests to the keep set;
|
||||
- (b) never delete untagged sha256 versions (simpler, but untagged
|
||||
garbage accumulates and the PVC pressure that motivated retention
|
||||
returns — registry PVC sits at its 50Gi ceiling on the HDD,
|
||||
see beads code-oflt);
|
||||
- (c) replace the custom script with Forgejo's native per-owner package
|
||||
cleanup rules, which are container-aware.
|
||||
|
||||
Also worth probing beyond `TAGS_PER_REPO=5`: older tags of any
|
||||
multi-arch image may already be orphaned (only newest-5 per repo are
|
||||
verified). Harmless until someone pulls an old tag.
|
||||
|
||||
## Lessons
|
||||
|
||||
- "No running pod uses it" is not a safe deletion predicate for OCI
|
||||
artifacts — reference graphs (index → child manifests) must be
|
||||
resolved at the registry level.
|
||||
- A `set -eu` probe whose first statement is a network package install
|
||||
conflates "registry broken" with "apk blip"; pre-bake the image or
|
||||
tolerate install retries.
|
||||
Loading…
Add table
Add a link
Reference in a new issue