forgejo retention: revert to DRY_RUN — first live run orphaned OCI indexes [ci skip]

The keep-set (newest 10 versions + latest + *cache* tags) treats multi-arch/attestation index CHILDREN — separate untagged sha256 versions — as deletable: for images not rebuilt recently they sort outside the newest-10 window and were pruned while their kept parent index survived. kms-website :latest and :dfc83fb children 404'd (RegistryManifestIntegrityFailure, caught by forgejo-integrity-probe within hours; deployed tag a794d1a unaffected). Healed: :latest re-pointed at the intact a794d1a index (also the newest commit), corrupt :dfc83fb version deleted, probe re-run clean (0 failures / 22 repos / 63 tags / 59 indexes). DRY_RUN=true applied live. Re-enable only with a container-aware keep-set — options in the post-mortem. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 09:22:47 +00:00 · 2026-06-10 09:22:47 +00:00 · a1b7b0ca53
commit a1b7b0ca53
parent e49c91e60c
3 changed files with 84 additions and 7 deletions
--- a/.claude/CLAUDE.md
+++ b/.claude/CLAUDE.md
@ -38,7 +38,7 @@ Violations cause state drift, which causes future applies to break or silently r
  - **DNS**: `dns_type = "proxied"` (Cloudflare CDN) or `"non-proxied"` (direct A/AAAA). DNS records are auto-created — no need to edit `config.tfvars`. Smoke-test target: `echo.viktorbarzin.me` (auth=public, header-reflecting backend).
 - **Anubis PoW challenge** (`modules/kubernetes/anubis_instance/`): per-site reverse proxy that issues a 30-day JWT cookie after a tiny PoW solve. Use for **public, content-bearing sites without app-level auth** (blog, docs, wikis, static landing pages). Pattern: declare `module "anubis" { source = "../../modules/kubernetes/anubis_instance"; name = "X"; namespace = ...; target_url = "http://<backend>.<ns>.svc.cluster.local" }`, then in `ingress_factory` set `service_name = module.anubis.service_name`, `port = module.anubis.service_port`, `anti_ai_scraping = false`. Shared ed25519 key in Vault `secret/viktor` -> `anubis_ed25519_key`; cookie scoped to `viktorbarzin.me` so one solve covers all Anubis-fronted subdomains. **DO NOT put Anubis in front of Git/API/WebDAV/CLI endpoints** — clients without JS can't solve PoW. **Replicas default to 1** because Anubis stores in-flight challenges in process memory; a challenge issued by pod A and solved against pod B errors with `store: key not found` (HTTP 500). Bumping replicas requires wiring a shared Redis store (TODO). For path-level carve-outs (e.g. wrongmove has `/` behind Anubis but `/api` direct, blog has `/net-diag.sh` direct), declare a second `ingress_factory` with `ingress_path = ["/<path>"]` pointing at the bare backend service. Active on: blog (except `/net-diag.sh`), www, kms, travel, f1, cc, json, pb (privatebin), home (homepage), wrongmove (UI only). See `.claude/reference/patterns.md` "Anti-AI Scraping" for full layering.
 - **Docker images**: Always build for `linux/amd64`. SHA-tag rule is being phased out — see `docs/plans/2026-05-16-auto-upgrade-apps-{design,plan}.md`. New model: CI pushes `:latest` (optionally also `:<8-char-sha>` for traceability), Keel polls and triggers rollouts. Cache-staleness concern from the old rule is resolved at the nginx layer (URL-split — manifests pass through, blobs cached). Until Phase 1 of the migration completes (per the plan), follow the SHA-tag rule for new services to match existing pattern.
- **Private registry**: `forgejo.viktorbarzin.me/viktor/<name>` (Forgejo packages, OAuth-style PAT auth). Use `image: forgejo.viktorbarzin.me/viktor/<name>:<tag>` + `imagePullSecrets: [{name: registry-credentials}]`. Kyverno auto-syncs the Secret to all namespaces. **Kubelet pulls** are kept off the hairpin **at the resolver, with zero node-side DNS config**: pfSense Unbound carries a domain override forwarding the whole `viktorbarzin.me` zone to Technitium (added 2026-06-10, `docs/runbooks/pfsense-unbound.md`), whose split-horizon zone CNAMEs every ingress host (auto-synced hourly by `technitium-ingress-dns-sync`) to the zone apex whose A record tracks the **live** Traefik LB IP (canary: `viktorbarzin-apex-probe`, alerts ViktorBarzinApexDrift). Nodes are stock — link DNS `10.0.20.1 94.140.14.14` via `qm set --nameserver`, no `/etc/hosts` pins, no resolved drop-ins (two same-day interim approaches on 2026-06-10 were removed the same day). The containerd `hosts.toml` mirror (`[host."https://10.0.20.203"]`, `skip_verify = true`) still exists but is **vestigial** — it can NOT keep pulls internal on its own: Traefik routes by Host/SNI and 404s the mirror's bare-IP requests, and the registry's Bearer auth realm is the absolute `https://forgejo.viktorbarzin.me/v2/token` URL fetched outside the mirror — without internal DNS every fresh pull degrades to public DNS → hairpin → intermittent `dial tcp 176.12.22.76:443: i/o timeout` ImagePullBackOff (tuya-bridge 7.5h outage 2026-06-10, tripit 2026-06-09; see `docs/post-mortems/2026-06-10-tuya-bridge-forgejo-pull-hairpin.md`). **In-cluster pods are deliberately carved out** — the Traefik LB IP is ETP=Local and unreachable from pods, so CoreDNS has a dedicated `viktorbarzin.me:53` block (Corefile in `stacks/technitium/modules/technitium/main.tf`): forgejo pinned to Traefik's **ClusterIP** (TF-interpolated from the live Service; replaces the old `rewrite ... traefik.traefik.svc.cluster.local` in `.:53`), all other `.me` names forwarded to `8.8.8.8/1.1.1.1` (pods keep public answers; beads code-yh33). Do NOT remove that block while the pfSense override exists. **Was `.200` until 2026-06-01** — Traefik's 2026-05-30 move to its dedicated `.203` left the mirror pointing at the now-dead `.200:443`, silently breaking every *fresh* forgejo pull; a future LB renumber is now handled by DNS (apex record + drift probe) — only the vestigial hosts.toml literal would go stale. Mirror source lives in `modules/create-template-vm/k8s-node-containerd-setup.sh` (new nodes) and `scripts/setup-forgejo-containerd-mirror.sh` (existing nodes; also cleans up the legacy 2026-06-10 node-DNS customization). Push-side: viktor PAT in Vault `secret/ci/global/forgejo_push_token` (Forgejo container packages are scoped per-user; only the package owner can push, ci-pusher cannot write to viktor/*). Pull-side: cluster-puller PAT in Vault `secret/viktor/forgejo_pull_token`. Retention CronJob (`forgejo-cleanup` in `forgejo` ns, daily 04:00) keeps newest 10 versions + always `:latest` + any buildkit `*cache*` tag (so `--cache-from`/`--cache-to` refs survive retention — added 2026-06-09); **went live (DRY_RUN=false) 2026-06-09** after verifying 0 running images on the delete set — the registry PVC is at its 50Gi autoresize ceiling on the HDD (we did NOT move it to SSD, see beads code-oflt), so live retention is what keeps it from filling. Integrity probed every 15min by `forgejo-integrity-probe` in `monitoring` ns (catalog walk + manifest HEAD on every blob). See `docs/plans/2026-05-07-forgejo-registry-consolidation-{design,plan}.md` for the migration history. Pull-through caches for upstream registries (DockerHub, GHCR, Quay, k8s.gcr, Kyverno) stay on the registry VM at `10.0.20.10` ports 5000/5010/5020/5030/5040 — the old port-5050 R/W private registry was decommissioned 2026-05-07.
+- **Private registry**: `forgejo.viktorbarzin.me/viktor/<name>` (Forgejo packages, OAuth-style PAT auth). Use `image: forgejo.viktorbarzin.me/viktor/<name>:<tag>` + `imagePullSecrets: [{name: registry-credentials}]`. Kyverno auto-syncs the Secret to all namespaces. **Kubelet pulls** are kept off the hairpin **at the resolver, with zero node-side DNS config**: pfSense Unbound carries a domain override forwarding the whole `viktorbarzin.me` zone to Technitium (added 2026-06-10, `docs/runbooks/pfsense-unbound.md`), whose split-horizon zone CNAMEs every ingress host (auto-synced hourly by `technitium-ingress-dns-sync`) to the zone apex whose A record tracks the **live** Traefik LB IP (canary: `viktorbarzin-apex-probe`, alerts ViktorBarzinApexDrift). Nodes are stock — link DNS `10.0.20.1 94.140.14.14` via `qm set --nameserver`, no `/etc/hosts` pins, no resolved drop-ins (two same-day interim approaches on 2026-06-10 were removed the same day). The containerd `hosts.toml` mirror (`[host."https://10.0.20.203"]`, `skip_verify = true`) still exists but is **vestigial** — it can NOT keep pulls internal on its own: Traefik routes by Host/SNI and 404s the mirror's bare-IP requests, and the registry's Bearer auth realm is the absolute `https://forgejo.viktorbarzin.me/v2/token` URL fetched outside the mirror — without internal DNS every fresh pull degrades to public DNS → hairpin → intermittent `dial tcp 176.12.22.76:443: i/o timeout` ImagePullBackOff (tuya-bridge 7.5h outage 2026-06-10, tripit 2026-06-09; see `docs/post-mortems/2026-06-10-tuya-bridge-forgejo-pull-hairpin.md`). **In-cluster pods are deliberately carved out** — the Traefik LB IP is ETP=Local and unreachable from pods, so CoreDNS has a dedicated `viktorbarzin.me:53` block (Corefile in `stacks/technitium/modules/technitium/main.tf`): forgejo pinned to Traefik's **ClusterIP** (TF-interpolated from the live Service; replaces the old `rewrite ... traefik.traefik.svc.cluster.local` in `.:53`), all other `.me` names forwarded to `8.8.8.8/1.1.1.1` (pods keep public answers; beads code-yh33). Do NOT remove that block while the pfSense override exists. **Was `.200` until 2026-06-01** — Traefik's 2026-05-30 move to its dedicated `.203` left the mirror pointing at the now-dead `.200:443`, silently breaking every *fresh* forgejo pull; a future LB renumber is now handled by DNS (apex record + drift probe) — only the vestigial hosts.toml literal would go stale. Mirror source lives in `modules/create-template-vm/k8s-node-containerd-setup.sh` (new nodes) and `scripts/setup-forgejo-containerd-mirror.sh` (existing nodes; also cleans up the legacy 2026-06-10 node-DNS customization). Push-side: viktor PAT in Vault `secret/ci/global/forgejo_push_token` (Forgejo container packages are scoped per-user; only the package owner can push, ci-pusher cannot write to viktor/*). Pull-side: cluster-puller PAT in Vault `secret/viktor/forgejo_pull_token`. Retention CronJob (`forgejo-cleanup` in `forgejo` ns, daily 04:00) keeps newest 10 versions + always `:latest` + any buildkit `*cache*` tag — **REVERTED to DRY_RUN 2026-06-10 after its first live run orphaned OCI index children** (multi-arch/attestation children are separate *untagged* sha256 versions that sort outside the newest-10 window while their parent index is kept; broke `kms-website:latest`+`:dfc83fb`, caught by the integrity probe, healed by re-tagging latest→a794d1a + deleting the corrupt version; see `docs/post-mortems/2026-06-10-forgejo-retention-orphaned-indexes.md`). Do NOT re-enable deletes until the keep-set resolves kept indexes' child digests (or skips untagged versions, or moves to Forgejo's native container-aware cleanup rules). The registry PVC remains at its 50Gi autoresize ceiling on the HDD (we did NOT move it to SSD, see beads code-oflt), so a container-aware retention is still needed. Integrity probed every 15min by `forgejo-integrity-probe` in `monitoring` ns (catalog walk + manifest HEAD on every blob). See `docs/plans/2026-05-07-forgejo-registry-consolidation-{design,plan}.md` for the migration history. Pull-through caches for upstream registries (DockerHub, GHCR, Quay, k8s.gcr, Kyverno) stay on the registry VM at `10.0.20.10` ports 5000/5010/5020/5030/5040 — the old port-5050 R/W private registry was decommissioned 2026-05-07.
 - **LinuxServer.io containers**: `DOCKER_MODS` runs apt-get on every start — bake slow mods into a custom image (`RUN /docker-mods || true` then `ENV DOCKER_MODS=`). Set `NO_CHOWN=true` to skip recursive chown that hangs on NFS mounts.
 - **Node memory changes**: When changing VM memory on any k8s node, update kubelet `systemReserved`, `kubeReserved`, and eviction thresholds accordingly. Config: `/var/lib/kubelet/config.yaml`. Template: `stacks/infra/main.tf`. Current values: systemReserved=512Mi, kubeReserved=512Mi, evictionHard=500Mi, evictionSoft=1Gi.
 - **Node OS disk tuning** (in `stacks/infra/main.tf`): kubelet `imageGCHighThresholdPercent=70` (was 85), `imageGCLowThresholdPercent=60` (was 80), ext4 `commit=60` in fstab (was default 5s), journald `SystemMaxUse=200M` + `MaxRetentionSec=3day`.
--- a/docs/post-mortems/2026-06-10-forgejo-retention-orphaned-indexes.md
+++ b/docs/post-mortems/2026-06-10-forgejo-retention-orphaned-indexes.md
@ -0,0 +1,67 @@
+# 2026-06-10 — forgejo retention orphaned OCI index children (kms-website)
+
+## Impact
+
+- `viktor/kms-website:latest` and `:dfc83fb` unpullable (index children
+  HTTP 404). No runtime impact — the deployed tag `:a794d1a` was intact
+  and `imagePullPolicy: IfNotPresent` kept running pods unaffected.
+- `RegistryManifestIntegrityFailure` firing from ~08:30 EEST;
+  `forgejo-integrity-probe` reported 4 failures across 60 indexes.
+
+## Root cause
+
+The `forgejo-cleanup` retention CronJob (live since 2026-06-09, first
+deleting run 2026-06-10 04:00) computes its keep-set over package
+**versions**: newest `KEEP_LAST_N=10` + tag `latest` + `*cache*` tags.
+Forgejo's container registry stores multi-arch / buildx-attestation
+**index children as separate untagged sha256 versions**. For images not
+rebuilt recently, those children sort *older* than the newest-10 window
+and were deleted while their parent index (a kept tag) survived →
+orphaned indexes, children 404.
+
+The 2026-06-09 go-live verification ("0 running images on the delete
+set") checked running **pods** against the delete list — it could not
+see index→child references, so the corruption class passed review.
+
+Detection worked as designed: `forgejo-integrity-probe` (15-min catalog
+walk + manifest HEAD) caught it the same morning. Two probe-run quirks
+slowed diagnosis: runs occasionally die at startup (`apk add` during
+transient DNS blips at cron ticks, `set -eu`), so the alert's
+active-since (08:29:52) lagged the 04:00 corruption.
+
+## Fix applied (2026-06-10)
+
+1. `forgejo_cleanup_dry_run = true` (stacks/forgejo/cleanup.tf, applied)
+   — retention logs but deletes nothing until the keep-set is
+   container-aware.
+2. `:latest` re-pointed at the intact `:a794d1a` index (registry
+   manifest PUT — `a794d1a` is also the newest commit of the repo, so
+   content is correct).
+3. Corrupt, obsolete `:dfc83fb` package version deleted.
+4. Probe re-run: **0 failures across 22 repos / 63 tags / 59 indexes**.
+
+## Follow-up (required before re-enabling deletes)
+
+Pick one:
+- (a) keep-set expansion: for every kept tagged version, resolve the
+  manifest via the registry API; if it is an index, add all child
+  digests to the keep set;
+- (b) never delete untagged sha256 versions (simpler, but untagged
+  garbage accumulates and the PVC pressure that motivated retention
+  returns — registry PVC sits at its 50Gi ceiling on the HDD,
+  see beads code-oflt);
+- (c) replace the custom script with Forgejo's native per-owner package
+  cleanup rules, which are container-aware.
+
+Also worth probing beyond `TAGS_PER_REPO=5`: older tags of any
+multi-arch image may already be orphaned (only newest-5 per repo are
+verified). Harmless until someone pulls an old tag.
+
+## Lessons
+
+- "No running pod uses it" is not a safe deletion predicate for OCI
+  artifacts — reference graphs (index → child manifests) must be
+  resolved at the registry level.
+- A `set -eu` probe whose first statement is a network package install
+  conflates "registry broken" with "apk blip"; pre-bake the image or
+  tolerate install retries.
--- a/stacks/forgejo/cleanup.tf
+++ b/stacks/forgejo/cleanup.tf
@ -22,12 +22,22 @@ data "vault_kv_secret_v2" "forgejo_viktor" {
 }

 locals {
-  # Activated 2026-06-09 after verifying a dry-run delete list against all
-  # running viktor/* images cluster-wide: 0 running images on the delete set
-  # (would prune 317 stale versions, keeping newest 10 + latest + cache tags).
-  # Live retention is what keeps the registry PVC from filling on the HDD
-  # (we deliberately did NOT move Forgejo to SSD — see beads code-oflt).
-  forgejo_cleanup_dry_run = false
+  # REVERTED TO DRY-RUN 2026-06-10: the first live runs ORPHANED OCI indexes.
+  # The keep-set is computed over package VERSIONS (newest 10 + tag "latest"
+  # + *cache* tags), but multi-arch/attestation index CHILDREN are separate
+  # UNTAGGED sha256 versions — for images not rebuilt recently they fall
+  # outside the newest-10 window and get deleted while their parent index is
+  # kept. Result: index children 404 (viktor/kms-website :latest + :dfc83fb,
+  # caught by forgejo-integrity-probe / RegistryManifestIntegrityFailure,
+  # 2026-06-10). Do NOT re-enable until the script either (a) resolves each
+  # kept index's child digests via the registry API and adds them to the
+  # keep set, or (b) skips untagged sha256 versions entirely, or (c) is
+  # replaced by Forgejo's native per-owner package cleanup rules (container-
+  # aware). The 2026-06-09 "0 running images on the delete set" verification
+  # checked running PODS, not index child references — insufficient.
+  # History: activated 2026-06-09 (would prune 317 stale versions); registry
+  # PVC pressure concern remains (HDD, no SSD move — see beads code-oflt).
+  forgejo_cleanup_dry_run = true
 }

 resource "kubernetes_config_map" "forgejo_cleanup_script" {