Viktor Barzin 7cb44d7264 [registry] Stop recurring orphan OCI-index incidents — detection + prevention + recovery

Second identical registry incident on 2026-04-19 (first 2026-04-13): the
infra-ci:latest image index resolved to child manifests whose blobs had been
garbage-collected out from under the index. Pipelines P366→P376 all exited
126 "image can't be pulled". Hot fix (a05d63e / 6371e75 / c113be4) restored
green CI but left the underlying bug unaddressed.

Root cause: cleanup-tags.sh rmtrees tag dirs on the registry VM daily at
02:00, registry:2's GC (Sunday 03:25) walks OCI index children imperfectly
(distribution/distribution#3324 class). Nothing verified pushes end-to-end;
nothing probed the registry for fetchability; nothing caught orphan indexes.

Phase 1 — Detection:
 - .woodpecker/build-ci-image.yml: after build-and-push, a verify-integrity
   step walks the just-pushed manifest (index + children + config + every
   layer blob) via HEAD and fails the pipeline on any non-200. Catches
   broken pushes at the source.
 - stacks/monitoring: new registry-integrity-probe CronJob (every 15m) and
   three alerts — RegistryManifestIntegrityFailure,
   RegistryIntegrityProbeStale, RegistryCatalogInaccessible — closing the
   "registry serves 404 for a tag that exists" gap that masked the incident
   for 2+ hours.
 - docs/post-mortems/2026-04-19-registry-orphan-index.md: root cause,
   timeline, monitoring gaps, permanent fix.

Phase 2 — Prevention:
 - modules/docker-registry/docker-compose.yml: pin registry:2 → registry:2.8.3
   across all six registry services. Removes the floating-tag footgun.
 - modules/docker-registry/fix-broken-blobs.sh: new scan walks every
   _manifests/revisions/sha256/<digest> that is an image index and logs a
   loud WARNING when a referenced child blob is missing. Does NOT auto-
   delete — deleting a published image is a conscious decision. Layer-link
   scan preserved.

Phase 3 — Recovery:
 - build-ci-image.yml: accept `manual` event so Woodpecker API/UI rebuilds
   don't need a cosmetic Dockerfile edit (matches convention from
   pve-nfs-exports-sync.yml).
 - docs/runbooks/registry-rebuild-image.md: exact command sequence for
   diagnosing + rebuilding after an orphan-index incident, plus a fallback
   for building directly on the registry VM if Woodpecker itself is down.
 - docs/runbooks/registry-vm.md + .claude/reference/service-catalog.md:
   cross-references to the new runbook.

Out of scope (verified healthy or intentionally deferred):
 - Pull-through DockerHub/GHCR mirrors (74.5% hit rate, no 404s).
 - Registry HA/replication (single-VM SPOF is a known architectural
   choice; Synology offsite covers RPO < 1 day).
 - Diun exclude for registry:2 — not applicable; Diun only watches
   k8s (DIUN_PROVIDERS_KUBERNETES=true), not the VM's docker-compose.

Verified locally:
 - fix-broken-blobs.sh --dry-run on a synthetic registry directory correctly
   flags both orphan layer links and orphan OCI-index children.
 - terraform fmt + validate on stacks/monitoring: success (only unrelated
   deprecation warnings).
 - python3 yaml.safe_load on .woodpecker/build-ci-image.yml and
   modules/docker-registry/docker-compose.yml: both parse clean.

Closes: code-4b8

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-19 17:08:28 +00:00

4.5 KiB

Raw Blame History

Runbook: Registry VM (docker-registry, 10.0.20.10)

Last updated: 2026-04-19

The registry VM hosts registry.viktorbarzin.me (private Docker registry, htpasswd-auth, NGINX → registry:2). It is an Ubuntu 24.04 VM on the cluster LAN subnet 10.0.20.0/24, with a static netplan config (no DHCP). Because it sits on a subnet that only has pfSense as its gateway, its DNS must be statically configured.

DNS configuration

Ubuntu ships systemd-resolved and uses netplan to declare per-link nameservers. Netplan writes systemd-networkd or NetworkManager configs that resolved reads at runtime. There is no automatic merging of netplan DNS with the [Resolve] section of /etc/systemd/resolved.conf — per-link settings override the global ones. So both layers must be in sync:

Layer	File	Role
Netplan	`/etc/netplan/50-cloud-init.yaml`	Per-link DNS servers that resolved reports on `Link 2 (eth0)`
Resolved global	`/etc/systemd/resolved.conf.d/10-internal-dns.conf`	`Global` scope `DNS=` / `FallbackDNS=` — also shown in `resolvectl status`

Current state

/etc/systemd/resolved.conf.d/10-internal-dns.conf:

[Resolve]
DNS=10.0.20.1
FallbackDNS=94.140.14.14
Domains=viktorbarzin.lan

/etc/netplan/50-cloud-init.yaml (eth0 block, simplified):

nameservers:
  addresses:
  - 10.0.20.1
  - 94.140.14.14
  search:
  - viktorbarzin.lan

resolvectl status output after the change:

Global
  resolv.conf mode: stub
  Current DNS Server: 10.0.20.1
  DNS Servers: 10.0.20.1
  Fallback DNS Servers: 94.140.14.14
  DNS Domain: viktorbarzin.lan

Link 2 (eth0)
  Current Scopes: DNS
  Current DNS Server: 10.0.20.1
  DNS Servers: 10.0.20.1 94.140.14.14
  DNS Domain: viktorbarzin.lan

Field	Value	Purpose
Primary	`10.0.20.1`	pfSense OPT1 interface (dnsmasq forwarder → Technitium LB) — resolves `.viktorbarzin.lan`
Fallback	`94.140.14.14`	AdGuard public DNS — used if pfSense unreachable (e.g., OPT1 flap)
Search	`viktorbarzin.lan`	Unqualified names resolve against the internal zone

Why this matters for the registry

Container builds on this VM reference .lan hostnames (Technitium, NFS, etc.) and external hostnames (Docker Hub, GHCR). Before the hardening the netplan had 1.1.1.1 / 8.8.8.8 only, which meant:

Internal hostname lookups silently failed (slow timeout) — the VM could not resolve idrac.viktorbarzin.lan or any internal helper.
If Cloudflare's 1.1.1.1 had an outage, the VM would lose DNS entirely.

With the new config the VM can resolve both zones and keeps working if the primary DNS server is unreachable.

Apply / re-apply

ssh root@10.0.20.10 '
  netplan generate
  netplan apply
  systemctl restart systemd-resolved
  resolvectl status | head -20
'

netplan apply is not disruptive when only nameservers change — it does not bounce the link.

Verification

ssh root@10.0.20.10 '
  dig +short idrac.viktorbarzin.lan       # 192.168.1.4
  dig +short github.com                   # GitHub A record
  dig +short registry.viktorbarzin.me     # 10.0.20.10 + external A
'

Fallback test — blackhole the primary and confirm external lookups still succeed through 94.140.14.14:

ssh root@10.0.20.10 '
  ip route add blackhole 10.0.20.1
  dig +short +time=5 +tries=2 github.com   # should still answer
  ip route del blackhole 10.0.20.1
'

Internal lookups do fail during the blackhole (the fallback is a public resolver and does not know about the internal zone), which is expected — the fallback buys availability for external pulls, not internal hostnames.

Rollback

A pre-change backup of /etc/resolv.conf, /etc/systemd/resolved.conf, and /etc/netplan/ lives at /root/dns-backups/dns-config-backup-YYYYMMDD-HHMMSS.tar.gz on the VM. To roll back:

ssh root@10.0.20.10 '
  BACKUP=$(ls -t /root/dns-backups/dns-config-backup-*.tar.gz | head -1)
  tar -xzf "$BACKUP" -C /
  rm -f /etc/systemd/resolved.conf.d/10-internal-dns.conf
  netplan apply
  systemctl restart systemd-resolved
  resolvectl status | head -10
'

docs/architecture/dns.md — resolver IP assignments per subnet.
.claude/CLAUDE.md (at repo root) — notes on the private registry and containerd hosts.toml redirects.
docs/runbooks/registry-rebuild-image.md — rebuild an image after an orphan OCI-index incident (different class of problem than DNS).
docs/post-mortems/2026-04-19-registry-orphan-index.md — root cause
- detection gaps behind the recurring missing-blob incidents.

4.5 KiB Raw Blame History