infra/docs/runbooks/registry-vm.md
Viktor Barzin 4ec40ea804 [forgejo] Phases 3+4+5: cutover, decommission, docs sweep
End of forgejo-registry-consolidation. After Phase 0/1 already landed
(Forgejo ready, dual-push CI, integrity probe, retention CronJob,
images migrated via forgejo-migrate-orphan-images.sh), this commit
flips everything off registry.viktorbarzin.me onto Forgejo and
removes the legacy infrastructure.

Phase 3 — image= flips:
* infra/stacks/{payslip-ingest,job-hunter,claude-agent-service,
  fire-planner,freedify/factory,chrome-service,beads-server}/main.tf
  — image= now points to forgejo.viktorbarzin.me/viktor/<name>.
* infra/stacks/claude-memory/main.tf — also moved off DockerHub
  (viktorbarzin/claude-memory-mcp:17 → forgejo.viktorbarzin.me/viktor/...).
* infra/.woodpecker/{default,drift-detection}.yml — infra-ci pulled
  from Forgejo. build-ci-image.yml dual-pushes still until next
  build cycle confirms Forgejo as canonical.
* /home/wizard/code/CLAUDE.md — claude-memory-mcp install URL updated.

Phase 4 — decommission registry-private:
* registry-credentials Secret: dropped registry.viktorbarzin.me /
  registry.viktorbarzin.me:5050 / 10.0.20.10:5050 auths entries.
  Forgejo entry is the only one left.
* infra/stacks/infra/main.tf cloud-init: dropped containerd
  hosts.toml entries for registry.viktorbarzin.me +
  10.0.20.10:5050. (Existing nodes already had the file removed
  manually by `setup-forgejo-containerd-mirror.sh` rollout — the
  cloud-init template only fires on new VM provision.)
* infra/modules/docker-registry/docker-compose.yml: registry-private
  service block removed; nginx 5050 port mapping dropped. Pull-
  through caches for upstream registries (5000/5010/5020/5030/5040)
  stay on the VM permanently.
* infra/modules/docker-registry/nginx_registry.conf: upstream
  `private` block + port 5050 server block removed.
* infra/stacks/monitoring/modules/monitoring/main.tf: registry_
  integrity_probe + registry_probe_credentials resources stripped.
  forgejo_integrity_probe is the only manifest probe now.

Phase 5 — final docs sweep:
* infra/docs/runbooks/registry-vm.md — VM scope reduced to pull-
  through caches; forgejo-registry-breakglass.md cross-ref added.
* infra/docs/architecture/ci-cd.md — registry component table +
  diagram now reflect Forgejo. Pre-migration root-cause sentence
  preserved as historical context with a pointer to the design doc.
* infra/docs/architecture/monitoring.md — Registry Integrity Probe
  row updated to point at the Forgejo probe.
* infra/.claude/CLAUDE.md — Private registry section rewritten end-
  to-end (auth, retention, integrity, where the bake came from).
* prometheus_chart_values.tpl — RegistryManifestIntegrityFailure
  alert annotation simplified now that only one registry is in
  scope.

Operational follow-up (cannot be done from a TF apply):
1. ssh root@10.0.20.10 — edit /opt/registry/docker-compose.yml to
   match the new template AND `docker compose up -d --remove-orphans`
   to actually stop the registry-private container. Memory id=1078
   confirms cloud-init won't redeploy on TF apply alone.
2. After 1 week of no incidents, `rm -rf /opt/registry/data/private/`
   on the VM (~2.6GB freed).
3. Open the dual-push step in build-ci-image.yml and drop
   registry.viktorbarzin.me:5050 from the `repo:` list — at that
   point the post-push integrity check at line 33-107 also needs
   to be repointed at Forgejo or removed (the per-build verify is
   redundant with the every-15min Forgejo probe).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-07 23:29:34 +00:00

7.4 KiB

Runbook: Registry VM (docker-registry, 10.0.20.10)

Last updated: 2026-05-07

The registry VM is an Ubuntu 24.04 VM on the cluster LAN subnet 10.0.20.0/24, with a static netplan config (no DHCP). Because it sits on a subnet that only has pfSense as its gateway, its DNS must be statically configured.

As of Phase 4 of forgejo-registry-consolidation 2026-05-07 the VM no longer hosts the private R/W registry. It hosts pull-through caches only:

Port Upstream
5000 docker.io (Docker Hub) — auth via dockerhub_registry_password
5010 ghcr.io
5020 quay.io
5030 registry.k8s.io
5040 reg.kyverno.io

The decommissioned private registry (port 5050) is now hosted on Forgejo at forgejo.viktorbarzin.me/viktor/<image>. See docs/plans/2026-05-07-forgejo-registry-consolidation-plan.md for the migration. Break-glass tarballs of infra-ci are still produced on each build to /opt/registry/data/private/_breakglass/ — see docs/runbooks/forgejo-registry-breakglass.md.

DNS configuration

Ubuntu ships systemd-resolved and uses netplan to declare per-link nameservers. Netplan writes systemd-networkd or NetworkManager configs that resolved reads at runtime. There is no automatic merging of netplan DNS with the [Resolve] section of /etc/systemd/resolved.conf — per-link settings override the global ones. So both layers must be in sync:

Layer File Role
Netplan /etc/netplan/50-cloud-init.yaml Per-link DNS servers that resolved reports on Link 2 (eth0)
Resolved global /etc/systemd/resolved.conf.d/10-internal-dns.conf Global scope DNS= / FallbackDNS= — also shown in resolvectl status

Current state

/etc/systemd/resolved.conf.d/10-internal-dns.conf:

[Resolve]
DNS=10.0.20.1
FallbackDNS=94.140.14.14
Domains=viktorbarzin.lan

/etc/netplan/50-cloud-init.yaml (eth0 block, simplified):

nameservers:
  addresses:
  - 10.0.20.1
  - 94.140.14.14
  search:
  - viktorbarzin.lan

resolvectl status output after the change:

Global
  resolv.conf mode: stub
  Current DNS Server: 10.0.20.1
  DNS Servers: 10.0.20.1
  Fallback DNS Servers: 94.140.14.14
  DNS Domain: viktorbarzin.lan

Link 2 (eth0)
  Current Scopes: DNS
  Current DNS Server: 10.0.20.1
  DNS Servers: 10.0.20.1 94.140.14.14
  DNS Domain: viktorbarzin.lan
Field Value Purpose
Primary 10.0.20.1 pfSense OPT1 interface (dnsmasq forwarder → Technitium LB) — resolves .viktorbarzin.lan
Fallback 94.140.14.14 AdGuard public DNS — used if pfSense unreachable (e.g., OPT1 flap)
Search viktorbarzin.lan Unqualified names resolve against the internal zone

Why this matters for the registry

Container builds on this VM reference .lan hostnames (Technitium, NFS, etc.) and external hostnames (Docker Hub, GHCR). Before the hardening the netplan had 1.1.1.1 / 8.8.8.8 only, which meant:

  1. Internal hostname lookups silently failed (slow timeout) — the VM could not resolve idrac.viktorbarzin.lan or any internal helper.
  2. If Cloudflare's 1.1.1.1 had an outage, the VM would lose DNS entirely.

With the new config the VM can resolve both zones and keeps working if the primary DNS server is unreachable.

Apply / re-apply

ssh root@10.0.20.10 '
  netplan generate
  netplan apply
  systemctl restart systemd-resolved
  resolvectl status | head -20
'

netplan apply is not disruptive when only nameservers change — it does not bounce the link.

Verification

ssh root@10.0.20.10 '
  dig +short idrac.viktorbarzin.lan       # 192.168.1.4
  dig +short github.com                   # GitHub A record
  dig +short registry.viktorbarzin.me     # 10.0.20.10 + external A
'

Fallback test — blackhole the primary and confirm external lookups still succeed through 94.140.14.14:

ssh root@10.0.20.10 '
  ip route add blackhole 10.0.20.1
  dig +short +time=5 +tries=2 github.com   # should still answer
  ip route del blackhole 10.0.20.1
'

Internal lookups do fail during the blackhole (the fallback is a public resolver and does not know about the internal zone), which is expected — the fallback buys availability for external pulls, not internal hostnames.

Rollback

A pre-change backup of /etc/resolv.conf, /etc/systemd/resolved.conf, and /etc/netplan/ lives at /root/dns-backups/dns-config-backup-YYYYMMDD-HHMMSS.tar.gz on the VM. To roll back:

ssh root@10.0.20.10 '
  BACKUP=$(ls -t /root/dns-backups/dns-config-backup-*.tar.gz | head -1)
  tar -xzf "$BACKUP" -C /
  rm -f /etc/systemd/resolved.conf.d/10-internal-dns.conf
  netplan apply
  systemctl restart systemd-resolved
  resolvectl status | head -10
'

Auto-sync pipeline

Changes to modules/docker-registry/{docker-compose.yml, fix-broken-blobs.sh, cleanup-tags.sh, nginx_registry.conf, config-private.yml} deploy automatically via .woodpecker/registry-config-sync.yml:

  • Fires on push to master touching any of those paths, or via manual event (Woodpecker UI / API).
  • SCPs every managed file to /opt/registry/ on 10.0.20.10.
  • Bounces containers + nginx when a compose-visible file changed; leaves them alone when only scripts changed (cron picks up automatically).
  • Runs a dry-run fix-broken-blobs.sh at the end to verify the registry is still coherent.

SSH credentials: Woodpecker repo-secret registry_ssh_key (ed25519, provisioned 2026-04-19). Public key at /root/.ssh/authorized_keys on 10.0.20.10. Private key mirrored at secret/woodpecker/registry_ssh_key in Vault (subkeys private_key / public_key / known_hosts_entry).

Manual override if you need to sync right now:

curl -sf -X POST \
  -H "Authorization: Bearer $WOODPECKER_TOKEN" \
  "https://ci.viktorbarzin.me/api/repos/1/pipelines" \
  -d '{"branch":"master"}' | jq .number

Bouncing registry containers — the nginx DNS trap

docker compose up -d on /opt/registry/docker-compose.yml recreates registry-* containers when their image tag changes, which assigns them new IPs on the registry bridge network. registry-nginx resolves its upstream DNS names (registry-private, registry-dockerhub, …) ONCE at startup and caches the results — it does not re-resolve after a recreate.

Symptom if you forget: /v2/_catalog on :5050 returns {"repositories": []}, /v2/ returns 200 without auth, pulls return the wrong image. nginx is forwarding to a stale IP that now belongs to a different registry-* backend (commonly the pull-through ghcr or dockerhub cache, which have empty catalogs from the htpasswd-auth user's perspective).

Always follow a registry- bounce with docker restart registry-nginx.* Or prevent the problem by setting a resolver directive in nginx_registry.conf so upstream names are re-resolved per request.

ssh root@10.0.20.10 '
  cd /opt/registry && docker compose up -d
  docker restart registry-nginx
  sleep 3
  docker ps --format "{{.Names}}\t{{.Image}}\t{{.Status}}" \
    | grep -E "registry-"
'
  • docs/architecture/dns.md — resolver IP assignments per subnet.
  • .claude/CLAUDE.md (at repo root) — notes on the private registry and containerd hosts.toml redirects.
  • docs/runbooks/registry-rebuild-image.md — rebuild an image after an orphan OCI-index incident (different class of problem than DNS).
  • docs/post-mortems/2026-04-19-registry-orphan-index.md — root cause
    • detection gaps behind the recurring missing-blob incidents.