6d224861 came from a --no-checkout worktree whose empty index made the
commit drop every file except two. This restores 05b50d2b's full tree and
correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su
entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the
live infra was never applied from the broken commit.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
7.4 KiB
Runbook: Registry VM (docker-registry, 10.0.20.10)
Last updated: 2026-05-07
The registry VM is an Ubuntu 24.04 VM on the cluster LAN subnet
10.0.20.0/24, with a static netplan config (no DHCP). Because it
sits on a subnet that only has pfSense as its gateway, its DNS must
be statically configured.
As of Phase 4 of forgejo-registry-consolidation 2026-05-07 the VM no longer hosts the private R/W registry. It hosts pull-through caches only:
| Port | Upstream |
|---|---|
| 5000 | docker.io (Docker Hub) — auth via dockerhub_registry_password |
| 5010 | ghcr.io |
| 5020 | quay.io |
| 5030 | registry.k8s.io |
| 5040 | reg.kyverno.io |
The decommissioned private registry (port 5050) is now hosted on
Forgejo at forgejo.viktorbarzin.me/viktor/<image>. See
docs/plans/2026-05-07-forgejo-registry-consolidation-plan.md for the
migration. Break-glass tarballs of infra-ci are still produced on
each build to /opt/registry/data/private/_breakglass/ — see
docs/runbooks/forgejo-registry-breakglass.md.
DNS configuration
Ubuntu ships systemd-resolved and uses netplan to declare per-link
nameservers. Netplan writes systemd-networkd or NetworkManager
configs that resolved reads at runtime. There is no automatic
merging of netplan DNS with the [Resolve] section of
/etc/systemd/resolved.conf — per-link settings override the global
ones. So both layers must be in sync:
| Layer | File | Role |
|---|---|---|
| Netplan | /etc/netplan/50-cloud-init.yaml |
Per-link DNS servers that resolved reports on Link 2 (eth0) |
| Resolved global | /etc/systemd/resolved.conf.d/10-internal-dns.conf |
Global scope DNS= / FallbackDNS= — also shown in resolvectl status |
Current state
/etc/systemd/resolved.conf.d/10-internal-dns.conf:
[Resolve]
DNS=10.0.20.1
FallbackDNS=94.140.14.14
Domains=viktorbarzin.lan
/etc/netplan/50-cloud-init.yaml (eth0 block, simplified):
nameservers:
addresses:
- 10.0.20.1
- 94.140.14.14
search:
- viktorbarzin.lan
resolvectl status output after the change:
Global
resolv.conf mode: stub
Current DNS Server: 10.0.20.1
DNS Servers: 10.0.20.1
Fallback DNS Servers: 94.140.14.14
DNS Domain: viktorbarzin.lan
Link 2 (eth0)
Current Scopes: DNS
Current DNS Server: 10.0.20.1
DNS Servers: 10.0.20.1 94.140.14.14
DNS Domain: viktorbarzin.lan
| Field | Value | Purpose |
|---|---|---|
| Primary | 10.0.20.1 |
pfSense OPT1 interface (dnsmasq forwarder → Technitium LB) — resolves .viktorbarzin.lan |
| Fallback | 94.140.14.14 |
AdGuard public DNS — used if pfSense unreachable (e.g., OPT1 flap) |
| Search | viktorbarzin.lan |
Unqualified names resolve against the internal zone |
Why this matters for the registry
Container builds on this VM reference .lan hostnames (Technitium,
NFS, etc.) and external hostnames (Docker Hub, GHCR). Before the
hardening the netplan had 1.1.1.1 / 8.8.8.8 only, which meant:
- Internal hostname lookups silently failed (slow timeout) — the
VM could not resolve
idrac.viktorbarzin.lanor any internal helper. - If Cloudflare's 1.1.1.1 had an outage, the VM would lose DNS entirely.
With the new config the VM can resolve both zones and keeps working if the primary DNS server is unreachable.
Apply / re-apply
ssh root@10.0.20.10 '
netplan generate
netplan apply
systemctl restart systemd-resolved
resolvectl status | head -20
'
netplan apply is not disruptive when only nameservers change — it
does not bounce the link.
Verification
ssh root@10.0.20.10 '
dig +short idrac.viktorbarzin.lan # 192.168.1.4
dig +short github.com # GitHub A record
dig +short registry.viktorbarzin.me # 10.0.20.10 + external A
'
Fallback test — blackhole the primary and confirm external lookups still succeed through 94.140.14.14:
ssh root@10.0.20.10 '
ip route add blackhole 10.0.20.1
dig +short +time=5 +tries=2 github.com # should still answer
ip route del blackhole 10.0.20.1
'
Internal lookups do fail during the blackhole (the fallback is a public resolver and does not know about the internal zone), which is expected — the fallback buys availability for external pulls, not internal hostnames.
Rollback
A pre-change backup of /etc/resolv.conf, /etc/systemd/resolved.conf,
and /etc/netplan/ lives at
/root/dns-backups/dns-config-backup-YYYYMMDD-HHMMSS.tar.gz on the
VM. To roll back:
ssh root@10.0.20.10 '
BACKUP=$(ls -t /root/dns-backups/dns-config-backup-*.tar.gz | head -1)
tar -xzf "$BACKUP" -C /
rm -f /etc/systemd/resolved.conf.d/10-internal-dns.conf
netplan apply
systemctl restart systemd-resolved
resolvectl status | head -10
'
Auto-sync pipeline
Changes to modules/docker-registry/{docker-compose.yml, fix-broken-blobs.sh, cleanup-tags.sh, nginx_registry.conf, config-private.yml} deploy
automatically via .woodpecker/registry-config-sync.yml:
- Fires on
pushto master touching any of those paths, or viamanualevent (Woodpecker UI / API). - SCPs every managed file to
/opt/registry/on10.0.20.10. - Bounces containers + nginx when a compose-visible file changed; leaves them alone when only scripts changed (cron picks up automatically).
- Runs a dry-run
fix-broken-blobs.shat the end to verify the registry is still coherent.
SSH credentials: Woodpecker repo-secret registry_ssh_key (ed25519,
provisioned 2026-04-19). Public key at /root/.ssh/authorized_keys on
10.0.20.10. Private key mirrored at secret/woodpecker/registry_ssh_key
in Vault (subkeys private_key / public_key / known_hosts_entry).
Manual override if you need to sync right now:
curl -sf -X POST \
-H "Authorization: Bearer $WOODPECKER_TOKEN" \
"https://ci.viktorbarzin.me/api/repos/1/pipelines" \
-d '{"branch":"master"}' | jq .number
Bouncing registry containers — the nginx DNS trap
docker compose up -d on /opt/registry/docker-compose.yml recreates
registry-* containers when their image tag changes, which assigns them
new IPs on the registry bridge network. registry-nginx resolves its
upstream DNS names (registry-private, registry-dockerhub, …) ONCE at
startup and caches the results — it does not re-resolve after a
recreate.
Symptom if you forget: /v2/_catalog on :5050 returns
{"repositories": []}, /v2/ returns 200 without auth, pulls return
the wrong image. nginx is forwarding to a stale IP that now belongs to a
different registry-* backend (commonly the pull-through ghcr or
dockerhub cache, which have empty catalogs from the htpasswd-auth user's
perspective).
Always follow a registry- bounce with docker restart registry-nginx.*
Or prevent the problem by setting a resolver directive in
nginx_registry.conf so upstream names are re-resolved per request.
ssh root@10.0.20.10 '
cd /opt/registry && docker compose up -d
docker restart registry-nginx
sleep 3
docker ps --format "{{.Names}}\t{{.Image}}\t{{.Status}}" \
| grep -E "registry-"
'
Related docs
docs/architecture/dns.md— resolver IP assignments per subnet..claude/CLAUDE.md(at repo root) — notes on the private registry andcontainerdhosts.tomlredirects.docs/runbooks/registry-rebuild-image.md— rebuild an image after an orphan OCI-index incident (different class of problem than DNS).docs/post-mortems/2026-04-19-registry-orphan-index.md— root cause- detection gaps behind the recurring missing-blob incidents.