infra/docs/runbooks/registry-vm.md
Viktor Barzin 34ee282d88 [ci] Auto-sync modules/docker-registry/* to registry VM + runbook docs
Replaces the manual scp+bounce sequence that landed registry:2.8.3 on
10.0.20.10 today (see commit 7cb44d72 + nginx-DNS-trap in runbook).
Addresses the "no repeat manual fixes" preference — future changes to
docker-compose.yml / fix-broken-blobs.sh / nginx_registry.conf /
config-private.yml / cleanup-tags.sh now deploy through CI.

Pipeline (.woodpecker/registry-config-sync.yml) mirrors
pve-nfs-exports-sync.yml: ssh-keyscan pin, scp the whole managed set,
bounce compose only when compose-visible files changed, always restart
nginx after a compose bounce (critical — nginx caches upstream DNS), end
with a dry-run fix-broken-blobs.sh to catch regressions.

Credentials:
 - Woodpecker repo-secret `registry_ssh_key` (events: push, manual)
 - Mirror at Vault `secret/woodpecker/registry_ssh_key`
   (private_key / public_key / known_hosts_entry)
 - Public key on /root/.ssh/authorized_keys on 10.0.20.10
 - Key label: woodpecker-registry-config-sync

Runbook updated with "Auto-sync pipeline" section pointing at the new
flow + manual override command.

Closes: code-3vl

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 17:32:12 +00:00

209 lines
6.8 KiB
Markdown

# Runbook: Registry VM (docker-registry, 10.0.20.10)
Last updated: 2026-04-19
The registry VM hosts `registry.viktorbarzin.me` (private Docker
registry, htpasswd-auth, NGINX → registry:2). It is an Ubuntu 24.04
VM on the cluster LAN subnet `10.0.20.0/24`, with a static netplan
config (no DHCP). Because it sits on a subnet that only has pfSense
as its gateway, its DNS must be statically configured.
## DNS configuration
Ubuntu ships `systemd-resolved` and uses netplan to declare per-link
`nameservers`. Netplan writes systemd-networkd or NetworkManager
configs that resolved reads at runtime. There is **no automatic
merging** of netplan DNS with the `[Resolve]` section of
`/etc/systemd/resolved.conf` — per-link settings override the global
ones. So both layers must be in sync:
| Layer | File | Role |
|---|---|---|
| Netplan | `/etc/netplan/50-cloud-init.yaml` | Per-link DNS servers that resolved reports on `Link 2 (eth0)` |
| Resolved global | `/etc/systemd/resolved.conf.d/10-internal-dns.conf` | `Global` scope `DNS=` / `FallbackDNS=` — also shown in `resolvectl status` |
### Current state
`/etc/systemd/resolved.conf.d/10-internal-dns.conf`:
```ini
[Resolve]
DNS=10.0.20.1
FallbackDNS=94.140.14.14
Domains=viktorbarzin.lan
```
`/etc/netplan/50-cloud-init.yaml` (eth0 block, simplified):
```yaml
nameservers:
addresses:
- 10.0.20.1
- 94.140.14.14
search:
- viktorbarzin.lan
```
`resolvectl status` output after the change:
```
Global
resolv.conf mode: stub
Current DNS Server: 10.0.20.1
DNS Servers: 10.0.20.1
Fallback DNS Servers: 94.140.14.14
DNS Domain: viktorbarzin.lan
Link 2 (eth0)
Current Scopes: DNS
Current DNS Server: 10.0.20.1
DNS Servers: 10.0.20.1 94.140.14.14
DNS Domain: viktorbarzin.lan
```
| Field | Value | Purpose |
|---|---|---|
| Primary | `10.0.20.1` | pfSense OPT1 interface (dnsmasq forwarder → Technitium LB) — resolves `.viktorbarzin.lan` |
| Fallback | `94.140.14.14` | AdGuard public DNS — used if pfSense unreachable (e.g., OPT1 flap) |
| Search | `viktorbarzin.lan` | Unqualified names resolve against the internal zone |
### Why this matters for the registry
Container builds on this VM reference `.lan` hostnames (Technitium,
NFS, etc.) and external hostnames (Docker Hub, GHCR). Before the
hardening the netplan had `1.1.1.1` / `8.8.8.8` only, which meant:
1. Internal hostname lookups silently failed (slow timeout) — the
VM could not resolve `idrac.viktorbarzin.lan` or any internal
helper.
2. If Cloudflare's 1.1.1.1 had an outage, the VM would lose DNS
entirely.
With the new config the VM can resolve both zones and keeps working
if the primary DNS server is unreachable.
## Apply / re-apply
```sh
ssh root@10.0.20.10 '
netplan generate
netplan apply
systemctl restart systemd-resolved
resolvectl status | head -20
'
```
`netplan apply` is not disruptive when only `nameservers` change — it
does not bounce the link.
## Verification
```sh
ssh root@10.0.20.10 '
dig +short idrac.viktorbarzin.lan # 192.168.1.4
dig +short github.com # GitHub A record
dig +short registry.viktorbarzin.me # 10.0.20.10 + external A
'
```
Fallback test — blackhole the primary and confirm external lookups
still succeed through 94.140.14.14:
```sh
ssh root@10.0.20.10 '
ip route add blackhole 10.0.20.1
dig +short +time=5 +tries=2 github.com # should still answer
ip route del blackhole 10.0.20.1
'
```
Internal lookups do fail during the blackhole (the fallback is a
public resolver and does not know about the internal zone), which is
expected — the fallback buys availability for external pulls, not
internal hostnames.
## Rollback
A pre-change backup of `/etc/resolv.conf`, `/etc/systemd/resolved.conf`,
and `/etc/netplan/` lives at
`/root/dns-backups/dns-config-backup-YYYYMMDD-HHMMSS.tar.gz` on the
VM. To roll back:
```sh
ssh root@10.0.20.10 '
BACKUP=$(ls -t /root/dns-backups/dns-config-backup-*.tar.gz | head -1)
tar -xzf "$BACKUP" -C /
rm -f /etc/systemd/resolved.conf.d/10-internal-dns.conf
netplan apply
systemctl restart systemd-resolved
resolvectl status | head -10
'
```
## Auto-sync pipeline
Changes to `modules/docker-registry/{docker-compose.yml, fix-broken-blobs.sh,
cleanup-tags.sh, nginx_registry.conf, config-private.yml}` deploy
automatically via `.woodpecker/registry-config-sync.yml`:
- Fires on `push` to master touching any of those paths, or via `manual`
event (Woodpecker UI / API).
- SCPs every managed file to `/opt/registry/` on `10.0.20.10`.
- Bounces containers + nginx when a compose-visible file changed; leaves
them alone when only scripts changed (cron picks up automatically).
- Runs a dry-run `fix-broken-blobs.sh` at the end to verify the registry
is still coherent.
SSH credentials: Woodpecker repo-secret `registry_ssh_key` (ed25519,
provisioned 2026-04-19). Public key at `/root/.ssh/authorized_keys` on
`10.0.20.10`. Private key mirrored at `secret/woodpecker/registry_ssh_key`
in Vault (subkeys `private_key` / `public_key` / `known_hosts_entry`).
Manual override if you need to sync right now:
```sh
curl -sf -X POST \
-H "Authorization: Bearer $WOODPECKER_TOKEN" \
"https://ci.viktorbarzin.me/api/repos/1/pipelines" \
-d '{"branch":"master"}' | jq .number
```
## Bouncing registry containers — the nginx DNS trap
`docker compose up -d` on `/opt/registry/docker-compose.yml` recreates
`registry-*` containers when their image tag changes, which assigns them
new IPs on the `registry` bridge network. **`registry-nginx` resolves its
upstream DNS names (`registry-private`, `registry-dockerhub`, …) ONCE at
startup and caches the results** — it does not re-resolve after a
recreate.
Symptom if you forget: `/v2/_catalog` on `:5050` returns
`{"repositories": []}`, `/v2/` returns 200 without auth, pulls return
the wrong image. nginx is forwarding to a stale IP that now belongs to a
different registry-* backend (commonly the pull-through ghcr or
dockerhub cache, which have empty catalogs from the htpasswd-auth user's
perspective).
**Always follow a registry-* bounce with `docker restart registry-nginx`.**
Or prevent the problem by setting a `resolver` directive in
`nginx_registry.conf` so upstream names are re-resolved per request.
```sh
ssh root@10.0.20.10 '
cd /opt/registry && docker compose up -d
docker restart registry-nginx
sleep 3
docker ps --format "{{.Names}}\t{{.Image}}\t{{.Status}}" \
| grep -E "registry-"
'
```
## Related docs
- `docs/architecture/dns.md` — resolver IP assignments per subnet.
- `.claude/CLAUDE.md` (at repo root) — notes on the private registry
and `containerd` `hosts.toml` redirects.
- `docs/runbooks/registry-rebuild-image.md` — rebuild an image after an
orphan OCI-index incident (different class of problem than DNS).
- `docs/post-mortems/2026-04-19-registry-orphan-index.md` — root cause
+ detection gaps behind the recurring missing-blob incidents.