infra/docs/runbooks/registry-vm.md

# Runbook: Registry VM (docker-registry, 10.0.20.10)

Last updated: 2026-04-19

The registry VM hosts `registry.viktorbarzin.me` (private Docker
registry, htpasswd-auth, NGINX → registry:2). It is an Ubuntu 24.04
VM on the cluster LAN subnet `10.0.20.0/24`, with a static netplan
config (no DHCP). Because it sits on a subnet that only has pfSense
as its gateway, its DNS must be statically configured.

## DNS configuration

Ubuntu ships `systemd-resolved` and uses netplan to declare per-link
`nameservers`. Netplan writes systemd-networkd or NetworkManager
configs that resolved reads at runtime. There is **no automatic
merging** of netplan DNS with the `[Resolve]` section of
`/etc/systemd/resolved.conf` — per-link settings override the global
ones. So both layers must be in sync:

| Layer | File | Role |
|---|---|---|
| Netplan | `/etc/netplan/50-cloud-init.yaml` | Per-link DNS servers that resolved reports on `Link 2 (eth0)` |
| Resolved global | `/etc/systemd/resolved.conf.d/10-internal-dns.conf` | `Global` scope `DNS=` / `FallbackDNS=` — also shown in `resolvectl status` |

### Current state

`/etc/systemd/resolved.conf.d/10-internal-dns.conf`:

```ini
[Resolve]
DNS=10.0.20.1
FallbackDNS=94.140.14.14
Domains=viktorbarzin.lan
```

`/etc/netplan/50-cloud-init.yaml` (eth0 block, simplified):

```yaml
nameservers:
  addresses:
  - 10.0.20.1
  - 94.140.14.14
  search:
  - viktorbarzin.lan
```

`resolvectl status` output after the change:

```
Global
  resolv.conf mode: stub
  Current DNS Server: 10.0.20.1
  DNS Servers: 10.0.20.1
  Fallback DNS Servers: 94.140.14.14
  DNS Domain: viktorbarzin.lan

Link 2 (eth0)
  Current Scopes: DNS
  Current DNS Server: 10.0.20.1
  DNS Servers: 10.0.20.1 94.140.14.14
  DNS Domain: viktorbarzin.lan
```

| Field | Value | Purpose |
|---|---|---|
| Primary | `10.0.20.1` | pfSense OPT1 interface (dnsmasq forwarder → Technitium LB) — resolves `.viktorbarzin.lan` |
| Fallback | `94.140.14.14` | AdGuard public DNS — used if pfSense unreachable (e.g., OPT1 flap) |
| Search | `viktorbarzin.lan` | Unqualified names resolve against the internal zone |

### Why this matters for the registry

Container builds on this VM reference `.lan` hostnames (Technitium,
NFS, etc.) and external hostnames (Docker Hub, GHCR). Before the
hardening the netplan had `1.1.1.1` / `8.8.8.8` only, which meant:

1. Internal hostname lookups silently failed (slow timeout) — the
   VM could not resolve `idrac.viktorbarzin.lan` or any internal
   helper.
2. If Cloudflare's 1.1.1.1 had an outage, the VM would lose DNS
   entirely.

With the new config the VM can resolve both zones and keeps working
if the primary DNS server is unreachable.

## Apply / re-apply

```sh
ssh root@10.0.20.10 '
  netplan generate
  netplan apply
  systemctl restart systemd-resolved
  resolvectl status | head -20
'
```

`netplan apply` is not disruptive when only `nameservers` change — it
does not bounce the link.

## Verification

```sh
ssh root@10.0.20.10 '
  dig +short idrac.viktorbarzin.lan       # 192.168.1.4
  dig +short github.com                   # GitHub A record
  dig +short registry.viktorbarzin.me     # 10.0.20.10 + external A
'
```

Fallback test — blackhole the primary and confirm external lookups
still succeed through 94.140.14.14:

```sh
ssh root@10.0.20.10 '
  ip route add blackhole 10.0.20.1
  dig +short +time=5 +tries=2 github.com   # should still answer
  ip route del blackhole 10.0.20.1
'
```

Internal lookups do fail during the blackhole (the fallback is a
public resolver and does not know about the internal zone), which is
expected — the fallback buys availability for external pulls, not
internal hostnames.

## Rollback

A pre-change backup of `/etc/resolv.conf`, `/etc/systemd/resolved.conf`,
and `/etc/netplan/` lives at
`/root/dns-backups/dns-config-backup-YYYYMMDD-HHMMSS.tar.gz` on the
VM. To roll back:

```sh
ssh root@10.0.20.10 '
  BACKUP=$(ls -t /root/dns-backups/dns-config-backup-*.tar.gz | head -1)
  tar -xzf "$BACKUP" -C /
  rm -f /etc/systemd/resolved.conf.d/10-internal-dns.conf
  netplan apply
  systemctl restart systemd-resolved
  resolvectl status | head -10
'
```

## Auto-sync pipeline

Changes to `modules/docker-registry/{docker-compose.yml, fix-broken-blobs.sh,
cleanup-tags.sh, nginx_registry.conf, config-private.yml}` deploy
automatically via `.woodpecker/registry-config-sync.yml`:

- Fires on `push` to master touching any of those paths, or via `manual`
  event (Woodpecker UI / API).
- SCPs every managed file to `/opt/registry/` on `10.0.20.10`.
- Bounces containers + nginx when a compose-visible file changed; leaves
  them alone when only scripts changed (cron picks up automatically).
- Runs a dry-run `fix-broken-blobs.sh` at the end to verify the registry
  is still coherent.

SSH credentials: Woodpecker repo-secret `registry_ssh_key` (ed25519,
provisioned 2026-04-19). Public key at `/root/.ssh/authorized_keys` on
`10.0.20.10`. Private key mirrored at `secret/woodpecker/registry_ssh_key`
in Vault (subkeys `private_key` / `public_key` / `known_hosts_entry`).

Manual override if you need to sync right now:

```sh
curl -sf -X POST \
  -H "Authorization: Bearer $WOODPECKER_TOKEN" \
  "https://ci.viktorbarzin.me/api/repos/1/pipelines" \
  -d '{"branch":"master"}' | jq .number
```

## Bouncing registry containers — the nginx DNS trap

`docker compose up -d` on `/opt/registry/docker-compose.yml` recreates
`registry-*` containers when their image tag changes, which assigns them
new IPs on the `registry` bridge network. **`registry-nginx` resolves its
upstream DNS names (`registry-private`, `registry-dockerhub`, …) ONCE at
startup and caches the results** — it does not re-resolve after a
recreate.

Symptom if you forget: `/v2/_catalog` on `:5050` returns
`{"repositories": []}`, `/v2/` returns 200 without auth, pulls return
the wrong image. nginx is forwarding to a stale IP that now belongs to a
different registry-* backend (commonly the pull-through ghcr or
dockerhub cache, which have empty catalogs from the htpasswd-auth user's
perspective).

**Always follow a registry-* bounce with `docker restart registry-nginx`.**
Or prevent the problem by setting a `resolver` directive in
`nginx_registry.conf` so upstream names are re-resolved per request.

```sh
ssh root@10.0.20.10 '
  cd /opt/registry && docker compose up -d
  docker restart registry-nginx
  sleep 3
  docker ps --format "{{.Names}}\t{{.Image}}\t{{.Status}}" \
    | grep -E "registry-"
'
```

## Related docs

- `docs/architecture/dns.md` — resolver IP assignments per subnet.
- `.claude/CLAUDE.md` (at repo root) — notes on the private registry
  and `containerd` `hosts.toml` redirects.
- `docs/runbooks/registry-rebuild-image.md` — rebuild an image after an
  orphan OCI-index incident (different class of problem than DNS).
- `docs/post-mortems/2026-04-19-registry-orphan-index.md` — root cause
  + detection gaps behind the recurring missing-blob incidents.
[dns] static-client DNS — Proxmox host, registry VM dual-resolver setup (WS F) Fixes single-upstream DNS brittleness on non-DHCP hosts. Each host now has a primary internal resolver + external fallback (AdGuard) so DNS keeps working if the primary resolver IP is unreachable. New config: - Proxmox host (192.168.1.127): plain /etc/resolv.conf with nameserver 192.168.1.2 (pfSense LAN) + 94.140.14.14 (AdGuard). Previously: single nameserver 192.168.1.1 — could not resolve internal .lan names at all. Documented in docs/runbooks/proxmox-host.md. - Registry VM (10.0.20.10): systemd-resolved drop-in at /etc/systemd/resolved.conf.d/10-internal-dns.conf (DNS=10.0.20.1, FallbackDNS=94.140.14.14, Domains=viktorbarzin.lan) plus matching per-link nameservers in /etc/netplan/50-cloud-init.yaml. Previously: 1.1.1.1 + 8.8.8.8 only — image pulls referencing .lan hostnames would fail to resolve. Documented in docs/runbooks/registry-vm.md. - TrueNAS (10.0.10.15): host unreachable during this session ("No route to host" on 10.0.10.0/24). Deferred best-effort per WS F instructions; noted on the beads task. Both hosts have pre-change backups at /root/dns-backups/ for one-command rollback. Fallback behaviour was validated by routing each primary to a blackhole and confirming dig answered from the fallback. Both runbooks include the verified resolvectl / resolv.conf state, the fallback-test procedure, and the rollback steps. Closes: code-dw8 2026-04-19 15:43:49 +00:00			`# Runbook: Registry VM (docker-registry, 10.0.20.10)`

			`Last updated: 2026-04-19`

			The registry VM hosts `registry.viktorbarzin.me` (private Docker
			`registry, htpasswd-auth, NGINX → registry:2). It is an Ubuntu 24.04`
			VM on the cluster LAN subnet `10.0.20.0/24`, with a static netplan
			`config (no DHCP). Because it sits on a subnet that only has pfSense`
			`as its gateway, its DNS must be statically configured.`

			`## DNS configuration`

			Ubuntu ships `systemd-resolved` and uses netplan to declare per-link
			`nameservers`. Netplan writes systemd-networkd or NetworkManager
			`configs that resolved reads at runtime. There is **no automatic`
			merging** of netplan DNS with the `[Resolve]` section of
			`/etc/systemd/resolved.conf` — per-link settings override the global
			`ones. So both layers must be in sync:`

			`\| Layer \| File \| Role \|`
			`\|---\|---\|---\|`
			\| Netplan \| `/etc/netplan/50-cloud-init.yaml` \| Per-link DNS servers that resolved reports on `Link 2 (eth0)` \|
			\| Resolved global \| `/etc/systemd/resolved.conf.d/10-internal-dns.conf` \| `Global` scope `DNS=` / `FallbackDNS=` — also shown in `resolvectl status` \|

			`### Current state`

			`/etc/systemd/resolved.conf.d/10-internal-dns.conf`:

			```ini
			`[Resolve]`
			`DNS=10.0.20.1`
			`FallbackDNS=94.140.14.14`
			`Domains=viktorbarzin.lan`
			```

			`/etc/netplan/50-cloud-init.yaml` (eth0 block, simplified):

			```yaml
			`nameservers:`
			`addresses:`
			`- 10.0.20.1`
			`- 94.140.14.14`
			`search:`
			`- viktorbarzin.lan`
			```

			`resolvectl status` output after the change:

			```
			`Global`
			`resolv.conf mode: stub`
			`Current DNS Server: 10.0.20.1`
			`DNS Servers: 10.0.20.1`
			`Fallback DNS Servers: 94.140.14.14`
			`DNS Domain: viktorbarzin.lan`

			`Link 2 (eth0)`
			`Current Scopes: DNS`
			`Current DNS Server: 10.0.20.1`
			`DNS Servers: 10.0.20.1 94.140.14.14`
			`DNS Domain: viktorbarzin.lan`
			```

			`\| Field \| Value \| Purpose \|`
			`\|---\|---\|---\|`
			\| Primary \| `10.0.20.1` \| pfSense OPT1 interface (dnsmasq forwarder → Technitium LB) — resolves `.viktorbarzin.lan` \|
			\| Fallback \| `94.140.14.14` \| AdGuard public DNS — used if pfSense unreachable (e.g., OPT1 flap) \|
			\| Search \| `viktorbarzin.lan` \| Unqualified names resolve against the internal zone \|

			`### Why this matters for the registry`

			Container builds on this VM reference `.lan` hostnames (Technitium,
			`NFS, etc.) and external hostnames (Docker Hub, GHCR). Before the`
			hardening the netplan had `1.1.1.1` / `8.8.8.8` only, which meant:

			`1. Internal hostname lookups silently failed (slow timeout) — the`
			VM could not resolve `idrac.viktorbarzin.lan` or any internal
			`helper.`
			`2. If Cloudflare's 1.1.1.1 had an outage, the VM would lose DNS`
			`entirely.`

			`With the new config the VM can resolve both zones and keeps working`
			`if the primary DNS server is unreachable.`

			`## Apply / re-apply`

			```sh
			`ssh root@10.0.20.10 '`
			`netplan generate`
			`netplan apply`
			`systemctl restart systemd-resolved`
			`resolvectl status \| head -20`
			`'`
			```

			`netplan apply` is not disruptive when only `nameservers` change — it
			`does not bounce the link.`

			`## Verification`

			```sh
			`ssh root@10.0.20.10 '`
			`dig +short idrac.viktorbarzin.lan # 192.168.1.4`
			`dig +short github.com # GitHub A record`
			`dig +short registry.viktorbarzin.me # 10.0.20.10 + external A`
			`'`
			```

			`Fallback test — blackhole the primary and confirm external lookups`
			`still succeed through 94.140.14.14:`

			```sh
			`ssh root@10.0.20.10 '`
			`ip route add blackhole 10.0.20.1`
			`dig +short +time=5 +tries=2 github.com # should still answer`
			`ip route del blackhole 10.0.20.1`
			`'`
			```

			`Internal lookups do fail during the blackhole (the fallback is a`
			`public resolver and does not know about the internal zone), which is`
			`expected — the fallback buys availability for external pulls, not`
			`internal hostnames.`

			`## Rollback`

			A pre-change backup of `/etc/resolv.conf`, `/etc/systemd/resolved.conf`,
			and `/etc/netplan/` lives at
			`/root/dns-backups/dns-config-backup-YYYYMMDD-HHMMSS.tar.gz` on the
			`VM. To roll back:`

			```sh
			`ssh root@10.0.20.10 '`
			`BACKUP=$(ls -t /root/dns-backups/dns-config-backup-*.tar.gz \| head -1)`
			`tar -xzf "$BACKUP" -C /`
			`rm -f /etc/systemd/resolved.conf.d/10-internal-dns.conf`
			`netplan apply`
			`systemctl restart systemd-resolved`
			`resolvectl status \| head -10`
			`'`
			```

[ci] Auto-sync modules/docker-registry/* to registry VM + runbook docs Replaces the manual scp+bounce sequence that landed registry:2.8.3 on 10.0.20.10 today (see commit 7cb44d72 + nginx-DNS-trap in runbook). Addresses the "no repeat manual fixes" preference — future changes to docker-compose.yml / fix-broken-blobs.sh / nginx_registry.conf / config-private.yml / cleanup-tags.sh now deploy through CI. Pipeline (.woodpecker/registry-config-sync.yml) mirrors pve-nfs-exports-sync.yml: ssh-keyscan pin, scp the whole managed set, bounce compose only when compose-visible files changed, always restart nginx after a compose bounce (critical — nginx caches upstream DNS), end with a dry-run fix-broken-blobs.sh to catch regressions. Credentials: - Woodpecker repo-secret `registry_ssh_key` (events: push, manual) - Mirror at Vault `secret/woodpecker/registry_ssh_key` (private_key / public_key / known_hosts_entry) - Public key on /root/.ssh/authorized_keys on 10.0.20.10 - Key label: woodpecker-registry-config-sync Runbook updated with "Auto-sync pipeline" section pointing at the new flow + manual override command. Closes: code-3vl Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> 2026-04-19 17:32:12 +00:00			`## Auto-sync pipeline`

			Changes to `modules/docker-registry/{docker-compose.yml, fix-broken-blobs.sh,
			cleanup-tags.sh, nginx_registry.conf, config-private.yml}` deploy
			automatically via `.woodpecker/registry-config-sync.yml`:

			- Fires on `push` to master touching any of those paths, or via `manual`
			`event (Woodpecker UI / API).`
			- SCPs every managed file to `/opt/registry/` on `10.0.20.10`.
			`- Bounces containers + nginx when a compose-visible file changed; leaves`
			`them alone when only scripts changed (cron picks up automatically).`
			- Runs a dry-run `fix-broken-blobs.sh` at the end to verify the registry
			`is still coherent.`

			SSH credentials: Woodpecker repo-secret `registry_ssh_key` (ed25519,
			provisioned 2026-04-19). Public key at `/root/.ssh/authorized_keys` on
			`10.0.20.10`. Private key mirrored at `secret/woodpecker/registry_ssh_key`
			in Vault (subkeys `private_key` / `public_key` / `known_hosts_entry`).

			`Manual override if you need to sync right now:`

			```sh
			`curl -sf -X POST \`
			`-H "Authorization: Bearer $WOODPECKER_TOKEN" \`
			`"https://ci.viktorbarzin.me/api/repos/1/pipelines" \`
			`-d '{"branch":"master"}' \| jq .number`
			```

[docs] Capture nginx stale-DNS trap in registry-vm runbook Discovered during the 2026-04-19 registry:2.8.3 pin deploy: nginx caches its upstream DNS at startup and does NOT re-resolve after registry-* containers are recreated. Symptom was /v2/_catalog returning {"repositories": []} and /v2/ returning 200 without auth — nginx was forwarding to a stale IP that a different backend container now owns. Fix is always 'docker restart registry-nginx' after any registry-* bounce. Captured in registry-vm.md so future manual operators and the coming auto-sync pipeline (beads code-3vl) both encode the step. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> 2026-04-19 17:24:09 +00:00			`## Bouncing registry containers — the nginx DNS trap`

			`docker compose up -d` on `/opt/registry/docker-compose.yml` recreates
			`registry-*` containers when their image tag changes, which assigns them
			new IPs on the `registry` bridge network. **`registry-nginx` resolves its
			upstream DNS names (`registry-private`, `registry-dockerhub`, …) ONCE at
			`startup and caches the results** — it does not re-resolve after a`
			`recreate.`

			Symptom if you forget: `/v2/_catalog` on `:5050` returns
			`{"repositories": []}`, `/v2/` returns 200 without auth, pulls return
			`the wrong image. nginx is forwarding to a stale IP that now belongs to a`
			`different registry-* backend (commonly the pull-through ghcr or`
			`dockerhub cache, which have empty catalogs from the htpasswd-auth user's`
			`perspective).`

			*Always follow a registry- bounce with `docker restart registry-nginx`.**
			Or prevent the problem by setting a `resolver` directive in
			`nginx_registry.conf` so upstream names are re-resolved per request.

			```sh
			`ssh root@10.0.20.10 '`
			`cd /opt/registry && docker compose up -d`
			`docker restart registry-nginx`
			`sleep 3`
			`docker ps --format "{{.Names}}\t{{.Image}}\t{{.Status}}" \`
			`\| grep -E "registry-"`
			`'`
			```

[dns] static-client DNS — Proxmox host, registry VM dual-resolver setup (WS F) Fixes single-upstream DNS brittleness on non-DHCP hosts. Each host now has a primary internal resolver + external fallback (AdGuard) so DNS keeps working if the primary resolver IP is unreachable. New config: - Proxmox host (192.168.1.127): plain /etc/resolv.conf with nameserver 192.168.1.2 (pfSense LAN) + 94.140.14.14 (AdGuard). Previously: single nameserver 192.168.1.1 — could not resolve internal .lan names at all. Documented in docs/runbooks/proxmox-host.md. - Registry VM (10.0.20.10): systemd-resolved drop-in at /etc/systemd/resolved.conf.d/10-internal-dns.conf (DNS=10.0.20.1, FallbackDNS=94.140.14.14, Domains=viktorbarzin.lan) plus matching per-link nameservers in /etc/netplan/50-cloud-init.yaml. Previously: 1.1.1.1 + 8.8.8.8 only — image pulls referencing .lan hostnames would fail to resolve. Documented in docs/runbooks/registry-vm.md. - TrueNAS (10.0.10.15): host unreachable during this session ("No route to host" on 10.0.10.0/24). Deferred best-effort per WS F instructions; noted on the beads task. Both hosts have pre-change backups at /root/dns-backups/ for one-command rollback. Fallback behaviour was validated by routing each primary to a blackhole and confirming dig answered from the fallback. Both runbooks include the verified resolvectl / resolv.conf state, the fallback-test procedure, and the rollback steps. Closes: code-dw8 2026-04-19 15:43:49 +00:00			`## Related docs`

			- `docs/architecture/dns.md` — resolver IP assignments per subnet.
			- `.claude/CLAUDE.md` (at repo root) — notes on the private registry
			and `containerd` `hosts.toml` redirects.
[registry] Stop recurring orphan OCI-index incidents — detection + prevention + recovery Second identical registry incident on 2026-04-19 (first 2026-04-13): the infra-ci:latest image index resolved to child manifests whose blobs had been garbage-collected out from under the index. Pipelines P366→P376 all exited 126 "image can't be pulled". Hot fix (a05d63e / 6371e75 / c113be4) restored green CI but left the underlying bug unaddressed. Root cause: cleanup-tags.sh rmtrees tag dirs on the registry VM daily at 02:00, registry:2's GC (Sunday 03:25) walks OCI index children imperfectly (distribution/distribution#3324 class). Nothing verified pushes end-to-end; nothing probed the registry for fetchability; nothing caught orphan indexes. Phase 1 — Detection: - .woodpecker/build-ci-image.yml: after build-and-push, a verify-integrity step walks the just-pushed manifest (index + children + config + every layer blob) via HEAD and fails the pipeline on any non-200. Catches broken pushes at the source. - stacks/monitoring: new registry-integrity-probe CronJob (every 15m) and three alerts — RegistryManifestIntegrityFailure, RegistryIntegrityProbeStale, RegistryCatalogInaccessible — closing the "registry serves 404 for a tag that exists" gap that masked the incident for 2+ hours. - docs/post-mortems/2026-04-19-registry-orphan-index.md: root cause, timeline, monitoring gaps, permanent fix. Phase 2 — Prevention: - modules/docker-registry/docker-compose.yml: pin registry:2 → registry:2.8.3 across all six registry services. Removes the floating-tag footgun. - modules/docker-registry/fix-broken-blobs.sh: new scan walks every _manifests/revisions/sha256/<digest> that is an image index and logs a loud WARNING when a referenced child blob is missing. Does NOT auto- delete — deleting a published image is a conscious decision. Layer-link scan preserved. Phase 3 — Recovery: - build-ci-image.yml: accept `manual` event so Woodpecker API/UI rebuilds don't need a cosmetic Dockerfile edit (matches convention from pve-nfs-exports-sync.yml). - docs/runbooks/registry-rebuild-image.md: exact command sequence for diagnosing + rebuilding after an orphan-index incident, plus a fallback for building directly on the registry VM if Woodpecker itself is down. - docs/runbooks/registry-vm.md + .claude/reference/service-catalog.md: cross-references to the new runbook. Out of scope (verified healthy or intentionally deferred): - Pull-through DockerHub/GHCR mirrors (74.5% hit rate, no 404s). - Registry HA/replication (single-VM SPOF is a known architectural choice; Synology offsite covers RPO < 1 day). - Diun exclude for registry:2 — not applicable; Diun only watches k8s (DIUN_PROVIDERS_KUBERNETES=true), not the VM's docker-compose. Verified locally: - fix-broken-blobs.sh --dry-run on a synthetic registry directory correctly flags both orphan layer links and orphan OCI-index children. - terraform fmt + validate on stacks/monitoring: success (only unrelated deprecation warnings). - python3 yaml.safe_load on .woodpecker/build-ci-image.yml and modules/docker-registry/docker-compose.yml: both parse clean. Closes: code-4b8 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> 2026-04-19 17:08:28 +00:00			- `docs/runbooks/registry-rebuild-image.md` — rebuild an image after an
			`orphan OCI-index incident (different class of problem than DNS).`
			- `docs/post-mortems/2026-04-19-registry-orphan-index.md` — root cause
			`+ detection gaps behind the recurring missing-blob incidents.`