End of forgejo-registry-consolidation. After Phase 0/1 already landed
(Forgejo ready, dual-push CI, integrity probe, retention CronJob,
images migrated via forgejo-migrate-orphan-images.sh), this commit
flips everything off registry.viktorbarzin.me onto Forgejo and
removes the legacy infrastructure.
Phase 3 — image= flips:
* infra/stacks/{payslip-ingest,job-hunter,claude-agent-service,
fire-planner,freedify/factory,chrome-service,beads-server}/main.tf
— image= now points to forgejo.viktorbarzin.me/viktor/<name>.
* infra/stacks/claude-memory/main.tf — also moved off DockerHub
(viktorbarzin/claude-memory-mcp:17 → forgejo.viktorbarzin.me/viktor/...).
* infra/.woodpecker/{default,drift-detection}.yml — infra-ci pulled
from Forgejo. build-ci-image.yml dual-pushes still until next
build cycle confirms Forgejo as canonical.
* /home/wizard/code/CLAUDE.md — claude-memory-mcp install URL updated.
Phase 4 — decommission registry-private:
* registry-credentials Secret: dropped registry.viktorbarzin.me /
registry.viktorbarzin.me:5050 / 10.0.20.10:5050 auths entries.
Forgejo entry is the only one left.
* infra/stacks/infra/main.tf cloud-init: dropped containerd
hosts.toml entries for registry.viktorbarzin.me +
10.0.20.10:5050. (Existing nodes already had the file removed
manually by `setup-forgejo-containerd-mirror.sh` rollout — the
cloud-init template only fires on new VM provision.)
* infra/modules/docker-registry/docker-compose.yml: registry-private
service block removed; nginx 5050 port mapping dropped. Pull-
through caches for upstream registries (5000/5010/5020/5030/5040)
stay on the VM permanently.
* infra/modules/docker-registry/nginx_registry.conf: upstream
`private` block + port 5050 server block removed.
* infra/stacks/monitoring/modules/monitoring/main.tf: registry_
integrity_probe + registry_probe_credentials resources stripped.
forgejo_integrity_probe is the only manifest probe now.
Phase 5 — final docs sweep:
* infra/docs/runbooks/registry-vm.md — VM scope reduced to pull-
through caches; forgejo-registry-breakglass.md cross-ref added.
* infra/docs/architecture/ci-cd.md — registry component table +
diagram now reflect Forgejo. Pre-migration root-cause sentence
preserved as historical context with a pointer to the design doc.
* infra/docs/architecture/monitoring.md — Registry Integrity Probe
row updated to point at the Forgejo probe.
* infra/.claude/CLAUDE.md — Private registry section rewritten end-
to-end (auth, retention, integrity, where the bake came from).
* prometheus_chart_values.tpl — RegistryManifestIntegrityFailure
alert annotation simplified now that only one registry is in
scope.
Operational follow-up (cannot be done from a TF apply):
1. ssh root@10.0.20.10 — edit /opt/registry/docker-compose.yml to
match the new template AND `docker compose up -d --remove-orphans`
to actually stop the registry-private container. Memory id=1078
confirms cloud-init won't redeploy on TF apply alone.
2. After 1 week of no incidents, `rm -rf /opt/registry/data/private/`
on the VM (~2.6GB freed).
3. Open the dual-push step in build-ci-image.yml and drop
registry.viktorbarzin.me:5050 from the `repo:` list — at that
point the post-push integrity check at line 33-107 also needs
to be repointed at Forgejo or removed (the per-build verify is
redundant with the every-15min Forgejo probe).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
227 lines
7.4 KiB
Markdown
227 lines
7.4 KiB
Markdown
# Runbook: Registry VM (docker-registry, 10.0.20.10)
|
|
|
|
Last updated: 2026-05-07
|
|
|
|
The registry VM is an Ubuntu 24.04 VM on the cluster LAN subnet
|
|
`10.0.20.0/24`, with a static netplan config (no DHCP). Because it
|
|
sits on a subnet that only has pfSense as its gateway, its DNS must
|
|
be statically configured.
|
|
|
|
**As of Phase 4 of forgejo-registry-consolidation 2026-05-07** the VM
|
|
no longer hosts the private R/W registry. It hosts pull-through
|
|
caches only:
|
|
|
|
| Port | Upstream |
|
|
|---|---|
|
|
| 5000 | docker.io (Docker Hub) — auth via dockerhub_registry_password |
|
|
| 5010 | ghcr.io |
|
|
| 5020 | quay.io |
|
|
| 5030 | registry.k8s.io |
|
|
| 5040 | reg.kyverno.io |
|
|
|
|
The decommissioned private registry (port 5050) is now hosted on
|
|
Forgejo at `forgejo.viktorbarzin.me/viktor/<image>`. See
|
|
`docs/plans/2026-05-07-forgejo-registry-consolidation-plan.md` for the
|
|
migration. Break-glass tarballs of `infra-ci` are still produced on
|
|
each build to `/opt/registry/data/private/_breakglass/` — see
|
|
`docs/runbooks/forgejo-registry-breakglass.md`.
|
|
|
|
## DNS configuration
|
|
|
|
Ubuntu ships `systemd-resolved` and uses netplan to declare per-link
|
|
`nameservers`. Netplan writes systemd-networkd or NetworkManager
|
|
configs that resolved reads at runtime. There is **no automatic
|
|
merging** of netplan DNS with the `[Resolve]` section of
|
|
`/etc/systemd/resolved.conf` — per-link settings override the global
|
|
ones. So both layers must be in sync:
|
|
|
|
| Layer | File | Role |
|
|
|---|---|---|
|
|
| Netplan | `/etc/netplan/50-cloud-init.yaml` | Per-link DNS servers that resolved reports on `Link 2 (eth0)` |
|
|
| Resolved global | `/etc/systemd/resolved.conf.d/10-internal-dns.conf` | `Global` scope `DNS=` / `FallbackDNS=` — also shown in `resolvectl status` |
|
|
|
|
### Current state
|
|
|
|
`/etc/systemd/resolved.conf.d/10-internal-dns.conf`:
|
|
|
|
```ini
|
|
[Resolve]
|
|
DNS=10.0.20.1
|
|
FallbackDNS=94.140.14.14
|
|
Domains=viktorbarzin.lan
|
|
```
|
|
|
|
`/etc/netplan/50-cloud-init.yaml` (eth0 block, simplified):
|
|
|
|
```yaml
|
|
nameservers:
|
|
addresses:
|
|
- 10.0.20.1
|
|
- 94.140.14.14
|
|
search:
|
|
- viktorbarzin.lan
|
|
```
|
|
|
|
`resolvectl status` output after the change:
|
|
|
|
```
|
|
Global
|
|
resolv.conf mode: stub
|
|
Current DNS Server: 10.0.20.1
|
|
DNS Servers: 10.0.20.1
|
|
Fallback DNS Servers: 94.140.14.14
|
|
DNS Domain: viktorbarzin.lan
|
|
|
|
Link 2 (eth0)
|
|
Current Scopes: DNS
|
|
Current DNS Server: 10.0.20.1
|
|
DNS Servers: 10.0.20.1 94.140.14.14
|
|
DNS Domain: viktorbarzin.lan
|
|
```
|
|
|
|
| Field | Value | Purpose |
|
|
|---|---|---|
|
|
| Primary | `10.0.20.1` | pfSense OPT1 interface (dnsmasq forwarder → Technitium LB) — resolves `.viktorbarzin.lan` |
|
|
| Fallback | `94.140.14.14` | AdGuard public DNS — used if pfSense unreachable (e.g., OPT1 flap) |
|
|
| Search | `viktorbarzin.lan` | Unqualified names resolve against the internal zone |
|
|
|
|
### Why this matters for the registry
|
|
|
|
Container builds on this VM reference `.lan` hostnames (Technitium,
|
|
NFS, etc.) and external hostnames (Docker Hub, GHCR). Before the
|
|
hardening the netplan had `1.1.1.1` / `8.8.8.8` only, which meant:
|
|
|
|
1. Internal hostname lookups silently failed (slow timeout) — the
|
|
VM could not resolve `idrac.viktorbarzin.lan` or any internal
|
|
helper.
|
|
2. If Cloudflare's 1.1.1.1 had an outage, the VM would lose DNS
|
|
entirely.
|
|
|
|
With the new config the VM can resolve both zones and keeps working
|
|
if the primary DNS server is unreachable.
|
|
|
|
## Apply / re-apply
|
|
|
|
```sh
|
|
ssh root@10.0.20.10 '
|
|
netplan generate
|
|
netplan apply
|
|
systemctl restart systemd-resolved
|
|
resolvectl status | head -20
|
|
'
|
|
```
|
|
|
|
`netplan apply` is not disruptive when only `nameservers` change — it
|
|
does not bounce the link.
|
|
|
|
## Verification
|
|
|
|
```sh
|
|
ssh root@10.0.20.10 '
|
|
dig +short idrac.viktorbarzin.lan # 192.168.1.4
|
|
dig +short github.com # GitHub A record
|
|
dig +short registry.viktorbarzin.me # 10.0.20.10 + external A
|
|
'
|
|
```
|
|
|
|
Fallback test — blackhole the primary and confirm external lookups
|
|
still succeed through 94.140.14.14:
|
|
|
|
```sh
|
|
ssh root@10.0.20.10 '
|
|
ip route add blackhole 10.0.20.1
|
|
dig +short +time=5 +tries=2 github.com # should still answer
|
|
ip route del blackhole 10.0.20.1
|
|
'
|
|
```
|
|
|
|
Internal lookups do fail during the blackhole (the fallback is a
|
|
public resolver and does not know about the internal zone), which is
|
|
expected — the fallback buys availability for external pulls, not
|
|
internal hostnames.
|
|
|
|
## Rollback
|
|
|
|
A pre-change backup of `/etc/resolv.conf`, `/etc/systemd/resolved.conf`,
|
|
and `/etc/netplan/` lives at
|
|
`/root/dns-backups/dns-config-backup-YYYYMMDD-HHMMSS.tar.gz` on the
|
|
VM. To roll back:
|
|
|
|
```sh
|
|
ssh root@10.0.20.10 '
|
|
BACKUP=$(ls -t /root/dns-backups/dns-config-backup-*.tar.gz | head -1)
|
|
tar -xzf "$BACKUP" -C /
|
|
rm -f /etc/systemd/resolved.conf.d/10-internal-dns.conf
|
|
netplan apply
|
|
systemctl restart systemd-resolved
|
|
resolvectl status | head -10
|
|
'
|
|
```
|
|
|
|
## Auto-sync pipeline
|
|
|
|
Changes to `modules/docker-registry/{docker-compose.yml, fix-broken-blobs.sh,
|
|
cleanup-tags.sh, nginx_registry.conf, config-private.yml}` deploy
|
|
automatically via `.woodpecker/registry-config-sync.yml`:
|
|
|
|
- Fires on `push` to master touching any of those paths, or via `manual`
|
|
event (Woodpecker UI / API).
|
|
- SCPs every managed file to `/opt/registry/` on `10.0.20.10`.
|
|
- Bounces containers + nginx when a compose-visible file changed; leaves
|
|
them alone when only scripts changed (cron picks up automatically).
|
|
- Runs a dry-run `fix-broken-blobs.sh` at the end to verify the registry
|
|
is still coherent.
|
|
|
|
SSH credentials: Woodpecker repo-secret `registry_ssh_key` (ed25519,
|
|
provisioned 2026-04-19). Public key at `/root/.ssh/authorized_keys` on
|
|
`10.0.20.10`. Private key mirrored at `secret/woodpecker/registry_ssh_key`
|
|
in Vault (subkeys `private_key` / `public_key` / `known_hosts_entry`).
|
|
|
|
Manual override if you need to sync right now:
|
|
|
|
```sh
|
|
curl -sf -X POST \
|
|
-H "Authorization: Bearer $WOODPECKER_TOKEN" \
|
|
"https://ci.viktorbarzin.me/api/repos/1/pipelines" \
|
|
-d '{"branch":"master"}' | jq .number
|
|
```
|
|
|
|
## Bouncing registry containers — the nginx DNS trap
|
|
|
|
`docker compose up -d` on `/opt/registry/docker-compose.yml` recreates
|
|
`registry-*` containers when their image tag changes, which assigns them
|
|
new IPs on the `registry` bridge network. **`registry-nginx` resolves its
|
|
upstream DNS names (`registry-private`, `registry-dockerhub`, …) ONCE at
|
|
startup and caches the results** — it does not re-resolve after a
|
|
recreate.
|
|
|
|
Symptom if you forget: `/v2/_catalog` on `:5050` returns
|
|
`{"repositories": []}`, `/v2/` returns 200 without auth, pulls return
|
|
the wrong image. nginx is forwarding to a stale IP that now belongs to a
|
|
different registry-* backend (commonly the pull-through ghcr or
|
|
dockerhub cache, which have empty catalogs from the htpasswd-auth user's
|
|
perspective).
|
|
|
|
**Always follow a registry-* bounce with `docker restart registry-nginx`.**
|
|
Or prevent the problem by setting a `resolver` directive in
|
|
`nginx_registry.conf` so upstream names are re-resolved per request.
|
|
|
|
```sh
|
|
ssh root@10.0.20.10 '
|
|
cd /opt/registry && docker compose up -d
|
|
docker restart registry-nginx
|
|
sleep 3
|
|
docker ps --format "{{.Names}}\t{{.Image}}\t{{.Status}}" \
|
|
| grep -E "registry-"
|
|
'
|
|
```
|
|
|
|
## Related docs
|
|
|
|
- `docs/architecture/dns.md` — resolver IP assignments per subnet.
|
|
- `.claude/CLAUDE.md` (at repo root) — notes on the private registry
|
|
and `containerd` `hosts.toml` redirects.
|
|
- `docs/runbooks/registry-rebuild-image.md` — rebuild an image after an
|
|
orphan OCI-index incident (different class of problem than DNS).
|
|
- `docs/post-mortems/2026-04-19-registry-orphan-index.md` — root cause
|
|
+ detection gaps behind the recurring missing-blob incidents.
|