[ci skip] remove low-traffic pull-through caches (registry.k8s.io, quay.io, reg.kyverno.io)

Pull-through cache at 10.0.20.10 was serving corrupted/truncated images
for low-traffic registries, breaking VPA certgen (ImagePullBackOff) and
previously causing Kyverno image pull failures.

Kept: docker.io (port 5000) and ghcr.io (port 5010) — high traffic,
Docker Hub rate limits make caching essential.

Removed from cloud-init template and all 5 live nodes:
- registry.k8s.io (port 5030) — 14 system images, very low churn
- quay.io (port 5020) — 11 images
- reg.kyverno.io (port 5040) — 5 images

The registry containers on the 10.0.20.10 VM still run but nodes no
longer route to them. They can be stopped/removed from the VM later.
This commit is contained in:
Viktor Barzin 2026-03-01 21:46:41 +00:00
parent 53be356f41
commit de598996f1
2 changed files with 11 additions and 18 deletions

View file

@ -122,7 +122,7 @@ terraform fmt -recursive # Format all
## Infrastructure
- Proxmox hypervisor (192.168.1.127) — see `.claude/reference/proxmox-inventory.md` for full VM table
- Kubernetes cluster: 5 nodes (k8s-master + k8s-node1-4, v1.34.2), GPU on node1 (Tesla T4)
- Docker registry pull-through cache at `10.0.20.10` (ports 5000/5010/5020/5030/5040)
- Docker registry pull-through cache at `10.0.20.10` — only docker.io (port 5000) and ghcr.io (port 5010) are active. quay.io/registry.k8s.io/reg.kyverno.io caches disabled (caused corrupted images).
- GPU workloads need: `node_selector = { "gpu": "true" }` + `toleration { key = "nvidia.com/gpu", value = "true", effect = "NoSchedule" }`
### Node Rebuild Procedure

View file

@ -66,25 +66,17 @@ module "k8s-node-template" {
# Set up config_path for per-registry mirror configuration
sed -i 's|config_path = ""|config_path = "/etc/containerd/certs.d"|' /etc/containerd/config.toml
# Create hosts.toml for docker.io (Docker Hub)
# Create hosts.toml for docker.io (Docker Hub) high traffic, rate-limited
mkdir -p /etc/containerd/certs.d/docker.io
printf 'server = "https://registry-1.docker.io"\n\n[host."http://10.0.20.10:5000"]\n capabilities = ["pull", "resolve"]\n' > /etc/containerd/certs.d/docker.io/hosts.toml
# Create hosts.toml for ghcr.io
# Create hosts.toml for ghcr.io medium traffic
mkdir -p /etc/containerd/certs.d/ghcr.io
printf 'server = "https://ghcr.io"\n\n[host."http://10.0.20.10:5010"]\n capabilities = ["pull", "resolve"]\n' > /etc/containerd/certs.d/ghcr.io/hosts.toml
# Create hosts.toml for quay.io
mkdir -p /etc/containerd/certs.d/quay.io
printf 'server = "https://quay.io"\n\n[host."http://10.0.20.10:5020"]\n capabilities = ["pull", "resolve"]\n' > /etc/containerd/certs.d/quay.io/hosts.toml
# Create hosts.toml for registry.k8s.io
mkdir -p /etc/containerd/certs.d/registry.k8s.io
printf 'server = "https://registry.k8s.io"\n\n[host."http://10.0.20.10:5030"]\n capabilities = ["pull", "resolve"]\n' > /etc/containerd/certs.d/registry.k8s.io/hosts.toml
# Create hosts.toml for reg.kyverno.io
mkdir -p /etc/containerd/certs.d/reg.kyverno.io
printf 'server = "https://reg.kyverno.io"\n\n[host."http://10.0.20.10:5040"]\n capabilities = ["pull", "resolve"]\n' > /etc/containerd/certs.d/reg.kyverno.io/hosts.toml
# Low-traffic registries (registry.k8s.io, quay.io, reg.kyverno.io) pull directly.
# Pull-through cache removed: caused corrupted images (truncated downloads)
# breaking VPA certgen and Kyverno image pulls.
sed -i 's/.*max_concurrent_downloads = 3/max_concurrent_downloads = 20/g' /etc/containerd/config.toml # Enable multiple concurrent downloads
sudo sed -i '/serializeImagePulls:/d' /var/lib/kubelet/config.yaml && \
@ -275,13 +267,14 @@ module "docker-registry-vm" {
bridge = "vmbr1"
vlan_tag = "20"
ipconfig0 = "ip=10.0.20.10/24,gw=10.0.20.1"
# All ports go through nginx for request serialization (proxy_cache_lock):
# Active pull-through caches (docker.io + ghcr.io only):
# 5000 -> nginx -> registry-dockerhub (docker.io proxy)
# 5001 -> registry-dockerhub direct (Prometheus metrics)
# 5010 -> nginx -> registry-ghcr (ghcr.io proxy)
# 5020 -> nginx -> registry-quay (quay.io proxy)
# 5030 -> nginx -> registry-k8s (registry.k8s.io proxy)
# 5040 -> nginx -> registry-kyverno (reg.kyverno.io proxy)
# Disabled caches (low-traffic, caused corrupted images):
# 5020 -> registry-quay (quay.io) DISABLED
# 5030 -> registry-k8s (registry.k8s.io) DISABLED, broke VPA certgen
# 5040 -> registry-kyverno (reg.kyverno.io) DISABLED
# 5050 -> nginx -> registry-private (R/W registry for CI build cache)
# 8080 -> registry-ui (joxit/docker-registry-ui)
}