Blob caching (content-addressed by SHA256) is unaffected — only manifest
re-validation changes. Every pull now checks upstream for the current
manifest digest, eliminating stale :latest tag issues.
- Add proxy_intercept_errors + error_page for 502/503/504 on blob locations
to prevent caching truncated upstream responses (root cause of repeated
ImagePullBackOff across services)
- Reduce proxy_cache_lock_timeout from 15m to 5m — fail fast, let containerd
retry instead of all concurrent pulls waiting on a failed first download
- Add proxy_cache_valid any 0 — never cache error responses
- Add /healthz endpoints on Docker Hub and GHCR servers
- Add draintimeout and proxy.ttl to registry proxy configs
Root cause: storage.filesystem.maxsize (5GiB) caused Docker Registry to
delete blob data while keeping metadata. Registry then served 200 OK with
correct Content-Length but 0 bytes body. nginx cached these broken responses.
Fixes:
- Remove maxsize from dockerhub/ghcr proxy configs (rely on weekly GC)
- nginx: don't cache 206 responses, require 2 requests before caching
- Wiped corrupted cache on registry VM and fixed corrupted pause container
blobs on node3/node4