From 0da86577fb3c95372c979151d45a37e6ad2eb9dd Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Sun, 15 Feb 2026 14:36:50 +0000 Subject: [PATCH] [ci skip] Add skills: containerd-multi-registry-pull-through-cache, traefik-plugin-download-failure-404 --- .../SKILL.md | 138 ++++++++++++++++++ .../SKILL.md | 98 +++++++++++++ 2 files changed, 236 insertions(+) create mode 100644 .claude/skills/containerd-multi-registry-pull-through-cache/SKILL.md create mode 100644 .claude/skills/traefik-plugin-download-failure-404/SKILL.md diff --git a/.claude/skills/containerd-multi-registry-pull-through-cache/SKILL.md b/.claude/skills/containerd-multi-registry-pull-through-cache/SKILL.md new file mode 100644 index 00000000..7b519b48 --- /dev/null +++ b/.claude/skills/containerd-multi-registry-pull-through-cache/SKILL.md @@ -0,0 +1,138 @@ +--- +name: containerd-multi-registry-pull-through-cache +description: | + Set up pull-through caches for multiple container registries (ghcr.io, quay.io, + registry.k8s.io, reg.kyverno.io) using Docker Registry v2 instances. Use when: + (1) ImagePullBackOff for non-Docker-Hub images routed through a wildcard mirror, + (2) containerd has deprecated `registry.mirrors."*"` catching all image pulls, + (3) need to add pull-through cache for a new upstream registry, + (4) `mirrors` cannot be set when `config_path` is provided error in containerd, + (5) containerd 1.6.x vs 1.7.x config_path compatibility issues. + Docker Registry v2 can only proxy ONE upstream per instance, so multiple + containers are needed for multiple registries. +author: Claude Code +version: 1.0.0 +date: 2026-02-14 +--- + +# Containerd Multi-Registry Pull-Through Cache + +## Problem + +Docker Registry v2 can only proxy **one upstream registry per instance**. A common +misconfiguration is using a containerd wildcard mirror (`registry.mirrors."*"`) pointing +to a single Docker Hub proxy, which breaks pulls from ghcr.io, quay.io, registry.k8s.io, +and other registries — they get routed to the Docker Hub proxy which can't serve them, +causing `ImagePullBackOff`. + +## Context / Trigger Conditions + +- `ImagePullBackOff` for images from ghcr.io, quay.io, registry.k8s.io, or other non-Docker-Hub registries +- Containerd config has deprecated `[plugins."io.containerd.grpc.v1.cri".registry.mirrors."*"]` +- Error: `failed to load plugin io.containerd.grpc.v1.cri: invalid plugin config: mirrors cannot be set when config_path is provided` +- Need to migrate from deprecated wildcard mirrors to modern `config_path` approach + +## Solution + +### 1. Run one Registry v2 container per upstream + +Each upstream needs its own Docker Registry v2 instance on a different port: + +| Port | Registry | Container Name | +|------|----------|---------------| +| 5000 | docker.io | registry | +| 5010 | ghcr.io | registry-ghcr | +| 5020 | quay.io | registry-quay | +| 5030 | registry.k8s.io | registry-k8s | +| 5040 | reg.kyverno.io | registry-kyverno | + +Config for non-Docker-Hub proxies (no auth needed — they're public): + +```yaml +version: 0.1 +storage: + cache: + blobdescriptor: inmemory + filesystem: + rootdirectory: /var/lib/registry +http: + addr: :5000 +proxy: + remoteurl: https://ghcr.io # change per registry +``` + +```bash +docker run -p 5010:5000 -d --restart always --name registry-ghcr \ + -v /etc/docker-registry/ghcr/config.yml:/etc/docker/registry/config.yml registry:2 +``` + +### 2. Replace deprecated wildcard mirror with `config_path` + +Instead of: +```toml +# DEPRECATED - breaks non-Docker-Hub registries +[plugins."io.containerd.grpc.v1.cri".registry.mirrors."*"] + endpoint = ["http://10.0.20.10:5000"] +``` + +Use the modern `config_path` approach: +```toml +[plugins."io.containerd.grpc.v1.cri".registry] + config_path = "/etc/containerd/certs.d" +``` + +Then create per-registry `hosts.toml` files: +```bash +mkdir -p /etc/containerd/certs.d/docker.io +cat > /etc/containerd/certs.d/docker.io/hosts.toml <<'EOF' +server = "https://registry-1.docker.io" + +[host."http://10.0.20.10:5000"] + capabilities = ["pull", "resolve"] +EOF +``` + +Registries without a `hosts.toml` entry **fall through to direct pull** (no breakage). + +### 3. Critical: `config_path` and `mirrors` cannot coexist + +Containerd will **refuse to start the CRI plugin** if both `config_path` and any +`mirrors` entries exist in `config.toml`. You must remove ALL `mirrors` entries +(including the `[plugins."...registry.mirrors"]` parent section) before setting +`config_path`. + +This is especially dangerous on containerd 1.6.x (used on older nodes like k8s-master) +where the config format is slightly different. If unsure, either: +- Don't use config_path on that node (skip the pull-through cache) +- Remove the entire `mirrors` section first, then add `config_path` + +### 4. Static IP for registry VM + +If the registry VM uses DHCP and gets the wrong IP, all mirrors break. Use static IP +via cloud-init `ipconfig0 = "ip=10.0.20.10/24,gw=10.0.20.1"` instead of DHCP. + +## Verification + +```bash +# Test each proxy responds +for port in 5000 5010 5020 5030 5040; do + curl -s http://10.0.20.10:$port/v2/_catalog +done + +# Test containerd can pull through cache +crictl pull ghcr.io/some/image:tag + +# Check containerd logs for mirror usage +journalctl -u containerd --since "5 minutes ago" | grep -i "mirror\|registry" +``` + +## Notes + +- **Fallback behavior**: If the local mirror is unreachable, containerd falls through to + direct pull from the upstream `server` URL. This provides graceful degradation. +- **GC crontabs**: Add weekly garbage collection for each registry container, staggered + to avoid I/O spikes. +- **Hourly restart**: Registry v2 has known memory leak issues; hourly restart mitigates. +- **Cache is ephemeral**: VM recreation clears the cache. Images re-cache on demand. + +See also: `k8s-docker-registry-cache-bypass` (for stale cached image issues) diff --git a/.claude/skills/traefik-plugin-download-failure-404/SKILL.md b/.claude/skills/traefik-plugin-download-failure-404/SKILL.md new file mode 100644 index 00000000..94df4d88 --- /dev/null +++ b/.claude/skills/traefik-plugin-download-failure-404/SKILL.md @@ -0,0 +1,98 @@ +--- +name: traefik-plugin-download-failure-404 +description: | + Fix for Traefik returning 404 on ALL routes after a restart or pod recreation. + Use when: (1) all Traefik-managed Ingresses suddenly return 404, + (2) Traefik logs show "Plugins are disabled because an error has occurred", + (3) plugin download fails with "context deadline exceeded" for crowdsec-bouncer + or rewrite-body plugins, (4) Traefik pods started while outbound internet was + unreachable (e.g. during containerd restart, network disruption, DNS outage), + (5) services were working before a node maintenance operation but now all return 404. + Root cause: Traefik downloads plugins on startup; if download fails, ALL plugins + are disabled, and any middleware referencing a plugin causes its route to 404. +author: Claude Code +version: 1.0.0 +date: 2026-02-14 +--- + +# Traefik Plugin Download Failure Causing Global 404 + +## Problem + +After a node maintenance operation (containerd restart, node drain/uncordon, etc.), +all Traefik-managed routes return 404. Services, Ingresses, and Middlewares all exist +and look correct, making this extremely confusing to debug. + +## Context / Trigger Conditions + +- ALL Traefik routes return 404 simultaneously (not just one service) +- Traefik pods are Running and Ready +- Ingress resources exist with correct annotations +- Middlewares exist in the correct namespaces +- TLS secrets exist +- Traefik startup logs contain: `Plugins are disabled because an error has occurred` +- Plugin download error: `unable to download plugin ... context deadline exceeded` +- Happened after a node restart, containerd restart, or network disruption + +## Root Cause + +Traefik downloads plugins (crowdsec-bouncer, rewrite-body, etc.) from +`plugins.traefik.io` on **every pod startup**. If the download fails (network +unreachable, DNS not ready, timeout), Traefik **disables ALL plugins entirely**. + +Since the `crowdsec` middleware is a plugin-based middleware referenced in virtually +every Ingress annotation (`traefik-crowdsec@kubernetescrd`), Traefik treats the +missing plugin middleware as a fatal routing error and returns 404 for every route +that references it — which is typically all of them. + +## Solution + +```bash +# 1. Confirm the diagnosis - check Traefik startup logs +kubectl logs -n traefik -l app.kubernetes.io/name=traefik | head -20 +# Look for: "Plugins are disabled because an error has occurred" + +# 2. Verify outbound connectivity is restored +kubectl exec -n traefik $(kubectl get pods -n traefik -l app.kubernetes.io/name=traefik \ + -o jsonpath='{.items[0].metadata.name}') -- wget -q -O- --timeout=5 https://plugins.traefik.io + +# 3. Rollout restart to retry plugin download +kubectl rollout restart deployment -n traefik traefik + +# 4. Verify plugins loaded +kubectl logs -n traefik -l app.kubernetes.io/name=traefik | grep "Plugins" +# Should show: "Plugins loaded." + +# 5. Verify routes work +curl -s -o /dev/null -w "%{http_code}" -H "Host: viktorbarzin.me" https://10.0.20.202 -k +# Should return 200 instead of 404 +``` + +## Verification + +- Traefik logs show `Plugins loaded.` (not `Plugins are disabled`) +- Routes return expected HTTP status codes (200, 302, etc.) instead of 404 +- `kubectl logs -n traefik | grep "does not exist"` shows no middleware errors + +## Why This Is Hard to Debug + +1. **Traefik pods show Running/Ready** — health checks pass even without plugins +2. **All Kubernetes resources look correct** — Ingresses, Services, Middlewares all exist +3. **The error is in startup logs only** — not in per-request logs (requests just get 404) +4. **The 404 is Traefik's default** — same as "no route matched", not a backend error +5. **The middleware error is logged once at startup** — easy to miss in a stream of logs + +## Prevention + +- During planned maintenance (node drain, containerd restart), restart Traefik pods + AFTER network connectivity is confirmed restored +- Consider pre-caching Traefik plugins in the container image or using an init container +- Monitor for the `Plugins are disabled` log message in your alerting system + +## Notes + +- This affects ALL plugin-based middlewares, not just crowdsec +- The `rewrite-body` plugin (used for rybbit analytics injection) is also affected +- Traefik v3.x downloads plugins on every startup; there is no persistent cache +- If only some routes return 404, the problem is likely different (missing middleware + or TLS secret, not a plugin issue)