When the pull-through proxy (10.0.20.10) is down, containerd now falls
back to the official upstream registries (registry-1.docker.io, ghcr.io)
instead of failing. Also cleans up stale disabled registry mirror dirs
and removes unnecessary containerd restart from the rollout script.
Blob caching (content-addressed by SHA256) is unaffected — only manifest
re-validation changes. Every pull now checks upstream for the current
manifest digest, eliminating stale :latest tag issues.
The cleanup-tags.sh + garbage-collect cycle can delete blob data while
leaving _layers/ link files intact. The registry then returns HTTP 200
with 0 bytes for those layers, causing "unexpected EOF" on image pulls.
fix-broken-blobs.sh walks all repositories, checks each layer link
against actual blob data, and removes orphaned links so the registry
re-fetches from upstream on next pull.
Schedule: daily at 2:30am (after tag cleanup) and Sunday 3:30am
(after garbage collection). First run found 2335/2556 (91%) of
layer links were orphaned.
localhost resolves to IPv6 ::1 but containers bind to 0.0.0.0 (IPv4
only), causing wget to fail with "Connection refused". The nginx
proxy had 18,462 consecutive health check failures because of this.
Also cleared corrupted pull-through cache for mghee/novelapp — the
registry had layer link files pointing to non-existent blob data,
causing containerd to get 200 responses with 0 bytes (unexpected EOF).
Caretta eBPF DaemonSet was using 600Mi x 5 nodes = 3Gi total for
non-critical network topology visualization. Removing it to free
memory for novelapp and aiostreams which were stuck in Pending.
Expanded cloud sync excludes to reduce sync time and Synology disk usage.
All excluded data is either regenerable or low-value.
TrueNAS Task 1 and incremental script already updated live.
CWA NETWORK_SHARE_MODE=true skips post-import chown, leaving files as
root. book-search now mounts the library to periodically fix permissions
on recently imported books.
Adds stacks-config volume mount to book-search pod so it can delete
Stacks history entries and force re-downloads when a book was consumed
by CWA but failed to import.
The init container was cloning the dotfiles repo via git on every pod
start, causing 200+ small NFS writes that amplified through ZFS.
Dotfiles already exist on NFS from a previous clone — no need to
re-clone on every restart. To update dotfiles, run git pull manually.
Also cleaned up stale Uptime Kuma files (1.6GB old SQLite DB + 289MB
error log left over from migration to MariaDB).
- API_KEY env var from calibre-secrets for /api/download-url auth
- SHORTCUT_ICLOUD_URL env var for /shortcut redirect
- Separate ingress for /api/download-url and /shortcut (bypasses Authentik)
HA frontend loads 30-50 JS bundles on page load, exhausting the burst.
iOS Companion app reconnections also trigger bursts. 172 rate-limited
(429) requests found in Traefik logs causing intermittent connectivity
failures for ha-sofia iOS app.
- Remove viktorbarzin.me from split DNS (same IPs as public DNS,
was adding unnecessary tunnel overhead for every DNS query)
- Narrow reverse DNS split scope from 10.0.0.0/8 → 10.0.20.0/24
and 10.0.10.0/24 only; 192.168.0.0/16 → 192.168.1.0/24 only
- Add extra_records for key internal services (technitium, k8s-master)
for instant MagicDNS resolution without tunnel roundtrip
- Replace full Tailscale DERP map (29 regions) with curated set:
home + 8 European + 5 global fallback DERPs (14 total)
- Add custom derp.yaml to ConfigMap, sourced from Vault
Port 80 DERP dropped — Traefik's global HTTP→HTTPS redirect
prevents non-TLS DERP upgrades on the web entrypoint.
- linkwarden: add Reloader match annotation to DB secret so pods
auto-restart on Vault credential rotation (was causing 100% 5xx)
- authentik: increase memory limits (server 1Gi→1.5Gi, worker 896Mi→1Gi)
to prevent OOM kills
- prometheus: drop 113k high-cardinality series to reduce HDD write rate
from ~8.8 to ~6.0 MB/s (31% reduction):
- drop all traefik/apiserver/etcd histogram bucket metrics
- drop goflow2_flow_process_nf_templates_total (9.3k series)
- drop container_tasks_state and container_memory_failures_total
- rewrite HighServiceLatency alert to use avg latency (_sum/_count)
- update cluster_health dashboard to match
- raise KubeletRuntimeOperationsLatency threshold from 30s to 60s
- Add SQLite backup CronJob (every 6h to NFS for cloud sync pickup)
- Move headscale-ui secrets (COOKIE_SECRET, ROOT_API_KEY) from hardcoded
values to Vault-managed secrets
- Add DERP IPv6 address (2001:470:6e:43d::2) for IPv6-capable clients
- Clean up stale test nodes, duplicate users, rename "localhost" nodes
Also updated headscale_config in Vault to include DERP ipv6 field
and headscale_ui_cookie_secret/headscale_ui_api_key secrets.
CrowdSec, rate limiting, anti-AI, and error pages middlewares were
interfering with the Upgrade: DERP protocol handshake. Also updated
Headscale ACL in Vault to allow tailnet DNS traffic to Technitium
(10.0.20.200:53).
NFS PVs report the entire NFS server filesystem usage (e.g., navidrome-music
shows 5.3 TiB Synology volume at 97%), not PVC-specific usage. Filter out
PVs with >1TiB capacity (always NFS mounts; iSCSI PVCs are 10-50Gi).