infra/stacks
Viktor Barzin cdbb418f45 monitoring: alert when cluster can't tolerate losing a non-GPU worker
ClusterCannotTolerateNonGpuNodeLoss fires when the most heavily reserved
non-GPU worker (k8s-node2/3/4) has more memory requests pinned to it
than the rest of the workers (incl. node1 GPU node) currently have free.
If that node went down, its pods would not fit elsewhere and would stay
Pending — exactly what happened today (2026-05-26) with node4 NotReady:
4 kyverno pods + woodpecker PVCs + several deployments stuck Pending
because node2/node3 were at 99% memory-request saturation.

Math: max(R(node X) for X in non-GPU workers) > sum(clamp_min(A(n) - R(n), 0))
over Ready workers. node1 included on the right because its taint is
PreferNoSchedule (soft) so it does absorb non-GPU pods under pressure.

Currently fires with a 33.96 GiB shortage. Remediation: right-size top
reservers via Goldilocks (immich-server 8Gi, frigate 5Gi, prometheus
4.4Gi, pg-cluster 3Gi each, paperless 2Gi) or bump VM RAM on
k8s-node2/k8s-node3 from 32GB → 48GB to match node1.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 02:34:13 +00:00
..
_template ingress_factory: replace protected bool with auth enum + audit pass across 100 stacks 2026-05-10 18:53:49 +00:00
actualbudget recruiter-responder: bump image_tag to 189ef901 2026-05-16 12:41:05 +00:00
affine recruiter-responder: bump image_tag to 189ef901 2026-05-16 12:41:05 +00:00
authentik authentik: worker replicas 3 -> 2 2026-05-21 09:14:35 +00:00
beads-server beads-server: codify Keel annotations on Dolt deployment (drift cleanup) 2026-05-17 22:22:40 +00:00
blog nfs-mirror: append transferred files to offsite-sync manifest 2026-05-24 15:32:22 +00:00
broker-sync broker-sync(imap): fix command name + add fsGroup for sync.db writes 2026-05-22 14:41:54 +00:00
calico security(wave1): W1.6 expand observation from recruiter-responder pilot → tier 3+4 (82 namespaces) 2026-05-19 22:14:16 +00:00
changedetection enrolled-patch stacks: ignore image drift from Keel auto-update 2026-05-16 13:24:16 +00:00
chrome-service recruiter-responder: bump image_tag to 189ef901 2026-05-16 12:41:05 +00:00
city-guesser enrolled-patch stacks: ignore image drift from Keel auto-update 2026-05-16 13:24:16 +00:00
claude-agent-service claude-agent-service: cut memory request 2Gi → 1Gi (limit 4Gi → 2Gi) 2026-05-23 10:03:42 +00:00
claude-memory recruiter-responder: bump image_tag to 189ef901 2026-05-16 12:41:05 +00:00
cloudflared mailserver: decommission SendGrid 2026-05-22 20:08:38 +00:00
cnpg cnpg: bump webhook-cert renewal threshold 7d -> 30d 2026-05-22 15:00:41 +00:00
coturn enrolled-patch stacks: ignore image drift from Keel auto-update 2026-05-16 13:24:16 +00:00
crowdsec crowdsec: pin image to v1.7.8 + remove ENROLL_KEY, CAPI restored 2026-05-24 11:11:29 +00:00
cyberchef final wave: enroll immich + status-page, retrigger 17 pending Bucket A 2026-05-16 23:19:20 +00:00
dashy enrolled-patch stacks: ignore image drift from Keel auto-update 2026-05-16 13:24:16 +00:00
dawarich enrolled-patch stacks: ignore image drift from Keel auto-update 2026-05-16 13:24:16 +00:00
dbaas dbaas: opt MySQL out of Keel + add do-not-bump warning 2026-05-19 13:21:03 +00:00
descheduler keel: enroll 15 critical-path namespaces for digest-only auto-update 2026-05-17 12:13:22 +00:00
diun enrolled-patch stacks: ignore image drift from Keel auto-update 2026-05-16 13:24:16 +00:00
ebook2audiobook enrolled-patch stacks: ignore image drift from Keel auto-update 2026-05-16 13:24:16 +00:00
ebooks enrolled-patch stacks: ignore image drift from Keel auto-update 2026-05-16 13:24:16 +00:00
echo enrolled-patch stacks: ignore image drift from Keel auto-update 2026-05-16 13:24:16 +00:00
excalidraw excalidraw: migrate PVC from proxmox-lvm to NFS 2026-05-26 02:33:41 +00:00
external-secrets recruiter-responder: bump image_tag to 189ef901 2026-05-16 12:41:05 +00:00
f1-stream f1-stream: verifier — wrap m3u8 fetches through /proxy 2026-05-24 22:26:56 +00:00
fire-planner fire-planner: COL refresh CronJob + Grafana Cost-of-Living dashboard 2026-05-22 14:15:38 +00:00
foolery recruiter-responder: bump image_tag to 189ef901 2026-05-16 12:41:05 +00:00
forgejo Woodpecker CI deploy [CI SKIP] 2026-05-24 22:07:58 +00:00
freedify recruiter-responder: bump image_tag to 189ef901 2026-05-16 12:41:05 +00:00
freshrss infra: add kubectl + authentik providers across 6 stacks 2026-05-21 08:07:22 +00:00
frigate ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
grampsweb ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
hackmd ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
headscale keel: enroll 15 critical-path namespaces for digest-only auto-update 2026-05-17 12:13:22 +00:00
health ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
hermes-agent ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
homepage final wave: enroll immich + status-page, retrigger 17 pending Bucket A 2026-05-16 23:19:20 +00:00
immich immich: harden against bulk-import load (memory + probe + Job retries) 2026-05-24 22:14:05 +00:00
infra [forgejo] Phases 3+4+5: cutover, decommission, docs sweep 2026-05-07 18:30:02 +00:00
infra-maintenance [infra] Sweep dns_config ignore_changes across all pod-owning resources [ci skip] 2026-04-18 21:19:48 +00:00
insta2spotify ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
instagram-poster Bucket A retrigger + Bucket D enrollment (5 module-nested stacks) 2026-05-16 23:10:38 +00:00
isponsorblocktv ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
job-hunter ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
jsoncrack final wave: enroll immich + status-page, retrigger 17 pending Bucket A 2026-05-16 23:19:20 +00:00
k8s-dashboard final wave: enroll immich + status-page, retrigger 17 pending Bucket A 2026-05-16 23:19:20 +00:00
k8s-portal Bucket A retrigger + Bucket D enrollment (5 module-nested stacks) 2026-05-16 23:10:38 +00:00
k8s-version-upgrade k8s-version-upgrade: ignore IngressTTFBCritical in halt-on-alert check 2026-05-24 01:10:44 +00:00
keel upgrade-state: skill + script + Keel scrape for periodic three-pipeline audit 2026-05-18 10:50:43 +00:00
kms final wave: enroll immich + status-page, retrigger 17 pending Bucket A 2026-05-16 23:19:20 +00:00
kured ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
kyverno kyverno: allowlist woodpeckerci/* for CI step pods 2026-05-23 08:52:48 +00:00
linkwarden infra: add kubectl + authentik providers across 6 stacks 2026-05-21 08:07:22 +00:00
llama-cpp llama-cpp: ignore_changes for keel/k8s-managed annotations 2026-05-24 09:01:17 +00:00
local-path final wave: enroll immich + status-page, retrigger 17 pending Bucket A 2026-05-16 23:19:20 +00:00
mailserver mailserver: decommission SendGrid 2026-05-22 20:08:38 +00:00
matrix ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
meshcentral ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
metallb keel: enroll 11 more namespaces (operators + critical infra) 2026-05-17 20:59:14 +00:00
metrics-server keel: enroll 15 critical-path namespaces for digest-only auto-update 2026-05-17 12:13:22 +00:00
monitoring monitoring: alert when cluster can't tolerate losing a non-GPU worker 2026-05-26 02:34:13 +00:00
n8n nfs-mirror: append transferred files to offsite-sync manifest 2026-05-24 15:32:22 +00:00
navidrome infra: add kubectl + authentik providers across 6 stacks 2026-05-21 08:07:22 +00:00
netbox ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
networking-toolbox ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
nextcloud nextcloud(external_storage): add per-mount enableSharing option 2026-05-24 11:39:16 +00:00
nfs-csi keel: enroll 11 more namespaces (operators + critical infra) 2026-05-17 20:59:14 +00:00
nodelocal-dns [dns] NodeLocal DNSCache — deploy DaemonSet to all nodes (WS C) 2026-04-19 15:46:41 +00:00
novelapp Woodpecker CI deploy [CI SKIP] 2026-05-16 23:17:44 +00:00
ntfy ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
nvidia nvidia: fix driver install deadlock + extend startup probe 2026-05-25 11:53:44 +00:00
onlyoffice ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
openclaw nfs-mirror: append transferred files to offsite-sync manifest 2026-05-24 15:32:22 +00:00
osm_routing final wave: enroll immich + status-page, retrigger 17 pending Bucket A 2026-05-16 23:19:20 +00:00
owntracks ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
paperless-mcp paperless-mcp: deploy MCP for AI document search 2026-05-17 11:14:35 +00:00
paperless-ngx ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
payslip-ingest ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
phpipam ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
platform [infra] Add Cloudflare provider to all stack lock files and generated providers 2026-04-16 16:31:36 +00:00
plotting-book Woodpecker CI deploy [CI SKIP] 2026-05-16 23:17:44 +00:00
poison-fountain ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
postiz postiz: bump memory request 512Mi → 2Gi, limit 4Gi → 3Gi (right-size for next deploy) 2026-05-24 01:11:25 +00:00
priority-pass ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
privatebin ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
proxmox-csi proxmox-csi/node: bump memory request 64Mi → 1Gi (LUKS unlock reservation) 2026-05-24 01:10:44 +00:00
pvc-autoresizer [infra] Suppress Goldilocks vpa-update-mode label drift on all namespaces [ci skip] 2026-04-18 21:15:27 +00:00
rbac [infra] Migrate Terraform state from local SOPS to PostgreSQL backend 2026-04-16 19:33:12 +00:00
real-estate-crawler realestate-crawler: dockerhub pull-secret + lift image-pin on ui/api 2026-05-18 19:11:43 +00:00
recruiter-responder nfs-mirror: append transferred files to offsite-sync manifest 2026-05-24 15:32:22 +00:00
redis keel: enroll 15 critical-path namespaces for digest-only auto-update 2026-05-17 12:13:22 +00:00
reloader keel: enroll 15 critical-path namespaces for digest-only auto-update 2026-05-17 12:13:22 +00:00
resume ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
reverse-proxy keel: enroll 15 critical-path namespaces for digest-only auto-update 2026-05-17 12:13:22 +00:00
rybbit ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
sealed-secrets keel: enroll 11 more namespaces (operators + critical infra) 2026-05-17 20:59:14 +00:00
send ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
servarr keel: enroll 11 more namespaces (operators + critical infra) 2026-05-17 20:59:14 +00:00
shadowsocks ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
speedtest ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
status-page [infra] Establish KYVERNO_LIFECYCLE_V1 drift-suppression convention [ci skip] 2026-04-18 14:15:51 +00:00
stirling-pdf ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
tandoor infra: add kubectl + authentik providers across 6 stacks 2026-05-21 08:07:22 +00:00
technitium technitium: cut memory — primary 2Gi → 1Gi, secondary+tertiary 2Gi → 512Mi 2026-05-23 10:03:51 +00:00
terminal terminal: probe + alerts after Traefik replica routing-table skew 2026-05-17 10:04:26 +00:00
tor-proxy ci: retrigger v3 — apply remaining 22 Keel-enrolled stacks 2026-05-16 14:06:39 +00:00
trading-bot trading-bot: add kevin_signal_bridge container (kill-switch OFF for Phase 1) 2026-05-24 01:22:53 +00:00
traefik traefik: bump auth-proxy nginx header buffers to handle Authentik cookie pile 2026-05-23 08:34:33 +00:00
travel_blog final wave: enroll immich + status-page, retrigger 17 pending Bucket A 2026-05-16 23:19:20 +00:00
tuya-bridge ci: retrigger v3 — apply remaining 22 Keel-enrolled stacks 2026-05-16 14:06:39 +00:00
uptime-kuma Bucket A retrigger + Bucket D enrollment (5 module-nested stacks) 2026-05-16 23:10:38 +00:00
url nfs-mirror: append transferred files to offsite-sync manifest 2026-05-24 15:32:22 +00:00
vault trading-bot: revive K8s stack + add meet-kevin-watcher 2026-05-22 11:23:30 +00:00
vaultwarden Bucket A retrigger + Bucket D enrollment (5 module-nested stacks) 2026-05-16 23:10:38 +00:00
vpa keel: enroll 11 more namespaces (operators + critical infra) 2026-05-17 20:59:14 +00:00
wealthfolio Woodpecker CI deploy [CI SKIP] 2026-05-16 13:45:45 +00:00
webhook_handler final wave: enroll immich + status-page, retrigger 17 pending Bucket A 2026-05-16 23:19:20 +00:00
whisper ci: retrigger v3 — apply remaining 22 Keel-enrolled stacks 2026-05-16 14:06:39 +00:00
wireguard keel: enroll 15 critical-path namespaces for digest-only auto-update 2026-05-17 12:13:22 +00:00
woodpecker ci: retrigger v3 — apply remaining 22 Keel-enrolled stacks 2026-05-16 14:06:39 +00:00
xray xray: drop dead vless ingress + pin Service target_port 2026-05-24 01:13:54 +00:00
ytdlp ci: retrigger v3 — apply remaining 22 Keel-enrolled stacks 2026-05-16 14:06:39 +00:00