infra/stacks/nvidia
Viktor Barzin 15c88bc683
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
keel: belt-and-suspenders opt-out for mysql/redis/nvidia-exporter
After re-enabling Keel with `policy: patch` (commit f325b949), 3 of the
60 first-hour bumps broke things and need explicit cluster-wide opt-out
so future Kyverno reconciles can't put them back under auto-update:

- `dbaas/mysql-standalone`: patch-bumped `mysql:8.4.8 → :8.4.9` and the
  DD upgrade stalled (we explicitly track that as beads `code-963q` —
  the 8.4.9 jump needs a wipe+reinit, not a rolling upgrade). The
  StatefulSet already had `annotation=never` from TF but was missing the
  LABEL — Kyverno's selector exclude reads the LABEL, so a reconcile
  that dropped the annotation could resume auto-update. Added the LABEL.

- `redis/redis-v2`: patch-bumped `redis:8-alpine → :8.0.6-alpine` and
  the new image rejected the `aof-load-corrupt-tail-max-size` directive
  from commit 1eee56d0 → redis-v2-2 CrashLoopBackOff. Plus :8.0.6 is
  semantically older than :8-alpine (which resolves to :8.6.2) — same
  Keel tag-picking pathology as the 2026-05-26 morning incident, just
  in a different shape. LABEL + ANNOTATION both added.

- `nvidia/nvidia-exporter`: Keel rewrote `:latest → :4.5.2-4.8.1-ubuntu22.04`
  and the new dcgm-exporter OOMKilled at the 192Mi memory limit
  (4 restarts before I caught it). Added LABEL + ANNOTATION for opt-out,
  AND bumped memory request/limit 192Mi → 256Mi/512Mi so the bumped image
  doesn't OOM (older versions fit in 192Mi; the bumped one needs ~250Mi
  steady-state).

The 56 other Keel bumps in that 10-minute window (coredns 1.12.1→1.12.4,
kyverno 1.16.1→1.16.4, nextcloud 32.0.3→32.0.9, grafana 12.3.1→12.3.6,
cnpg, mailserver, csi-nfs, metrics-server, etc.) landed cleanly — the
`patch` policy is the right default. Per-workload `never` opt-out is
the maintenance cost.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 21:53:10 +00:00
..
modules/nvidia keel: belt-and-suspenders opt-out for mysql/redis/nvidia-exporter 2026-05-26 21:53:10 +00:00
main.tf extract monitoring, nvidia, mailserver, cloudflared, kyverno from platform [ci skip] 2026-03-17 21:34:11 +00:00
secrets extract monitoring, nvidia, mailserver, cloudflared, kyverno from platform [ci skip] 2026-03-17 21:34:11 +00:00
terragrunt.hcl extract monitoring, nvidia, mailserver, cloudflared, kyverno from platform [ci skip] 2026-03-17 21:34:11 +00:00