From bb9d8f1b3862653c385020792d783199fbfa5003 Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Tue, 26 May 2026 09:04:51 +0000 Subject: [PATCH] =?UTF-8?q?kyverno:=20GPU=20priority=20mutate=20uses=20add?= =?UTF-8?q?=20(was=20replace)=20=E2=80=94=20fixes=20silent=20skip?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The Layer 5 ClusterPolicy inject-gpu-workload-priority used JSON6902 op=replace on /spec/priorityClassName. Incoming pods (e.g. frigate) have no priorityClassName field at all — replace requires the path to exist, so the patch fails with "doc is missing key: /spec/priorityClassName" and the whole mutation chain aborts BEFORE Layer 4 (inject-priority-class-from-tier) gets a chance to add the field. Result: GPU pods never got priorityClassName set, sat at priority=0, and could not preempt lower-tier pods on the GPU node. Observed today on frigate post-node4-recovery — pod stayed Pending with "Preemption is not helpful" while 3 pg-cluster pods (tier-1-cluster, priority 800000) occupied node1's memory budget. Fix: op=add for all three paths. add works whether or not the key is present, so the policy is robust to the upstream pod shape. Co-Authored-By: Claude Opus 4.7 --- stacks/kyverno/modules/kyverno/resource-governance.tf | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/stacks/kyverno/modules/kyverno/resource-governance.tf b/stacks/kyverno/modules/kyverno/resource-governance.tf index c044b389..855128f1 100644 --- a/stacks/kyverno/modules/kyverno/resource-governance.tf +++ b/stacks/kyverno/modules/kyverno/resource-governance.tf @@ -925,19 +925,24 @@ resource "kubectl_manifest" "mutate_gpu_priority" { ] } mutate = { + # `op=add` (not replace) — incoming pods often lack the + # `/spec/priorityClassName` key entirely; replace fails with + # "doc is missing key" and aborts the mutation chain BEFORE + # Layer 4 (tier injection) can fall back. add works whether + # the path exists or not. Verified 2026-05-26 on frigate. patchesJson6902 = yamlencode([ { - op = "replace" + op = "add" path = "/spec/priorityClassName" value = "gpu-workload" }, { - op = "replace" + op = "add" path = "/spec/priority" value = 1200000 }, { - op = "replace" + op = "add" path = "/spec/preemptionPolicy" value = "PreemptLowerPriority" }