Viktor Barzin e2146e6916 gpu: schedule off NFD label, not k8s-node1 hostname

Remove every hardcoded reference to k8s-node1 that pinned GPU
scheduling to a specific host:

- GPU workload nodeSelectors: gpu=true -> nvidia.com/gpu.present=true
  (frigate, immich, whisper, piper, ytdlp, ebook2audiobook, audiblez,
  audiblez-web, nvidia-exporter, gpu-pod-exporter). The NFD label is
  auto-applied by gpu-feature-discovery on any node carrying an
  NVIDIA PCI device, so the selector follows the card.

- null_resource.gpu_node_config: rewrite to enumerate NFD-labeled
  nodes (feature.node.kubernetes.io/pci-10de.present=true) and taint
  each with nvidia.com/gpu=true:PreferNoSchedule. Drop the manual
  'kubectl label gpu=true' since NFD handles labeling.

- MySQL anti-affinity: kubernetes.io/hostname NotIn [k8s-node1] ->
  nvidia.com/gpu.present NotIn [true]. Same intent (keep MySQL off
  the GPU node) but portable when the card relocates.

Net effect: moving the GPU card between nodes no longer requires any
Terraform edit. Verified no-op for current scheduling — both old and
new labels resolve to node1 today.

Docs updated to match: AGENTS.md, compute.md, overview.md,
proxmox-inventory.md, k8s-portal agent-guidance string.

2026-04-22 13:43:07 +00:00

22 KiB

Raw Blame History

Compute & Resource Management

Overview

The infrastructure runs on a single Dell R730 server with Proxmox VE, hosting a 5-node Kubernetes cluster. Compute resources are managed through a combination of Vertical Pod Autoscaler (VPA) recommendations, tier-based LimitRange defaults, and ResourceQuota enforcement. The cluster employs a no-CPU-limits policy to avoid CFS throttling while using memory requests=limits for stability. GPU workloads run on a dedicated node with Tesla T4 passthrough.

Architecture Diagram

graph TB
    subgraph Physical["Dell R730 Physical Host"]
        CPU["1x Xeon E5-2699 v4<br/>22c/44t<br/>CPU2 unpopulated"]
        RAM["272GB DDR4-2400 ECC"]
        GPU["NVIDIA Tesla T4<br/>PCIe 0000:06:00.0"]
        DISK["1.1TB SSD<br/>931GB SSD<br/>10.7TB HDD"]
    end

    subgraph Proxmox["Proxmox VE"]
        direction TB
        MASTER["VM 200: k8s-master<br/>8c / 32GB<br/>10.0.20.100"]
        NODE1["VM 201: k8s-node1<br/>16c / 32GB<br/>GPU Passthrough<br/>nvidia.com/gpu=true:PreferNoSchedule"]
        NODE2["VM 202: k8s-node2<br/>8c / 32GB"]
        NODE3["VM 203: k8s-node3<br/>8c / 32GB"]
        NODE4["VM 204: k8s-node4<br/>8c / 32GB"]
    end

    subgraph K8s["Kubernetes Cluster v1.34.2"]
        direction TB

        subgraph VPA["VPA (Goldilocks - Initial Mode)"]
            RECOMMEND["Quarterly Review:<br/>upperBound x1.2 (stable)<br/>upperBound x1.3 (GPU/volatile)"]
        end

        subgraph LimitRange["LimitRange per Tier"]
            TIER0_LR["0-core: 512Mi-8Gi mem<br/>500m-4 cpu"]
            TIER1_LR["1-cluster: 512Mi-4Gi mem<br/>500m-2 cpu"]
            TIER2_LR["2-gpu: 2Gi-16Gi mem<br/>1-8 cpu"]
            TIER34_LR["3-edge/4-aux: 256Mi-4Gi mem<br/>250m-2 cpu"]
        end

        subgraph ResourceQuota["ResourceQuota per Tier"]
            TIER0_RQ["0-core: 32 cpu / 64Gi mem / 100 pods"]
            TIER1_RQ["1-cluster: 16 cpu / 32Gi mem / 30 pods"]
            TIER2_RQ["2-gpu: 48 cpu / 96Gi mem / 40 pods"]
            TIER34_RQ["3-edge/4-aux: 8-16 cpu / 16-32Gi mem / 20-30 pods"]
        end
    end

    Physical --> Proxmox
    GPU -.->|Passthrough| NODE1
    Proxmox --> K8s
    VPA --> LimitRange
    LimitRange --> ResourceQuota

Components

Proxmox Host

Component	Specification
Model	Dell PowerEdge R730
CPU	1x Intel Xeon E5-2699 v4 (22 cores / 44 threads, CPU2 unpopulated)
Total Cores/Threads	22 cores / 44 threads
RAM	272GB DDR4-2400 ECC RDIMM physical (10 DIMMs: 8x32G Samsung + 2x8G Hynix). VMs use ~160GB total (5 K8s VMs x 32GB)
GPU	NVIDIA Tesla T4 (16GB GDDR6, PCIe 0000:06:00.0)
Storage	1.1TB SSD + 931GB SSD + 10.7TB HDD
Hypervisor	Proxmox VE

Kubernetes Nodes

VM	VMID	vCPUs	RAM	Network	Role	Taints
k8s-master	200	8	32GB	vmbr1:vlan20 (10.0.20.100)	Control Plane	`node-role.kubernetes.io/control-plane:NoSchedule`
k8s-node1	201	16	32GB	vmbr1:vlan20	GPU Worker	`nvidia.com/gpu=true:PreferNoSchedule` (applied dynamically to whichever node carries the GPU)
k8s-node2	202	8	32GB	vmbr1:vlan20	Worker	None
k8s-node3	203	8	32GB	vmbr1:vlan20	Worker	None
k8s-node4	204	8	32GB	vmbr1:vlan20	Worker	None

Total Cluster Resources: 48 vCPUs, ~160GB RAM (5 nodes x 32GB)

GPU Passthrough

Parameter	Value
Device	NVIDIA Tesla T4 (16GB GDDR6)
PCIe Address	0000:06:00.0
Assigned VM	VMID 201 (k8s-node1) — physical location only, no Terraform pin
Node Label	`nvidia.com/gpu.present=true` (auto-applied by gpu-feature-discovery; also `feature.node.kubernetes.io/pci-10de.present=true` from NFD)
Node Taint	`nvidia.com/gpu=true:PreferNoSchedule` (applied by `null_resource.gpu_node_config` to every NFD-tagged GPU node)
Driver	NVIDIA GPU Operator
Resource Name	`nvidia.com/gpu`

Resource Management Stack

Component	Version/Mode	Purpose
VPA	Goldilocks "Initial" mode	Resource recommendation (not auto-scaling)
Kyverno	Policy engine	Auto-generate LimitRange + ResourceQuota per tier
PriorityClass	Per tier (200K-900K)	Pod preemption during resource pressure
QoS Class	Guaranteed (0-2), Burstable (3-4)	Eviction order

How It Works

CPU Resource Management

Policy: No CPU limits cluster-wide, only CPU requests.

Rationale: Linux CFS (Completely Fair Scheduler) throttles containers to their exact CPU limit even when the CPU is idle, causing artificial performance degradation. By setting only CPU requests, containers can burst to unused CPU capacity.

Implementation:

All pods set resources.requests.cpu (reserves capacity)
No pods set resources.limits.cpu
Scheduler uses CPU requests for bin-packing
Kernel CFS shares unused CPU proportionally by requests

Example:

resources:
  requests:
    cpu: "500m"
  # No limits.cpu - can burst to idle CPU

Memory Resource Management

Policy: Memory requests = limits for stability.

Rationale: Memory is not compressible like CPU. A pod that exceeds its memory request can be OOMKilled unpredictably. Setting requests=limits ensures:

Predictable memory allocation
QoS class "Guaranteed" (tiers 0-2) or "Burstable" (tiers 3-4)
No surprise OOMKills during memory pressure

Implementation:

Tier 0-2: requests.memory = limits.memory (Guaranteed QoS)
Tier 3-4: requests.memory < limits.memory (Burstable QoS, reduces scheduler pressure)
Values based on VPA upperBound x1.2 (stable) or x1.3 (GPU/volatile)

Example:

# Tier 0-2 (Guaranteed)
resources:
  requests:
    memory: "2Gi"
  limits:
    memory: "2Gi"

# Tier 3-4 (Burstable)
resources:
  requests:
    memory: "512Mi"
  limits:
    memory: "1Gi"

Vertical Pod Autoscaler (VPA)

Mode: Goldilocks in "Initial" mode (recommend-only, not auto-scaling).

Why not Auto mode?

VPA Auto mode directly updates Deployment specs, creating drift from Terraform state
Terraform manages all resources declaratively, so VPA changes would be reverted
Quarterly review process maintains control and aligns with planned maintenance windows

Workflow:

VPA monitors pod resource usage over time
Goldilocks dashboard shows recommendations (lowerBound, target, upperBound)
Quarterly review: Engineer reviews VPA recommendations in Goldilocks UI
Apply sizing: Update Terraform with memory: <upperBound> * 1.2 (stable) or * 1.3 (GPU/volatile)
Terragrunt apply updates Deployment specs
Pods restart with new resource allocations

Stability Multipliers:

x1.2: Stable services (databases, monitoring, core services)
x1.3: GPU workloads or volatile services (user-facing apps, ML inference)

Tier-Based LimitRange

Kyverno automatically creates a LimitRange in each namespace based on its tier prefix.

Tier	Default Memory	Max Memory	Default CPU	Max CPU
0-core	512Mi	8Gi	500m	4
1-cluster	512Mi	4Gi	500m	2
2-gpu	2Gi	16Gi	1	8
3-edge	256Mi	4Gi	250m	2
4-aux	256Mi	4Gi	250m	2

Purpose:

Prevents pods without explicit resources from requesting unlimited resources
Sets sensible defaults for sidecars and init containers
Enforces maximum per-container limits

Example: A pod in 4-aux-vaultwarden without explicit resources gets:

resources:
  requests:
    memory: 256Mi
    cpu: 250m
  limits:
    memory: 4Gi
    cpu: 2  # (ignored due to no-CPU-limits policy)

Tier-Based ResourceQuota

Kyverno automatically creates a ResourceQuota in each namespace based on its tier.

Tier	CPU Limit	Memory Limit	Max Pods
0-core	32	64Gi	100
1-cluster	16	32Gi	30
2-gpu	48	96Gi	40
3-edge	16	32Gi	30
4-aux	8	16Gi	20

Purpose:

Prevents a single namespace from monopolizing cluster resources
Enforces tier-appropriate resource allocation
Protects critical services from lower-tier resource exhaustion

Quota Exhaustion: If a namespace exceeds its quota, new pods are rejected with Forbidden: exceeded quota.

QoS Classes and Eviction

Kubernetes assigns QoS classes based on resource configuration:

QoS Class	Condition	Eviction Priority	Tiers
Guaranteed	requests = limits (both CPU & memory)	Last	0-core, 1-cluster, 2-gpu
Burstable	requests < limits	Middle	3-edge, 4-aux
BestEffort	No requests or limits	First	None (not used)

Eviction Order during Memory Pressure:

BestEffort pods (none in cluster)
Burstable pods (tier 3-4), lowest priority first
Guaranteed pods (tier 0-2), lowest priority first

Priority Classes:

0-core: 900000
1-cluster: 700000
2-gpu: 500000
3-edge: 300000
4-aux: 200000

During resource pressure, tier 4 pods are evicted before tier 3, tier 3 before tier 2, etc.

Democratic-CSI Sidecar Resources

Problem: Democratic-CSI injects 3-4 sidecar containers per pod with PVCs:

csi-driver-registrar
csi-provisioner
csi-attacher
csi-resizer

Without explicit resources, each defaults to LimitRange default (256Mi), consuming 768Mi-1Gi per pod.

Solution: Explicitly set sidecar resources in Terraform:

resources {
  requests = {
    memory = "32Mi"
    cpu    = "10m"
  }
  limits = {
    memory = "80Mi"
  }
}

Result: 17 CSI sidecars go from 4.3GB (17 * 256Mi) to 544Mi (17 * 32Mi), freeing 3.7GB.

GPU Resource Management

Node Selection: GPU pods must:

Tolerate nvidia.com/gpu=true:PreferNoSchedule taint
Select nvidia.com/gpu.present=true label (auto-applied by gpu-feature-discovery wherever the card is)
Request nvidia.com/gpu: 1 resource

Example:

spec:
  tolerations:
  - key: nvidia.com/gpu
    operator: Equal
    value: "true"
    effect: NoSchedule
  nodeSelector:
    nvidia.com/gpu.present: "true"
  containers:
  - name: app
    resources:
      limits:
        nvidia.com/gpu: 1

Portability: No Terraform code references a specific hostname for GPU scheduling. If the GPU card is physically moved to a different node, gpu-feature-discovery moves the nvidia.com/gpu.present=true label with it, and null_resource.gpu_node_config re-applies the nvidia.com/gpu=true:PreferNoSchedule taint to the new host on the next apply (discovery keyed on feature.node.kubernetes.io/pci-10de.present=true).

GPU Workloads:

Ollama (LLM inference)
ComfyUI (Stable Diffusion workflows)
Stable Diffusion WebUI

Configuration

Key Files

Path	Purpose
`modules/namespace_config/`	Kyverno policies for LimitRange + ResourceQuota generation
`modules/k8s_app/main.tf`	Default resource templates for apps
`stacks/<service>/terragrunt.hcl`	Per-service resource overrides
`modules/gpu_app/`	GPU-specific resource templates

Terraform Resource Configuration

Standard App (no PVC):

module "app" {
  source = "../../modules/k8s_app"

  resources = {
    requests = {
      memory = "1Gi"      # VPA upperBound * 1.2
      cpu    = "500m"
    }
    limits = {
      memory = "1Gi"      # Same as request
      # No CPU limit
    }
  }
}

App with Democratic-CSI PVC:

module "app" {
  source = "../../modules/k8s_app"

  resources = {
    requests = {
      memory = "2Gi"
      cpu    = "500m"
    }
    limits = {
      memory = "2Gi"
    }
  }

  sidecar_resources = {
    requests = {
      memory = "32Mi"
      cpu    = "10m"
    }
    limits = {
      memory = "80Mi"
    }
  }
}

GPU App:

module "gpu_app" {
  source = "../../modules/gpu_app"

  gpu_count = 1

  resources = {
    requests = {
      memory = "8Gi"      # VPA upperBound * 1.3
      cpu    = "2"
    }
    limits = {
      memory = "8Gi"
      nvidia.com/gpu = 1
    }
  }
}

Kyverno Policies

LimitRange Generation (modules/namespace_config/limitrange-policy.yaml):

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: generate-limitrange
spec:
  rules:
  - name: generate-limitrange-0-core
    match:
      resources:
        kinds:
        - Namespace
        name: "0-core-*"
    generate:
      kind: LimitRange
      data:
        spec:
          limits:
          - default:
              memory: 512Mi
              cpu: 500m
            defaultRequest:
              memory: 512Mi
              cpu: 500m
            max:
              memory: 8Gi
              cpu: 4
            type: Container

ResourceQuota Generation (modules/namespace_config/resourcequota-policy.yaml):

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: generate-resourcequota
spec:
  rules:
  - name: generate-quota-0-core
    match:
      resources:
        kinds:
        - Namespace
        name: "0-core-*"
    generate:
      kind: ResourceQuota
      data:
        spec:
          hard:
            requests.cpu: "32"
            requests.memory: 64Gi
            pods: "100"

Decisions & Rationale

Why no CPU limits?

Decision: Set CPU requests but never set CPU limits.

Rationale:

CFS Throttling: Linux Completely Fair Scheduler throttles containers to their exact CPU limit, even when CPU is idle. This causes artificial performance degradation.
Burstability: Services can burst to unused CPU during low-load periods, improving response times.
Memory-bound: With 272GB physical host RAM (~160GB allocated to K8s VMs), memory is no longer the primary constraint. ~112GB headroom available for new VMs.

Tradeoff: A runaway process could monopolize CPU. Mitigated by CPU requests reserving capacity and PriorityClass preemption.

Evidence: After removing CPU limits cluster-wide, p95 latency dropped 40% for API services during load tests.

Why Goldilocks in Initial mode instead of Auto?

Decision: Use VPA in "Initial" (recommend-only) mode rather than "Auto" (update pods automatically).

Rationale:

Terraform State Drift: VPA Auto mode directly mutates Deployment specs, creating drift from Terraform-managed state. Next Terraform apply reverts VPA changes.
Declarative Workflow: Terraform is the source of truth. VPA recommendations are reviewed and applied via Terraform, maintaining declarative infrastructure.
Controlled Changes: Quarterly review ensures resource changes align with capacity planning and cluster upgrades.
Avoid Thrashing: VPA Auto can restart pods frequently during volatile workloads. Manual application reduces churn.

Tradeoff: Requires quarterly manual review. Accepted because homelab prioritizes stability over auto-optimization.

Why memory requests = limits for tiers 0-2?

Decision: Set memory requests equal to limits for core and cluster services (tiers 0-2).

Rationale:

Guaranteed QoS: Ensures pods are last to be evicted during memory pressure.
Predictable OOM: Pods are OOMKilled only when exceeding their own limit, not due to other pods' usage.
Stability: Critical services (traefik, authentik, vault) must not be evicted unexpectedly.

Tradeoff: Cannot burst above limit. Accepted because critical services are right-sized via VPA.

Why Burstable QoS for tiers 3-4?

Decision: Set memory requests < limits for edge and auxiliary services (tiers 3-4).

Rationale:

Reduced Scheduler Pressure: Lower memory requests allow more pods to fit on nodes.
Acceptable Eviction: Tier 3-4 services are non-critical (freshrss, vaultwarden) and tolerate occasional eviction.
Cost Efficiency: Allows oversubscription of memory for bursty workloads.

Tradeoff: Pods may be evicted during memory pressure. Accepted because tier 3-4 services have PriorityClass 200K-300K.

Why VPA upperBound * 1.2 (or 1.3)?

Decision: Set memory limits to VPA upperBound * 1.2 for stable services, * 1.3 for GPU/volatile services.

Rationale:

Headroom: VPA upperBound is the observed maximum usage. Adding 20-30% headroom prevents OOMKills during traffic spikes.
Growth Buffer: Services grow over time (more users, more data). Headroom delays the need for manual intervention.
GPU Volatility: GPU workloads (ML inference) have unpredictable memory usage. 30% headroom reduces OOMKills.

Tradeoff: Slightly higher memory allocation. Accepted because 272GB RAM provides ample capacity.

Troubleshooting

Pods stuck in Pending state

Symptom: Pod shows status: Pending with event FailedScheduling.

Diagnosis:

kubectl describe pod <pod-name> -n <namespace>

Common Causes:

ResourceQuota exceeded:

Error: exceeded quota: <namespace>-quota, requested: requests.memory=2Gi, used: requests.memory=14Gi, limited: requests.memory=16Gi

Fix: Increase ResourceQuota in modules/namespace_config/ for that tier, or reduce other pods' requests.

LimitRange default too high:
```
0/5 nodes are available: 5 Insufficient memory.
```
Fix: Override pod resources explicitly in Terraform (defaults come from LimitRange).

GPU taint not tolerated:

0/5 nodes are available: 1 node(s) had untolerated taint {nvidia.com/gpu: true}, 4 Insufficient nvidia.com/gpu.

Fix: Add toleration and nodeSelector for GPU pods.

No nodes with GPU:
```
0/5 nodes are available: 5 Insufficient nvidia.com/gpu.
```
Fix: Verify the GPU-carrying node is Ready and has the nvidia.com/gpu.present=true label. Check kubectl get nodes -l nvidia.com/gpu.present=true — if empty, gpu-feature-discovery hasn't labeled any node (operator not running, driver not loaded, or PCI passthrough broken).

Pods OOMKilled repeatedly

Symptom: Pod shows status: OOMKilled in events, restarts frequently.

Diagnosis:

kubectl describe pod <pod-name> -n <namespace>
kubectl top pod <pod-name> -n <namespace>  # Current usage
kubectl get limitrange -n <namespace> -o yaml  # Check defaults

Common Causes:

Using LimitRange default (256Mi or 512Mi): Fix: Set explicit memory request/limit in Terraform based on actual usage.
Memory limit too low: Fix: Check Goldilocks VPA recommendation, set memory = upperBound * 1.2.
Memory leak: Fix: Investigate application code, check Grafana memory usage trends.

Democratic-CSI sidecars consuming excessive memory

Symptom: Pods with PVCs have 3-4 sidecar containers, each using 256Mi (LimitRange default).

Diagnosis:

kubectl get pods -A -o json | jq '.items[] | select(.spec.containers[].name | contains("csi")) | {name: .metadata.name, namespace: .metadata.namespace}'
kubectl top pod <pod-name> -n <namespace> --containers

Fix: Update Terraform to override sidecar resources:

sidecar_resources = {
  requests = {
    memory = "32Mi"
    cpu    = "10m"
  }
  limits = {
    memory = "80Mi"
  }
}

Tier 3-4 pods evicted during resource pressure

Symptom: Lower-tier pods show status: Evicted with reason The node was low on resource: memory.

Diagnosis:

kubectl get events --sort-by='.lastTimestamp' | grep Evicted
kubectl top nodes  # Check node memory usage

Expected Behavior: This is normal. Tier 3-4 use Burstable QoS and priority 200K-300K, making them first eviction candidates.

Fix:

If evictions are frequent: Increase node memory or reduce tier 3-4 memory limits
If evicted service is critical: Promote to tier 1 or 2
If node is overloaded: Check for memory leaks in tier 0-2 services

GPU pods not scheduling on GPU node

Symptom: GPU pod stuck in Pending with event 0/5 nodes are available: 1 node(s) had untolerated taint.

Diagnosis:

kubectl describe node k8s-node1 | grep Taints
kubectl describe pod <pod-name> -n <namespace> | grep -A5 Tolerations

Fix: Add GPU toleration and selector to pod spec:

spec:
  tolerations:
  - key: nvidia.com/gpu
    operator: Equal
    value: "true"
    effect: NoSchedule
  nodeSelector:
    nvidia.com/gpu.present: "true"
  containers:
  - name: app
    resources:
      limits:
        nvidia.com/gpu: 1

Node out of memory despite low pod usage

Symptom: Node shows memory pressure, but kubectl top pods shows low usage.

Diagnosis:

# SSH to node
ssh k8s-node2
free -h
ps aux --sort=-%mem | head -20

Common Causes:

Kernel memory: Page cache, slab allocator not shown in kubectl top
System services: kubelet, containerd, systemd-journald
Zombie containers: Old containers not cleaned up

Fix:

# Clear page cache (safe on production)
echo 3 > /proc/sys/vm/drop_caches

# Cleanup stopped containers
crictl rmp $(crictl ps -a --state Exited -q)

# Restart kubelet (forces cleanup)
systemctl restart kubelet

VPA recommendations not appearing in Goldilocks

Symptom: Goldilocks dashboard shows no recommendations for a service.

Diagnosis:

kubectl get vpa -n <namespace>
kubectl describe vpa <vpa-name> -n <namespace>

Common Causes:

VPA not created: Terraform module missing VPA resource
Insufficient data: VPA needs 24h of metrics before recommending
VPA pod not running: VPA controller/recommender crashed

Fix:

# Check VPA pods
kubectl get pods -n kube-system | grep vpa

# Check VPA logs
kubectl logs -n kube-system deployment/vpa-recommender

# Restart VPA if needed
kubectl rollout restart -n kube-system deployment/vpa-recommender

Overview - VM inventory and cluster architecture
Multi-tenancy - Tier system and namespace isolation
Monitoring - Resource usage dashboards and Goldilocks UI
Runbooks: Right-Sizing - Quarterly VPA review process
Runbooks: GPU Troubleshooting
Runbooks: Node Maintenance

22 KiB Raw Blame History

Compute & Resource Management

Overview

Architecture Diagram

Components

Proxmox Host

Kubernetes Nodes

GPU Passthrough

Resource Management Stack

How It Works

CPU Resource Management

Memory Resource Management

Vertical Pod Autoscaler (VPA)

Tier-Based LimitRange

Tier-Based ResourceQuota

QoS Classes and Eviction

Democratic-CSI Sidecar Resources

GPU Resource Management

Configuration

Key Files

Terraform Resource Configuration

Kyverno Policies

Decisions & Rationale

Why no CPU limits?

Why Goldilocks in Initial mode instead of Auto?

Why memory requests = limits for tiers 0-2?

Why Burstable QoS for tiers 3-4?

Why VPA upperBound * 1.2 (or 1.3)?

Troubleshooting

Pods stuck in Pending state

Pods OOMKilled repeatedly

Democratic-CSI sidecars consuming excessive memory

Tier 3-4 pods evicted during resource pressure

GPU pods not scheduling on GPU node

Node out of memory despite low pod usage

VPA recommendations not appearing in Goldilocks

Related

22 KiB

Raw Blame History