infra/docs/architecture/compute.md
Viktor Barzin e2146e6916 gpu: schedule off NFD label, not k8s-node1 hostname
Remove every hardcoded reference to k8s-node1 that pinned GPU
scheduling to a specific host:

- GPU workload nodeSelectors: gpu=true -> nvidia.com/gpu.present=true
  (frigate, immich, whisper, piper, ytdlp, ebook2audiobook, audiblez,
  audiblez-web, nvidia-exporter, gpu-pod-exporter). The NFD label is
  auto-applied by gpu-feature-discovery on any node carrying an
  NVIDIA PCI device, so the selector follows the card.

- null_resource.gpu_node_config: rewrite to enumerate NFD-labeled
  nodes (feature.node.kubernetes.io/pci-10de.present=true) and taint
  each with nvidia.com/gpu=true:PreferNoSchedule. Drop the manual
  'kubectl label gpu=true' since NFD handles labeling.

- MySQL anti-affinity: kubernetes.io/hostname NotIn [k8s-node1] ->
  nvidia.com/gpu.present NotIn [true]. Same intent (keep MySQL off
  the GPU node) but portable when the card relocates.

Net effect: moving the GPU card between nodes no longer requires any
Terraform edit. Verified no-op for current scheduling — both old and
new labels resolve to node1 today.

Docs updated to match: AGENTS.md, compute.md, overview.md,
proxmox-inventory.md, k8s-portal agent-guidance string.
2026-04-22 13:43:07 +00:00

22 KiB

Compute & Resource Management

Overview

The infrastructure runs on a single Dell R730 server with Proxmox VE, hosting a 5-node Kubernetes cluster. Compute resources are managed through a combination of Vertical Pod Autoscaler (VPA) recommendations, tier-based LimitRange defaults, and ResourceQuota enforcement. The cluster employs a no-CPU-limits policy to avoid CFS throttling while using memory requests=limits for stability. GPU workloads run on a dedicated node with Tesla T4 passthrough.

Architecture Diagram

graph TB
    subgraph Physical["Dell R730 Physical Host"]
        CPU["1x Xeon E5-2699 v4<br/>22c/44t<br/>CPU2 unpopulated"]
        RAM["272GB DDR4-2400 ECC"]
        GPU["NVIDIA Tesla T4<br/>PCIe 0000:06:00.0"]
        DISK["1.1TB SSD<br/>931GB SSD<br/>10.7TB HDD"]
    end

    subgraph Proxmox["Proxmox VE"]
        direction TB
        MASTER["VM 200: k8s-master<br/>8c / 32GB<br/>10.0.20.100"]
        NODE1["VM 201: k8s-node1<br/>16c / 32GB<br/>GPU Passthrough<br/>nvidia.com/gpu=true:PreferNoSchedule"]
        NODE2["VM 202: k8s-node2<br/>8c / 32GB"]
        NODE3["VM 203: k8s-node3<br/>8c / 32GB"]
        NODE4["VM 204: k8s-node4<br/>8c / 32GB"]
    end

    subgraph K8s["Kubernetes Cluster v1.34.2"]
        direction TB

        subgraph VPA["VPA (Goldilocks - Initial Mode)"]
            RECOMMEND["Quarterly Review:<br/>upperBound x1.2 (stable)<br/>upperBound x1.3 (GPU/volatile)"]
        end

        subgraph LimitRange["LimitRange per Tier"]
            TIER0_LR["0-core: 512Mi-8Gi mem<br/>500m-4 cpu"]
            TIER1_LR["1-cluster: 512Mi-4Gi mem<br/>500m-2 cpu"]
            TIER2_LR["2-gpu: 2Gi-16Gi mem<br/>1-8 cpu"]
            TIER34_LR["3-edge/4-aux: 256Mi-4Gi mem<br/>250m-2 cpu"]
        end

        subgraph ResourceQuota["ResourceQuota per Tier"]
            TIER0_RQ["0-core: 32 cpu / 64Gi mem / 100 pods"]
            TIER1_RQ["1-cluster: 16 cpu / 32Gi mem / 30 pods"]
            TIER2_RQ["2-gpu: 48 cpu / 96Gi mem / 40 pods"]
            TIER34_RQ["3-edge/4-aux: 8-16 cpu / 16-32Gi mem / 20-30 pods"]
        end
    end

    Physical --> Proxmox
    GPU -.->|Passthrough| NODE1
    Proxmox --> K8s
    VPA --> LimitRange
    LimitRange --> ResourceQuota

Components

Proxmox Host

Component Specification
Model Dell PowerEdge R730
CPU 1x Intel Xeon E5-2699 v4 (22 cores / 44 threads, CPU2 unpopulated)
Total Cores/Threads 22 cores / 44 threads
RAM 272GB DDR4-2400 ECC RDIMM physical (10 DIMMs: 8x32G Samsung + 2x8G Hynix). VMs use ~160GB total (5 K8s VMs x 32GB)
GPU NVIDIA Tesla T4 (16GB GDDR6, PCIe 0000:06:00.0)
Storage 1.1TB SSD + 931GB SSD + 10.7TB HDD
Hypervisor Proxmox VE

Kubernetes Nodes

VM VMID vCPUs RAM Network Role Taints
k8s-master 200 8 32GB vmbr1:vlan20 (10.0.20.100) Control Plane node-role.kubernetes.io/control-plane:NoSchedule
k8s-node1 201 16 32GB vmbr1:vlan20 GPU Worker nvidia.com/gpu=true:PreferNoSchedule (applied dynamically to whichever node carries the GPU)
k8s-node2 202 8 32GB vmbr1:vlan20 Worker None
k8s-node3 203 8 32GB vmbr1:vlan20 Worker None
k8s-node4 204 8 32GB vmbr1:vlan20 Worker None

Total Cluster Resources: 48 vCPUs, ~160GB RAM (5 nodes x 32GB)

GPU Passthrough

Parameter Value
Device NVIDIA Tesla T4 (16GB GDDR6)
PCIe Address 0000:06:00.0
Assigned VM VMID 201 (k8s-node1) — physical location only, no Terraform pin
Node Label nvidia.com/gpu.present=true (auto-applied by gpu-feature-discovery; also feature.node.kubernetes.io/pci-10de.present=true from NFD)
Node Taint nvidia.com/gpu=true:PreferNoSchedule (applied by null_resource.gpu_node_config to every NFD-tagged GPU node)
Driver NVIDIA GPU Operator
Resource Name nvidia.com/gpu

Resource Management Stack

Component Version/Mode Purpose
VPA Goldilocks "Initial" mode Resource recommendation (not auto-scaling)
Kyverno Policy engine Auto-generate LimitRange + ResourceQuota per tier
PriorityClass Per tier (200K-900K) Pod preemption during resource pressure
QoS Class Guaranteed (0-2), Burstable (3-4) Eviction order

How It Works

CPU Resource Management

Policy: No CPU limits cluster-wide, only CPU requests.

Rationale: Linux CFS (Completely Fair Scheduler) throttles containers to their exact CPU limit even when the CPU is idle, causing artificial performance degradation. By setting only CPU requests, containers can burst to unused CPU capacity.

Implementation:

  • All pods set resources.requests.cpu (reserves capacity)
  • No pods set resources.limits.cpu
  • Scheduler uses CPU requests for bin-packing
  • Kernel CFS shares unused CPU proportionally by requests

Example:

resources:
  requests:
    cpu: "500m"
  # No limits.cpu - can burst to idle CPU

Memory Resource Management

Policy: Memory requests = limits for stability.

Rationale: Memory is not compressible like CPU. A pod that exceeds its memory request can be OOMKilled unpredictably. Setting requests=limits ensures:

  • Predictable memory allocation
  • QoS class "Guaranteed" (tiers 0-2) or "Burstable" (tiers 3-4)
  • No surprise OOMKills during memory pressure

Implementation:

  • Tier 0-2: requests.memory = limits.memory (Guaranteed QoS)
  • Tier 3-4: requests.memory < limits.memory (Burstable QoS, reduces scheduler pressure)
  • Values based on VPA upperBound x1.2 (stable) or x1.3 (GPU/volatile)

Example:

# Tier 0-2 (Guaranteed)
resources:
  requests:
    memory: "2Gi"
  limits:
    memory: "2Gi"

# Tier 3-4 (Burstable)
resources:
  requests:
    memory: "512Mi"
  limits:
    memory: "1Gi"

Vertical Pod Autoscaler (VPA)

Mode: Goldilocks in "Initial" mode (recommend-only, not auto-scaling).

Why not Auto mode?

  • VPA Auto mode directly updates Deployment specs, creating drift from Terraform state
  • Terraform manages all resources declaratively, so VPA changes would be reverted
  • Quarterly review process maintains control and aligns with planned maintenance windows

Workflow:

  1. VPA monitors pod resource usage over time
  2. Goldilocks dashboard shows recommendations (lowerBound, target, upperBound)
  3. Quarterly review: Engineer reviews VPA recommendations in Goldilocks UI
  4. Apply sizing: Update Terraform with memory: <upperBound> * 1.2 (stable) or * 1.3 (GPU/volatile)
  5. Terragrunt apply updates Deployment specs
  6. Pods restart with new resource allocations

Stability Multipliers:

  • x1.2: Stable services (databases, monitoring, core services)
  • x1.3: GPU workloads or volatile services (user-facing apps, ML inference)

Tier-Based LimitRange

Kyverno automatically creates a LimitRange in each namespace based on its tier prefix.

Tier Default Memory Max Memory Default CPU Max CPU
0-core 512Mi 8Gi 500m 4
1-cluster 512Mi 4Gi 500m 2
2-gpu 2Gi 16Gi 1 8
3-edge 256Mi 4Gi 250m 2
4-aux 256Mi 4Gi 250m 2

Purpose:

  • Prevents pods without explicit resources from requesting unlimited resources
  • Sets sensible defaults for sidecars and init containers
  • Enforces maximum per-container limits

Example: A pod in 4-aux-vaultwarden without explicit resources gets:

resources:
  requests:
    memory: 256Mi
    cpu: 250m
  limits:
    memory: 4Gi
    cpu: 2  # (ignored due to no-CPU-limits policy)

Tier-Based ResourceQuota

Kyverno automatically creates a ResourceQuota in each namespace based on its tier.

Tier CPU Limit Memory Limit Max Pods
0-core 32 64Gi 100
1-cluster 16 32Gi 30
2-gpu 48 96Gi 40
3-edge 16 32Gi 30
4-aux 8 16Gi 20

Purpose:

  • Prevents a single namespace from monopolizing cluster resources
  • Enforces tier-appropriate resource allocation
  • Protects critical services from lower-tier resource exhaustion

Quota Exhaustion: If a namespace exceeds its quota, new pods are rejected with Forbidden: exceeded quota.

QoS Classes and Eviction

Kubernetes assigns QoS classes based on resource configuration:

QoS Class Condition Eviction Priority Tiers
Guaranteed requests = limits (both CPU & memory) Last 0-core, 1-cluster, 2-gpu
Burstable requests < limits Middle 3-edge, 4-aux
BestEffort No requests or limits First None (not used)

Eviction Order during Memory Pressure:

  1. BestEffort pods (none in cluster)
  2. Burstable pods (tier 3-4), lowest priority first
  3. Guaranteed pods (tier 0-2), lowest priority first

Priority Classes:

  • 0-core: 900000
  • 1-cluster: 700000
  • 2-gpu: 500000
  • 3-edge: 300000
  • 4-aux: 200000

During resource pressure, tier 4 pods are evicted before tier 3, tier 3 before tier 2, etc.

Democratic-CSI Sidecar Resources

Problem: Democratic-CSI injects 3-4 sidecar containers per pod with PVCs:

  • csi-driver-registrar
  • csi-provisioner
  • csi-attacher
  • csi-resizer

Without explicit resources, each defaults to LimitRange default (256Mi), consuming 768Mi-1Gi per pod.

Solution: Explicitly set sidecar resources in Terraform:

resources {
  requests = {
    memory = "32Mi"
    cpu    = "10m"
  }
  limits = {
    memory = "80Mi"
  }
}

Result: 17 CSI sidecars go from 4.3GB (17 * 256Mi) to 544Mi (17 * 32Mi), freeing 3.7GB.

GPU Resource Management

Node Selection: GPU pods must:

  1. Tolerate nvidia.com/gpu=true:PreferNoSchedule taint
  2. Select nvidia.com/gpu.present=true label (auto-applied by gpu-feature-discovery wherever the card is)
  3. Request nvidia.com/gpu: 1 resource

Example:

spec:
  tolerations:
  - key: nvidia.com/gpu
    operator: Equal
    value: "true"
    effect: NoSchedule
  nodeSelector:
    nvidia.com/gpu.present: "true"
  containers:
  - name: app
    resources:
      limits:
        nvidia.com/gpu: 1

Portability: No Terraform code references a specific hostname for GPU scheduling. If the GPU card is physically moved to a different node, gpu-feature-discovery moves the nvidia.com/gpu.present=true label with it, and null_resource.gpu_node_config re-applies the nvidia.com/gpu=true:PreferNoSchedule taint to the new host on the next apply (discovery keyed on feature.node.kubernetes.io/pci-10de.present=true).

GPU Workloads:

  • Ollama (LLM inference)
  • ComfyUI (Stable Diffusion workflows)
  • Stable Diffusion WebUI

Configuration

Key Files

Path Purpose
modules/namespace_config/ Kyverno policies for LimitRange + ResourceQuota generation
modules/k8s_app/main.tf Default resource templates for apps
stacks/<service>/terragrunt.hcl Per-service resource overrides
modules/gpu_app/ GPU-specific resource templates

Terraform Resource Configuration

Standard App (no PVC):

module "app" {
  source = "../../modules/k8s_app"

  resources = {
    requests = {
      memory = "1Gi"      # VPA upperBound * 1.2
      cpu    = "500m"
    }
    limits = {
      memory = "1Gi"      # Same as request
      # No CPU limit
    }
  }
}

App with Democratic-CSI PVC:

module "app" {
  source = "../../modules/k8s_app"

  resources = {
    requests = {
      memory = "2Gi"
      cpu    = "500m"
    }
    limits = {
      memory = "2Gi"
    }
  }

  sidecar_resources = {
    requests = {
      memory = "32Mi"
      cpu    = "10m"
    }
    limits = {
      memory = "80Mi"
    }
  }
}

GPU App:

module "gpu_app" {
  source = "../../modules/gpu_app"

  gpu_count = 1

  resources = {
    requests = {
      memory = "8Gi"      # VPA upperBound * 1.3
      cpu    = "2"
    }
    limits = {
      memory = "8Gi"
      nvidia.com/gpu = 1
    }
  }
}

Kyverno Policies

LimitRange Generation (modules/namespace_config/limitrange-policy.yaml):

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: generate-limitrange
spec:
  rules:
  - name: generate-limitrange-0-core
    match:
      resources:
        kinds:
        - Namespace
        name: "0-core-*"
    generate:
      kind: LimitRange
      data:
        spec:
          limits:
          - default:
              memory: 512Mi
              cpu: 500m
            defaultRequest:
              memory: 512Mi
              cpu: 500m
            max:
              memory: 8Gi
              cpu: 4
            type: Container

ResourceQuota Generation (modules/namespace_config/resourcequota-policy.yaml):

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: generate-resourcequota
spec:
  rules:
  - name: generate-quota-0-core
    match:
      resources:
        kinds:
        - Namespace
        name: "0-core-*"
    generate:
      kind: ResourceQuota
      data:
        spec:
          hard:
            requests.cpu: "32"
            requests.memory: 64Gi
            pods: "100"

Decisions & Rationale

Why no CPU limits?

Decision: Set CPU requests but never set CPU limits.

Rationale:

  • CFS Throttling: Linux Completely Fair Scheduler throttles containers to their exact CPU limit, even when CPU is idle. This causes artificial performance degradation.
  • Burstability: Services can burst to unused CPU during low-load periods, improving response times.
  • Memory-bound: With 272GB physical host RAM (~160GB allocated to K8s VMs), memory is no longer the primary constraint. ~112GB headroom available for new VMs.

Tradeoff: A runaway process could monopolize CPU. Mitigated by CPU requests reserving capacity and PriorityClass preemption.

Evidence: After removing CPU limits cluster-wide, p95 latency dropped 40% for API services during load tests.

Why Goldilocks in Initial mode instead of Auto?

Decision: Use VPA in "Initial" (recommend-only) mode rather than "Auto" (update pods automatically).

Rationale:

  • Terraform State Drift: VPA Auto mode directly mutates Deployment specs, creating drift from Terraform-managed state. Next Terraform apply reverts VPA changes.
  • Declarative Workflow: Terraform is the source of truth. VPA recommendations are reviewed and applied via Terraform, maintaining declarative infrastructure.
  • Controlled Changes: Quarterly review ensures resource changes align with capacity planning and cluster upgrades.
  • Avoid Thrashing: VPA Auto can restart pods frequently during volatile workloads. Manual application reduces churn.

Tradeoff: Requires quarterly manual review. Accepted because homelab prioritizes stability over auto-optimization.

Why memory requests = limits for tiers 0-2?

Decision: Set memory requests equal to limits for core and cluster services (tiers 0-2).

Rationale:

  • Guaranteed QoS: Ensures pods are last to be evicted during memory pressure.
  • Predictable OOM: Pods are OOMKilled only when exceeding their own limit, not due to other pods' usage.
  • Stability: Critical services (traefik, authentik, vault) must not be evicted unexpectedly.

Tradeoff: Cannot burst above limit. Accepted because critical services are right-sized via VPA.

Why Burstable QoS for tiers 3-4?

Decision: Set memory requests < limits for edge and auxiliary services (tiers 3-4).

Rationale:

  • Reduced Scheduler Pressure: Lower memory requests allow more pods to fit on nodes.
  • Acceptable Eviction: Tier 3-4 services are non-critical (freshrss, vaultwarden) and tolerate occasional eviction.
  • Cost Efficiency: Allows oversubscription of memory for bursty workloads.

Tradeoff: Pods may be evicted during memory pressure. Accepted because tier 3-4 services have PriorityClass 200K-300K.

Why VPA upperBound * 1.2 (or 1.3)?

Decision: Set memory limits to VPA upperBound * 1.2 for stable services, * 1.3 for GPU/volatile services.

Rationale:

  • Headroom: VPA upperBound is the observed maximum usage. Adding 20-30% headroom prevents OOMKills during traffic spikes.
  • Growth Buffer: Services grow over time (more users, more data). Headroom delays the need for manual intervention.
  • GPU Volatility: GPU workloads (ML inference) have unpredictable memory usage. 30% headroom reduces OOMKills.

Tradeoff: Slightly higher memory allocation. Accepted because 272GB RAM provides ample capacity.

Troubleshooting

Pods stuck in Pending state

Symptom: Pod shows status: Pending with event FailedScheduling.

Diagnosis:

kubectl describe pod <pod-name> -n <namespace>

Common Causes:

  1. ResourceQuota exceeded:

    Error: exceeded quota: <namespace>-quota, requested: requests.memory=2Gi, used: requests.memory=14Gi, limited: requests.memory=16Gi
    

    Fix: Increase ResourceQuota in modules/namespace_config/ for that tier, or reduce other pods' requests.

  2. LimitRange default too high:

    0/5 nodes are available: 5 Insufficient memory.
    

    Fix: Override pod resources explicitly in Terraform (defaults come from LimitRange).

  3. GPU taint not tolerated:

    0/5 nodes are available: 1 node(s) had untolerated taint {nvidia.com/gpu: true}, 4 Insufficient nvidia.com/gpu.
    

    Fix: Add toleration and nodeSelector for GPU pods.

  4. No nodes with GPU:

    0/5 nodes are available: 5 Insufficient nvidia.com/gpu.
    

    Fix: Verify the GPU-carrying node is Ready and has the nvidia.com/gpu.present=true label. Check kubectl get nodes -l nvidia.com/gpu.present=true — if empty, gpu-feature-discovery hasn't labeled any node (operator not running, driver not loaded, or PCI passthrough broken).

Pods OOMKilled repeatedly

Symptom: Pod shows status: OOMKilled in events, restarts frequently.

Diagnosis:

kubectl describe pod <pod-name> -n <namespace>
kubectl top pod <pod-name> -n <namespace>  # Current usage
kubectl get limitrange -n <namespace> -o yaml  # Check defaults

Common Causes:

  1. Using LimitRange default (256Mi or 512Mi): Fix: Set explicit memory request/limit in Terraform based on actual usage.

  2. Memory limit too low: Fix: Check Goldilocks VPA recommendation, set memory = upperBound * 1.2.

  3. Memory leak: Fix: Investigate application code, check Grafana memory usage trends.

Democratic-CSI sidecars consuming excessive memory

Symptom: Pods with PVCs have 3-4 sidecar containers, each using 256Mi (LimitRange default).

Diagnosis:

kubectl get pods -A -o json | jq '.items[] | select(.spec.containers[].name | contains("csi")) | {name: .metadata.name, namespace: .metadata.namespace}'
kubectl top pod <pod-name> -n <namespace> --containers

Fix: Update Terraform to override sidecar resources:

sidecar_resources = {
  requests = {
    memory = "32Mi"
    cpu    = "10m"
  }
  limits = {
    memory = "80Mi"
  }
}

Tier 3-4 pods evicted during resource pressure

Symptom: Lower-tier pods show status: Evicted with reason The node was low on resource: memory.

Diagnosis:

kubectl get events --sort-by='.lastTimestamp' | grep Evicted
kubectl top nodes  # Check node memory usage

Expected Behavior: This is normal. Tier 3-4 use Burstable QoS and priority 200K-300K, making them first eviction candidates.

Fix:

  • If evictions are frequent: Increase node memory or reduce tier 3-4 memory limits
  • If evicted service is critical: Promote to tier 1 or 2
  • If node is overloaded: Check for memory leaks in tier 0-2 services

GPU pods not scheduling on GPU node

Symptom: GPU pod stuck in Pending with event 0/5 nodes are available: 1 node(s) had untolerated taint.

Diagnosis:

kubectl describe node k8s-node1 | grep Taints
kubectl describe pod <pod-name> -n <namespace> | grep -A5 Tolerations

Fix: Add GPU toleration and selector to pod spec:

spec:
  tolerations:
  - key: nvidia.com/gpu
    operator: Equal
    value: "true"
    effect: NoSchedule
  nodeSelector:
    nvidia.com/gpu.present: "true"
  containers:
  - name: app
    resources:
      limits:
        nvidia.com/gpu: 1

Node out of memory despite low pod usage

Symptom: Node shows memory pressure, but kubectl top pods shows low usage.

Diagnosis:

# SSH to node
ssh k8s-node2
free -h
ps aux --sort=-%mem | head -20

Common Causes:

  1. Kernel memory: Page cache, slab allocator not shown in kubectl top
  2. System services: kubelet, containerd, systemd-journald
  3. Zombie containers: Old containers not cleaned up

Fix:

# Clear page cache (safe on production)
echo 3 > /proc/sys/vm/drop_caches

# Cleanup stopped containers
crictl rmp $(crictl ps -a --state Exited -q)

# Restart kubelet (forces cleanup)
systemctl restart kubelet

VPA recommendations not appearing in Goldilocks

Symptom: Goldilocks dashboard shows no recommendations for a service.

Diagnosis:

kubectl get vpa -n <namespace>
kubectl describe vpa <vpa-name> -n <namespace>

Common Causes:

  1. VPA not created: Terraform module missing VPA resource
  2. Insufficient data: VPA needs 24h of metrics before recommending
  3. VPA pod not running: VPA controller/recommender crashed

Fix:

# Check VPA pods
kubectl get pods -n kube-system | grep vpa

# Check VPA logs
kubectl logs -n kube-system deployment/vpa-recommender

# Restart VPA if needed
kubectl rollout restart -n kube-system deployment/vpa-recommender