The infrastructure runs on a single Dell R730 server with Proxmox VE, hosting a 7-node Kubernetes cluster. Compute resources are managed through a combination of Vertical Pod Autoscaler (VPA) recommendations, tier-based LimitRange defaults, and ResourceQuota enforcement. The cluster employs a no-CPU-limits policy to avoid CFS throttling while using memory requests=limits for stability. GPU workloads run on a dedicated node with Tesla T4 passthrough.
> Re-adoption into TF (via the `bpg/proxmox` provider, which models
> dynamic disks correctly) is possible but not scheduled — the
> cloud-init template above already captures the bootstrap-
> reproducibility goal.
### GPU Passthrough
| Parameter | Value |
|-----------|-------|
| Device | NVIDIA Tesla T4 (16GB GDDR6) |
| PCIe Address | 0000:06:00.0 |
| Assigned VM | VMID 201 (k8s-node1) — physical location only, no Terraform pin |
| Node Label | `nvidia.com/gpu.present=true` (auto-applied by gpu-feature-discovery; also `feature.node.kubernetes.io/pci-10de.present=true` from NFD) |
| Node Taint | `nvidia.com/gpu=true:PreferNoSchedule` (applied by `null_resource.gpu_node_config` to every NFD-tagged GPU node) |
| PriorityClass | Per tier (200K-900K) | Pod preemption during resource pressure |
| QoS Class | Guaranteed (0-2), Burstable (3-4) | Eviction order |
## How It Works
### CPU Resource Management
**Policy**: No CPU limits cluster-wide, only CPU requests.
**Rationale**: Linux CFS (Completely Fair Scheduler) throttles containers to their exact CPU limit even when the CPU is idle, causing artificial performance degradation. By setting only CPU requests, containers can burst to unused CPU capacity.
**Implementation**:
- All pods set `resources.requests.cpu` (reserves capacity)
- No pods set `resources.limits.cpu`
- Scheduler uses CPU requests for bin-packing
- Kernel CFS shares unused CPU proportionally by requests
**Example**:
```yaml
resources:
requests:
cpu: "500m"
# No limits.cpu - can burst to idle CPU
```
### Memory Resource Management
**Policy**: Memory requests = limits for stability.
**Rationale**: Memory is not compressible like CPU. A pod that exceeds its memory request can be OOMKilled unpredictably. Setting requests=limits ensures:
- Predictable memory allocation
- QoS class "Guaranteed" (tiers 0-2) or "Burstable" (tiers 3-4)
**Decision**: Set CPU requests but never set CPU limits.
**Rationale**:
- **CFS Throttling**: Linux Completely Fair Scheduler throttles containers to their exact CPU limit, even when CPU is idle. This causes artificial performance degradation.
- **Burstability**: Services can burst to unused CPU during low-load periods, improving response times.
- **Memory-bound**: With 272GB physical host RAM (~160GB allocated to K8s VMs), memory is no longer the primary constraint. ~112GB headroom available for new VMs.
**Tradeoff**: A runaway process could monopolize CPU. Mitigated by CPU requests reserving capacity and PriorityClass preemption.
**Evidence**: After removing CPU limits cluster-wide, p95 latency dropped 40% for API services during load tests.
### Why Goldilocks in Initial mode instead of Auto?
**Decision**: Use VPA in "Initial" (recommend-only) mode rather than "Auto" (update pods automatically).
**Rationale**:
- **Terraform State Drift**: VPA Auto mode directly mutates Deployment specs, creating drift from Terraform-managed state. Next Terraform apply reverts VPA changes.
- **Declarative Workflow**: Terraform is the source of truth. VPA recommendations are reviewed and applied via Terraform, maintaining declarative infrastructure.
- **Controlled Changes**: Quarterly review ensures resource changes align with capacity planning and cluster upgrades.
- **Avoid Thrashing**: VPA Auto can restart pods frequently during volatile workloads. Manual application reduces churn.
**Tradeoff**: Requires quarterly manual review. Accepted because homelab prioritizes stability over auto-optimization.
### Why memory requests = limits for tiers 0-2?
**Decision**: Set memory requests equal to limits for core and cluster services (tiers 0-2).
**Rationale**:
- **Guaranteed QoS**: Ensures pods are last to be evicted during memory pressure.
- **Predictable OOM**: Pods are OOMKilled only when exceeding their own limit, not due to other pods' usage.
- **Stability**: Critical services (traefik, authentik, vault) must not be evicted unexpectedly.
**Tradeoff**: Cannot burst above limit. Accepted because critical services are right-sized via VPA.
### Why Burstable QoS for tiers 3-4?
**Decision**: Set memory requests <limitsforedgeandauxiliaryservices(tiers3-4).
**Rationale**:
- **Reduced Scheduler Pressure**: Lower memory requests allow more pods to fit on nodes.
- **Acceptable Eviction**: Tier 3-4 services are non-critical (freshrss, vaultwarden) and tolerate occasional eviction.
- **Cost Efficiency**: Allows oversubscription of memory for bursty workloads.
**Tradeoff**: Pods may be evicted during memory pressure. Accepted because tier 3-4 services have PriorityClass 200K-300K.
### Why VPA upperBound * 1.2 (or 1.3)?
**Decision**: Set memory limits to VPA upperBound * 1.2 for stable services, * 1.3 for GPU/volatile services.
**Rationale**:
- **Headroom**: VPA upperBound is the observed maximum usage. Adding 20-30% headroom prevents OOMKills during traffic spikes.
- **Growth Buffer**: Services grow over time (more users, more data). Headroom delays the need for manual intervention.
**Fix**: Increase ResourceQuota in `modules/namespace_config/` for that tier, or reduce other pods' requests.
2.**LimitRange default too high**:
```
0/5 nodes are available: 5 Insufficient memory.
```
**Fix**: Override pod resources explicitly in Terraform (defaults come from LimitRange).
3.**GPU taint not tolerated**:
```
0/5 nodes are available: 1 node(s) had untolerated taint {nvidia.com/gpu: true}, 4 Insufficient nvidia.com/gpu.
```
**Fix**: Add toleration and nodeSelector for GPU pods.
4.**No nodes with GPU**:
```
0/5 nodes are available: 5 Insufficient nvidia.com/gpu.
```
**Fix**: Verify the GPU-carrying node is Ready and has the `nvidia.com/gpu.present=true` label. Check `kubectl get nodes -l nvidia.com/gpu.present=true` — if empty, gpu-feature-discovery hasn't labeled any node (operator not running, driver not loaded, or PCI passthrough broken).
### Pods OOMKilled repeatedly
**Symptom**: Pod shows `status: OOMKilled` in events, restarts frequently.
**Diagnosis**:
```bash
kubectl describe pod <pod-name> -n <namespace>
kubectl top pod <pod-name> -n <namespace> # Current usage
kubectl get limitrange -n <namespace> -o yaml # Check defaults
```
**Common Causes**:
1.**Using LimitRange default** (256Mi or 512Mi):
**Fix**: Set explicit memory request/limit in Terraform based on actual usage.