resource quota review: fix OOM risks, close quota gaps, add HA protections

Phase 1 - OOM fixes:
- dashy: increase memory limit 512Mi→1Gi (was at 99% utilization)
- caretta DaemonSet: set explicit resources 300Mi/512Mi (was at 85-98%)
- mysql-operator: add Helm resource values 256Mi/512Mi, create namespace
  with tier label (was at 92% of LimitRange default)
- prowlarr, flaresolverr, annas-archive-stacks: add explicit resources
  (outgrowing 256Mi LimitRange defaults)
- real-estate-crawler celery: add resources 512Mi/3Gi (608Mi actual, no
  explicit resources)

Phase 2 - Close quota gaps:
- nvidia, real-estate-crawler, trading-bot: remove custom-quota=true
  labels so Kyverno generates tier-appropriate quotas
- descheduler: add tier=1-cluster label for proper classification

Phase 3 - Reduce excessive quotas:
- monitoring: limits.memory 240Gi→64Gi, limits.cpu 120→64
- woodpecker: limits.memory 128Gi→32Gi, limits.cpu 64→16
- GPU tier default: limits.memory 96Gi→32Gi, limits.cpu 48→16

Phase 4 - Kubelet protection:
- Add cpu: 200m to systemReserved and kubeReserved in kubelet template

Phase 5 - HA improvements:
- cloudflared: add topology spread (ScheduleAnyway) + PDB (maxUnavailable:1)
- grafana: add topology spread + PDB via Helm values
- crowdsec LAPI: add topology spread + PDB via Helm values
- authentik server: add topology spread via Helm values
- authentik worker: add topology spread + PDB via Helm values

2026-03-08 18:17:46 +00:00

2.8 KiB

Raw Blame History

Proxmox Inventory & Infrastructure

Static reference for VMs, hardware, and network topology.

Proxmox Host Hardware

CPU: Intel Xeon E5-2699 v4 @ 2.20GHz (22 cores / 44 threads, single socket)
RAM: 142 GB (Dell R730 server)
GPU: NVIDIA Tesla T4 (PCIe passthrough to k8s-node1)
Disks: 1.1TB + 931GB + 10.7TB (local storage)
Proxmox access: ssh root@192.168.1.127

Network Topology

10.0.10.0/24 - Management: Wizard (10.0.10.10), TrueNAS NFS (10.0.10.15)
10.0.20.0/24 - Kubernetes: pfSense GW (10.0.20.1), Registry (10.0.20.10),
               k8s-master (10.0.20.100), DNS (10.0.20.101), MetalLB (10.0.20.102-200)
192.168.1.0/24 - Physical: Proxmox (192.168.1.127)

Network Bridges

vmbr0: Physical bridge on eno1, IP 192.168.1.127/24 — physical/home network
vmbr1: Internal-only bridge, VLAN-aware — VLAN 10 (management) and VLAN 20 (kubernetes)

VM Inventory

VMID	Name	Status	CPUs	RAM	Network	Disk	Notes
101	pfsense	running	8	16GB	vmbr0, vmbr1:vlan10, vmbr1:vlan20	32G	Gateway/firewall
102	devvm	running	16	8GB	vmbr1:vlan10	100G	Development VM
103	home-assistant	running	8	8GB	vmbr0	64G	HA Sofia, net0(vlan10) disabled, SSH: vbarzin@192.168.1.8
105	pbs	stopped	16	8GB	vmbr1:vlan10	32G	Proxmox Backup (unused)
200	k8s-master	running	8	8GB*	vmbr1:vlan20	64G	Control plane (10.0.20.100). *Verify via `qm config 200`
201	k8s-node1	running	16	16GB*	vmbr1:vlan20	256G	GPU node, Tesla T4. *Verify via `qm config 201`
202	k8s-node2	running	8	24GB*	vmbr1:vlan20	256G	Worker. *Inferred from k8s allocatable (~22 GiB)
203	k8s-node3	running	8	24GB*	vmbr1:vlan20	256G	Worker. *Inferred from k8s allocatable (~22 GiB)
204	k8s-node4	running	8	24GB*	vmbr1:vlan20	256G	Worker. *Inferred from k8s allocatable (~22 GiB)
220	docker-registry	running	4	4GB	vmbr1:vlan20	64G	MAC DE:AD:BE:EF:22:22 (10.0.20.10)
300	Windows10	running	16	8GB	vmbr0	100G	Windows VM
9000	truenas	running	16	16GB	vmbr1:vlan10	32G+7x256G+1T	NFS (10.0.10.15)

VM Templates

VMID	Name	Purpose
1000	ubuntu-2404-cloudinit-non-k8s-template	Base for non-K8s VMs
1001	docker-registry-template	Docker registry VM
2000	ubuntu-2404-cloudinit-k8s-template	Base for K8s nodes

GPU Node (k8s-node1)

VMID: 201, PCIe: 0000:06:00.0 (NVIDIA Tesla T4)
Taint: nvidia.com/gpu=true:NoSchedule, Label: gpu=true
GPU workloads need: node_selector = { "gpu": "true" } + nvidia toleration
Taint applied via null_resource.gpu_node_taint in modules/kubernetes/nvidia/main.tf

2.8 KiB Raw Blame History

Proxmox Inventory & Infrastructure

Proxmox Host Hardware

Network Topology

Network Bridges

VM Inventory

VM Templates

GPU Node (k8s-node1)

2.8 KiB

Raw Blame History