infra/.claude/reference/proxmox-inventory.md
Viktor Barzin 407b33abd6
resource quota review: fix OOM risks, close quota gaps, add HA protections
Phase 1 - OOM fixes:
- dashy: increase memory limit 512Mi→1Gi (was at 99% utilization)
- caretta DaemonSet: set explicit resources 300Mi/512Mi (was at 85-98%)
- mysql-operator: add Helm resource values 256Mi/512Mi, create namespace
  with tier label (was at 92% of LimitRange default)
- prowlarr, flaresolverr, annas-archive-stacks: add explicit resources
  (outgrowing 256Mi LimitRange defaults)
- real-estate-crawler celery: add resources 512Mi/3Gi (608Mi actual, no
  explicit resources)

Phase 2 - Close quota gaps:
- nvidia, real-estate-crawler, trading-bot: remove custom-quota=true
  labels so Kyverno generates tier-appropriate quotas
- descheduler: add tier=1-cluster label for proper classification

Phase 3 - Reduce excessive quotas:
- monitoring: limits.memory 240Gi→64Gi, limits.cpu 120→64
- woodpecker: limits.memory 128Gi→32Gi, limits.cpu 64→16
- GPU tier default: limits.memory 96Gi→32Gi, limits.cpu 48→16

Phase 4 - Kubelet protection:
- Add cpu: 200m to systemReserved and kubeReserved in kubelet template

Phase 5 - HA improvements:
- cloudflared: add topology spread (ScheduleAnyway) + PDB (maxUnavailable:1)
- grafana: add topology spread + PDB via Helm values
- crowdsec LAPI: add topology spread + PDB via Helm values
- authentik server: add topology spread via Helm values
- authentik worker: add topology spread + PDB via Helm values
2026-03-08 18:17:46 +00:00

2.8 KiB

Proxmox Inventory & Infrastructure

Static reference for VMs, hardware, and network topology.

Proxmox Host Hardware

  • CPU: Intel Xeon E5-2699 v4 @ 2.20GHz (22 cores / 44 threads, single socket)
  • RAM: 142 GB (Dell R730 server)
  • GPU: NVIDIA Tesla T4 (PCIe passthrough to k8s-node1)
  • Disks: 1.1TB + 931GB + 10.7TB (local storage)
  • Proxmox access: ssh root@192.168.1.127

Network Topology

10.0.10.0/24 - Management: Wizard (10.0.10.10), TrueNAS NFS (10.0.10.15)
10.0.20.0/24 - Kubernetes: pfSense GW (10.0.20.1), Registry (10.0.20.10),
               k8s-master (10.0.20.100), DNS (10.0.20.101), MetalLB (10.0.20.102-200)
192.168.1.0/24 - Physical: Proxmox (192.168.1.127)

Network Bridges

  • vmbr0: Physical bridge on eno1, IP 192.168.1.127/24 — physical/home network
  • vmbr1: Internal-only bridge, VLAN-aware — VLAN 10 (management) and VLAN 20 (kubernetes)

VM Inventory

VMID Name Status CPUs RAM Network Disk Notes
101 pfsense running 8 16GB vmbr0, vmbr1:vlan10, vmbr1:vlan20 32G Gateway/firewall
102 devvm running 16 8GB vmbr1:vlan10 100G Development VM
103 home-assistant running 8 8GB vmbr0 64G HA Sofia, net0(vlan10) disabled, SSH: vbarzin@192.168.1.8
105 pbs stopped 16 8GB vmbr1:vlan10 32G Proxmox Backup (unused)
200 k8s-master running 8 8GB* vmbr1:vlan20 64G Control plane (10.0.20.100). *Verify via qm config 200
201 k8s-node1 running 16 16GB* vmbr1:vlan20 256G GPU node, Tesla T4. *Verify via qm config 201
202 k8s-node2 running 8 24GB* vmbr1:vlan20 256G Worker. *Inferred from k8s allocatable (~22 GiB)
203 k8s-node3 running 8 24GB* vmbr1:vlan20 256G Worker. *Inferred from k8s allocatable (~22 GiB)
204 k8s-node4 running 8 24GB* vmbr1:vlan20 256G Worker. *Inferred from k8s allocatable (~22 GiB)
220 docker-registry running 4 4GB vmbr1:vlan20 64G MAC DE:AD:BE:EF:22:22 (10.0.20.10)
300 Windows10 running 16 8GB vmbr0 100G Windows VM
9000 truenas running 16 16GB vmbr1:vlan10 32G+7x256G+1T NFS (10.0.10.15)

VM Templates

VMID Name Purpose
1000 ubuntu-2404-cloudinit-non-k8s-template Base for non-K8s VMs
1001 docker-registry-template Docker registry VM
2000 ubuntu-2404-cloudinit-k8s-template Base for K8s nodes

GPU Node (k8s-node1)

  • VMID: 201, PCIe: 0000:06:00.0 (NVIDIA Tesla T4)
  • Taint: nvidia.com/gpu=true:NoSchedule, Label: gpu=true
  • GPU workloads need: node_selector = { "gpu": "true" } + nvidia toleration
  • Taint applied via null_resource.gpu_node_taint in modules/kubernetes/nvidia/main.tf