Phase 1 - OOM fixes: - dashy: increase memory limit 512Mi→1Gi (was at 99% utilization) - caretta DaemonSet: set explicit resources 300Mi/512Mi (was at 85-98%) - mysql-operator: add Helm resource values 256Mi/512Mi, create namespace with tier label (was at 92% of LimitRange default) - prowlarr, flaresolverr, annas-archive-stacks: add explicit resources (outgrowing 256Mi LimitRange defaults) - real-estate-crawler celery: add resources 512Mi/3Gi (608Mi actual, no explicit resources) Phase 2 - Close quota gaps: - nvidia, real-estate-crawler, trading-bot: remove custom-quota=true labels so Kyverno generates tier-appropriate quotas - descheduler: add tier=1-cluster label for proper classification Phase 3 - Reduce excessive quotas: - monitoring: limits.memory 240Gi→64Gi, limits.cpu 120→64 - woodpecker: limits.memory 128Gi→32Gi, limits.cpu 64→16 - GPU tier default: limits.memory 96Gi→32Gi, limits.cpu 48→16 Phase 4 - Kubelet protection: - Add cpu: 200m to systemReserved and kubeReserved in kubelet template Phase 5 - HA improvements: - cloudflared: add topology spread (ScheduleAnyway) + PDB (maxUnavailable:1) - grafana: add topology spread + PDB via Helm values - crowdsec LAPI: add topology spread + PDB via Helm values - authentik server: add topology spread via Helm values - authentik worker: add topology spread + PDB via Helm values
52 lines
2.8 KiB
Markdown
52 lines
2.8 KiB
Markdown
# Proxmox Inventory & Infrastructure
|
|
|
|
> Static reference for VMs, hardware, and network topology.
|
|
|
|
## Proxmox Host Hardware
|
|
- **CPU**: Intel Xeon E5-2699 v4 @ 2.20GHz (22 cores / 44 threads, single socket)
|
|
- **RAM**: 142 GB (Dell R730 server)
|
|
- **GPU**: NVIDIA Tesla T4 (PCIe passthrough to k8s-node1)
|
|
- **Disks**: 1.1TB + 931GB + 10.7TB (local storage)
|
|
- **Proxmox access**: `ssh root@192.168.1.127`
|
|
|
|
## Network Topology
|
|
```
|
|
10.0.10.0/24 - Management: Wizard (10.0.10.10), TrueNAS NFS (10.0.10.15)
|
|
10.0.20.0/24 - Kubernetes: pfSense GW (10.0.20.1), Registry (10.0.20.10),
|
|
k8s-master (10.0.20.100), DNS (10.0.20.101), MetalLB (10.0.20.102-200)
|
|
192.168.1.0/24 - Physical: Proxmox (192.168.1.127)
|
|
```
|
|
|
|
## Network Bridges
|
|
- **vmbr0**: Physical bridge on `eno1`, IP `192.168.1.127/24` — physical/home network
|
|
- **vmbr1**: Internal-only bridge, VLAN-aware — VLAN 10 (management) and VLAN 20 (kubernetes)
|
|
|
|
## VM Inventory
|
|
|
|
| VMID | Name | Status | CPUs | RAM | Network | Disk | Notes |
|
|
|------|------|--------|------|-----|---------|------|-------|
|
|
| 101 | pfsense | running | 8 | 16GB | vmbr0, vmbr1:vlan10, vmbr1:vlan20 | 32G | Gateway/firewall |
|
|
| 102 | devvm | running | 16 | 8GB | vmbr1:vlan10 | 100G | Development VM |
|
|
| 103 | home-assistant | running | 8 | 8GB | vmbr0 | 64G | HA Sofia, net0(vlan10) disabled, SSH: vbarzin@192.168.1.8 |
|
|
| 105 | pbs | stopped | 16 | 8GB | vmbr1:vlan10 | 32G | Proxmox Backup (unused) |
|
|
| 200 | k8s-master | running | 8 | 8GB* | vmbr1:vlan20 | 64G | Control plane (10.0.20.100). *Verify via `qm config 200` |
|
|
| 201 | k8s-node1 | running | 16 | 16GB* | vmbr1:vlan20 | 256G | GPU node, Tesla T4. *Verify via `qm config 201` |
|
|
| 202 | k8s-node2 | running | 8 | 24GB* | vmbr1:vlan20 | 256G | Worker. *Inferred from k8s allocatable (~22 GiB) |
|
|
| 203 | k8s-node3 | running | 8 | 24GB* | vmbr1:vlan20 | 256G | Worker. *Inferred from k8s allocatable (~22 GiB) |
|
|
| 204 | k8s-node4 | running | 8 | 24GB* | vmbr1:vlan20 | 256G | Worker. *Inferred from k8s allocatable (~22 GiB) |
|
|
| 220 | docker-registry | running | 4 | 4GB | vmbr1:vlan20 | 64G | MAC DE:AD:BE:EF:22:22 (10.0.20.10) |
|
|
| 300 | Windows10 | running | 16 | 8GB | vmbr0 | 100G | Windows VM |
|
|
| 9000 | truenas | running | 16 | 16GB | vmbr1:vlan10 | 32G+7x256G+1T | NFS (10.0.10.15) |
|
|
|
|
## VM Templates
|
|
| VMID | Name | Purpose |
|
|
|------|------|---------|
|
|
| 1000 | ubuntu-2404-cloudinit-non-k8s-template | Base for non-K8s VMs |
|
|
| 1001 | docker-registry-template | Docker registry VM |
|
|
| 2000 | ubuntu-2404-cloudinit-k8s-template | Base for K8s nodes |
|
|
|
|
## GPU Node (k8s-node1)
|
|
- **VMID**: 201, **PCIe**: `0000:06:00.0` (NVIDIA Tesla T4)
|
|
- **Taint**: `nvidia.com/gpu=true:NoSchedule`, **Label**: `gpu=true`
|
|
- GPU workloads need: `node_selector = { "gpu": "true" }` + nvidia toleration
|
|
- Taint applied via `null_resource.gpu_node_taint` in `modules/kubernetes/nvidia/main.tf`
|