From 2d8aa5ed890889351840f7b01c8881f6c4893b0e Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Thu, 2 Apr 2026 00:48:13 +0300 Subject: [PATCH] docs: update hardware inventory for R730 RAM upgrade to 272GB Upgraded from 144GB (4x32G + 2x8G) to 272GB (8x32G + 2x8G) DDR4-2400. Added physical DIMM slot diagram, channel layout, and BIOS speed override notes. Updated compute architecture with correct CPU (single socket), VM memory values, and capacity figures. --- .claude/reference/proxmox-inventory.md | 87 +++++++++-- docs/architecture/compute.md | 24 +-- docs/runbooks/r730-ram-upgrade-272gb.md | 188 ++++++++++++++++++++++++ 3 files changed, 277 insertions(+), 22 deletions(-) create mode 100644 docs/runbooks/r730-ram-upgrade-272gb.md diff --git a/.claude/reference/proxmox-inventory.md b/.claude/reference/proxmox-inventory.md index 46caee0c..8eeb56ea 100644 --- a/.claude/reference/proxmox-inventory.md +++ b/.claude/reference/proxmox-inventory.md @@ -3,12 +3,77 @@ > Static reference for VMs, hardware, and network topology. ## Proxmox Host Hardware -- **CPU**: Intel Xeon E5-2699 v4 @ 2.20GHz (22 cores / 44 threads, single socket) -- **RAM**: 142 GB (Dell R730 server) +- **Model**: Dell R730 +- **CPU**: Intel Xeon E5-2699 v4 @ 2.20GHz (22 cores / 44 threads, single socket, CPU2 unpopulated) +- **RAM**: 272 GB DDR4-2400 ECC RDIMM (10 DIMMs, see Memory Layout below) - **GPU**: NVIDIA Tesla T4 (PCIe passthrough to k8s-node1) -- **Disks**: 1.1TB + 931GB + 10.7TB (local storage) +- **iDRAC**: 192.168.1.4 (root/calvin) +- **Disks**: 1.1TB RAID1 SAS (unused) + 931GB Samsung SSD + 10.7TB RAID1 HDD - **Proxmox access**: `ssh root@192.168.1.127` +## Memory Layout (updated 2026-04-01) + +### Physical DIMM Slot Map + +``` +╔══════════════════════════════════════════════════════════════════════════════╗ +║ CPU1 DIMM SLOTS ║ +║ ║ +║ ┌─── WHITE (1st per channel) ───┐ ║ +║ │ │ ║ +║ │ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ║ +║ │ │ A1 │ │ A2 │ │ A3 │ │ A4 │ ║ +║ │ │ 32G │ │ 32G │ │ 32G │ │ 32G │ Samsung M393A4K40BB1-CRC (2R) ║ +║ │ │██████│ │██████│ │██████│ │██████│ ║ +║ │ └──────┘ └──────┘ └──────┘ └──────┘ ║ +║ │ Ch 0 Ch 1 Ch 2 Ch 3 ║ +║ └────────────────────────────────┘ ║ +║ ║ +║ ┌─── BLACK (2nd per channel) ───┐ ║ +║ │ │ ║ +║ │ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ║ +║ │ │ A5 │ │ A6 │ │ A7 │ │ A8 │ ║ +║ │ │ 32G │ │ 32G │ │ 32G │ │ 32G │ Samsung M393A4K40CB1-CRC (2R) ║ +║ │ │▓▓▓▓▓▓│ │▓▓▓▓▓▓│ │▓▓▓▓▓▓│ │▓▓▓▓▓▓│ ║ +║ │ └──────┘ └──────┘ └──────┘ └──────┘ ║ +║ │ Ch 0 Ch 1 Ch 2 Ch 3 ║ +║ └────────────────────────────────┘ ║ +║ ║ +║ ┌─── GREEN (3rd per channel) ───┐ ║ +║ │ │ ║ +║ │ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ║ +║ │ │ A9 │ │ A10 │ │ A11 │ │ A12 │ ║ +║ │ │ │ │ │ │ 8G │ │ 8G │ SK Hynix HMA81GR7AFR8N-UH (1R) ║ +║ │ │ empty│ │ empty│ │░░░░░░│ │░░░░░░│ ║ +║ │ └──────┘ └──────┘ └──────┘ └──────┘ ║ +║ │ Ch 0 Ch 1 Ch 2 Ch 3 ║ +║ └────────────────────────────────┘ ║ +║ ║ +║ B1-B12: All empty (requires CPU2) ║ +║ ║ +║ Legend: ██ = Samsung BB1 32G ▓▓ = Samsung CB1 32G ░░ = Hynix 8G ║ +╚══════════════════════════════════════════════════════════════════════════════╝ +``` + +### Channel Summary + +``` +Channel 0: A1 [32G] ──── A5 [32G] ──── A9 [ ] = 64 GB ✓ matched +Channel 1: A2 [32G] ──── A6 [32G] ──── A10[ ] = 64 GB ✓ matched +Channel 2: A3 [32G] ──── A7 [32G] ──── A11[ 8G ] = 72 GB ~ +8G bonus +Channel 3: A4 [32G] ──── A8 [32G] ──── A12[ 8G ] = 72 GB ~ +8G bonus + ───────── ───────── ────────── + WHITE BLACK GREEN TOTAL: 272 GB +``` + +### DIMM Details + +- **A1-A4**: Samsung M393A4K40BB1-CRC 32GB DDR4-2400 ECC RDIMM (2-rank, original) +- **A5-A8**: Samsung M393A4K40CB1-CRC 32GB DDR4-2400 ECC RDIMM (2-rank, added 2026-04-01) +- **A11-A12**: SK Hynix HMA81GR7AFR8N-UH 8GB DDR4-2400 ECC RDIMM (1-rank, relocated from A5/A6) +- **A9-A10, B1-B12**: Empty (B-side requires CPU2) +- **Speed**: 2400 MHz (BIOS override — 3 DPC defaults to 1866 MHz, forced to 2400 via System BIOS > Memory Settings > Memory Frequency) + ## Network Topology ``` 10.0.10.0/24 - Management: Wizard (10.0.10.10), TrueNAS NFS (10.0.10.15) @@ -25,18 +90,20 @@ | VMID | Name | Status | CPUs | RAM | Network | Disk | Notes | |------|------|--------|------|-----|---------|------|-------| -| 101 | pfsense | running | 8 | 16GB | vmbr0, vmbr1:vlan10, vmbr1:vlan20 | 32G | Gateway/firewall | +| 101 | pfsense | running | 8 | 4GB | vmbr0, vmbr1:vlan10, vmbr1:vlan20 | 32G | Gateway/firewall | | 102 | devvm | running | 16 | 8GB | vmbr1:vlan10 | 100G | Development VM | | 103 | home-assistant | running | 8 | 8GB | vmbr0 | 64G | HA Sofia, net0(vlan10) disabled, SSH: vbarzin@192.168.1.8 | | 105 | pbs | stopped | 16 | 8GB | vmbr1:vlan10 | 32G | Proxmox Backup (unused) | -| 200 | k8s-master | running | 8 | 8GB* | vmbr1:vlan20 | 64G | Control plane (10.0.20.100). *Verify via `qm config 200` | -| 201 | k8s-node1 | running | 16 | 16GB* | vmbr1:vlan20 | 256G | GPU node, Tesla T4. *Verify via `qm config 201` | -| 202 | k8s-node2 | running | 8 | 24GB* | vmbr1:vlan20 | 256G | Worker. *Inferred from k8s allocatable (~22 GiB) | -| 203 | k8s-node3 | running | 8 | 24GB* | vmbr1:vlan20 | 256G | Worker. *Inferred from k8s allocatable (~22 GiB) | -| 204 | k8s-node4 | running | 8 | 24GB* | vmbr1:vlan20 | 256G | Worker. *Inferred from k8s allocatable (~22 GiB) | +| 200 | k8s-master | running | 8 | 16GB | vmbr1:vlan20 | 64G | Control plane (10.0.20.100) | +| 201 | k8s-node1 | running | 16 | 32GB | vmbr1:vlan20 | 256G | GPU node, Tesla T4 | +| 202 | k8s-node2 | running | 8 | 24GB | vmbr1:vlan20 | 256G | Worker | +| 203 | k8s-node3 | running | 8 | 24GB | vmbr1:vlan20 | 256G | Worker | +| 204 | k8s-node4 | running | 8 | 24GB | vmbr1:vlan20 | 256G | Worker | | 220 | docker-registry | running | 4 | 4GB | vmbr1:vlan20 | 64G | MAC DE:AD:BE:EF:22:22 (10.0.20.10) | | 300 | Windows10 | running | 16 | 8GB | vmbr0 | 100G | Windows VM | -| 9000 | truenas | running | 16 | 16GB | vmbr1:vlan10 | 32G+7x256G+1T | NFS (10.0.10.15) | +| 9000 | truenas | running | 16 | 8GB | vmbr1:vlan10 | 32G+7x256G+1T | NFS (10.0.10.15) | + +**Total VM RAM allocated**: 180 GB of 272 GB (66%) — 92 GB free for future VMs ## VM Templates | VMID | Name | Purpose | diff --git a/docs/architecture/compute.md b/docs/architecture/compute.md index 6992cc55..50090c04 100644 --- a/docs/architecture/compute.md +++ b/docs/architecture/compute.md @@ -9,16 +9,16 @@ The infrastructure runs on a single Dell R730 server with Proxmox VE, hosting a ```mermaid graph TB subgraph Physical["Dell R730 Physical Host"] - CPU["2x Xeon E5-2699 v4
22c/44t each
44c/88t total"] - RAM["142GB DDR4 ECC"] + CPU["1x Xeon E5-2699 v4
22c/44t
CPU2 unpopulated"] + RAM["272GB DDR4-2400 ECC"] GPU["NVIDIA Tesla T4
PCIe 0000:06:00.0"] DISK["1.1TB SSD
931GB SSD
10.7TB HDD"] end subgraph Proxmox["Proxmox VE"] direction TB - MASTER["VM 200: k8s-master
8c / 8GB
10.0.20.100"] - NODE1["VM 201: k8s-node1
16c / 16GB
GPU Passthrough
nvidia.com/gpu=true:NoSchedule"] + MASTER["VM 200: k8s-master
8c / 16GB
10.0.20.100"] + NODE1["VM 201: k8s-node1
16c / 32GB
GPU Passthrough
nvidia.com/gpu=true:NoSchedule"] NODE2["VM 202: k8s-node2
8c / 24GB"] NODE3["VM 203: k8s-node3
8c / 24GB"] NODE4["VM 204: k8s-node4
8c / 24GB"] @@ -60,9 +60,9 @@ graph TB | Component | Specification | |-----------|---------------| | Model | Dell PowerEdge R730 | -| CPU | 2x Intel Xeon E5-2699 v4 (22 cores / 44 threads each) | -| Total Cores/Threads | 44 cores / 88 threads | -| RAM | 142GB DDR4 ECC | +| CPU | 1x Intel Xeon E5-2699 v4 (22 cores / 44 threads, CPU2 unpopulated) | +| Total Cores/Threads | 22 cores / 44 threads | +| RAM | 272GB DDR4-2400 ECC RDIMM (10 DIMMs: 8x32G Samsung + 2x8G Hynix) | | GPU | NVIDIA Tesla T4 (16GB GDDR6, PCIe 0000:06:00.0) | | Storage | 1.1TB SSD + 931GB SSD + 10.7TB HDD | | Hypervisor | Proxmox VE | @@ -71,13 +71,13 @@ graph TB | VM | VMID | vCPUs | RAM | Network | Role | Taints | |----|------|-------|-----|---------|------|--------| -| k8s-master | 200 | 8 | 8GB | vmbr1:vlan20 (10.0.20.100) | Control Plane | `node-role.kubernetes.io/control-plane:NoSchedule` | -| k8s-node1 | 201 | 16 | 16GB | vmbr1:vlan20 | GPU Worker | `nvidia.com/gpu=true:NoSchedule` | +| k8s-master | 200 | 8 | 16GB | vmbr1:vlan20 (10.0.20.100) | Control Plane | `node-role.kubernetes.io/control-plane:NoSchedule` | +| k8s-node1 | 201 | 16 | 32GB | vmbr1:vlan20 | GPU Worker | `nvidia.com/gpu=true:NoSchedule` | | k8s-node2 | 202 | 8 | 24GB | vmbr1:vlan20 | Worker | None | | k8s-node3 | 203 | 8 | 24GB | vmbr1:vlan20 | Worker | None | | k8s-node4 | 204 | 8 | 24GB | vmbr1:vlan20 | Worker | None | -**Total Cluster Resources**: 48 vCPUs, 104GB RAM (excluding control plane) +**Total Cluster Resources**: 48 vCPUs, 120GB RAM (excluding control plane) ### GPU Passthrough @@ -443,7 +443,7 @@ spec: **Rationale**: - **CFS Throttling**: Linux Completely Fair Scheduler throttles containers to their exact CPU limit, even when CPU is idle. This causes artificial performance degradation. - **Burstability**: Services can burst to unused CPU during low-load periods, improving response times. -- **Memory-bound**: With 142GB RAM across 48 vCPUs, memory exhaustion occurs before CPU saturation. Memory is the constraining resource. +- **Memory-bound**: With 272GB host RAM (180GB allocated to VMs), memory is no longer the primary constraint. 92GB headroom available for new VMs. **Tradeoff**: A runaway process could monopolize CPU. Mitigated by CPU requests reserving capacity and PriorityClass preemption. @@ -492,7 +492,7 @@ spec: - **Growth Buffer**: Services grow over time (more users, more data). Headroom delays the need for manual intervention. - **GPU Volatility**: GPU workloads (ML inference) have unpredictable memory usage. 30% headroom reduces OOMKills. -**Tradeoff**: Slightly higher memory allocation. Accepted because 142GB RAM provides ample capacity. +**Tradeoff**: Slightly higher memory allocation. Accepted because 272GB RAM provides ample capacity. ## Troubleshooting diff --git a/docs/runbooks/r730-ram-upgrade-272gb.md b/docs/runbooks/r730-ram-upgrade-272gb.md new file mode 100644 index 00000000..bd0f9a33 --- /dev/null +++ b/docs/runbooks/r730-ram-upgrade-272gb.md @@ -0,0 +1,188 @@ +# RAM Upgrade — Dell R730 Proxmox Host (Completed 2026-04-01) + +**Host**: Dell R730 @ 192.168.1.127 (Proxmox) +**CPU**: Single Xeon E5-2699 v4 (CPU2 unpopulated — B-side slots unavailable) +**Before**: 144 GB (4x32G Samsung BB1 + 2x8G SK Hynix) @ 2400 MHz +**After**: 272 GB (4x32G Samsung BB1 + 4x32G Samsung CB1 + 2x8G SK Hynix) @ 2400 MHz + +## Lessons Learned + +1. **3 DPC downclock**: Adding DIMMs to the 3rd slot per channel (A11/A12) caused automatic downclocking to 1866 MHz. Dell R730 BIOS allows manual override back to 2400 MHz via **System BIOS > Memory Settings > Memory Frequency > Max Performance**. +2. **MySQL InnoDB Cluster CR recreation**: Deleting and recreating the InnoDBCluster CR generates new admin secrets that don't match the existing data on PVCs. Fix: manually create the new admin user in MySQL and configure GR recovery channel credentials. +3. **CNPG primary label**: After restarting the CNPG operator, it may not immediately label the primary pod with `role=primary`. Deleting the pod forces the operator to recreate it with the correct labels. +4. **LimitRange blocks MySQL**: The `dbaas` namespace LimitRange (4Gi max) blocks MySQL pods that need 5Gi. Kyverno policy resets LimitRange patches. Fix: reduce MySQL memory limit in CR to 4Gi. + +## Physical DIMM Slot Map (looking down at motherboard, front of server at bottom) + +``` +╔══════════════════════════════════════════════════════════════════════════════╗ +║ CPU1 DIMM SLOTS ║ +║ ║ +║ ┌─── WHITE (1st per channel) ───┐ ║ +║ │ │ ║ +║ │ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ║ +║ │ │ A1 │ │ A2 │ │ A3 │ │ A4 │ ║ +║ │ │ 32G │ │ 32G │ │ 32G │ │ 32G │ ◄── KEEP (existing Samsung 32G) ║ +║ │ │██████│ │██████│ │██████│ │██████│ ║ +║ │ └──────┘ └──────┘ └──────┘ └──────┘ ║ +║ │ Ch 0 Ch 1 Ch 2 Ch 3 ║ +║ └────────────────────────────────┘ ║ +║ ║ +║ ┌─── BLACK (2nd per channel) ───┐ ║ +║ │ │ ║ +║ │ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ║ +║ │ │ A5 │ │ A6 │ │ A7 │ │ A8 │ ║ +║ │ │ 32G │ │ 32G │ │ 32G │ │ 32G │ ◄── INSTALL NEW 32G Samsung ║ +║ │ │▓▓▓▓▓▓│ │▓▓▓▓▓▓│ │▓▓▓▓▓▓│ │▓▓▓▓▓▓│ (remove old 8G from A5/A6) ║ +║ │ └──────┘ └──────┘ └──────┘ └──────┘ ║ +║ │ Ch 0 Ch 1 Ch 2 Ch 3 ║ +║ └────────────────────────────────┘ ║ +║ ║ +║ ┌─── GREEN (3rd per channel) ───┐ ║ +║ │ │ ║ +║ │ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ║ +║ │ │ A9 │ │ A10 │ │ A11 │ │ A12 │ ║ +║ │ │ │ │ │ │ 8G │ │ 8G │ ◄── MOVE old 8G Hynix here ║ +║ │ │ empty│ │ empty│ │░░░░░░│ │░░░░░░│ (from A5 → A11, A6 → A12) ║ +║ │ └──────┘ └──────┘ └──────┘ └──────┘ ║ +║ │ Ch 0 Ch 1 Ch 2 Ch 3 ║ +║ └────────────────────────────────┘ ║ +║ ║ +║ Legend: ██ = existing 32G (keep in place) ║ +║ ▓▓ = NEW 32G Samsung M393A4K40BB1-CRC (install) ║ +║ ░░ = relocated 8G SK Hynix HMA81GR7AFR8N-UH (moved from A5/A6) ║ +║ ║ +╚══════════════════════════════════════════════════════════════════════════════╝ +``` + +## Channel Summary After Install + +``` +Channel 0: A1 [32G] ──── A5 [32G] ──── A9 [ ] = 64 GB ✓ matched +Channel 1: A2 [32G] ──── A6 [32G] ──── A10[ ] = 64 GB ✓ matched +Channel 2: A3 [32G] ──── A7 [32G] ──── A11[ 8G ] = 72 GB ~ +8G bonus +Channel 3: A4 [32G] ──── A8 [32G] ──── A12[ 8G ] = 72 GB ~ +8G bonus + ───────── ───────── ────────── + WHITE BLACK GREEN TOTAL: 272 GB + (keep) (new 32G) (moved 8G) +``` + +**Performance**: ~1-2% bandwidth penalty on Ch2/Ch3 due to mixed DIMM sizes. Ch0/Ch1 fully matched. + +## Shutdown Sequence + +### Phase 0: Gracefully Stop Stateful Services + +Scale down databases, caches, and secrets engines before draining nodes to ensure clean shutdown with no data loss. + +```bash +export KUBECONFIG=/path/to/config + +# 1. Vault — seal all instances (flushes WAL, closes connections) +kubectl -n vault exec vault-0 -- vault operator step-down 2>/dev/null +kubectl -n vault exec vault-0 -- vault operator seal +kubectl -n vault exec vault-1 -- vault operator seal +kubectl -n vault exec vault-2 -- vault operator seal + +# 2. MySQL InnoDB Cluster — set super_read_only, scale router to 0 +kubectl -n dbaas scale deploy mysql-cluster-router --replicas=0 +kubectl -n dbaas exec mysql-cluster-0 -- mysql -e "SET GLOBAL innodb_fast_shutdown=0; SET GLOBAL super_read_only=ON;" +kubectl -n dbaas exec mysql-cluster-1 -- mysql -e "SET GLOBAL innodb_fast_shutdown=0; SET GLOBAL super_read_only=ON;" +kubectl -n dbaas exec mysql-cluster-2 -- mysql -e "SET GLOBAL innodb_fast_shutdown=0; SET GLOBAL super_read_only=ON;" +# innodb_fast_shutdown=0 forces full purge + change buffer merge on stop + +# 3. PostgreSQL CNPG — trigger checkpoint on primaries +kubectl -n dbaas exec pg-cluster-2 -- psql -U postgres -c "CHECKPOINT;" +kubectl -n dbaas exec pg-cluster-4 -- psql -U postgres -c "CHECKPOINT;" +kubectl -n immich exec deploy/immich-postgresql -- psql -U postgres -c "CHECKPOINT;" + +# 4. Redis — trigger BGSAVE then scale down +kubectl -n redis exec redis-node-0 -- redis-cli BGSAVE +kubectl -n redis exec redis-node-1 -- redis-cli BGSAVE +sleep 5 # wait for RDB flush +kubectl -n redis scale deploy redis-haproxy --replicas=0 + +# 5. ClickHouse — flush +kubectl -n rybbit exec deploy/clickhouse -- clickhouse-client --query "SYSTEM FLUSH LOGS" + +# 6. Scale down stateful workloads +kubectl -n dbaas scale sts mysql-cluster --replicas=0 +kubectl -n redis scale sts redis-node --replicas=0 +kubectl -n vault scale sts vault --replicas=0 + +# 7. Verify all stateful pods terminated +kubectl get pods -A | grep -iE 'mysql-cluster-[0-9]|pg-cluster|redis-node|vault-[0-9]|clickhouse' +``` + +### Phase 1: Drain K8s Nodes + +```bash +# Drain workers (reverse order) +kubectl drain k8s-node4 --ignore-daemonsets --delete-emptydir-data --force --timeout=120s +kubectl drain k8s-node3 --ignore-daemonsets --delete-emptydir-data --force --timeout=120s +kubectl drain k8s-node2 --ignore-daemonsets --delete-emptydir-data --force --timeout=120s +kubectl drain k8s-node1 --ignore-daemonsets --delete-emptydir-data --force --timeout=120s + +# Cordon master +kubectl cordon k8s-master +``` + +### Phase 2: Shutdown VMs (via Proxmox) + +```bash +ssh root@192.168.1.127 + +# K8s workers +for VMID in 201 202 203 204; do qm shutdown $VMID && echo "Shutdown VMID $VMID"; done +sleep 30 + +# K8s master +qm shutdown 200; sleep 15 + +# Docker registry +qm shutdown 220; sleep 10 + +# Secondary VMs +for VMID in 102 300 103; do qm shutdown $VMID; done +sleep 20 + +# TrueNAS (wait for ZFS flush) +qm shutdown 9000; sleep 60 + +# pfSense (last — network gateway) +qm shutdown 101; sleep 15 + +# Verify all VMs stopped +qm list +``` + +### Phase 3: Shutdown Proxmox Host + +```bash +shutdown -h now +``` + +## Physical RAM Installation + +| Step | Action | Slot(s) | DIMM | +|------|--------|---------|------| +| 1 | Power off host | — | Completed via Phase 3 above | +| 2 | **Remove** | A5 (black clip) | Take out 8G Hynix, set aside | +| 3 | **Remove** | A6 (black clip) | Take out 8G Hynix, set aside | +| 4 | **Install NEW** | A5 (black clip) | Insert 32G Samsung | +| 5 | **Install NEW** | A6 (black clip) | Insert 32G Samsung | +| 6 | **Install NEW** | A7 (black clip) | Insert 32G Samsung | +| 7 | **Install NEW** | A8 (black clip) | Insert 32G Samsung | +| 8 | **Install MOVED** | A11 (green clip) | Insert 8G Hynix (was in A5) | +| 9 | **Install MOVED** | A12 (green clip) | Insert 8G Hynix (was in A6) | +| 10 | Power on | — | — | + +## Post-Boot Verification + +```bash +# Verify all 10 DIMMs detected +ssh root@192.168.1.127 'dmidecode -t memory | grep -E "Locator:|Size:" | grep -v Bank' + +# Verify total RAM (~268 GiB usable) +ssh root@192.168.1.127 'free -h' +```