docs: update hardware inventory for R730 RAM upgrade to 272GB

Upgraded from 144GB (4x32G + 2x8G) to 272GB (8x32G + 2x8G) DDR4-2400.
Added physical DIMM slot diagram, channel layout, and BIOS speed override
notes. Updated compute architecture with correct CPU (single socket),
VM memory values, and capacity figures.
This commit is contained in:
Viktor Barzin 2026-04-02 00:48:13 +03:00
parent 87c858f026
commit 2d8aa5ed89
3 changed files with 277 additions and 22 deletions

View file

@ -9,16 +9,16 @@ The infrastructure runs on a single Dell R730 server with Proxmox VE, hosting a
```mermaid
graph TB
subgraph Physical["Dell R730 Physical Host"]
CPU["2x Xeon E5-2699 v4<br/>22c/44t each<br/>44c/88t total"]
RAM["142GB DDR4 ECC"]
CPU["1x Xeon E5-2699 v4<br/>22c/44t<br/>CPU2 unpopulated"]
RAM["272GB DDR4-2400 ECC"]
GPU["NVIDIA Tesla T4<br/>PCIe 0000:06:00.0"]
DISK["1.1TB SSD<br/>931GB SSD<br/>10.7TB HDD"]
end
subgraph Proxmox["Proxmox VE"]
direction TB
MASTER["VM 200: k8s-master<br/>8c / 8GB<br/>10.0.20.100"]
NODE1["VM 201: k8s-node1<br/>16c / 16GB<br/>GPU Passthrough<br/>nvidia.com/gpu=true:NoSchedule"]
MASTER["VM 200: k8s-master<br/>8c / 16GB<br/>10.0.20.100"]
NODE1["VM 201: k8s-node1<br/>16c / 32GB<br/>GPU Passthrough<br/>nvidia.com/gpu=true:NoSchedule"]
NODE2["VM 202: k8s-node2<br/>8c / 24GB"]
NODE3["VM 203: k8s-node3<br/>8c / 24GB"]
NODE4["VM 204: k8s-node4<br/>8c / 24GB"]
@ -60,9 +60,9 @@ graph TB
| Component | Specification |
|-----------|---------------|
| Model | Dell PowerEdge R730 |
| CPU | 2x Intel Xeon E5-2699 v4 (22 cores / 44 threads each) |
| Total Cores/Threads | 44 cores / 88 threads |
| RAM | 142GB DDR4 ECC |
| CPU | 1x Intel Xeon E5-2699 v4 (22 cores / 44 threads, CPU2 unpopulated) |
| Total Cores/Threads | 22 cores / 44 threads |
| RAM | 272GB DDR4-2400 ECC RDIMM (10 DIMMs: 8x32G Samsung + 2x8G Hynix) |
| GPU | NVIDIA Tesla T4 (16GB GDDR6, PCIe 0000:06:00.0) |
| Storage | 1.1TB SSD + 931GB SSD + 10.7TB HDD |
| Hypervisor | Proxmox VE |
@ -71,13 +71,13 @@ graph TB
| VM | VMID | vCPUs | RAM | Network | Role | Taints |
|----|------|-------|-----|---------|------|--------|
| k8s-master | 200 | 8 | 8GB | vmbr1:vlan20 (10.0.20.100) | Control Plane | `node-role.kubernetes.io/control-plane:NoSchedule` |
| k8s-node1 | 201 | 16 | 16GB | vmbr1:vlan20 | GPU Worker | `nvidia.com/gpu=true:NoSchedule` |
| k8s-master | 200 | 8 | 16GB | vmbr1:vlan20 (10.0.20.100) | Control Plane | `node-role.kubernetes.io/control-plane:NoSchedule` |
| k8s-node1 | 201 | 16 | 32GB | vmbr1:vlan20 | GPU Worker | `nvidia.com/gpu=true:NoSchedule` |
| k8s-node2 | 202 | 8 | 24GB | vmbr1:vlan20 | Worker | None |
| k8s-node3 | 203 | 8 | 24GB | vmbr1:vlan20 | Worker | None |
| k8s-node4 | 204 | 8 | 24GB | vmbr1:vlan20 | Worker | None |
**Total Cluster Resources**: 48 vCPUs, 104GB RAM (excluding control plane)
**Total Cluster Resources**: 48 vCPUs, 120GB RAM (excluding control plane)
### GPU Passthrough
@ -443,7 +443,7 @@ spec:
**Rationale**:
- **CFS Throttling**: Linux Completely Fair Scheduler throttles containers to their exact CPU limit, even when CPU is idle. This causes artificial performance degradation.
- **Burstability**: Services can burst to unused CPU during low-load periods, improving response times.
- **Memory-bound**: With 142GB RAM across 48 vCPUs, memory exhaustion occurs before CPU saturation. Memory is the constraining resource.
- **Memory-bound**: With 272GB host RAM (180GB allocated to VMs), memory is no longer the primary constraint. 92GB headroom available for new VMs.
**Tradeoff**: A runaway process could monopolize CPU. Mitigated by CPU requests reserving capacity and PriorityClass preemption.
@ -492,7 +492,7 @@ spec:
- **Growth Buffer**: Services grow over time (more users, more data). Headroom delays the need for manual intervention.
- **GPU Volatility**: GPU workloads (ML inference) have unpredictable memory usage. 30% headroom reduces OOMKills.
**Tradeoff**: Slightly higher memory allocation. Accepted because 142GB RAM provides ample capacity.
**Tradeoff**: Slightly higher memory allocation. Accepted because 272GB RAM provides ample capacity.
## Troubleshooting

View file

@ -0,0 +1,188 @@
# RAM Upgrade — Dell R730 Proxmox Host (Completed 2026-04-01)
**Host**: Dell R730 @ 192.168.1.127 (Proxmox)
**CPU**: Single Xeon E5-2699 v4 (CPU2 unpopulated — B-side slots unavailable)
**Before**: 144 GB (4x32G Samsung BB1 + 2x8G SK Hynix) @ 2400 MHz
**After**: 272 GB (4x32G Samsung BB1 + 4x32G Samsung CB1 + 2x8G SK Hynix) @ 2400 MHz
## Lessons Learned
1. **3 DPC downclock**: Adding DIMMs to the 3rd slot per channel (A11/A12) caused automatic downclocking to 1866 MHz. Dell R730 BIOS allows manual override back to 2400 MHz via **System BIOS > Memory Settings > Memory Frequency > Max Performance**.
2. **MySQL InnoDB Cluster CR recreation**: Deleting and recreating the InnoDBCluster CR generates new admin secrets that don't match the existing data on PVCs. Fix: manually create the new admin user in MySQL and configure GR recovery channel credentials.
3. **CNPG primary label**: After restarting the CNPG operator, it may not immediately label the primary pod with `role=primary`. Deleting the pod forces the operator to recreate it with the correct labels.
4. **LimitRange blocks MySQL**: The `dbaas` namespace LimitRange (4Gi max) blocks MySQL pods that need 5Gi. Kyverno policy resets LimitRange patches. Fix: reduce MySQL memory limit in CR to 4Gi.
## Physical DIMM Slot Map (looking down at motherboard, front of server at bottom)
```
╔══════════════════════════════════════════════════════════════════════════════╗
║ CPU1 DIMM SLOTS ║
║ ║
║ ┌─── WHITE (1st per channel) ───┐ ║
║ │ │ ║
║ │ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ║
║ │ │ A1 │ │ A2 │ │ A3 │ │ A4 │ ║
║ │ │ 32G │ │ 32G │ │ 32G │ │ 32G │ ◄── KEEP (existing Samsung 32G) ║
║ │ │██████│ │██████│ │██████│ │██████│ ║
║ │ └──────┘ └──────┘ └──────┘ └──────┘ ║
║ │ Ch 0 Ch 1 Ch 2 Ch 3 ║
║ └────────────────────────────────┘ ║
║ ║
║ ┌─── BLACK (2nd per channel) ───┐ ║
║ │ │ ║
║ │ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ║
║ │ │ A5 │ │ A6 │ │ A7 │ │ A8 │ ║
║ │ │ 32G │ │ 32G │ │ 32G │ │ 32G │ ◄── INSTALL NEW 32G Samsung ║
║ │ │▓▓▓▓▓▓│ │▓▓▓▓▓▓│ │▓▓▓▓▓▓│ │▓▓▓▓▓▓│ (remove old 8G from A5/A6) ║
║ │ └──────┘ └──────┘ └──────┘ └──────┘ ║
║ │ Ch 0 Ch 1 Ch 2 Ch 3 ║
║ └────────────────────────────────┘ ║
║ ║
║ ┌─── GREEN (3rd per channel) ───┐ ║
║ │ │ ║
║ │ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ║
║ │ │ A9 │ │ A10 │ │ A11 │ │ A12 │ ║
║ │ │ │ │ │ │ 8G │ │ 8G │ ◄── MOVE old 8G Hynix here ║
║ │ │ empty│ │ empty│ │░░░░░░│ │░░░░░░│ (from A5 → A11, A6 → A12) ║
║ │ └──────┘ └──────┘ └──────┘ └──────┘ ║
║ │ Ch 0 Ch 1 Ch 2 Ch 3 ║
║ └────────────────────────────────┘ ║
║ ║
║ Legend: ██ = existing 32G (keep in place) ║
║ ▓▓ = NEW 32G Samsung M393A4K40BB1-CRC (install) ║
║ ░░ = relocated 8G SK Hynix HMA81GR7AFR8N-UH (moved from A5/A6) ║
║ ║
╚══════════════════════════════════════════════════════════════════════════════╝
```
## Channel Summary After Install
```
Channel 0: A1 [32G] ──── A5 [32G] ──── A9 [ ] = 64 GB ✓ matched
Channel 1: A2 [32G] ──── A6 [32G] ──── A10[ ] = 64 GB ✓ matched
Channel 2: A3 [32G] ──── A7 [32G] ──── A11[ 8G ] = 72 GB ~ +8G bonus
Channel 3: A4 [32G] ──── A8 [32G] ──── A12[ 8G ] = 72 GB ~ +8G bonus
───────── ───────── ──────────
WHITE BLACK GREEN TOTAL: 272 GB
(keep) (new 32G) (moved 8G)
```
**Performance**: ~1-2% bandwidth penalty on Ch2/Ch3 due to mixed DIMM sizes. Ch0/Ch1 fully matched.
## Shutdown Sequence
### Phase 0: Gracefully Stop Stateful Services
Scale down databases, caches, and secrets engines before draining nodes to ensure clean shutdown with no data loss.
```bash
export KUBECONFIG=/path/to/config
# 1. Vault — seal all instances (flushes WAL, closes connections)
kubectl -n vault exec vault-0 -- vault operator step-down 2>/dev/null
kubectl -n vault exec vault-0 -- vault operator seal
kubectl -n vault exec vault-1 -- vault operator seal
kubectl -n vault exec vault-2 -- vault operator seal
# 2. MySQL InnoDB Cluster — set super_read_only, scale router to 0
kubectl -n dbaas scale deploy mysql-cluster-router --replicas=0
kubectl -n dbaas exec mysql-cluster-0 -- mysql -e "SET GLOBAL innodb_fast_shutdown=0; SET GLOBAL super_read_only=ON;"
kubectl -n dbaas exec mysql-cluster-1 -- mysql -e "SET GLOBAL innodb_fast_shutdown=0; SET GLOBAL super_read_only=ON;"
kubectl -n dbaas exec mysql-cluster-2 -- mysql -e "SET GLOBAL innodb_fast_shutdown=0; SET GLOBAL super_read_only=ON;"
# innodb_fast_shutdown=0 forces full purge + change buffer merge on stop
# 3. PostgreSQL CNPG — trigger checkpoint on primaries
kubectl -n dbaas exec pg-cluster-2 -- psql -U postgres -c "CHECKPOINT;"
kubectl -n dbaas exec pg-cluster-4 -- psql -U postgres -c "CHECKPOINT;"
kubectl -n immich exec deploy/immich-postgresql -- psql -U postgres -c "CHECKPOINT;"
# 4. Redis — trigger BGSAVE then scale down
kubectl -n redis exec redis-node-0 -- redis-cli BGSAVE
kubectl -n redis exec redis-node-1 -- redis-cli BGSAVE
sleep 5 # wait for RDB flush
kubectl -n redis scale deploy redis-haproxy --replicas=0
# 5. ClickHouse — flush
kubectl -n rybbit exec deploy/clickhouse -- clickhouse-client --query "SYSTEM FLUSH LOGS"
# 6. Scale down stateful workloads
kubectl -n dbaas scale sts mysql-cluster --replicas=0
kubectl -n redis scale sts redis-node --replicas=0
kubectl -n vault scale sts vault --replicas=0
# 7. Verify all stateful pods terminated
kubectl get pods -A | grep -iE 'mysql-cluster-[0-9]|pg-cluster|redis-node|vault-[0-9]|clickhouse'
```
### Phase 1: Drain K8s Nodes
```bash
# Drain workers (reverse order)
kubectl drain k8s-node4 --ignore-daemonsets --delete-emptydir-data --force --timeout=120s
kubectl drain k8s-node3 --ignore-daemonsets --delete-emptydir-data --force --timeout=120s
kubectl drain k8s-node2 --ignore-daemonsets --delete-emptydir-data --force --timeout=120s
kubectl drain k8s-node1 --ignore-daemonsets --delete-emptydir-data --force --timeout=120s
# Cordon master
kubectl cordon k8s-master
```
### Phase 2: Shutdown VMs (via Proxmox)
```bash
ssh root@192.168.1.127
# K8s workers
for VMID in 201 202 203 204; do qm shutdown $VMID && echo "Shutdown VMID $VMID"; done
sleep 30
# K8s master
qm shutdown 200; sleep 15
# Docker registry
qm shutdown 220; sleep 10
# Secondary VMs
for VMID in 102 300 103; do qm shutdown $VMID; done
sleep 20
# TrueNAS (wait for ZFS flush)
qm shutdown 9000; sleep 60
# pfSense (last — network gateway)
qm shutdown 101; sleep 15
# Verify all VMs stopped
qm list
```
### Phase 3: Shutdown Proxmox Host
```bash
shutdown -h now
```
## Physical RAM Installation
| Step | Action | Slot(s) | DIMM |
|------|--------|---------|------|
| 1 | Power off host | — | Completed via Phase 3 above |
| 2 | **Remove** | A5 (black clip) | Take out 8G Hynix, set aside |
| 3 | **Remove** | A6 (black clip) | Take out 8G Hynix, set aside |
| 4 | **Install NEW** | A5 (black clip) | Insert 32G Samsung |
| 5 | **Install NEW** | A6 (black clip) | Insert 32G Samsung |
| 6 | **Install NEW** | A7 (black clip) | Insert 32G Samsung |
| 7 | **Install NEW** | A8 (black clip) | Insert 32G Samsung |
| 8 | **Install MOVED** | A11 (green clip) | Insert 8G Hynix (was in A5) |
| 9 | **Install MOVED** | A12 (green clip) | Insert 8G Hynix (was in A6) |
| 10 | Power on | — | — |
## Post-Boot Verification
```bash
# Verify all 10 DIMMs detected
ssh root@192.168.1.127 'dmidecode -t memory | grep -E "Locator:|Size:" | grep -v Bank'
# Verify total RAM (~268 GiB usable)
ssh root@192.168.1.127 'free -h'
```