docs: update hardware inventory for R730 RAM upgrade to 272GB
Upgraded from 144GB (4x32G + 2x8G) to 272GB (8x32G + 2x8G) DDR4-2400. Added physical DIMM slot diagram, channel layout, and BIOS speed override notes. Updated compute architecture with correct CPU (single socket), VM memory values, and capacity figures.
This commit is contained in:
parent
87c858f026
commit
2d8aa5ed89
3 changed files with 277 additions and 22 deletions
|
|
@ -9,16 +9,16 @@ The infrastructure runs on a single Dell R730 server with Proxmox VE, hosting a
|
|||
```mermaid
|
||||
graph TB
|
||||
subgraph Physical["Dell R730 Physical Host"]
|
||||
CPU["2x Xeon E5-2699 v4<br/>22c/44t each<br/>44c/88t total"]
|
||||
RAM["142GB DDR4 ECC"]
|
||||
CPU["1x Xeon E5-2699 v4<br/>22c/44t<br/>CPU2 unpopulated"]
|
||||
RAM["272GB DDR4-2400 ECC"]
|
||||
GPU["NVIDIA Tesla T4<br/>PCIe 0000:06:00.0"]
|
||||
DISK["1.1TB SSD<br/>931GB SSD<br/>10.7TB HDD"]
|
||||
end
|
||||
|
||||
subgraph Proxmox["Proxmox VE"]
|
||||
direction TB
|
||||
MASTER["VM 200: k8s-master<br/>8c / 8GB<br/>10.0.20.100"]
|
||||
NODE1["VM 201: k8s-node1<br/>16c / 16GB<br/>GPU Passthrough<br/>nvidia.com/gpu=true:NoSchedule"]
|
||||
MASTER["VM 200: k8s-master<br/>8c / 16GB<br/>10.0.20.100"]
|
||||
NODE1["VM 201: k8s-node1<br/>16c / 32GB<br/>GPU Passthrough<br/>nvidia.com/gpu=true:NoSchedule"]
|
||||
NODE2["VM 202: k8s-node2<br/>8c / 24GB"]
|
||||
NODE3["VM 203: k8s-node3<br/>8c / 24GB"]
|
||||
NODE4["VM 204: k8s-node4<br/>8c / 24GB"]
|
||||
|
|
@ -60,9 +60,9 @@ graph TB
|
|||
| Component | Specification |
|
||||
|-----------|---------------|
|
||||
| Model | Dell PowerEdge R730 |
|
||||
| CPU | 2x Intel Xeon E5-2699 v4 (22 cores / 44 threads each) |
|
||||
| Total Cores/Threads | 44 cores / 88 threads |
|
||||
| RAM | 142GB DDR4 ECC |
|
||||
| CPU | 1x Intel Xeon E5-2699 v4 (22 cores / 44 threads, CPU2 unpopulated) |
|
||||
| Total Cores/Threads | 22 cores / 44 threads |
|
||||
| RAM | 272GB DDR4-2400 ECC RDIMM (10 DIMMs: 8x32G Samsung + 2x8G Hynix) |
|
||||
| GPU | NVIDIA Tesla T4 (16GB GDDR6, PCIe 0000:06:00.0) |
|
||||
| Storage | 1.1TB SSD + 931GB SSD + 10.7TB HDD |
|
||||
| Hypervisor | Proxmox VE |
|
||||
|
|
@ -71,13 +71,13 @@ graph TB
|
|||
|
||||
| VM | VMID | vCPUs | RAM | Network | Role | Taints |
|
||||
|----|------|-------|-----|---------|------|--------|
|
||||
| k8s-master | 200 | 8 | 8GB | vmbr1:vlan20 (10.0.20.100) | Control Plane | `node-role.kubernetes.io/control-plane:NoSchedule` |
|
||||
| k8s-node1 | 201 | 16 | 16GB | vmbr1:vlan20 | GPU Worker | `nvidia.com/gpu=true:NoSchedule` |
|
||||
| k8s-master | 200 | 8 | 16GB | vmbr1:vlan20 (10.0.20.100) | Control Plane | `node-role.kubernetes.io/control-plane:NoSchedule` |
|
||||
| k8s-node1 | 201 | 16 | 32GB | vmbr1:vlan20 | GPU Worker | `nvidia.com/gpu=true:NoSchedule` |
|
||||
| k8s-node2 | 202 | 8 | 24GB | vmbr1:vlan20 | Worker | None |
|
||||
| k8s-node3 | 203 | 8 | 24GB | vmbr1:vlan20 | Worker | None |
|
||||
| k8s-node4 | 204 | 8 | 24GB | vmbr1:vlan20 | Worker | None |
|
||||
|
||||
**Total Cluster Resources**: 48 vCPUs, 104GB RAM (excluding control plane)
|
||||
**Total Cluster Resources**: 48 vCPUs, 120GB RAM (excluding control plane)
|
||||
|
||||
### GPU Passthrough
|
||||
|
||||
|
|
@ -443,7 +443,7 @@ spec:
|
|||
**Rationale**:
|
||||
- **CFS Throttling**: Linux Completely Fair Scheduler throttles containers to their exact CPU limit, even when CPU is idle. This causes artificial performance degradation.
|
||||
- **Burstability**: Services can burst to unused CPU during low-load periods, improving response times.
|
||||
- **Memory-bound**: With 142GB RAM across 48 vCPUs, memory exhaustion occurs before CPU saturation. Memory is the constraining resource.
|
||||
- **Memory-bound**: With 272GB host RAM (180GB allocated to VMs), memory is no longer the primary constraint. 92GB headroom available for new VMs.
|
||||
|
||||
**Tradeoff**: A runaway process could monopolize CPU. Mitigated by CPU requests reserving capacity and PriorityClass preemption.
|
||||
|
||||
|
|
@ -492,7 +492,7 @@ spec:
|
|||
- **Growth Buffer**: Services grow over time (more users, more data). Headroom delays the need for manual intervention.
|
||||
- **GPU Volatility**: GPU workloads (ML inference) have unpredictable memory usage. 30% headroom reduces OOMKills.
|
||||
|
||||
**Tradeoff**: Slightly higher memory allocation. Accepted because 142GB RAM provides ample capacity.
|
||||
**Tradeoff**: Slightly higher memory allocation. Accepted because 272GB RAM provides ample capacity.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
|
|
|
|||
188
docs/runbooks/r730-ram-upgrade-272gb.md
Normal file
188
docs/runbooks/r730-ram-upgrade-272gb.md
Normal file
|
|
@ -0,0 +1,188 @@
|
|||
# RAM Upgrade — Dell R730 Proxmox Host (Completed 2026-04-01)
|
||||
|
||||
**Host**: Dell R730 @ 192.168.1.127 (Proxmox)
|
||||
**CPU**: Single Xeon E5-2699 v4 (CPU2 unpopulated — B-side slots unavailable)
|
||||
**Before**: 144 GB (4x32G Samsung BB1 + 2x8G SK Hynix) @ 2400 MHz
|
||||
**After**: 272 GB (4x32G Samsung BB1 + 4x32G Samsung CB1 + 2x8G SK Hynix) @ 2400 MHz
|
||||
|
||||
## Lessons Learned
|
||||
|
||||
1. **3 DPC downclock**: Adding DIMMs to the 3rd slot per channel (A11/A12) caused automatic downclocking to 1866 MHz. Dell R730 BIOS allows manual override back to 2400 MHz via **System BIOS > Memory Settings > Memory Frequency > Max Performance**.
|
||||
2. **MySQL InnoDB Cluster CR recreation**: Deleting and recreating the InnoDBCluster CR generates new admin secrets that don't match the existing data on PVCs. Fix: manually create the new admin user in MySQL and configure GR recovery channel credentials.
|
||||
3. **CNPG primary label**: After restarting the CNPG operator, it may not immediately label the primary pod with `role=primary`. Deleting the pod forces the operator to recreate it with the correct labels.
|
||||
4. **LimitRange blocks MySQL**: The `dbaas` namespace LimitRange (4Gi max) blocks MySQL pods that need 5Gi. Kyverno policy resets LimitRange patches. Fix: reduce MySQL memory limit in CR to 4Gi.
|
||||
|
||||
## Physical DIMM Slot Map (looking down at motherboard, front of server at bottom)
|
||||
|
||||
```
|
||||
╔══════════════════════════════════════════════════════════════════════════════╗
|
||||
║ CPU1 DIMM SLOTS ║
|
||||
║ ║
|
||||
║ ┌─── WHITE (1st per channel) ───┐ ║
|
||||
║ │ │ ║
|
||||
║ │ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ║
|
||||
║ │ │ A1 │ │ A2 │ │ A3 │ │ A4 │ ║
|
||||
║ │ │ 32G │ │ 32G │ │ 32G │ │ 32G │ ◄── KEEP (existing Samsung 32G) ║
|
||||
║ │ │██████│ │██████│ │██████│ │██████│ ║
|
||||
║ │ └──────┘ └──────┘ └──────┘ └──────┘ ║
|
||||
║ │ Ch 0 Ch 1 Ch 2 Ch 3 ║
|
||||
║ └────────────────────────────────┘ ║
|
||||
║ ║
|
||||
║ ┌─── BLACK (2nd per channel) ───┐ ║
|
||||
║ │ │ ║
|
||||
║ │ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ║
|
||||
║ │ │ A5 │ │ A6 │ │ A7 │ │ A8 │ ║
|
||||
║ │ │ 32G │ │ 32G │ │ 32G │ │ 32G │ ◄── INSTALL NEW 32G Samsung ║
|
||||
║ │ │▓▓▓▓▓▓│ │▓▓▓▓▓▓│ │▓▓▓▓▓▓│ │▓▓▓▓▓▓│ (remove old 8G from A5/A6) ║
|
||||
║ │ └──────┘ └──────┘ └──────┘ └──────┘ ║
|
||||
║ │ Ch 0 Ch 1 Ch 2 Ch 3 ║
|
||||
║ └────────────────────────────────┘ ║
|
||||
║ ║
|
||||
║ ┌─── GREEN (3rd per channel) ───┐ ║
|
||||
║ │ │ ║
|
||||
║ │ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ║
|
||||
║ │ │ A9 │ │ A10 │ │ A11 │ │ A12 │ ║
|
||||
║ │ │ │ │ │ │ 8G │ │ 8G │ ◄── MOVE old 8G Hynix here ║
|
||||
║ │ │ empty│ │ empty│ │░░░░░░│ │░░░░░░│ (from A5 → A11, A6 → A12) ║
|
||||
║ │ └──────┘ └──────┘ └──────┘ └──────┘ ║
|
||||
║ │ Ch 0 Ch 1 Ch 2 Ch 3 ║
|
||||
║ └────────────────────────────────┘ ║
|
||||
║ ║
|
||||
║ Legend: ██ = existing 32G (keep in place) ║
|
||||
║ ▓▓ = NEW 32G Samsung M393A4K40BB1-CRC (install) ║
|
||||
║ ░░ = relocated 8G SK Hynix HMA81GR7AFR8N-UH (moved from A5/A6) ║
|
||||
║ ║
|
||||
╚══════════════════════════════════════════════════════════════════════════════╝
|
||||
```
|
||||
|
||||
## Channel Summary After Install
|
||||
|
||||
```
|
||||
Channel 0: A1 [32G] ──── A5 [32G] ──── A9 [ ] = 64 GB ✓ matched
|
||||
Channel 1: A2 [32G] ──── A6 [32G] ──── A10[ ] = 64 GB ✓ matched
|
||||
Channel 2: A3 [32G] ──── A7 [32G] ──── A11[ 8G ] = 72 GB ~ +8G bonus
|
||||
Channel 3: A4 [32G] ──── A8 [32G] ──── A12[ 8G ] = 72 GB ~ +8G bonus
|
||||
───────── ───────── ──────────
|
||||
WHITE BLACK GREEN TOTAL: 272 GB
|
||||
(keep) (new 32G) (moved 8G)
|
||||
```
|
||||
|
||||
**Performance**: ~1-2% bandwidth penalty on Ch2/Ch3 due to mixed DIMM sizes. Ch0/Ch1 fully matched.
|
||||
|
||||
## Shutdown Sequence
|
||||
|
||||
### Phase 0: Gracefully Stop Stateful Services
|
||||
|
||||
Scale down databases, caches, and secrets engines before draining nodes to ensure clean shutdown with no data loss.
|
||||
|
||||
```bash
|
||||
export KUBECONFIG=/path/to/config
|
||||
|
||||
# 1. Vault — seal all instances (flushes WAL, closes connections)
|
||||
kubectl -n vault exec vault-0 -- vault operator step-down 2>/dev/null
|
||||
kubectl -n vault exec vault-0 -- vault operator seal
|
||||
kubectl -n vault exec vault-1 -- vault operator seal
|
||||
kubectl -n vault exec vault-2 -- vault operator seal
|
||||
|
||||
# 2. MySQL InnoDB Cluster — set super_read_only, scale router to 0
|
||||
kubectl -n dbaas scale deploy mysql-cluster-router --replicas=0
|
||||
kubectl -n dbaas exec mysql-cluster-0 -- mysql -e "SET GLOBAL innodb_fast_shutdown=0; SET GLOBAL super_read_only=ON;"
|
||||
kubectl -n dbaas exec mysql-cluster-1 -- mysql -e "SET GLOBAL innodb_fast_shutdown=0; SET GLOBAL super_read_only=ON;"
|
||||
kubectl -n dbaas exec mysql-cluster-2 -- mysql -e "SET GLOBAL innodb_fast_shutdown=0; SET GLOBAL super_read_only=ON;"
|
||||
# innodb_fast_shutdown=0 forces full purge + change buffer merge on stop
|
||||
|
||||
# 3. PostgreSQL CNPG — trigger checkpoint on primaries
|
||||
kubectl -n dbaas exec pg-cluster-2 -- psql -U postgres -c "CHECKPOINT;"
|
||||
kubectl -n dbaas exec pg-cluster-4 -- psql -U postgres -c "CHECKPOINT;"
|
||||
kubectl -n immich exec deploy/immich-postgresql -- psql -U postgres -c "CHECKPOINT;"
|
||||
|
||||
# 4. Redis — trigger BGSAVE then scale down
|
||||
kubectl -n redis exec redis-node-0 -- redis-cli BGSAVE
|
||||
kubectl -n redis exec redis-node-1 -- redis-cli BGSAVE
|
||||
sleep 5 # wait for RDB flush
|
||||
kubectl -n redis scale deploy redis-haproxy --replicas=0
|
||||
|
||||
# 5. ClickHouse — flush
|
||||
kubectl -n rybbit exec deploy/clickhouse -- clickhouse-client --query "SYSTEM FLUSH LOGS"
|
||||
|
||||
# 6. Scale down stateful workloads
|
||||
kubectl -n dbaas scale sts mysql-cluster --replicas=0
|
||||
kubectl -n redis scale sts redis-node --replicas=0
|
||||
kubectl -n vault scale sts vault --replicas=0
|
||||
|
||||
# 7. Verify all stateful pods terminated
|
||||
kubectl get pods -A | grep -iE 'mysql-cluster-[0-9]|pg-cluster|redis-node|vault-[0-9]|clickhouse'
|
||||
```
|
||||
|
||||
### Phase 1: Drain K8s Nodes
|
||||
|
||||
```bash
|
||||
# Drain workers (reverse order)
|
||||
kubectl drain k8s-node4 --ignore-daemonsets --delete-emptydir-data --force --timeout=120s
|
||||
kubectl drain k8s-node3 --ignore-daemonsets --delete-emptydir-data --force --timeout=120s
|
||||
kubectl drain k8s-node2 --ignore-daemonsets --delete-emptydir-data --force --timeout=120s
|
||||
kubectl drain k8s-node1 --ignore-daemonsets --delete-emptydir-data --force --timeout=120s
|
||||
|
||||
# Cordon master
|
||||
kubectl cordon k8s-master
|
||||
```
|
||||
|
||||
### Phase 2: Shutdown VMs (via Proxmox)
|
||||
|
||||
```bash
|
||||
ssh root@192.168.1.127
|
||||
|
||||
# K8s workers
|
||||
for VMID in 201 202 203 204; do qm shutdown $VMID && echo "Shutdown VMID $VMID"; done
|
||||
sleep 30
|
||||
|
||||
# K8s master
|
||||
qm shutdown 200; sleep 15
|
||||
|
||||
# Docker registry
|
||||
qm shutdown 220; sleep 10
|
||||
|
||||
# Secondary VMs
|
||||
for VMID in 102 300 103; do qm shutdown $VMID; done
|
||||
sleep 20
|
||||
|
||||
# TrueNAS (wait for ZFS flush)
|
||||
qm shutdown 9000; sleep 60
|
||||
|
||||
# pfSense (last — network gateway)
|
||||
qm shutdown 101; sleep 15
|
||||
|
||||
# Verify all VMs stopped
|
||||
qm list
|
||||
```
|
||||
|
||||
### Phase 3: Shutdown Proxmox Host
|
||||
|
||||
```bash
|
||||
shutdown -h now
|
||||
```
|
||||
|
||||
## Physical RAM Installation
|
||||
|
||||
| Step | Action | Slot(s) | DIMM |
|
||||
|------|--------|---------|------|
|
||||
| 1 | Power off host | — | Completed via Phase 3 above |
|
||||
| 2 | **Remove** | A5 (black clip) | Take out 8G Hynix, set aside |
|
||||
| 3 | **Remove** | A6 (black clip) | Take out 8G Hynix, set aside |
|
||||
| 4 | **Install NEW** | A5 (black clip) | Insert 32G Samsung |
|
||||
| 5 | **Install NEW** | A6 (black clip) | Insert 32G Samsung |
|
||||
| 6 | **Install NEW** | A7 (black clip) | Insert 32G Samsung |
|
||||
| 7 | **Install NEW** | A8 (black clip) | Insert 32G Samsung |
|
||||
| 8 | **Install MOVED** | A11 (green clip) | Insert 8G Hynix (was in A5) |
|
||||
| 9 | **Install MOVED** | A12 (green clip) | Insert 8G Hynix (was in A6) |
|
||||
| 10 | Power on | — | — |
|
||||
|
||||
## Post-Boot Verification
|
||||
|
||||
```bash
|
||||
# Verify all 10 DIMMs detected
|
||||
ssh root@192.168.1.127 'dmidecode -t memory | grep -E "Locator:|Size:" | grep -v Bank'
|
||||
|
||||
# Verify total RAM (~268 GiB usable)
|
||||
ssh root@192.168.1.127 'free -h'
|
||||
```
|
||||
Loading…
Add table
Add a link
Reference in a new issue