diff --git a/.claude/reference/proxmox-inventory.md b/.claude/reference/proxmox-inventory.md
index 46caee0c..8eeb56ea 100644
--- a/.claude/reference/proxmox-inventory.md
+++ b/.claude/reference/proxmox-inventory.md
@@ -3,12 +3,77 @@
> Static reference for VMs, hardware, and network topology.
## Proxmox Host Hardware
-- **CPU**: Intel Xeon E5-2699 v4 @ 2.20GHz (22 cores / 44 threads, single socket)
-- **RAM**: 142 GB (Dell R730 server)
+- **Model**: Dell R730
+- **CPU**: Intel Xeon E5-2699 v4 @ 2.20GHz (22 cores / 44 threads, single socket, CPU2 unpopulated)
+- **RAM**: 272 GB DDR4-2400 ECC RDIMM (10 DIMMs, see Memory Layout below)
- **GPU**: NVIDIA Tesla T4 (PCIe passthrough to k8s-node1)
-- **Disks**: 1.1TB + 931GB + 10.7TB (local storage)
+- **iDRAC**: 192.168.1.4 (root/calvin)
+- **Disks**: 1.1TB RAID1 SAS (unused) + 931GB Samsung SSD + 10.7TB RAID1 HDD
- **Proxmox access**: `ssh root@192.168.1.127`
+## Memory Layout (updated 2026-04-01)
+
+### Physical DIMM Slot Map
+
+```
+╔══════════════════════════════════════════════════════════════════════════════╗
+║ CPU1 DIMM SLOTS ║
+║ ║
+║ ┌─── WHITE (1st per channel) ───┐ ║
+║ │ │ ║
+║ │ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ║
+║ │ │ A1 │ │ A2 │ │ A3 │ │ A4 │ ║
+║ │ │ 32G │ │ 32G │ │ 32G │ │ 32G │ Samsung M393A4K40BB1-CRC (2R) ║
+║ │ │██████│ │██████│ │██████│ │██████│ ║
+║ │ └──────┘ └──────┘ └──────┘ └──────┘ ║
+║ │ Ch 0 Ch 1 Ch 2 Ch 3 ║
+║ └────────────────────────────────┘ ║
+║ ║
+║ ┌─── BLACK (2nd per channel) ───┐ ║
+║ │ │ ║
+║ │ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ║
+║ │ │ A5 │ │ A6 │ │ A7 │ │ A8 │ ║
+║ │ │ 32G │ │ 32G │ │ 32G │ │ 32G │ Samsung M393A4K40CB1-CRC (2R) ║
+║ │ │▓▓▓▓▓▓│ │▓▓▓▓▓▓│ │▓▓▓▓▓▓│ │▓▓▓▓▓▓│ ║
+║ │ └──────┘ └──────┘ └──────┘ └──────┘ ║
+║ │ Ch 0 Ch 1 Ch 2 Ch 3 ║
+║ └────────────────────────────────┘ ║
+║ ║
+║ ┌─── GREEN (3rd per channel) ───┐ ║
+║ │ │ ║
+║ │ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ║
+║ │ │ A9 │ │ A10 │ │ A11 │ │ A12 │ ║
+║ │ │ │ │ │ │ 8G │ │ 8G │ SK Hynix HMA81GR7AFR8N-UH (1R) ║
+║ │ │ empty│ │ empty│ │░░░░░░│ │░░░░░░│ ║
+║ │ └──────┘ └──────┘ └──────┘ └──────┘ ║
+║ │ Ch 0 Ch 1 Ch 2 Ch 3 ║
+║ └────────────────────────────────┘ ║
+║ ║
+║ B1-B12: All empty (requires CPU2) ║
+║ ║
+║ Legend: ██ = Samsung BB1 32G ▓▓ = Samsung CB1 32G ░░ = Hynix 8G ║
+╚══════════════════════════════════════════════════════════════════════════════╝
+```
+
+### Channel Summary
+
+```
+Channel 0: A1 [32G] ──── A5 [32G] ──── A9 [ ] = 64 GB ✓ matched
+Channel 1: A2 [32G] ──── A6 [32G] ──── A10[ ] = 64 GB ✓ matched
+Channel 2: A3 [32G] ──── A7 [32G] ──── A11[ 8G ] = 72 GB ~ +8G bonus
+Channel 3: A4 [32G] ──── A8 [32G] ──── A12[ 8G ] = 72 GB ~ +8G bonus
+ ───────── ───────── ──────────
+ WHITE BLACK GREEN TOTAL: 272 GB
+```
+
+### DIMM Details
+
+- **A1-A4**: Samsung M393A4K40BB1-CRC 32GB DDR4-2400 ECC RDIMM (2-rank, original)
+- **A5-A8**: Samsung M393A4K40CB1-CRC 32GB DDR4-2400 ECC RDIMM (2-rank, added 2026-04-01)
+- **A11-A12**: SK Hynix HMA81GR7AFR8N-UH 8GB DDR4-2400 ECC RDIMM (1-rank, relocated from A5/A6)
+- **A9-A10, B1-B12**: Empty (B-side requires CPU2)
+- **Speed**: 2400 MHz (BIOS override — 3 DPC defaults to 1866 MHz, forced to 2400 via System BIOS > Memory Settings > Memory Frequency)
+
## Network Topology
```
10.0.10.0/24 - Management: Wizard (10.0.10.10), TrueNAS NFS (10.0.10.15)
@@ -25,18 +90,20 @@
| VMID | Name | Status | CPUs | RAM | Network | Disk | Notes |
|------|------|--------|------|-----|---------|------|-------|
-| 101 | pfsense | running | 8 | 16GB | vmbr0, vmbr1:vlan10, vmbr1:vlan20 | 32G | Gateway/firewall |
+| 101 | pfsense | running | 8 | 4GB | vmbr0, vmbr1:vlan10, vmbr1:vlan20 | 32G | Gateway/firewall |
| 102 | devvm | running | 16 | 8GB | vmbr1:vlan10 | 100G | Development VM |
| 103 | home-assistant | running | 8 | 8GB | vmbr0 | 64G | HA Sofia, net0(vlan10) disabled, SSH: vbarzin@192.168.1.8 |
| 105 | pbs | stopped | 16 | 8GB | vmbr1:vlan10 | 32G | Proxmox Backup (unused) |
-| 200 | k8s-master | running | 8 | 8GB* | vmbr1:vlan20 | 64G | Control plane (10.0.20.100). *Verify via `qm config 200` |
-| 201 | k8s-node1 | running | 16 | 16GB* | vmbr1:vlan20 | 256G | GPU node, Tesla T4. *Verify via `qm config 201` |
-| 202 | k8s-node2 | running | 8 | 24GB* | vmbr1:vlan20 | 256G | Worker. *Inferred from k8s allocatable (~22 GiB) |
-| 203 | k8s-node3 | running | 8 | 24GB* | vmbr1:vlan20 | 256G | Worker. *Inferred from k8s allocatable (~22 GiB) |
-| 204 | k8s-node4 | running | 8 | 24GB* | vmbr1:vlan20 | 256G | Worker. *Inferred from k8s allocatable (~22 GiB) |
+| 200 | k8s-master | running | 8 | 16GB | vmbr1:vlan20 | 64G | Control plane (10.0.20.100) |
+| 201 | k8s-node1 | running | 16 | 32GB | vmbr1:vlan20 | 256G | GPU node, Tesla T4 |
+| 202 | k8s-node2 | running | 8 | 24GB | vmbr1:vlan20 | 256G | Worker |
+| 203 | k8s-node3 | running | 8 | 24GB | vmbr1:vlan20 | 256G | Worker |
+| 204 | k8s-node4 | running | 8 | 24GB | vmbr1:vlan20 | 256G | Worker |
| 220 | docker-registry | running | 4 | 4GB | vmbr1:vlan20 | 64G | MAC DE:AD:BE:EF:22:22 (10.0.20.10) |
| 300 | Windows10 | running | 16 | 8GB | vmbr0 | 100G | Windows VM |
-| 9000 | truenas | running | 16 | 16GB | vmbr1:vlan10 | 32G+7x256G+1T | NFS (10.0.10.15) |
+| 9000 | truenas | running | 16 | 8GB | vmbr1:vlan10 | 32G+7x256G+1T | NFS (10.0.10.15) |
+
+**Total VM RAM allocated**: 180 GB of 272 GB (66%) — 92 GB free for future VMs
## VM Templates
| VMID | Name | Purpose |
diff --git a/docs/architecture/compute.md b/docs/architecture/compute.md
index 6992cc55..50090c04 100644
--- a/docs/architecture/compute.md
+++ b/docs/architecture/compute.md
@@ -9,16 +9,16 @@ The infrastructure runs on a single Dell R730 server with Proxmox VE, hosting a
```mermaid
graph TB
subgraph Physical["Dell R730 Physical Host"]
- CPU["2x Xeon E5-2699 v4
22c/44t each
44c/88t total"]
- RAM["142GB DDR4 ECC"]
+ CPU["1x Xeon E5-2699 v4
22c/44t
CPU2 unpopulated"]
+ RAM["272GB DDR4-2400 ECC"]
GPU["NVIDIA Tesla T4
PCIe 0000:06:00.0"]
DISK["1.1TB SSD
931GB SSD
10.7TB HDD"]
end
subgraph Proxmox["Proxmox VE"]
direction TB
- MASTER["VM 200: k8s-master
8c / 8GB
10.0.20.100"]
- NODE1["VM 201: k8s-node1
16c / 16GB
GPU Passthrough
nvidia.com/gpu=true:NoSchedule"]
+ MASTER["VM 200: k8s-master
8c / 16GB
10.0.20.100"]
+ NODE1["VM 201: k8s-node1
16c / 32GB
GPU Passthrough
nvidia.com/gpu=true:NoSchedule"]
NODE2["VM 202: k8s-node2
8c / 24GB"]
NODE3["VM 203: k8s-node3
8c / 24GB"]
NODE4["VM 204: k8s-node4
8c / 24GB"]
@@ -60,9 +60,9 @@ graph TB
| Component | Specification |
|-----------|---------------|
| Model | Dell PowerEdge R730 |
-| CPU | 2x Intel Xeon E5-2699 v4 (22 cores / 44 threads each) |
-| Total Cores/Threads | 44 cores / 88 threads |
-| RAM | 142GB DDR4 ECC |
+| CPU | 1x Intel Xeon E5-2699 v4 (22 cores / 44 threads, CPU2 unpopulated) |
+| Total Cores/Threads | 22 cores / 44 threads |
+| RAM | 272GB DDR4-2400 ECC RDIMM (10 DIMMs: 8x32G Samsung + 2x8G Hynix) |
| GPU | NVIDIA Tesla T4 (16GB GDDR6, PCIe 0000:06:00.0) |
| Storage | 1.1TB SSD + 931GB SSD + 10.7TB HDD |
| Hypervisor | Proxmox VE |
@@ -71,13 +71,13 @@ graph TB
| VM | VMID | vCPUs | RAM | Network | Role | Taints |
|----|------|-------|-----|---------|------|--------|
-| k8s-master | 200 | 8 | 8GB | vmbr1:vlan20 (10.0.20.100) | Control Plane | `node-role.kubernetes.io/control-plane:NoSchedule` |
-| k8s-node1 | 201 | 16 | 16GB | vmbr1:vlan20 | GPU Worker | `nvidia.com/gpu=true:NoSchedule` |
+| k8s-master | 200 | 8 | 16GB | vmbr1:vlan20 (10.0.20.100) | Control Plane | `node-role.kubernetes.io/control-plane:NoSchedule` |
+| k8s-node1 | 201 | 16 | 32GB | vmbr1:vlan20 | GPU Worker | `nvidia.com/gpu=true:NoSchedule` |
| k8s-node2 | 202 | 8 | 24GB | vmbr1:vlan20 | Worker | None |
| k8s-node3 | 203 | 8 | 24GB | vmbr1:vlan20 | Worker | None |
| k8s-node4 | 204 | 8 | 24GB | vmbr1:vlan20 | Worker | None |
-**Total Cluster Resources**: 48 vCPUs, 104GB RAM (excluding control plane)
+**Total Cluster Resources**: 48 vCPUs, 120GB RAM (excluding control plane)
### GPU Passthrough
@@ -443,7 +443,7 @@ spec:
**Rationale**:
- **CFS Throttling**: Linux Completely Fair Scheduler throttles containers to their exact CPU limit, even when CPU is idle. This causes artificial performance degradation.
- **Burstability**: Services can burst to unused CPU during low-load periods, improving response times.
-- **Memory-bound**: With 142GB RAM across 48 vCPUs, memory exhaustion occurs before CPU saturation. Memory is the constraining resource.
+- **Memory-bound**: With 272GB host RAM (180GB allocated to VMs), memory is no longer the primary constraint. 92GB headroom available for new VMs.
**Tradeoff**: A runaway process could monopolize CPU. Mitigated by CPU requests reserving capacity and PriorityClass preemption.
@@ -492,7 +492,7 @@ spec:
- **Growth Buffer**: Services grow over time (more users, more data). Headroom delays the need for manual intervention.
- **GPU Volatility**: GPU workloads (ML inference) have unpredictable memory usage. 30% headroom reduces OOMKills.
-**Tradeoff**: Slightly higher memory allocation. Accepted because 142GB RAM provides ample capacity.
+**Tradeoff**: Slightly higher memory allocation. Accepted because 272GB RAM provides ample capacity.
## Troubleshooting
diff --git a/docs/runbooks/r730-ram-upgrade-272gb.md b/docs/runbooks/r730-ram-upgrade-272gb.md
new file mode 100644
index 00000000..bd0f9a33
--- /dev/null
+++ b/docs/runbooks/r730-ram-upgrade-272gb.md
@@ -0,0 +1,188 @@
+# RAM Upgrade — Dell R730 Proxmox Host (Completed 2026-04-01)
+
+**Host**: Dell R730 @ 192.168.1.127 (Proxmox)
+**CPU**: Single Xeon E5-2699 v4 (CPU2 unpopulated — B-side slots unavailable)
+**Before**: 144 GB (4x32G Samsung BB1 + 2x8G SK Hynix) @ 2400 MHz
+**After**: 272 GB (4x32G Samsung BB1 + 4x32G Samsung CB1 + 2x8G SK Hynix) @ 2400 MHz
+
+## Lessons Learned
+
+1. **3 DPC downclock**: Adding DIMMs to the 3rd slot per channel (A11/A12) caused automatic downclocking to 1866 MHz. Dell R730 BIOS allows manual override back to 2400 MHz via **System BIOS > Memory Settings > Memory Frequency > Max Performance**.
+2. **MySQL InnoDB Cluster CR recreation**: Deleting and recreating the InnoDBCluster CR generates new admin secrets that don't match the existing data on PVCs. Fix: manually create the new admin user in MySQL and configure GR recovery channel credentials.
+3. **CNPG primary label**: After restarting the CNPG operator, it may not immediately label the primary pod with `role=primary`. Deleting the pod forces the operator to recreate it with the correct labels.
+4. **LimitRange blocks MySQL**: The `dbaas` namespace LimitRange (4Gi max) blocks MySQL pods that need 5Gi. Kyverno policy resets LimitRange patches. Fix: reduce MySQL memory limit in CR to 4Gi.
+
+## Physical DIMM Slot Map (looking down at motherboard, front of server at bottom)
+
+```
+╔══════════════════════════════════════════════════════════════════════════════╗
+║ CPU1 DIMM SLOTS ║
+║ ║
+║ ┌─── WHITE (1st per channel) ───┐ ║
+║ │ │ ║
+║ │ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ║
+║ │ │ A1 │ │ A2 │ │ A3 │ │ A4 │ ║
+║ │ │ 32G │ │ 32G │ │ 32G │ │ 32G │ ◄── KEEP (existing Samsung 32G) ║
+║ │ │██████│ │██████│ │██████│ │██████│ ║
+║ │ └──────┘ └──────┘ └──────┘ └──────┘ ║
+║ │ Ch 0 Ch 1 Ch 2 Ch 3 ║
+║ └────────────────────────────────┘ ║
+║ ║
+║ ┌─── BLACK (2nd per channel) ───┐ ║
+║ │ │ ║
+║ │ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ║
+║ │ │ A5 │ │ A6 │ │ A7 │ │ A8 │ ║
+║ │ │ 32G │ │ 32G │ │ 32G │ │ 32G │ ◄── INSTALL NEW 32G Samsung ║
+║ │ │▓▓▓▓▓▓│ │▓▓▓▓▓▓│ │▓▓▓▓▓▓│ │▓▓▓▓▓▓│ (remove old 8G from A5/A6) ║
+║ │ └──────┘ └──────┘ └──────┘ └──────┘ ║
+║ │ Ch 0 Ch 1 Ch 2 Ch 3 ║
+║ └────────────────────────────────┘ ║
+║ ║
+║ ┌─── GREEN (3rd per channel) ───┐ ║
+║ │ │ ║
+║ │ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ║
+║ │ │ A9 │ │ A10 │ │ A11 │ │ A12 │ ║
+║ │ │ │ │ │ │ 8G │ │ 8G │ ◄── MOVE old 8G Hynix here ║
+║ │ │ empty│ │ empty│ │░░░░░░│ │░░░░░░│ (from A5 → A11, A6 → A12) ║
+║ │ └──────┘ └──────┘ └──────┘ └──────┘ ║
+║ │ Ch 0 Ch 1 Ch 2 Ch 3 ║
+║ └────────────────────────────────┘ ║
+║ ║
+║ Legend: ██ = existing 32G (keep in place) ║
+║ ▓▓ = NEW 32G Samsung M393A4K40BB1-CRC (install) ║
+║ ░░ = relocated 8G SK Hynix HMA81GR7AFR8N-UH (moved from A5/A6) ║
+║ ║
+╚══════════════════════════════════════════════════════════════════════════════╝
+```
+
+## Channel Summary After Install
+
+```
+Channel 0: A1 [32G] ──── A5 [32G] ──── A9 [ ] = 64 GB ✓ matched
+Channel 1: A2 [32G] ──── A6 [32G] ──── A10[ ] = 64 GB ✓ matched
+Channel 2: A3 [32G] ──── A7 [32G] ──── A11[ 8G ] = 72 GB ~ +8G bonus
+Channel 3: A4 [32G] ──── A8 [32G] ──── A12[ 8G ] = 72 GB ~ +8G bonus
+ ───────── ───────── ──────────
+ WHITE BLACK GREEN TOTAL: 272 GB
+ (keep) (new 32G) (moved 8G)
+```
+
+**Performance**: ~1-2% bandwidth penalty on Ch2/Ch3 due to mixed DIMM sizes. Ch0/Ch1 fully matched.
+
+## Shutdown Sequence
+
+### Phase 0: Gracefully Stop Stateful Services
+
+Scale down databases, caches, and secrets engines before draining nodes to ensure clean shutdown with no data loss.
+
+```bash
+export KUBECONFIG=/path/to/config
+
+# 1. Vault — seal all instances (flushes WAL, closes connections)
+kubectl -n vault exec vault-0 -- vault operator step-down 2>/dev/null
+kubectl -n vault exec vault-0 -- vault operator seal
+kubectl -n vault exec vault-1 -- vault operator seal
+kubectl -n vault exec vault-2 -- vault operator seal
+
+# 2. MySQL InnoDB Cluster — set super_read_only, scale router to 0
+kubectl -n dbaas scale deploy mysql-cluster-router --replicas=0
+kubectl -n dbaas exec mysql-cluster-0 -- mysql -e "SET GLOBAL innodb_fast_shutdown=0; SET GLOBAL super_read_only=ON;"
+kubectl -n dbaas exec mysql-cluster-1 -- mysql -e "SET GLOBAL innodb_fast_shutdown=0; SET GLOBAL super_read_only=ON;"
+kubectl -n dbaas exec mysql-cluster-2 -- mysql -e "SET GLOBAL innodb_fast_shutdown=0; SET GLOBAL super_read_only=ON;"
+# innodb_fast_shutdown=0 forces full purge + change buffer merge on stop
+
+# 3. PostgreSQL CNPG — trigger checkpoint on primaries
+kubectl -n dbaas exec pg-cluster-2 -- psql -U postgres -c "CHECKPOINT;"
+kubectl -n dbaas exec pg-cluster-4 -- psql -U postgres -c "CHECKPOINT;"
+kubectl -n immich exec deploy/immich-postgresql -- psql -U postgres -c "CHECKPOINT;"
+
+# 4. Redis — trigger BGSAVE then scale down
+kubectl -n redis exec redis-node-0 -- redis-cli BGSAVE
+kubectl -n redis exec redis-node-1 -- redis-cli BGSAVE
+sleep 5 # wait for RDB flush
+kubectl -n redis scale deploy redis-haproxy --replicas=0
+
+# 5. ClickHouse — flush
+kubectl -n rybbit exec deploy/clickhouse -- clickhouse-client --query "SYSTEM FLUSH LOGS"
+
+# 6. Scale down stateful workloads
+kubectl -n dbaas scale sts mysql-cluster --replicas=0
+kubectl -n redis scale sts redis-node --replicas=0
+kubectl -n vault scale sts vault --replicas=0
+
+# 7. Verify all stateful pods terminated
+kubectl get pods -A | grep -iE 'mysql-cluster-[0-9]|pg-cluster|redis-node|vault-[0-9]|clickhouse'
+```
+
+### Phase 1: Drain K8s Nodes
+
+```bash
+# Drain workers (reverse order)
+kubectl drain k8s-node4 --ignore-daemonsets --delete-emptydir-data --force --timeout=120s
+kubectl drain k8s-node3 --ignore-daemonsets --delete-emptydir-data --force --timeout=120s
+kubectl drain k8s-node2 --ignore-daemonsets --delete-emptydir-data --force --timeout=120s
+kubectl drain k8s-node1 --ignore-daemonsets --delete-emptydir-data --force --timeout=120s
+
+# Cordon master
+kubectl cordon k8s-master
+```
+
+### Phase 2: Shutdown VMs (via Proxmox)
+
+```bash
+ssh root@192.168.1.127
+
+# K8s workers
+for VMID in 201 202 203 204; do qm shutdown $VMID && echo "Shutdown VMID $VMID"; done
+sleep 30
+
+# K8s master
+qm shutdown 200; sleep 15
+
+# Docker registry
+qm shutdown 220; sleep 10
+
+# Secondary VMs
+for VMID in 102 300 103; do qm shutdown $VMID; done
+sleep 20
+
+# TrueNAS (wait for ZFS flush)
+qm shutdown 9000; sleep 60
+
+# pfSense (last — network gateway)
+qm shutdown 101; sleep 15
+
+# Verify all VMs stopped
+qm list
+```
+
+### Phase 3: Shutdown Proxmox Host
+
+```bash
+shutdown -h now
+```
+
+## Physical RAM Installation
+
+| Step | Action | Slot(s) | DIMM |
+|------|--------|---------|------|
+| 1 | Power off host | — | Completed via Phase 3 above |
+| 2 | **Remove** | A5 (black clip) | Take out 8G Hynix, set aside |
+| 3 | **Remove** | A6 (black clip) | Take out 8G Hynix, set aside |
+| 4 | **Install NEW** | A5 (black clip) | Insert 32G Samsung |
+| 5 | **Install NEW** | A6 (black clip) | Insert 32G Samsung |
+| 6 | **Install NEW** | A7 (black clip) | Insert 32G Samsung |
+| 7 | **Install NEW** | A8 (black clip) | Insert 32G Samsung |
+| 8 | **Install MOVED** | A11 (green clip) | Insert 8G Hynix (was in A5) |
+| 9 | **Install MOVED** | A12 (green clip) | Insert 8G Hynix (was in A6) |
+| 10 | Power on | — | — |
+
+## Post-Boot Verification
+
+```bash
+# Verify all 10 DIMMs detected
+ssh root@192.168.1.127 'dmidecode -t memory | grep -E "Locator:|Size:" | grep -v Bank'
+
+# Verify total RAM (~268 GiB usable)
+ssh root@192.168.1.127 'free -h'
+```