add architecture documentation for all infrastructure subsystems [ci skip]
14 docs covering networking, VPN, storage, authentication, security,
monitoring, secrets, CI/CD, backup/DR, compute, databases, and
multi-tenancy. Each doc includes Mermaid diagrams, component tables,
configuration references, decision rationale, and troubleshooting.
2026-03-24 00:55:25 +02:00
# Infrastructure Overview
## Overview
This homelab infrastructure runs a production-grade Kubernetes cluster on Proxmox, hosting 70+ services including web applications, databases, monitoring, security, and GPU-accelerated workloads. The entire infrastructure is managed declaratively using Terraform and Terragrunt, with automated CI/CD pipelines for continuous deployment. Services are organized into a five-tier system for resource isolation and priority-based scheduling.
## Architecture Diagram
```mermaid
graph TB
subgraph Physical["Physical Hardware"]
docs: comprehensive audit and update of all architecture docs and runbooks [ci skip]
Audited 14 documentation files against live cluster state and Terraform code.
Architecture docs:
- databases.md: MySQL 8.4.4, proxmox-lvm storage (not iSCSI), anti-affinity
excludes k8s-node1 (GPU), 2Gi/3Gi resources, 7-day rotation (not 24h),
CNPG 2 instances, PostGIS 16, postgresql.dbaas has endpoints
- overview.md: 1x CPU, ~160GB RAM, all nodes 32GB, proxmox-lvm storage,
correct Vault paths (secret/ not kv/)
- compute.md: 272GB physical host RAM, ~160GB allocated to VMs
- secrets.md: 7-day rotation, 7 MySQL + 5 PG roles, correct ESO config
- networking.md: MetalLB pool 10.0.20.200-220
- ci-cd.md: 9 GHA projects, travel_blog 5.7GB
Runbooks:
- restore-mysql/postgresql: backup files are .sql.gz (not .sql)
- restore-vault: weekly backup (not daily), auto-unseal sidecar note
- restore-vaultwarden: PVC is proxmox (not iscsi)
- restore-full-cluster: updated node roles, removed trading
Reference docs:
- CLAUDE.md: 7-day rotation, removed trading from PG list
- AGENTS.md: 100+ stacks, proxmox-lvm, platform empty shell
- service-catalog.md: 6 new stacks, 14 stack column updates
2026-04-06 13:21:05 +03:00
R730["Dell R730< br / > 22c/44t Xeon E5-2699 v4< br / > ~160GB RAM< br / > NVIDIA Tesla T4< br / > 1.1TB + 931GB + 10.7TB"]
add architecture documentation for all infrastructure subsystems [ci skip]
14 docs covering networking, VPN, storage, authentication, security,
monitoring, secrets, CI/CD, backup/DR, compute, databases, and
multi-tenancy. Each doc includes Mermaid diagrams, component tables,
configuration references, decision rationale, and troubleshooting.
2026-03-24 00:55:25 +02:00
end
subgraph Proxmox["Proxmox VE"]
direction LR
PF["pfSense< br / > 101"]
DEV["devvm< br / > 102"]
HA["home-assistant< br / > 103"]
MASTER["k8s-master< br / > 200"]
NODE1["k8s-node1< br / > 201< br / > (GPU)"]
NODE2["k8s-node2< br / > 202"]
NODE3["k8s-node3< br / > 203"]
NODE4["k8s-node4< br / > 204"]
REG["docker-registry< br / > 220"]
end
subgraph Network["Network Bridges"]
VMBR0["vmbr0< br / > 192.168.1.0/24< br / > Physical"]
VMBR1_10["vmbr1:vlan10< br / > 10.0.10.0/24< br / > Management"]
VMBR1_20["vmbr1:vlan20< br / > 10.0.20.0/24< br / > Kubernetes"]
end
subgraph K8s["Kubernetes Cluster v1.34.2"]
direction TB
TIER0["Tier 0: Core< br / > traefik, authentik, vault"]
TIER1["Tier 1: Cluster< br / > prometheus, grafana, loki"]
TIER2["Tier 2: GPU< br / > ollama, comfyui"]
TIER3["Tier 3: Edge< br / > cloudflared, headscale"]
TIER4["Tier 4: Auxiliary< br / > vaultwarden, immich"]
end
R730 --> Proxmox
PF --> VMBR0
PF --> VMBR1_10
PF --> VMBR1_20
HA --> VMBR0
DEV --> VMBR1_10
MASTER --> VMBR1_20
NODE1 --> VMBR1_20
NODE2 --> VMBR1_20
NODE3 --> VMBR1_20
NODE4 --> VMBR1_20
REG --> VMBR1_20
VMBR1_20 --> K8s
```
## Components
### Hardware
| Component | Specification |
|-----------|---------------|
| Server | Dell PowerEdge R730 |
docs: comprehensive audit and update of all architecture docs and runbooks [ci skip]
Audited 14 documentation files against live cluster state and Terraform code.
Architecture docs:
- databases.md: MySQL 8.4.4, proxmox-lvm storage (not iSCSI), anti-affinity
excludes k8s-node1 (GPU), 2Gi/3Gi resources, 7-day rotation (not 24h),
CNPG 2 instances, PostGIS 16, postgresql.dbaas has endpoints
- overview.md: 1x CPU, ~160GB RAM, all nodes 32GB, proxmox-lvm storage,
correct Vault paths (secret/ not kv/)
- compute.md: 272GB physical host RAM, ~160GB allocated to VMs
- secrets.md: 7-day rotation, 7 MySQL + 5 PG roles, correct ESO config
- networking.md: MetalLB pool 10.0.20.200-220
- ci-cd.md: 9 GHA projects, travel_blog 5.7GB
Runbooks:
- restore-mysql/postgresql: backup files are .sql.gz (not .sql)
- restore-vault: weekly backup (not daily), auto-unseal sidecar note
- restore-vaultwarden: PVC is proxmox (not iscsi)
- restore-full-cluster: updated node roles, removed trading
Reference docs:
- CLAUDE.md: 7-day rotation, removed trading from PG list
- AGENTS.md: 100+ stacks, proxmox-lvm, platform empty shell
- service-catalog.md: 6 new stacks, 14 stack column updates
2026-04-06 13:21:05 +03:00
| CPU | 1x Intel Xeon E5-2699 v4 (22 cores / 44 threads, CPU2 unpopulated) |
| RAM | ~160GB DDR4 ECC |
add architecture documentation for all infrastructure subsystems [ci skip]
14 docs covering networking, VPN, storage, authentication, security,
monitoring, secrets, CI/CD, backup/DR, compute, databases, and
multi-tenancy. Each doc includes Mermaid diagrams, component tables,
configuration references, decision rationale, and troubleshooting.
2026-03-24 00:55:25 +02:00
| GPU | NVIDIA Tesla T4 (16GB, PCIe 0000:06:00.0) |
| Storage | 1.1TB SSD + 931GB SSD + 10.7TB HDD |
| Network | eno1 (physical), vmbr0 (physical bridge), vmbr1 (VLAN-aware internal) |
### Network Topology
| Network | VLAN | CIDR | Purpose |
|---------|------|------|---------|
| Physical | - | 192.168.1.0/24 | Physical devices, Proxmox host (192.168.1.127) |
2026-04-13 14:41:15 +00:00
| Management | 10 | 10.0.10.0/24 | Infrastructure VMs, devvm |
add architecture documentation for all infrastructure subsystems [ci skip]
14 docs covering networking, VPN, storage, authentication, security,
monitoring, secrets, CI/CD, backup/DR, compute, databases, and
multi-tenancy. Each doc includes Mermaid diagrams, component tables,
configuration references, decision rationale, and troubleshooting.
2026-03-24 00:55:25 +02:00
| Kubernetes | 20 | 10.0.20.0/24 | K8s cluster nodes and services |
### Virtual Machine Inventory
| VMID | Name | CPUs | RAM | Network | IP Address | Notes |
|------|------|------|-----|---------|------------|-------|
| 101 | pfsense | 8 | 16GB | vmbr0, vmbr1:vlan10, vmbr1:vlan20 | - | Gateway/firewall routing between VLANs |
| 102 | devvm | 16 | 8GB | vmbr1:vlan10 | - | Development VM |
| 103 | home-assistant | 8 | 8GB | vmbr0 | - | Home Assistant Sofia instance |
docs: comprehensive audit and update of all architecture docs and runbooks [ci skip]
Audited 14 documentation files against live cluster state and Terraform code.
Architecture docs:
- databases.md: MySQL 8.4.4, proxmox-lvm storage (not iSCSI), anti-affinity
excludes k8s-node1 (GPU), 2Gi/3Gi resources, 7-day rotation (not 24h),
CNPG 2 instances, PostGIS 16, postgresql.dbaas has endpoints
- overview.md: 1x CPU, ~160GB RAM, all nodes 32GB, proxmox-lvm storage,
correct Vault paths (secret/ not kv/)
- compute.md: 272GB physical host RAM, ~160GB allocated to VMs
- secrets.md: 7-day rotation, 7 MySQL + 5 PG roles, correct ESO config
- networking.md: MetalLB pool 10.0.20.200-220
- ci-cd.md: 9 GHA projects, travel_blog 5.7GB
Runbooks:
- restore-mysql/postgresql: backup files are .sql.gz (not .sql)
- restore-vault: weekly backup (not daily), auto-unseal sidecar note
- restore-vaultwarden: PVC is proxmox (not iscsi)
- restore-full-cluster: updated node roles, removed trading
Reference docs:
- CLAUDE.md: 7-day rotation, removed trading from PG list
- AGENTS.md: 100+ stacks, proxmox-lvm, platform empty shell
- service-catalog.md: 6 new stacks, 14 stack column updates
2026-04-06 13:21:05 +03:00
| 200 | k8s-master | 8 | 32GB | vmbr1:vlan20 | 10.0.20.100 | Kubernetes control plane |
| 201 | k8s-node1 | 16 | 32GB | vmbr1:vlan20 | - | GPU worker node (Tesla T4 passthrough) |
| 202 | k8s-node2 | 8 | 32GB | vmbr1:vlan20 | - | Worker node |
| 203 | k8s-node3 | 8 | 32GB | vmbr1:vlan20 | - | Worker node |
| 204 | k8s-node4 | 8 | 32GB | vmbr1:vlan20 | - | Worker node |
add architecture documentation for all infrastructure subsystems [ci skip]
14 docs covering networking, VPN, storage, authentication, security,
monitoring, secrets, CI/CD, backup/DR, compute, databases, and
multi-tenancy. Each doc includes Mermaid diagrams, component tables,
configuration references, decision rationale, and troubleshooting.
2026-03-24 00:55:25 +02:00
| 220 | docker-registry | 4 | 4GB | vmbr1:vlan20 | 10.0.20.10 | Private Docker registry |
2026-04-19 16:55:43 +00:00
| ~~9000~~ | ~~truenas~~ | — | — | — | ~~10.0.10.15~~ | **DECOMMISSIONED 2026-04-13** — NFS now served by Proxmox host (192.168.1.127). VM still exists in stopped state on PVE pending user decision on deletion. |
add architecture documentation for all infrastructure subsystems [ci skip]
14 docs covering networking, VPN, storage, authentication, security,
monitoring, secrets, CI/CD, backup/DR, compute, databases, and
multi-tenancy. Each doc includes Mermaid diagrams, component tables,
configuration references, decision rationale, and troubleshooting.
2026-03-24 00:55:25 +02:00
### Kubernetes Cluster
| Component | Details |
|-----------|---------|
| Version | v1.34.2 |
| Nodes | 5 (1 control plane, 4 workers) |
| CNI | Calico |
2026-04-13 14:41:15 +00:00
| Storage | NFS (Proxmox host, nfs-csi) + Proxmox-LVM (Proxmox CSI) |
add architecture documentation for all infrastructure subsystems [ci skip]
14 docs covering networking, VPN, storage, authentication, security,
monitoring, secrets, CI/CD, backup/DR, compute, databases, and
multi-tenancy. Each doc includes Mermaid diagrams, component tables,
configuration references, decision rationale, and troubleshooting.
2026-03-24 00:55:25 +02:00
| Ingress | Traefik v3 |
| Total Services | 70+ services across 5 tiers |
### Service Tier System
The cluster uses a five-tier namespace system managed by Kyverno, which automatically generates LimitRange and ResourceQuota policies per tier:
| Tier | Namespace Pattern | Purpose | Priority Class |
|------|-------------------|---------|----------------|
| 0-core | `0-core-*` | Critical infrastructure (traefik, authentik, vault) | 900000 |
| 1-cluster | `1-cluster-*` | Cluster services (prometheus, grafana, kyverno) | 700000 |
| 2-gpu | `2-gpu-*` | GPU workloads (ollama, comfyui, stable-diffusion) | 500000 |
| 3-edge | `3-edge-*` | Edge services (cloudflared, headscale, technitium) | 300000 |
| 4-aux | `4-aux-*` | Auxiliary apps (vaultwarden, immich, freshrss) | 200000 |
## How It Works
### Physical Layer
docs: comprehensive audit and update of all architecture docs and runbooks [ci skip]
Audited 14 documentation files against live cluster state and Terraform code.
Architecture docs:
- databases.md: MySQL 8.4.4, proxmox-lvm storage (not iSCSI), anti-affinity
excludes k8s-node1 (GPU), 2Gi/3Gi resources, 7-day rotation (not 24h),
CNPG 2 instances, PostGIS 16, postgresql.dbaas has endpoints
- overview.md: 1x CPU, ~160GB RAM, all nodes 32GB, proxmox-lvm storage,
correct Vault paths (secret/ not kv/)
- compute.md: 272GB physical host RAM, ~160GB allocated to VMs
- secrets.md: 7-day rotation, 7 MySQL + 5 PG roles, correct ESO config
- networking.md: MetalLB pool 10.0.20.200-220
- ci-cd.md: 9 GHA projects, travel_blog 5.7GB
Runbooks:
- restore-mysql/postgresql: backup files are .sql.gz (not .sql)
- restore-vault: weekly backup (not daily), auto-unseal sidecar note
- restore-vaultwarden: PVC is proxmox (not iscsi)
- restore-full-cluster: updated node roles, removed trading
Reference docs:
- CLAUDE.md: 7-day rotation, removed trading from PG list
- AGENTS.md: 100+ stacks, proxmox-lvm, platform empty shell
- service-catalog.md: 6 new stacks, 14 stack column updates
2026-04-06 13:21:05 +03:00
The infrastructure runs on a single Dell R730 server with a Xeon E5-2699 v4 CPU and ~160GB RAM. Proxmox VE provides hypervisor capabilities with hardware passthrough support for the Tesla T4 GPU. The physical network interface (eno1) bridges to vmbr0 for physical network access, while vmbr1 provides VLAN-aware internal networking.
add architecture documentation for all infrastructure subsystems [ci skip]
14 docs covering networking, VPN, storage, authentication, security,
monitoring, secrets, CI/CD, backup/DR, compute, databases, and
multi-tenancy. Each doc includes Mermaid diagrams, component tables,
configuration references, decision rationale, and troubleshooting.
2026-03-24 00:55:25 +02:00
### Network Layer
pfSense (VMID 101) acts as the central gateway and firewall, routing traffic between:
- Physical network (192.168.1.0/24) via vmbr0
- Management VLAN 10 (10.0.10.0/24) via vmbr1:vlan10
- Kubernetes VLAN 20 (10.0.20.0/24) via vmbr1:vlan20
This three-tier network design isolates Kubernetes workloads from management infrastructure and provides controlled access to the physical network.
### Compute Layer
The Kubernetes cluster consists of 5 nodes:
docs: comprehensive audit and update of all architecture docs and runbooks [ci skip]
Audited 14 documentation files against live cluster state and Terraform code.
Architecture docs:
- databases.md: MySQL 8.4.4, proxmox-lvm storage (not iSCSI), anti-affinity
excludes k8s-node1 (GPU), 2Gi/3Gi resources, 7-day rotation (not 24h),
CNPG 2 instances, PostGIS 16, postgresql.dbaas has endpoints
- overview.md: 1x CPU, ~160GB RAM, all nodes 32GB, proxmox-lvm storage,
correct Vault paths (secret/ not kv/)
- compute.md: 272GB physical host RAM, ~160GB allocated to VMs
- secrets.md: 7-day rotation, 7 MySQL + 5 PG roles, correct ESO config
- networking.md: MetalLB pool 10.0.20.200-220
- ci-cd.md: 9 GHA projects, travel_blog 5.7GB
Runbooks:
- restore-mysql/postgresql: backup files are .sql.gz (not .sql)
- restore-vault: weekly backup (not daily), auto-unseal sidecar note
- restore-vaultwarden: PVC is proxmox (not iscsi)
- restore-full-cluster: updated node roles, removed trading
Reference docs:
- CLAUDE.md: 7-day rotation, removed trading from PG list
- AGENTS.md: 100+ stacks, proxmox-lvm, platform empty shell
- service-catalog.md: 6 new stacks, 14 stack column updates
2026-04-06 13:21:05 +03:00
- **k8s-master (200)**: 8c/32GB control plane running kube-apiserver, etcd, controller-manager
- **k8s-node1 (201)**: 16c/32GB GPU node with Tesla T4 passthrough, tainted for GPU workloads only
- **k8s-node2-4 (202-204)**: 8c/32GB workers running general-purpose workloads
add architecture documentation for all infrastructure subsystems [ci skip]
14 docs covering networking, VPN, storage, authentication, security,
monitoring, secrets, CI/CD, backup/DR, compute, databases, and
multi-tenancy. Each doc includes Mermaid diagrams, component tables,
configuration references, decision rationale, and troubleshooting.
2026-03-24 00:55:25 +02:00
gpu: schedule off NFD label, not k8s-node1 hostname
Remove every hardcoded reference to k8s-node1 that pinned GPU
scheduling to a specific host:
- GPU workload nodeSelectors: gpu=true -> nvidia.com/gpu.present=true
(frigate, immich, whisper, piper, ytdlp, ebook2audiobook, audiblez,
audiblez-web, nvidia-exporter, gpu-pod-exporter). The NFD label is
auto-applied by gpu-feature-discovery on any node carrying an
NVIDIA PCI device, so the selector follows the card.
- null_resource.gpu_node_config: rewrite to enumerate NFD-labeled
nodes (feature.node.kubernetes.io/pci-10de.present=true) and taint
each with nvidia.com/gpu=true:PreferNoSchedule. Drop the manual
'kubectl label gpu=true' since NFD handles labeling.
- MySQL anti-affinity: kubernetes.io/hostname NotIn [k8s-node1] ->
nvidia.com/gpu.present NotIn [true]. Same intent (keep MySQL off
the GPU node) but portable when the card relocates.
Net effect: moving the GPU card between nodes no longer requires any
Terraform edit. Verified no-op for current scheduling — both old and
new labels resolve to node1 today.
Docs updated to match: AGENTS.md, compute.md, overview.md,
proxmox-inventory.md, k8s-portal agent-guidance string.
2026-04-22 13:43:07 +00:00
GPU passthrough on node1 uses PCIe device 0000:06:00.0. The NVIDIA GPU Operator's gpu-feature-discovery auto-labels whichever node carries the card with `nvidia.com/gpu.present=true` ; `null_resource.gpu_node_config` taints the same set of nodes with `nvidia.com/gpu=true:PreferNoSchedule` . No hostname is hardcoded — moving the card to a different node requires no Terraform edits.
add architecture documentation for all infrastructure subsystems [ci skip]
14 docs covering networking, VPN, storage, authentication, security,
monitoring, secrets, CI/CD, backup/DR, compute, databases, and
multi-tenancy. Each doc includes Mermaid diagrams, component tables,
configuration references, decision rationale, and troubleshooting.
2026-03-24 00:55:25 +02:00
### Service Organization
Services are organized into 70+ individual Terraform stacks under `stacks/<service>/` . Each service belongs to a tier, which determines:
- Resource limits and quotas
- Scheduling priority (higher tier = preempts lower)
- Default container resources
- QoS class (Guaranteed for tiers 0-2, Burstable for 3-4)
Kyverno policies automatically inject namespace labels, LimitRange, ResourceQuota, and PriorityClass based on the namespace tier prefix.
### Key Services
**Critical Services (Tier 0-1)**:
- **Traefik**: Ingress controller with automatic HTTPS (Let's Encrypt)
- **Authentik**: SSO/OIDC provider for all services
- **Vault**: Secrets management with auto-unseal
- **Cloudflared**: Cloudflare Tunnel for external access
- **Technitium**: Internal DNS server
- **Headscale**: Tailscale-compatible mesh VPN control plane
**Storage & Security**:
2026-04-13 14:41:15 +00:00
- **Proxmox NFS**: NFS storage served directly from Proxmox host (192.168.1.127) at `/srv/nfs` (HDD) and `/srv/nfs-ssd` (SSD)
- **Proxmox CSI**: Block storage via LVM-thin hotplug for databases
add architecture documentation for all infrastructure subsystems [ci skip]
14 docs covering networking, VPN, storage, authentication, security,
monitoring, secrets, CI/CD, backup/DR, compute, databases, and
multi-tenancy. Each doc includes Mermaid diagrams, component tables,
configuration references, decision rationale, and troubleshooting.
2026-03-24 00:55:25 +02:00
- **Vaultwarden**: Password manager
- **Immich**: Photo management
- **CrowdSec**: IPS/IDS with community threat intelligence
- **Kyverno**: Policy engine for admission control
**Monitoring & Observability**:
- **Prometheus**: Metrics collection
- **Grafana**: Visualization and dashboards
- **Loki**: Log aggregation
- **Alertmanager**: Alert routing
**Application Services**: Woodpecker CI, Gitea, PostgreSQL, MySQL, Redis, Ollama, ComfyUI, Stable Diffusion, Freshrss, and 50+ more services.
## Configuration
### Key Files
| Path | Purpose |
|------|---------|
| `stacks/<service>/terragrunt.hcl` | Individual service configuration |
| `modules/k8s_app/` | Reusable Kubernetes app module |
| `modules/helm_app/` | Helm chart deployment module |
| `base.hcl` | Global Terragrunt configuration |
| `terraform.tfvars` | Global variables (git-ignored) |
### Terraform Organization
Each service lives in `stacks/<service>/` with its own Terragrunt configuration. Common patterns:
- Helm deployments use `modules/helm_app/`
- Custom manifests use `modules/k8s_app/`
- Databases use dedicated modules (`modules/postgres_app/` , `modules/mysql_app/` )
- Shared dependencies via `dependency` blocks in terragrunt.hcl
### Vault Paths
docs: comprehensive audit and update of all architecture docs and runbooks [ci skip]
Audited 14 documentation files against live cluster state and Terraform code.
Architecture docs:
- databases.md: MySQL 8.4.4, proxmox-lvm storage (not iSCSI), anti-affinity
excludes k8s-node1 (GPU), 2Gi/3Gi resources, 7-day rotation (not 24h),
CNPG 2 instances, PostGIS 16, postgresql.dbaas has endpoints
- overview.md: 1x CPU, ~160GB RAM, all nodes 32GB, proxmox-lvm storage,
correct Vault paths (secret/ not kv/)
- compute.md: 272GB physical host RAM, ~160GB allocated to VMs
- secrets.md: 7-day rotation, 7 MySQL + 5 PG roles, correct ESO config
- networking.md: MetalLB pool 10.0.20.200-220
- ci-cd.md: 9 GHA projects, travel_blog 5.7GB
Runbooks:
- restore-mysql/postgresql: backup files are .sql.gz (not .sql)
- restore-vault: weekly backup (not daily), auto-unseal sidecar note
- restore-vaultwarden: PVC is proxmox (not iscsi)
- restore-full-cluster: updated node roles, removed trading
Reference docs:
- CLAUDE.md: 7-day rotation, removed trading from PG list
- AGENTS.md: 100+ stacks, proxmox-lvm, platform empty shell
- service-catalog.md: 6 new stacks, 14 stack column updates
2026-04-06 13:21:05 +03:00
Secrets are stored in HashiCorp Vault under `secret/` :
- `secret/<service>/*` - Service-specific secrets
- `secret/cloudflare` - Cloudflare API tokens
- `secret/authentik` - OIDC client credentials
- `secret/backup` - Backup encryption keys
add architecture documentation for all infrastructure subsystems [ci skip]
14 docs covering networking, VPN, storage, authentication, security,
monitoring, secrets, CI/CD, backup/DR, compute, databases, and
multi-tenancy. Each doc includes Mermaid diagrams, component tables,
configuration references, decision rationale, and troubleshooting.
2026-03-24 00:55:25 +02:00
## Decisions & Rationale
### Why Proxmox over bare-metal Kubernetes?
**Decision**: Run Kubernetes inside Proxmox VMs rather than directly on bare metal.
**Rationale**:
- **Flexibility**: Easy to snapshot, clone, and roll back VMs during upgrades
2026-04-19 16:55:43 +00:00
- **Isolation**: Management network (devvm) separated from Kubernetes
add architecture documentation for all infrastructure subsystems [ci skip]
14 docs covering networking, VPN, storage, authentication, security,
monitoring, secrets, CI/CD, backup/DR, compute, databases, and
multi-tenancy. Each doc includes Mermaid diagrams, component tables,
configuration references, decision rationale, and troubleshooting.
2026-03-24 00:55:25 +02:00
- **GPU passthrough**: Can dedicate GPU to a single node without tainting the entire host
- **Multi-purpose**: Same physical host can run non-K8s VMs (pfSense, Home Assistant)
**Tradeoff**: Slight performance overhead from virtualization (acceptable for homelab).
### Why five-tier namespace system?
**Decision**: Organize services into 5 tiers with automatic LimitRange/ResourceQuota via Kyverno.
**Rationale**:
- **Predictable scheduling**: Critical services (tier 0) always preempt auxiliary services (tier 4)
- **Resource protection**: Prevents a single service from consuming all cluster resources
- **Clear priorities**: Tier prefix makes service criticality obvious
- **Automation**: Kyverno auto-generates policies, reducing manual configuration
**Tradeoff**: Adds namespace naming convention requirement.
### Why no CPU limits cluster-wide?
**Decision**: Set CPU requests but no CPU limits on containers.
**Rationale**:
- **CFS throttling**: Linux CFS throttles containers to exact CPU limit even when CPU is idle, causing artificial slowdowns
- **Burstability**: Services can burst to unused CPU during idle periods
docs: comprehensive audit and update of all architecture docs and runbooks [ci skip]
Audited 14 documentation files against live cluster state and Terraform code.
Architecture docs:
- databases.md: MySQL 8.4.4, proxmox-lvm storage (not iSCSI), anti-affinity
excludes k8s-node1 (GPU), 2Gi/3Gi resources, 7-day rotation (not 24h),
CNPG 2 instances, PostGIS 16, postgresql.dbaas has endpoints
- overview.md: 1x CPU, ~160GB RAM, all nodes 32GB, proxmox-lvm storage,
correct Vault paths (secret/ not kv/)
- compute.md: 272GB physical host RAM, ~160GB allocated to VMs
- secrets.md: 7-day rotation, 7 MySQL + 5 PG roles, correct ESO config
- networking.md: MetalLB pool 10.0.20.200-220
- ci-cd.md: 9 GHA projects, travel_blog 5.7GB
Runbooks:
- restore-mysql/postgresql: backup files are .sql.gz (not .sql)
- restore-vault: weekly backup (not daily), auto-unseal sidecar note
- restore-vaultwarden: PVC is proxmox (not iscsi)
- restore-full-cluster: updated node roles, removed trading
Reference docs:
- CLAUDE.md: 7-day rotation, removed trading from PG list
- AGENTS.md: 100+ stacks, proxmox-lvm, platform empty shell
- service-catalog.md: 6 new stacks, 14 stack column updates
2026-04-06 13:21:05 +03:00
- **Memory is the constraint**: With ~160GB RAM across VMs, memory exhaustion occurs before CPU saturation
add architecture documentation for all infrastructure subsystems [ci skip]
14 docs covering networking, VPN, storage, authentication, security,
monitoring, secrets, CI/CD, backup/DR, compute, databases, and
multi-tenancy. Each doc includes Mermaid diagrams, component tables,
configuration references, decision rationale, and troubleshooting.
2026-03-24 00:55:25 +02:00
**Tradeoff**: A runaway process could monopolize CPU (mitigated by CPU requests reserving capacity).
### Why Goldilocks in Initial mode, not Auto?
**Decision**: Run VPA Goldilocks in "Initial" (recommend-only) mode instead of "Auto" (update pods).
**Rationale**:
- **Terraform conflicts**: Auto mode directly modifies Deployment specs, creating drift from Terraform state
- **Controlled changes**: Recommendations are reviewed and applied via Terraform, maintaining declarative workflow
- **Quarterly review**: Right-sizing happens deliberately every quarter, not continuously
**Tradeoff**: Requires manual review of VPA recommendations.
## Troubleshooting
### Pods stuck in Pending state
**Symptom**: Pod shows `status: Pending` with event `FailedScheduling` .
**Diagnosis**:
```bash
kubectl describe pod < pod-name > -n < namespace >
# Check events for:
# - "Insufficient memory" → ResourceQuota exceeded
# - "0/5 nodes available: 5 Insufficient memory" → LimitRange default too high
# - "0/5 nodes available: 1 node(s) had untolerated taint" → GPU taint
```
**Fix**:
- ResourceQuota exceeded: Increase quota in `modules/namespace_config/` for that tier
- LimitRange too high: Override pod resources in Terraform
- GPU taint: Add `tolerations` and `nodeSelector` for GPU pods
### OOMKilled pods
**Symptom**: Pod shows `status: OOMKilled` in events.
**Diagnosis**:
```bash
kubectl describe pod < pod-name > -n < namespace >
# Check LimitRange defaults:
kubectl get limitrange -n < namespace > -o yaml
```
**Fix**:
- If pod uses LimitRange default (256Mi or 512Mi): Set explicit memory request/limit in Terraform
- If pod has explicit limit: Increase memory based on Goldilocks VPA recommendation (upperBound x1.2)
### Democratic-CSI sidecars consuming excessive memory
**Symptom**: Pods with PVCs have 3-4 sidecar containers each using 256Mi (LimitRange default).
**Diagnosis**:
```bash
kubectl get pods -A -o json | jq '.items[] | select(.spec.containers[].name | contains("csi")) | .metadata.name'
```
**Fix**: Democratic-CSI sidecars need explicit resources (32-80Mi each). Update Terraform to override sidecar resources.
### Tier 3-4 pods evicted during resource pressure
**Symptom**: Lower-tier pods show `status: Evicted` with reason `The node was low on resource: memory` .
**Diagnosis**: This is expected behavior. Tier 3-4 use Burstable QoS (request < limit ) and priority 200K-300K , making them first candidates for eviction .
**Fix**:
- Increase node memory if evictions are frequent
- Promote critical services to higher tier
- Reduce memory limits on tier 4 services
## Related
- [Compute & Resource Management ](compute.md ) - Detailed resource management patterns
- [Multi-tenancy ](multi-tenancy.md ) - Namespace isolation and tier system
- [Monitoring ](monitoring.md ) - Resource usage dashboards
- [Runbooks: Node Maintenance ](../../runbooks/node-maintenance.md )
- [Runbooks: Service Onboarding ](../../runbooks/service-onboarding.md )