k8s-node1 was upgraded to Ubuntu 26.04 (kernel 7.0.0-15-generic) at some
point. NVIDIA has NOT published ubuntu26.04 driver images yet
(skopeo list-tags docker://nvcr.io/nvidia/driver returned 0 ubuntu26.04
tags vs 779 for ubuntu22.04 and 206 for ubuntu24.04).
Attempted fix today: bump gpu-operator chart v25.10.1 → v26.3.1 +
driver 570.195.03 → 580.105.08 + kernelModuleType=open. The chart
applied cleanly but the v26.3.1 operator auto-detects host OS via NFD
labels and constructs `<version>-ubuntu26.04` image tags, which 404 on
pull. Rolled back to chart v25.10.1 and pinned it explicitly here so
future `terraform apply` doesn't surface the same trap again.
Note: chart rollback alone does NOT restore GPU functionality on
k8s-node1. Both v25.10.1 and v26.3.1's operators now pick the
ubuntu26.04 suffix (the NFD label is sticky once detected). The actual
recovery path requires either (a) NVIDIA shipping ubuntu26.04 driver
images, or (b) rolling the host kernel back to 6.8.0-117-generic
(still installed in /boot, headers in /usr/src) + `apt-mark hold` to
prevent re-upgrade. That step needs explicit user authorization for a
node reboot — left as the next action item on code-8vr0.
Files:
- stacks/nvidia/modules/nvidia/main.tf — explicit version pin,
explanatory comment
- stacks/nvidia/modules/nvidia/values.yaml — comment block
documenting the situation; driver pinned at 570.195.03
- docs/post-mortems/2026-05-17-gpu-driver-ubuntu2604-mismatch.md —
full timeline, root causes, recovery procedure
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
||
|---|---|---|
| .. | ||
| architecture | ||
| benchmarks | ||
| plans | ||
| post-mortems | ||
| runbooks | ||
| README.md | ||
Infrastructure Documentation
This repository contains the configuration and documentation for a homelab Kubernetes cluster running on Proxmox. The infrastructure hosts 70+ services managed declaratively with Terraform and Terragrunt.
Quick Reference
Network Ranges
- Physical Network:
192.168.1.0/24- Physical devices and host network - Management VLAN 10:
10.0.10.0/24- Infrastructure VMs and management - Kubernetes VLAN 20:
10.0.20.0/24- Kubernetes cluster network
Key URLs
- Public:
viktorbarzin.me - Internal:
viktorbarzin.lan
Architecture Documentation
| Document | Description |
|---|---|
| Overview | Infrastructure overview, hardware specs, VM inventory, and service catalog |
| Networking | Network topology, VLANs, routing, and firewall rules |
| VPN | Headscale mesh VPN and Cloudflare Tunnel configuration |
| Storage | Proxmox host NFS, Proxmox CSI (LVM-thin + LUKS2), and persistent volume management |
| Authentication | Authentik SSO, OIDC flows, and service integration |
| Security | CrowdSec IPS, Kyverno policies, and security controls |
| Monitoring | Prometheus, Grafana, Loki, and observability stack |
| Secrets Management | HashiCorp Vault integration and secret rotation |
| CI/CD | Woodpecker CI pipeline and deployment automation |
| Backup & DR | Backup strategy, disaster recovery, and restore procedures |
| Compute | Proxmox VMs, GPU passthrough, K8s resource management, and VPA |
| Databases | PostgreSQL, MySQL, Redis, and database operators |
| Multi-tenancy | Namespace isolation, tier system, and resource quotas |
Operations
- Runbooks - Step-by-step operational procedures
- Plans - Infrastructure change plans and rollout strategies
Getting Started
- Review the Overview for a high-level understanding
- Read the Networking doc to understand connectivity
- Check Compute for resource management patterns
- Explore individual architecture docs based on your area of interest