All of Anca's photos are imported. The Job was declared as
kubernetes_job_v1.anca_elements_import — meaning every `terragrunt apply` of
the immich stack re-created it, despite the 2026-05-25 in-code comment saying
"After successful completion: REMOVE this resource block + apply again."
Nobody noticed for 22 days; the re-trigger today (2026-06-16) was the
6th IO-pressure incident — it scanned all 21,643 assets in pure read-scan
mode for 51 min, saturated sdc, starved etcd, crash-looped kube-apiserver.
Recovery actions taken before this commit:
- Throttled nfsd 64→8 on PVE host to give apiserver headroom
- `kubectl delete job -n immich anca-elements-import` + force-delete pod
- Restored nfsd to 64; cluster healthy
Code change here:
- Remove `kubernetes_job_v1.anca_elements_import` block
- Remove `module.nfs_anca_elements_host` (PVC `immich-anca-elements-host` —
no live consumer; videos batch deferred per user, source dump remains on
PVE at /srv/nfs/anca-elements, browseable via Nextcloud admin)
- Update 2026-05-25 post-mortem: 6th-incident section + new lesson that
one-shot Jobs do NOT belong in kubernetes_job_v1 (use a suspended CronJob
or a runbook-captured `kubectl create job` ad-hoc invocation instead).
|
||
|---|---|---|
| .. | ||
| adr | ||
| architecture | ||
| benchmarks | ||
| plans | ||
| post-mortems | ||
| runbooks | ||
| known-issues.md | ||
| README.md | ||
Infrastructure Documentation
This repository contains the configuration and documentation for a homelab Kubernetes cluster running on Proxmox. The infrastructure hosts 70+ services managed declaratively with Terraform and Terragrunt.
Quick Reference
Network Ranges
- Physical Network:
192.168.1.0/24- Physical devices and host network - Management VLAN 10:
10.0.10.0/24- Infrastructure VMs and management - Kubernetes VLAN 20:
10.0.20.0/24- Kubernetes cluster network
Key URLs
- Public:
viktorbarzin.me - Internal:
viktorbarzin.lan
Architecture Documentation
| Document | Description |
|---|---|
| Overview | Infrastructure overview, hardware specs, VM inventory, and service catalog |
| Networking | Network topology, VLANs, routing, and firewall rules |
| VPN | Headscale mesh VPN and Cloudflare Tunnel configuration |
| Storage | Proxmox host NFS, Proxmox CSI (LVM-thin + LUKS2), and persistent volume management |
| Authentication | Authentik SSO, OIDC flows, and service integration |
| Security | CrowdSec IPS, Kyverno policies, and security controls |
| Monitoring | Prometheus, Grafana, Loki, and observability stack |
| Secrets Management | HashiCorp Vault integration and secret rotation |
| CI/CD | Woodpecker CI pipeline and deployment automation |
| Backup & DR | Backup strategy, disaster recovery, and restore procedures |
| Compute | Proxmox VMs, GPU passthrough, K8s resource management, and VPA |
| Databases | PostgreSQL, MySQL, Redis, and database operators |
| Multi-tenancy | Namespace isolation, tier system, and resource quotas |
Operations
- Runbooks - Step-by-step operational procedures
- Plans - Infrastructure change plans and rollout strategies
Getting Started
- Review the Overview for a high-level understanding
- Read the Networking doc to understand connectivity
- Check Compute for resource management patterns
- Explore individual architecture docs based on your area of interest