AGENTS.md (63 lines): shared infra knowledge for any AI tool (Codex, Claude, Cursor). Covers: critical rules, architecture, storage, tiers, common ops. CLAUDE.md (23 lines): Claude-specific addons — skills, agents, user preferences. References AGENTS.md for shared knowledge. Removed generic agents (devops-engineer, fullstack-developer).
4.1 KiB
4.1 KiB
Infrastructure Repository — AI Agent Instructions
Critical Rules (MUST FOLLOW)
- ALL changes through Terraform/Terragrunt — NEVER
kubectl apply/edit/patch/deletefor persistent changes. Read-only kubectl is fine. - NEVER put secrets in committed files — use
terraform.tfvarsorsecrets/(git-crypt encrypted) - NEVER restart NFS on TrueNAS — causes cluster-wide mount failures across all pods
- NEVER commit secrets — triple-check before every commit
[ci skip]in commit messages when changes were already applied locally- Ask before
git push— always confirm with the user first
Execution
- Apply a service:
cd stacks/<service> && terragrunt apply --non-interactive - kubectl:
kubectl --kubeconfig $(pwd)/config - Health check:
bash scripts/cluster_healthcheck.sh --quiet - Plan all:
cd stacks && terragrunt run --all --non-interactive -- plan
Architecture
Terragrunt-based homelab managing a Kubernetes cluster (5 nodes, v1.34.2) on Proxmox VMs.
- 70+ services, each in
stacks/<service>/with its own Terraform state - Core platform:
stacks/platform/modules/(~22 modules: Traefik, Kyverno, monitoring, dbaas, etc.) - Public domain:
viktorbarzin.me(Cloudflare) | Internal:viktorbarzin.lan(Technitium DNS) - Secrets:
terraform.tfvars(git-crypt encrypted)
Key Paths
stacks/<service>/main.tf— service definitionstacks/platform/modules/<service>/— core infra modulesmodules/kubernetes/ingress_factory/— standardized ingress with auth, rate limiting, anti-AImodules/kubernetes/nfs_volume/— NFS volume module (CSI-backed, soft mount)terraform.tfvars— all secrets, DNS config, shared variablesscripts/cluster_healthcheck.sh— 25-check cluster health script
Storage
- NFS (
nfs-truenasStorageClass): For app data. Use thenfs_volumemodule, never inlinenfs {}blocks. - iSCSI (
iscsi-truenasStorageClass): For databases (PostgreSQL, MySQL). democratic-csi driver. - TrueNAS: 10.0.10.15. NFS exports managed via
secrets/nfs_exports.sh.
Shared Variables (never hardcode)
var.nfs_server (10.0.10.15), var.redis_host, var.postgresql_host, var.mysql_host, var.ollama_host, var.mail_host
Tier System
0-core | 1-cluster | 2-gpu | 3-edge | 4-aux — Kyverno auto-generates LimitRange + ResourceQuota per namespace based on tier label.
- Containers without explicit
resources {}get default limits (256Mi for edge/aux — causes OOMKill for heavy apps) - Always set explicit resources on containers that need more than defaults
- Opt-out: labels
resource-governance/custom-quota=true/resource-governance/custom-limitrange=true
Infrastructure
- Proxmox: 192.168.1.127 (Dell R730, 22c/44t, 142GB RAM)
- Nodes: k8s-master (10.0.20.100), node1 (GPU, Tesla T4), node2-4
- GPU:
node_selector = { "gpu": "true" }+ tolerationnvidia.com/gpu - Pull-through cache: 10.0.20.10 — docker.io (:5000), ghcr.io (:5010) only
- pfSense: 10.0.20.1 (gateway, firewall, DNS forwarding)
- MySQL InnoDB Cluster: 3 instances on iSCSI, anti-affinity excludes node2 (SIGBUS bug)
- SMTP:
var.mail_hostport 587 STARTTLS (not internal svc address — cert mismatch)
Common Operations
- Deploy new service: Use
stacks/<existing-service>/as template. Create stack, add DNS in tfvars, apply platform then service. - Fix crashed pods: Run healthcheck first. Safe to delete evicted/failed pods and CrashLoopBackOff pods with >10 restarts.
- OOMKilled: Check
kubectl describe limitrange tier-defaults -n <ns>. Increaseresources.limits.memoryin the stack's main.tf. - Helm stuck: If Helm release is in
pending-upgrade/failed, checkreference/patterns.mdfor recovery. - NFS exports: Create dir on TrueNAS first, add to
secrets/nfs_directories.txt, runsecrets/nfs_exports.sh.
Detailed Reference
See .claude/reference/patterns.md for: NFS volume code examples, iSCSI details, Kyverno governance tables, anti-AI scraping layers, Terragrunt architecture, node rebuild procedure, archived troubleshooting runbooks index.