infra/AGENTS.md
Viktor Barzin dc85c34069
[ci skip] add AGENTS.md for model-agnostic knowledge, slim CLAUDE.md to Claude-specific layer
AGENTS.md (63 lines): shared infra knowledge for any AI tool (Codex, Claude,
Cursor). Covers: critical rules, architecture, storage, tiers, common ops.

CLAUDE.md (23 lines): Claude-specific addons — skills, agents, user preferences.
References AGENTS.md for shared knowledge.

Removed generic agents (devops-engineer, fullstack-developer).
2026-03-06 23:50:26 +00:00

4.1 KiB

Infrastructure Repository — AI Agent Instructions

Critical Rules (MUST FOLLOW)

  • ALL changes through Terraform/Terragrunt — NEVER kubectl apply/edit/patch/delete for persistent changes. Read-only kubectl is fine.
  • NEVER put secrets in committed files — use terraform.tfvars or secrets/ (git-crypt encrypted)
  • NEVER restart NFS on TrueNAS — causes cluster-wide mount failures across all pods
  • NEVER commit secrets — triple-check before every commit
  • [ci skip] in commit messages when changes were already applied locally
  • Ask before git push — always confirm with the user first

Execution

  • Apply a service: cd stacks/<service> && terragrunt apply --non-interactive
  • kubectl: kubectl --kubeconfig $(pwd)/config
  • Health check: bash scripts/cluster_healthcheck.sh --quiet
  • Plan all: cd stacks && terragrunt run --all --non-interactive -- plan

Architecture

Terragrunt-based homelab managing a Kubernetes cluster (5 nodes, v1.34.2) on Proxmox VMs.

  • 70+ services, each in stacks/<service>/ with its own Terraform state
  • Core platform: stacks/platform/modules/ (~22 modules: Traefik, Kyverno, monitoring, dbaas, etc.)
  • Public domain: viktorbarzin.me (Cloudflare) | Internal: viktorbarzin.lan (Technitium DNS)
  • Secrets: terraform.tfvars (git-crypt encrypted)

Key Paths

  • stacks/<service>/main.tf — service definition
  • stacks/platform/modules/<service>/ — core infra modules
  • modules/kubernetes/ingress_factory/ — standardized ingress with auth, rate limiting, anti-AI
  • modules/kubernetes/nfs_volume/ — NFS volume module (CSI-backed, soft mount)
  • terraform.tfvars — all secrets, DNS config, shared variables
  • scripts/cluster_healthcheck.sh — 25-check cluster health script

Storage

  • NFS (nfs-truenas StorageClass): For app data. Use the nfs_volume module, never inline nfs {} blocks.
  • iSCSI (iscsi-truenas StorageClass): For databases (PostgreSQL, MySQL). democratic-csi driver.
  • TrueNAS: 10.0.10.15. NFS exports managed via secrets/nfs_exports.sh.

Shared Variables (never hardcode)

var.nfs_server (10.0.10.15), var.redis_host, var.postgresql_host, var.mysql_host, var.ollama_host, var.mail_host

Tier System

0-core | 1-cluster | 2-gpu | 3-edge | 4-aux — Kyverno auto-generates LimitRange + ResourceQuota per namespace based on tier label.

  • Containers without explicit resources {} get default limits (256Mi for edge/aux — causes OOMKill for heavy apps)
  • Always set explicit resources on containers that need more than defaults
  • Opt-out: labels resource-governance/custom-quota=true / resource-governance/custom-limitrange=true

Infrastructure

  • Proxmox: 192.168.1.127 (Dell R730, 22c/44t, 142GB RAM)
  • Nodes: k8s-master (10.0.20.100), node1 (GPU, Tesla T4), node2-4
  • GPU: node_selector = { "gpu": "true" } + toleration nvidia.com/gpu
  • Pull-through cache: 10.0.20.10 — docker.io (:5000), ghcr.io (:5010) only
  • pfSense: 10.0.20.1 (gateway, firewall, DNS forwarding)
  • MySQL InnoDB Cluster: 3 instances on iSCSI, anti-affinity excludes node2 (SIGBUS bug)
  • SMTP: var.mail_host port 587 STARTTLS (not internal svc address — cert mismatch)

Common Operations

  • Deploy new service: Use stacks/<existing-service>/ as template. Create stack, add DNS in tfvars, apply platform then service.
  • Fix crashed pods: Run healthcheck first. Safe to delete evicted/failed pods and CrashLoopBackOff pods with >10 restarts.
  • OOMKilled: Check kubectl describe limitrange tier-defaults -n <ns>. Increase resources.limits.memory in the stack's main.tf.
  • Helm stuck: If Helm release is in pending-upgrade/failed, check reference/patterns.md for recovery.
  • NFS exports: Create dir on TrueNAS first, add to secrets/nfs_directories.txt, run secrets/nfs_exports.sh.

Detailed Reference

See .claude/reference/patterns.md for: NFS volume code examples, iSCSI details, Kyverno governance tables, anti-AI scraping layers, Terragrunt architecture, node rebuild procedure, archived troubleshooting runbooks index.