diff --git a/.claude/CLAUDE.md b/.claude/CLAUDE.md index c5f4c4f8..ad4be989 100755 --- a/.claude/CLAUDE.md +++ b/.claude/CLAUDE.md @@ -1,69 +1,20 @@ -# Infrastructure Repository Knowledge +# Claude Code — Project Configuration -## Instructions -- **"remember X"**: Update this file, commit with `[ci skip]` -- **Skills**: `.claude/skills/` (7 active workflows). Archived runbooks in `.claude/skills/archived/` -- **Reference**: `.claude/reference/` — patterns.md (detailed procedures), service-catalog.md, proxmox-inventory.md, github-api.md, authentik-state.md -- **Agents**: `.claude/agents/` — `cluster-health-checker` (haiku, autonomous health checks) +> **Shared knowledge**: Read `AGENTS.md` at repo root for architecture, patterns, rules, and operations. This file adds Claude-specific features on top. -## Critical Rules -- **ALL changes through Terraform/Terragrunt** — never `kubectl apply/edit/patch` directly -- **NEVER put secrets in committed files** — use `terraform.tfvars` or `secrets/` (git-crypt) -- **NEVER restart NFS on TrueNAS** — causes cluster-wide mount failures -- **NEVER commit secrets** — triple-check every commit -- **New services need CI/CD** (Woodpecker) and **monitoring** (Prometheus/Uptime Kuma) -- **ALWAYS `[ci skip]`** in commit messages when already applied locally -- **Ask before pushing** to git. Commit specific files, not `git add -A` - -## Execution -- **Terragrunt**: `cd stacks/ && terragrunt apply --non-interactive` -- **kubectl**: `kubectl --kubeconfig $(pwd)/config` -- **Health check**: `bash scripts/cluster_healthcheck.sh --quiet` -- **Plan all**: `cd stacks && terragrunt run --all --non-interactive -- plan` +## Claude-Specific Resources +- **Skills**: `.claude/skills/` (7 active). Archived runbooks: `.claude/skills/archived/` +- **Agents**: `.claude/agents/cluster-health-checker` (haiku, autonomous health checks) +- **Reference**: `.claude/reference/` — patterns.md, service-catalog.md, proxmox-inventory.md, github-api.md, authentik-state.md - **GitHub API**: `curl` with tokens from tfvars (`gh` CLI blocked by sandbox) - -## Overview -Terragrunt-based homelab managing K8s cluster on Proxmox. Per-service stacks under `stacks/`. Git-crypt for secrets. -- **Public domain**: `viktorbarzin.me` (Cloudflare) | **Internal**: `viktorbarzin.lan` (Technitium DNS) -- **Cluster**: 5 nodes (master + node1-4, v1.34.2), GPU on node1 (Tesla T4) - **CI/CD**: Woodpecker CI — pushes to master auto-apply platform stack -## Key Paths -- `terraform.tfvars` — secrets, DNS, Cloudflare (git-crypt) -- `stacks//` — individual stacks | `stacks/platform/modules/` — core infra (~22 modules) -- `modules/kubernetes/ingress_factory/`, `nfs_volume/`, `setup_tls_secret/` — shared modules - -## Quick Patterns -- **NFS volumes**: Use `nfs_volume` module (see `reference/patterns.md`). StorageClass: `nfs-truenas`. Never use inline `nfs {}` blocks. -- **iSCSI (databases)**: StorageClass `iscsi-truenas` (democratic-csi). Used by PostgreSQL, MySQL. -- **SMTP**: `var.mail_host` port 587 STARTTLS. NOT `mailserver.mailserver.svc.cluster.local` (cert mismatch). -- **New service**: Use `setup-project` skill. Quick: create stack → add DNS in tfvars → apply platform → apply service. +## Instructions +- **"remember X"**: Update this file + `AGENTS.md` (if it's shared knowledge), commit with `[ci skip]` +- **New services need CI/CD** (Woodpecker) and **monitoring** (Prometheus/Uptime Kuma) +- **New service**: Use `setup-project` skill for full workflow - **Ingress**: `ingress_factory` module. Auth: `protected = true`. Anti-AI: on by default. -## Shared Variables (never hardcode) -`var.nfs_server` (10.0.10.15), `var.redis_host`, `var.postgresql_host`, `var.mysql_host`, `var.ollama_host`, `var.mail_host` - -## Infrastructure -- Proxmox (192.168.1.127) — see `reference/proxmox-inventory.md` -- Pull-through cache at `10.0.20.10` — docker.io (:5000) and ghcr.io (:5010) only -- GPU: `node_selector = { "gpu": "true" }` + `toleration { key = "nvidia.com/gpu", value = "true", effect = "NoSchedule" }` -- Node rebuild: see `reference/patterns.md` - -## Tier System -`0-core` (ingress, DNS, VPN, auth) | `1-cluster` (Redis, metrics) | `2-gpu` | `3-edge` (user-facing) | `4-aux` (optional) -- Auto-generated into `tiers.tf` — use `local.tiers.core`, `local.tiers.cluster`, etc. -- Kyverno governance: LimitRange defaults + ResourceQuota per namespace (see `reference/patterns.md`) -- **OOMKilled?** → Container without explicit resources gets 256Mi (edge/aux). Set explicit `resources {}`. -- **Won't schedule?** → Check `kubectl describe resourcequota tier-quota -n ` -- **Opt-out**: labels `resource-governance/custom-quota=true` and/or `resource-governance/custom-limitrange=true` - -## MySQL InnoDB Cluster (dbaas namespace) -- 3 instances on `iscsi-truenas`, anti-affinity excludes k8s-node2 (SIGBUS in init containers) -- `mysql` service selector includes `mysql.oracle.com/cluster-role: PRIMARY` -- GR bootstrap: `SET GLOBAL group_replication_bootstrap_group=ON; START GROUP_REPLICATION;` -- Service users NOT managed by Terraform — recreate manually after cluster rebuild -- `manualStartOnBoot: true` — GR doesn't auto-start, needs bootstrap after full restart - ## User Preferences - **Calendar**: Nextcloud at `nextcloud.viktorbarzin.me` - **Home Assistant**: ha-london (default), ha-sofia. "ha"/"HA" = ha-london diff --git a/AGENTS.md b/AGENTS.md new file mode 100644 index 00000000..2a0f5300 --- /dev/null +++ b/AGENTS.md @@ -0,0 +1,63 @@ +# Infrastructure Repository — AI Agent Instructions + +## Critical Rules (MUST FOLLOW) +- **ALL changes through Terraform/Terragrunt** — NEVER `kubectl apply/edit/patch/delete` for persistent changes. Read-only kubectl is fine. +- **NEVER put secrets in committed files** — use `terraform.tfvars` or `secrets/` (git-crypt encrypted) +- **NEVER restart NFS on TrueNAS** — causes cluster-wide mount failures across all pods +- **NEVER commit secrets** — triple-check before every commit +- **`[ci skip]` in commit messages** when changes were already applied locally +- **Ask before `git push`** — always confirm with the user first + +## Execution +- **Apply a service**: `cd stacks/ && terragrunt apply --non-interactive` +- **kubectl**: `kubectl --kubeconfig $(pwd)/config` +- **Health check**: `bash scripts/cluster_healthcheck.sh --quiet` +- **Plan all**: `cd stacks && terragrunt run --all --non-interactive -- plan` + +## Architecture +Terragrunt-based homelab managing a Kubernetes cluster (5 nodes, v1.34.2) on Proxmox VMs. +- **70+ services**, each in `stacks//` with its own Terraform state +- **Core platform**: `stacks/platform/modules/` (~22 modules: Traefik, Kyverno, monitoring, dbaas, etc.) +- **Public domain**: `viktorbarzin.me` (Cloudflare) | **Internal**: `viktorbarzin.lan` (Technitium DNS) +- **Secrets**: `terraform.tfvars` (git-crypt encrypted) + +## Key Paths +- `stacks//main.tf` — service definition +- `stacks/platform/modules//` — core infra modules +- `modules/kubernetes/ingress_factory/` — standardized ingress with auth, rate limiting, anti-AI +- `modules/kubernetes/nfs_volume/` — NFS volume module (CSI-backed, soft mount) +- `terraform.tfvars` — all secrets, DNS config, shared variables +- `scripts/cluster_healthcheck.sh` — 25-check cluster health script + +## Storage +- **NFS** (`nfs-truenas` StorageClass): For app data. Use the `nfs_volume` module, never inline `nfs {}` blocks. +- **iSCSI** (`iscsi-truenas` StorageClass): For databases (PostgreSQL, MySQL). democratic-csi driver. +- **TrueNAS**: 10.0.10.15. NFS exports managed via `secrets/nfs_exports.sh`. + +## Shared Variables (never hardcode) +`var.nfs_server` (10.0.10.15), `var.redis_host`, `var.postgresql_host`, `var.mysql_host`, `var.ollama_host`, `var.mail_host` + +## Tier System +`0-core` | `1-cluster` | `2-gpu` | `3-edge` | `4-aux` — Kyverno auto-generates LimitRange + ResourceQuota per namespace based on tier label. +- Containers without explicit `resources {}` get default limits (256Mi for edge/aux — causes OOMKill for heavy apps) +- Always set explicit resources on containers that need more than defaults +- Opt-out: labels `resource-governance/custom-quota=true` / `resource-governance/custom-limitrange=true` + +## Infrastructure +- **Proxmox**: 192.168.1.127 (Dell R730, 22c/44t, 142GB RAM) +- **Nodes**: k8s-master (10.0.20.100), node1 (GPU, Tesla T4), node2-4 +- **GPU**: `node_selector = { "gpu": "true" }` + toleration `nvidia.com/gpu` +- **Pull-through cache**: 10.0.20.10 — docker.io (:5000), ghcr.io (:5010) only +- **pfSense**: 10.0.20.1 (gateway, firewall, DNS forwarding) +- **MySQL InnoDB Cluster**: 3 instances on iSCSI, anti-affinity excludes node2 (SIGBUS bug) +- **SMTP**: `var.mail_host` port 587 STARTTLS (not internal svc address — cert mismatch) + +## Common Operations +- **Deploy new service**: Use `stacks//` as template. Create stack, add DNS in tfvars, apply platform then service. +- **Fix crashed pods**: Run healthcheck first. Safe to delete evicted/failed pods and CrashLoopBackOff pods with >10 restarts. +- **OOMKilled**: Check `kubectl describe limitrange tier-defaults -n `. Increase `resources.limits.memory` in the stack's main.tf. +- **Helm stuck**: If Helm release is in `pending-upgrade`/`failed`, check `reference/patterns.md` for recovery. +- **NFS exports**: Create dir on TrueNAS first, add to `secrets/nfs_directories.txt`, run `secrets/nfs_exports.sh`. + +## Detailed Reference +See `.claude/reference/patterns.md` for: NFS volume code examples, iSCSI details, Kyverno governance tables, anti-AI scraping layers, Terragrunt architecture, node rebuild procedure, archived troubleshooting runbooks index.