2026-01-18 11:04:51 +00:00
|
|
|
# Infrastructure Repository Knowledge
|
|
|
|
|
|
[ci skip] refactor claude files: compact CLAUDE.md, clean memory, remove generic agents
CLAUDE.md: 260→72 lines. Moved detailed patterns (NFS, iSCSI, Kyverno
tables, anti-AI, node rebuild) to .claude/reference/patterns.md.
Kept: critical rules, quick patterns, key commands, tier overview, prefs.
Memory: CLAUDE.md is now single source of truth. Auto-memory reduced to
scratch pad (67→25 lines, 5→1 files). MetaClaw DB cleaned from 40→16
entries (removed all infra-specific duplicates, kept cross-project prefs).
Agents: removed generic devops-engineer (885L) and fullstack-developer
(234L). Kept custom cluster-health-checker (48L).
2026-03-06 23:27:46 +00:00
|
|
|
## Instructions
|
|
|
|
|
- **"remember X"**: Update this file, commit with `[ci skip]`
|
|
|
|
|
- **Skills**: `.claude/skills/` (7 active workflows). Archived runbooks in `.claude/skills/archived/`
|
|
|
|
|
- **Reference**: `.claude/reference/` — patterns.md (detailed procedures), service-catalog.md, proxmox-inventory.md, github-api.md, authentik-state.md
|
|
|
|
|
- **Agents**: `.claude/agents/` — `cluster-health-checker` (haiku, autonomous health checks)
|
|
|
|
|
|
|
|
|
|
## Critical Rules
|
|
|
|
|
- **ALL changes through Terraform/Terragrunt** — never `kubectl apply/edit/patch` directly
|
|
|
|
|
- **NEVER put secrets in committed files** — use `terraform.tfvars` or `secrets/` (git-crypt)
|
|
|
|
|
- **NEVER restart NFS on TrueNAS** — causes cluster-wide mount failures
|
|
|
|
|
- **NEVER commit secrets** — triple-check every commit
|
|
|
|
|
- **New services need CI/CD** (Woodpecker) and **monitoring** (Prometheus/Uptime Kuma)
|
|
|
|
|
- **ALWAYS `[ci skip]`** in commit messages when already applied locally
|
|
|
|
|
- **Ask before pushing** to git. Commit specific files, not `git add -A`
|
|
|
|
|
|
|
|
|
|
## Execution
|
|
|
|
|
- **Terragrunt**: `cd stacks/<service> && terragrunt apply --non-interactive`
|
2026-02-22 22:11:31 +00:00
|
|
|
- **kubectl**: `kubectl --kubeconfig $(pwd)/config`
|
[ci skip] refactor claude files: compact CLAUDE.md, clean memory, remove generic agents
CLAUDE.md: 260→72 lines. Moved detailed patterns (NFS, iSCSI, Kyverno
tables, anti-AI, node rebuild) to .claude/reference/patterns.md.
Kept: critical rules, quick patterns, key commands, tier overview, prefs.
Memory: CLAUDE.md is now single source of truth. Auto-memory reduced to
scratch pad (67→25 lines, 5→1 files). MetaClaw DB cleaned from 40→16
entries (removed all infra-specific duplicates, kept cross-project prefs).
Agents: removed generic devops-engineer (885L) and fullstack-developer
(234L). Kept custom cluster-health-checker (48L).
2026-03-06 23:27:46 +00:00
|
|
|
- **Health check**: `bash scripts/cluster_healthcheck.sh --quiet`
|
|
|
|
|
- **Plan all**: `cd stacks && terragrunt run --all --non-interactive -- plan`
|
|
|
|
|
- **GitHub API**: `curl` with tokens from tfvars (`gh` CLI blocked by sandbox)
|
2026-01-18 11:04:51 +00:00
|
|
|
|
|
|
|
|
## Overview
|
[ci skip] refactor claude files: compact CLAUDE.md, clean memory, remove generic agents
CLAUDE.md: 260→72 lines. Moved detailed patterns (NFS, iSCSI, Kyverno
tables, anti-AI, node rebuild) to .claude/reference/patterns.md.
Kept: critical rules, quick patterns, key commands, tier overview, prefs.
Memory: CLAUDE.md is now single source of truth. Auto-memory reduced to
scratch pad (67→25 lines, 5→1 files). MetaClaw DB cleaned from 40→16
entries (removed all infra-specific duplicates, kept cross-project prefs).
Agents: removed generic devops-engineer (885L) and fullstack-developer
(234L). Kept custom cluster-health-checker (48L).
2026-03-06 23:27:46 +00:00
|
|
|
Terragrunt-based homelab managing K8s cluster on Proxmox. Per-service stacks under `stacks/`. Git-crypt for secrets.
|
|
|
|
|
- **Public domain**: `viktorbarzin.me` (Cloudflare) | **Internal**: `viktorbarzin.lan` (Technitium DNS)
|
|
|
|
|
- **Cluster**: 5 nodes (master + node1-4, v1.34.2), GPU on node1 (Tesla T4)
|
|
|
|
|
- **CI/CD**: Woodpecker CI — pushes to master auto-apply platform stack
|
|
|
|
|
|
|
|
|
|
## Key Paths
|
|
|
|
|
- `terraform.tfvars` — secrets, DNS, Cloudflare (git-crypt)
|
|
|
|
|
- `stacks/<service>/` — individual stacks | `stacks/platform/modules/` — core infra (~22 modules)
|
|
|
|
|
- `modules/kubernetes/ingress_factory/`, `nfs_volume/`, `setup_tls_secret/` — shared modules
|
|
|
|
|
|
|
|
|
|
## Quick Patterns
|
|
|
|
|
- **NFS volumes**: Use `nfs_volume` module (see `reference/patterns.md`). StorageClass: `nfs-truenas`. Never use inline `nfs {}` blocks.
|
|
|
|
|
- **iSCSI (databases)**: StorageClass `iscsi-truenas` (democratic-csi). Used by PostgreSQL, MySQL.
|
|
|
|
|
- **SMTP**: `var.mail_host` port 587 STARTTLS. NOT `mailserver.mailserver.svc.cluster.local` (cert mismatch).
|
|
|
|
|
- **New service**: Use `setup-project` skill. Quick: create stack → add DNS in tfvars → apply platform → apply service.
|
|
|
|
|
- **Ingress**: `ingress_factory` module. Auth: `protected = true`. Anti-AI: on by default.
|
|
|
|
|
|
|
|
|
|
## Shared Variables (never hardcode)
|
|
|
|
|
`var.nfs_server` (10.0.10.15), `var.redis_host`, `var.postgresql_host`, `var.mysql_host`, `var.ollama_host`, `var.mail_host`
|
2026-02-21 19:14:41 +00:00
|
|
|
|
2026-01-18 11:04:51 +00:00
|
|
|
## Infrastructure
|
[ci skip] refactor claude files: compact CLAUDE.md, clean memory, remove generic agents
CLAUDE.md: 260→72 lines. Moved detailed patterns (NFS, iSCSI, Kyverno
tables, anti-AI, node rebuild) to .claude/reference/patterns.md.
Kept: critical rules, quick patterns, key commands, tier overview, prefs.
Memory: CLAUDE.md is now single source of truth. Auto-memory reduced to
scratch pad (67→25 lines, 5→1 files). MetaClaw DB cleaned from 40→16
entries (removed all infra-specific duplicates, kept cross-project prefs).
Agents: removed generic devops-engineer (885L) and fullstack-developer
(234L). Kept custom cluster-health-checker (48L).
2026-03-06 23:27:46 +00:00
|
|
|
- Proxmox (192.168.1.127) — see `reference/proxmox-inventory.md`
|
|
|
|
|
- Pull-through cache at `10.0.20.10` — docker.io (:5000) and ghcr.io (:5010) only
|
|
|
|
|
- GPU: `node_selector = { "gpu": "true" }` + `toleration { key = "nvidia.com/gpu", value = "true", effect = "NoSchedule" }`
|
|
|
|
|
- Node rebuild: see `reference/patterns.md`
|
2026-01-18 13:40:31 +00:00
|
|
|
|
2026-02-23 23:10:13 +00:00
|
|
|
## Tier System
|
[ci skip] refactor claude files: compact CLAUDE.md, clean memory, remove generic agents
CLAUDE.md: 260→72 lines. Moved detailed patterns (NFS, iSCSI, Kyverno
tables, anti-AI, node rebuild) to .claude/reference/patterns.md.
Kept: critical rules, quick patterns, key commands, tier overview, prefs.
Memory: CLAUDE.md is now single source of truth. Auto-memory reduced to
scratch pad (67→25 lines, 5→1 files). MetaClaw DB cleaned from 40→16
entries (removed all infra-specific duplicates, kept cross-project prefs).
Agents: removed generic devops-engineer (885L) and fullstack-developer
(234L). Kept custom cluster-health-checker (48L).
2026-03-06 23:27:46 +00:00
|
|
|
`0-core` (ingress, DNS, VPN, auth) | `1-cluster` (Redis, metrics) | `2-gpu` | `3-edge` (user-facing) | `4-aux` (optional)
|
|
|
|
|
- Auto-generated into `tiers.tf` — use `local.tiers.core`, `local.tiers.cluster`, etc.
|
|
|
|
|
- Kyverno governance: LimitRange defaults + ResourceQuota per namespace (see `reference/patterns.md`)
|
|
|
|
|
- **OOMKilled?** → Container without explicit resources gets 256Mi (edge/aux). Set explicit `resources {}`.
|
|
|
|
|
- **Won't schedule?** → Check `kubectl describe resourcequota tier-quota -n <ns>`
|
|
|
|
|
- **Opt-out**: labels `resource-governance/custom-quota=true` and/or `resource-governance/custom-limitrange=true`
|
|
|
|
|
|
|
|
|
|
## MySQL InnoDB Cluster (dbaas namespace)
|
|
|
|
|
- 3 instances on `iscsi-truenas`, anti-affinity excludes k8s-node2 (SIGBUS in init containers)
|
|
|
|
|
- `mysql` service selector includes `mysql.oracle.com/cluster-role: PRIMARY`
|
|
|
|
|
- GR bootstrap: `SET GLOBAL group_replication_bootstrap_group=ON; START GROUP_REPLICATION;`
|
|
|
|
|
- Service users NOT managed by Terraform — recreate manually after cluster rebuild
|
|
|
|
|
- `manualStartOnBoot: true` — GR doesn't auto-start, needs bootstrap after full restart
|
2026-02-06 20:10:02 +00:00
|
|
|
|
|
|
|
|
## User Preferences
|
[ci skip] refactor claude files: compact CLAUDE.md, clean memory, remove generic agents
CLAUDE.md: 260→72 lines. Moved detailed patterns (NFS, iSCSI, Kyverno
tables, anti-AI, node rebuild) to .claude/reference/patterns.md.
Kept: critical rules, quick patterns, key commands, tier overview, prefs.
Memory: CLAUDE.md is now single source of truth. Auto-memory reduced to
scratch pad (67→25 lines, 5→1 files). MetaClaw DB cleaned from 40→16
entries (removed all infra-specific duplicates, kept cross-project prefs).
Agents: removed generic devops-engineer (885L) and fullstack-developer
(234L). Kept custom cluster-health-checker (48L).
2026-03-06 23:27:46 +00:00
|
|
|
- **Calendar**: Nextcloud at `nextcloud.viktorbarzin.me`
|
|
|
|
|
- **Home Assistant**: ha-london (default), ha-sofia. "ha"/"HA" = ha-london
|
2026-02-22 22:11:31 +00:00
|
|
|
- **Frontend**: Svelte for all new web apps
|
[ci skip] refactor claude files: compact CLAUDE.md, clean memory, remove generic agents
CLAUDE.md: 260→72 lines. Moved detailed patterns (NFS, iSCSI, Kyverno
tables, anti-AI, node rebuild) to .claude/reference/patterns.md.
Kept: critical rules, quick patterns, key commands, tier overview, prefs.
Memory: CLAUDE.md is now single source of truth. Auto-memory reduced to
scratch pad (67→25 lines, 5→1 files). MetaClaw DB cleaned from 40→16
entries (removed all infra-specific duplicates, kept cross-project prefs).
Agents: removed generic devops-engineer (885L) and fullstack-developer
(234L). Kept custom cluster-health-checker (48L).
2026-03-06 23:27:46 +00:00
|
|
|
- **Tools**: Docker containers only — never `brew install` locally
|
|
|
|
|
- **Pod monitoring**: Never use `sleep` — spawn background subagent with `kubectl get pods -w`
|