From 307b7f6819baae1929e9d2b24576f3be2eba3e68 Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Sun, 15 Mar 2026 10:46:45 +0000 Subject: [PATCH] update claude knowledge: infra operational learnings from commit history [ci skip] Add resource management patterns, networking resilience, service-specific notes, monitoring patterns, and NFS storage rules extracted from ~963 commits. --- .claude/CLAUDE.md | 33 +++++++++++++++++++++++++++++++++ AGENTS.md | 3 +++ 2 files changed, 36 insertions(+) diff --git a/.claude/CLAUDE.md b/.claude/CLAUDE.md index edf9bb3c..6a84a47b 100755 --- a/.claude/CLAUDE.md +++ b/.claude/CLAUDE.md @@ -27,6 +27,39 @@ - **Complex types** (maps/lists like `homepage_credentials`, `k8s_users`) stored as JSON strings in KV, decoded with `jsondecode()` in consuming stack `locals` blocks. - **New stacks**: Add a `vault_kv_secret_v2` resource in vault/main.tf, then use `data "vault_kv_secret_v2" "secrets"` + `dependency "vault"` in the new stack. +## Resource Management Patterns +- **CPU**: All CPU limits removed cluster-wide (CFS throttling). Only set CPU requests based on actual usage. +- **Memory**: Set explicit `requests=limits` based on Prometheus 7-day max. Overcommit ratio ~2x max. +- **VPA (Goldilocks)**: Must be `Initial` mode (not `Auto`) — Auto conflicts with Terraform's declarative resource management. +- **LimitRange**: Tier-based defaults silently apply to pods with `resources: {}`. Always set explicit resources on containers needing more than 256Mi (edge/aux default). +- **Pin database versions**: Disable Diun (image update monitoring) for MySQL, PostgreSQL, Redis. + +## Networking & Resilience +- **Critical path services scaled to 3**: Traefik, Authentik, CrowdSec LAPI, PgBouncer, Cloudflared. +- **PDBs**: minAvailable=2 on Traefik and Authentik. +- **Fallback proxies**: basicAuth when Authentik is down, fail-open when poison-fountain is down. +- **CrowdSec bouncer**: graceful degradation mode (fail-open on error). +- **Rate limiting**: Return 429 (not 503). Per-service tuning: Immich/Nextcloud need higher limits. +- **Retry middleware**: 2 attempts, 100ms — in default ingress chain. +- **HTTP/3 (QUIC)**: Enabled cluster-wide via Traefik. + +## Service-Specific Notes +| Service | Key Operational Knowledge | +|---------|--------------------------| +| Nextcloud | MaxRequestWorkers=150, needs 4Gi memory, very generous startup probe | +| Immich | ML on SSD, disable ModSecurity (breaks streaming), CUDA for ML, frequent upgrades | +| CrowdSec | Pin version, disable Metabase when not needed (CPU hog), LAPI scaled to 3 | +| Frigate | GPU stall detection in liveness probe (inference speed check), high CPU | +| Authentik | 3 replicas, PgBouncer in front of PostgreSQL, strip auth headers before forwarding | +| Kyverno | failurePolicy=Ignore to prevent blocking cluster, pin chart version | +| MySQL InnoDB | Enable auto-recovery, anti-affinity excludes node2 (SIGBUS), 4.4Gi req but ~1Gi used | + +## Monitoring & Alerting +- Alert cascade inhibitions: if node is down, suppress pod alerts on that node. +- Exclude completed CronJob pods from "pod not ready" alerts. +- Every new service gets Prometheus scrape config + Uptime Kuma monitor. +- Key alerts: OOMKill, pod replica mismatch, 4xx/5xx error rates, UPS battery, CPU temp, SSD writes, NFS responsiveness. + ## Known Issues - **CrowdSec Helm upgrade times out**: `terragrunt apply` on platform stack causes CrowdSec Helm release to get stuck in `pending-upgrade`. Workaround: `helm rollback crowdsec -n crowdsec`. Root cause: likely ResourceQuota CPU at 302% preventing pods from passing readiness probes. Needs investigation. - **OpenClaw config is writable**: OpenClaw writes to `openclaw.json` at runtime (doctor --fix, plugin auto-enable). Never use subPath ConfigMap mounts for it — use an init container to copy into a writable volume. Needs 2Gi memory + `NODE_OPTIONS=--max-old-space-size=1536`. diff --git a/AGENTS.md b/AGENTS.md index dffb9f33..103e46ff 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -62,6 +62,9 @@ Terragrunt-based homelab managing a Kubernetes cluster (5 nodes, v1.34.2) on Pro - **NFS** (`nfs-truenas` StorageClass): For app data. Use the `nfs_volume` module, never inline `nfs {}` blocks. - **iSCSI** (`iscsi-truenas` StorageClass): For databases (PostgreSQL, MySQL). democratic-csi driver. - **TrueNAS**: 10.0.10.15. NFS exports managed via `secrets/nfs_exports.sh`. +- **SQLite on NFS is unreliable** (fsync issues) — always use iSCSI or local disk for databases. +- **NFS mount options**: Always `soft,timeo=30,retrans=3` to prevent uninterruptible sleep (D state). +- **NFS export directory must exist** on TrueNAS before Terraform can create the PV. ## Shared Variables (never hardcode) `var.nfs_server` (10.0.10.15), `var.redis_host`, `var.postgresql_host`, `var.mysql_host`, `var.ollama_host`, `var.mail_host`