From 307b7f6819baae1929e9d2b24576f3be2eba3e68 Mon Sep 17 00:00:00 2001
From: Viktor Barzin <viktorbarzin@meta.com>
Date: Sun, 15 Mar 2026 10:46:45 +0000
Subject: [PATCH] update claude knowledge: infra operational learnings from
 commit history [ci skip]

Add resource management patterns, networking resilience, service-specific
notes, monitoring patterns, and NFS storage rules extracted from ~963 commits.
---
 .claude/CLAUDE.md | 33 +++++++++++++++++++++++++++++++++
 AGENTS.md         |  3 +++
 2 files changed, 36 insertions(+)
diff --git a/.claude/CLAUDE.md b/.claude/CLAUDE.md
index edf9bb3c..6a84a47b 100755
--- a/.claude/CLAUDE.md
+++ b/.claude/CLAUDE.md
@@ -27,6 +27,39 @@
 - **Complex types** (maps/lists like `homepage_credentials`, `k8s_users`) stored as JSON strings in KV, decoded with `jsondecode()` in consuming stack `locals` blocks.
 - **New stacks**: Add a `vault_kv_secret_v2` resource in vault/main.tf, then use `data "vault_kv_secret_v2" "secrets"` + `dependency "vault"` in the new stack.
 
+## Resource Management Patterns
+- **CPU**: All CPU limits removed cluster-wide (CFS throttling). Only set CPU requests based on actual usage.
+- **Memory**: Set explicit `requests=limits` based on Prometheus 7-day max. Overcommit ratio ~2x max.
+- **VPA (Goldilocks)**: Must be `Initial` mode (not `Auto`) — Auto conflicts with Terraform's declarative resource management.
+- **LimitRange**: Tier-based defaults silently apply to pods with `resources: {}`. Always set explicit resources on containers needing more than 256Mi (edge/aux default).
+- **Pin database versions**: Disable Diun (image update monitoring) for MySQL, PostgreSQL, Redis.
+
+## Networking & Resilience
+- **Critical path services scaled to 3**: Traefik, Authentik, CrowdSec LAPI, PgBouncer, Cloudflared.
+- **PDBs**: minAvailable=2 on Traefik and Authentik.
+- **Fallback proxies**: basicAuth when Authentik is down, fail-open when poison-fountain is down.
+- **CrowdSec bouncer**: graceful degradation mode (fail-open on error).
+- **Rate limiting**: Return 429 (not 503). Per-service tuning: Immich/Nextcloud need higher limits.
+- **Retry middleware**: 2 attempts, 100ms — in default ingress chain.
+- **HTTP/3 (QUIC)**: Enabled cluster-wide via Traefik.
+
+## Service-Specific Notes
+| Service | Key Operational Knowledge |
+|---------|--------------------------|
+| Nextcloud | MaxRequestWorkers=150, needs 4Gi memory, very generous startup probe |
+| Immich | ML on SSD, disable ModSecurity (breaks streaming), CUDA for ML, frequent upgrades |
+| CrowdSec | Pin version, disable Metabase when not needed (CPU hog), LAPI scaled to 3 |
+| Frigate | GPU stall detection in liveness probe (inference speed check), high CPU |
+| Authentik | 3 replicas, PgBouncer in front of PostgreSQL, strip auth headers before forwarding |
+| Kyverno | failurePolicy=Ignore to prevent blocking cluster, pin chart version |
+| MySQL InnoDB | Enable auto-recovery, anti-affinity excludes node2 (SIGBUS), 4.4Gi req but ~1Gi used |
+
+## Monitoring & Alerting
+- Alert cascade inhibitions: if node is down, suppress pod alerts on that node.
+- Exclude completed CronJob pods from "pod not ready" alerts.
+- Every new service gets Prometheus scrape config + Uptime Kuma monitor.
+- Key alerts: OOMKill, pod replica mismatch, 4xx/5xx error rates, UPS battery, CPU temp, SSD writes, NFS responsiveness.
+
 ## Known Issues
 - **CrowdSec Helm upgrade times out**: `terragrunt apply` on platform stack causes CrowdSec Helm release to get stuck in `pending-upgrade`. Workaround: `helm rollback crowdsec <rev> -n crowdsec`. Root cause: likely ResourceQuota CPU at 302% preventing pods from passing readiness probes. Needs investigation.
 - **OpenClaw config is writable**: OpenClaw writes to `openclaw.json` at runtime (doctor --fix, plugin auto-enable). Never use subPath ConfigMap mounts for it — use an init container to copy into a writable volume. Needs 2Gi memory + `NODE_OPTIONS=--max-old-space-size=1536`.
diff --git a/AGENTS.md b/AGENTS.md
index dffb9f33..103e46ff 100644
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -62,6 +62,9 @@ Terragrunt-based homelab managing a Kubernetes cluster (5 nodes, v1.34.2) on Pro
 - **NFS** (`nfs-truenas` StorageClass): For app data. Use the `nfs_volume` module, never inline `nfs {}` blocks.
 - **iSCSI** (`iscsi-truenas` StorageClass): For databases (PostgreSQL, MySQL). democratic-csi driver.
 - **TrueNAS**: 10.0.10.15. NFS exports managed via `secrets/nfs_exports.sh`.
+- **SQLite on NFS is unreliable** (fsync issues) — always use iSCSI or local disk for databases.
+- **NFS mount options**: Always `soft,timeo=30,retrans=3` to prevent uninterruptible sleep (D state).
+- **NFS export directory must exist** on TrueNAS before Terraform can create the PV.
 
 ## Shared Variables (never hardcode)
 `var.nfs_server` (10.0.10.15), `var.redis_host`, `var.postgresql_host`, `var.mysql_host`, `var.ollama_host`, `var.mail_host`