[ci skip] refactor claude files: compact CLAUDE.md, clean memory, remove generic agents
CLAUDE.md: 260→72 lines. Moved detailed patterns (NFS, iSCSI, Kyverno
tables, anti-AI, node rebuild) to .claude/reference/patterns.md.
Kept: critical rules, quick patterns, key commands, tier overview, prefs.
Memory: CLAUDE.md is now single source of truth. Auto-memory reduced to
scratch pad (67→25 lines, 5→1 files). MetaClaw DB cleaned from 40→16
entries (removed all infra-specific duplicates, kept cross-project prefs).
Agents: removed generic devops-engineer (885L) and fullstack-developer
(234L). Kept custom cluster-health-checker (48L).
2026-03-06 23:27:46 +00:00
# Detailed Infrastructure Patterns
Reference file for patterns, procedures, and tables. Read on demand when the specific topic comes up.
## NFS Volume Pattern
2026-04-13 14:41:15 +00:00
Use the `nfs_volume` shared module for all NFS volumes (creates static PVs, CSI-backed, `soft,timeo=30,retrans=3` ):
[ci skip] refactor claude files: compact CLAUDE.md, clean memory, remove generic agents
CLAUDE.md: 260→72 lines. Moved detailed patterns (NFS, iSCSI, Kyverno
tables, anti-AI, node rebuild) to .claude/reference/patterns.md.
Kept: critical rules, quick patterns, key commands, tier overview, prefs.
Memory: CLAUDE.md is now single source of truth. Auto-memory reduced to
scratch pad (67→25 lines, 5→1 files). MetaClaw DB cleaned from 40→16
entries (removed all infra-specific duplicates, kept cross-project prefs).
Agents: removed generic devops-engineer (885L) and fullstack-developer
(234L). Kept custom cluster-health-checker (48L).
2026-03-06 23:27:46 +00:00
```hcl
module "nfs_data" {
source = "../../modules/kubernetes/nfs_volume" # ../../../../ for platform modules, ../../../ for sub-stacks
name = "< service > -data" # Must be globally unique (PV is cluster-scoped)
namespace = kubernetes_namespace.< service > .metadata[0].name
2026-04-13 14:41:15 +00:00
nfs_server = var.nfs_server # 192.168.1.127 (Proxmox host)
nfs_path = "/srv/nfs/< service > " # HDD NFS, or "/srv/nfs-ssd/< service > " for SSD
[ci skip] refactor claude files: compact CLAUDE.md, clean memory, remove generic agents
CLAUDE.md: 260→72 lines. Moved detailed patterns (NFS, iSCSI, Kyverno
tables, anti-AI, node rebuild) to .claude/reference/patterns.md.
Kept: critical rules, quick patterns, key commands, tier overview, prefs.
Memory: CLAUDE.md is now single source of truth. Auto-memory reduced to
scratch pad (67→25 lines, 5→1 files). MetaClaw DB cleaned from 40→16
entries (removed all infra-specific duplicates, kept cross-project prefs).
Agents: removed generic devops-engineer (885L) and fullstack-developer
(234L). Kept custom cluster-health-checker (48L).
2026-03-06 23:27:46 +00:00
}
# In pod spec: persistent_volume_claim { claim_name = module.nfs_data.claim_name }
```
2026-04-13 14:41:15 +00:00
**Note**: Some legacy PVs still reference `/mnt/main/<service>` paths (from the TrueNAS era). These work via compatibility on the Proxmox host. New PVs should use `/srv/nfs/` or `/srv/nfs-ssd/` .
[ci skip] refactor claude files: compact CLAUDE.md, clean memory, remove generic agents
CLAUDE.md: 260→72 lines. Moved detailed patterns (NFS, iSCSI, Kyverno
tables, anti-AI, node rebuild) to .claude/reference/patterns.md.
Kept: critical rules, quick patterns, key commands, tier overview, prefs.
Memory: CLAUDE.md is now single source of truth. Auto-memory reduced to
scratch pad (67→25 lines, 5→1 files). MetaClaw DB cleaned from 40→16
entries (removed all infra-specific duplicates, kept cross-project prefs).
Agents: removed generic devops-engineer (885L) and fullstack-developer
(234L). Kept custom cluster-health-checker (48L).
2026-03-06 23:27:46 +00:00
**DO NOT use inline `nfs {}` blocks** — they mount with `hard,timeo=600` defaults which hang forever.
## Adding NFS Exports
2026-04-13 14:41:15 +00:00
1. Create dir on Proxmox host: `ssh root@192.168.1.127 "mkdir -p /srv/nfs/<service> && chmod 777 /srv/nfs/<service>"`
2. Edit `/etc/exports` on the Proxmox host — add the export entry
3. Reload exports: `ssh root@192.168.1.127 "exportfs -ra"`
4. Verify: `showmount -e 192.168.1.127`
[ci skip] refactor claude files: compact CLAUDE.md, clean memory, remove generic agents
CLAUDE.md: 260→72 lines. Moved detailed patterns (NFS, iSCSI, Kyverno
tables, anti-AI, node rebuild) to .claude/reference/patterns.md.
Kept: critical rules, quick patterns, key commands, tier overview, prefs.
Memory: CLAUDE.md is now single source of truth. Auto-memory reduced to
scratch pad (67→25 lines, 5→1 files). MetaClaw DB cleaned from 40→16
entries (removed all infra-specific duplicates, kept cross-project prefs).
Agents: removed generic devops-engineer (885L) and fullstack-developer
(234L). Kept custom cluster-health-checker (48L).
2026-03-06 23:27:46 +00:00
2026-04-13 14:41:15 +00:00
## ~~iSCSI Storage~~ (REMOVED — replaced by proxmox-lvm)
> iSCSI via democratic-csi and TrueNAS has been fully removed (2026-04). All database storage now uses `StorageClass: proxmox-lvm` (Proxmox CSI, LVM-thin hotplug). TrueNAS has been decommissioned.
[ci skip] refactor claude files: compact CLAUDE.md, clean memory, remove generic agents
CLAUDE.md: 260→72 lines. Moved detailed patterns (NFS, iSCSI, Kyverno
tables, anti-AI, node rebuild) to .claude/reference/patterns.md.
Kept: critical rules, quick patterns, key commands, tier overview, prefs.
Memory: CLAUDE.md is now single source of truth. Auto-memory reduced to
scratch pad (67→25 lines, 5→1 files). MetaClaw DB cleaned from 40→16
entries (removed all infra-specific duplicates, kept cross-project prefs).
Agents: removed generic devops-engineer (885L) and fullstack-developer
(234L). Kept custom cluster-health-checker (48L).
2026-03-06 23:27:46 +00:00
2026-05-10 00:04:37 +00:00
## Anti-AI Scraping (4 Active Layers) (Updated 2026-05-10)
[ci skip] refactor claude files: compact CLAUDE.md, clean memory, remove generic agents
CLAUDE.md: 260→72 lines. Moved detailed patterns (NFS, iSCSI, Kyverno
tables, anti-AI, node rebuild) to .claude/reference/patterns.md.
Kept: critical rules, quick patterns, key commands, tier overview, prefs.
Memory: CLAUDE.md is now single source of truth. Auto-memory reduced to
scratch pad (67→25 lines, 5→1 files). MetaClaw DB cleaned from 40→16
entries (removed all infra-specific duplicates, kept cross-project prefs).
Agents: removed generic devops-engineer (885L) and fullstack-developer
(234L). Kept custom cluster-health-checker (48L).
2026-03-06 23:27:46 +00:00
Default `anti_ai_scraping = true` in ingress_factory. Disable per-service: `anti_ai_scraping = false` .
2026-05-10 00:04:37 +00:00
1. **Anubis PoW challenge** (per-site reverse proxy) — `modules/kubernetes/anubis_instance/` . Latest: `ghcr.io/techarohq/anubis:v1.25.0` . Difficulty 2 (~250 ms desktop / ~700 ms mobile), 30-day JWT cookie scoped to `viktorbarzin.me` so a single solve covers every Anubis-fronted subdomain. Active on: `viktorbarzin.me` , `kms.viktorbarzin.me` , `travel.viktorbarzin.me` . Add to a stack: `module "anubis" { source = "../../modules/kubernetes/anubis_instance"; name = "X"; namespace = ...; target_url = "http://<svc>.<ns>.svc.cluster.local" }` , then point ingress_factory at `module.anubis.service_name` + `port = module.anubis.service_port` and set `anti_ai_scraping = false` . Shared ed25519 signing key in Vault `secret/viktor` -> `anubis_ed25519_key` . **Avoid putting Anubis in front of CLI/API/Git endpoints (Forgejo, APIs, WebDAV)** — clients without JS can't solve PoW.
2. **Bot blocking forwardAuth** (ForwardAuth → bot-block-proxy → poison-fountain) — global default for non-Anubis sites. `bot-block-proxy` (OpenResty in `traefik` ns) is fail-open with 100 ms connect / 200 ms read timeouts so a downed poison-fountain costs ≤200 ms per request. Source: `stacks/traefik/modules/traefik/main.tf` .
3. **X-Robots-Tag noai** — set by `traefik-anti-ai-headers` middleware. Anubis additionally serves a comprehensive `/robots.txt` (`SERVE_ROBOTS_TXT=true` ) to well-behaved bots.
4. **Tarpit/poison content** (standalone at poison.viktorbarzin.me, `stacks/poison-fountain/` ). Currently scaled to `replicas = 0` — fail-open path means no live traffic, no penalty.
Trap links (formerly a layer) removed April 2026 — rewrite-body plugin broken on Traefik v3.6.12 (Yaegi bugs). `strip-accept-encoding` and `anti-ai-trap-links` middlewares deleted.
2026-04-17 21:43:13 +00:00
Rybbit analytics injection now via Cloudflare Worker (`stacks/rybbit/worker/` , HTMLRewriter, wildcard route `*.viktorbarzin.me/*` , 28 site ID mappings).
2026-05-10 00:04:37 +00:00
Key files: `modules/kubernetes/anubis_instance/` , `stacks/poison-fountain/` , `stacks/rybbit/worker/` , `stacks/traefik/modules/traefik/main.tf`
[ci skip] refactor claude files: compact CLAUDE.md, clean memory, remove generic agents
CLAUDE.md: 260→72 lines. Moved detailed patterns (NFS, iSCSI, Kyverno
tables, anti-AI, node rebuild) to .claude/reference/patterns.md.
Kept: critical rules, quick patterns, key commands, tier overview, prefs.
Memory: CLAUDE.md is now single source of truth. Auto-memory reduced to
scratch pad (67→25 lines, 5→1 files). MetaClaw DB cleaned from 40→16
entries (removed all infra-specific duplicates, kept cross-project prefs).
Agents: removed generic devops-engineer (885L) and fullstack-developer
(234L). Kept custom cluster-health-checker (48L).
2026-03-06 23:27:46 +00:00
## Terragrunt Architecture
- Root `terragrunt.hcl` : DRY providers, backend, variable loading, `generate "tiers"` block
- Each stack: `stacks/<service>/main.tf` , state at `state/stacks/<service>/terraform.tfstate`
- Platform modules: `stacks/platform/modules/<service>/` , shared: `modules/kubernetes/`
- Syntax: `--non-interactive` , `terragrunt run --all -- <command>` (not `run-all` )
- Tiers auto-generated into `tiers.tf` — never add `locals { tiers = {} }` manually
## Factory Pattern (Multi-User Services)
Structure: `stacks/<service>/main.tf` + `factory/main.tf` . Examples: `actualbudget` , `freedify` .
To add a user: export NFS share, add Cloudflare route in tfvars, add module block calling factory.
## Node Rebuild Procedure
1. Drain: `kubectl drain k8s-nodeX --ignore-daemonsets --delete-emptydir-data`
2. Delete: `kubectl delete node k8s-nodeX`
3. Destroy VM (remove from `stacks/infra/main.tf` )
4. Get fresh join command: `ssh wizard@10.0.20.100 'sudo kubeadm token create --print-join-command'` (tokens expire 24h)
5. Update `k8s_join_command` in `terraform.tfvars` , add VM to `stacks/infra/main.tf` , apply
6. GPU node (k8s-node1): apply platform stack to re-apply GPU label/taint
## Kyverno Resource Governance
### LimitRange Defaults (injected when no explicit `resources {}`)
| Tier | Default Mem | Max Mem | Default CPU | Max CPU |
|------|------------|---------|-------------|---------|
| 0-core | 512Mi | 8Gi | 500m | 4 |
| 1-cluster | 512Mi | 4Gi | 500m | 2 |
| 2-gpu | 2Gi | 16Gi | 1 | 8 |
| 3-edge / 4-aux | 256Mi | 4Gi | 250m | 2 |
| No tier | 256Mi | 2Gi | 250m | 1 |
### ResourceQuota (opt-out: `resource-governance/custom-quota=true`)
| Tier | lim CPU | lim Mem | Pods |
|------|---------|---------|------|
| 0-core | 32 | 64Gi | 100 |
| 1-cluster | 16 | 32Gi | 30 |
| 2-gpu | 48 | 96Gi | 40 |
| 3-edge / 4-aux | 8-16 | 16-32Gi | 20-30 |
Custom quotas: authentik, monitoring (opted out), nvidia (opted out), nextcloud, onlyoffice.
LimitRange opt-out: `resource-governance/custom-limitrange=true` + custom `kubernetes_limit_range` in stack.
### Other Policies
- `inject-priority-class-from-tier` (CREATE only), `inject-ndots` (ndots:2), `sync-tier-label`
- `goldilocks-vpa-auto-mode` : VPA `off` globally — Terraform owns resources, Goldilocks observe-only
- Security policies ALL Audit mode: `deny-privileged-containers` , `deny-host-namespaces` , `restrict-sys-admin` , `require-trusted-registries`
### Debugging Container Failures
1. **OOMKilled?** → `kubectl describe limitrange tier-defaults -n <ns>` . edge/aux default = 256Mi.
2. **Won't schedule?** → `kubectl describe resourcequota tier-quota -n <ns>` .
3. **Evicted?** → aux-tier pods (priority 200K, Never preempt) evicted first.
4. **Unexpected limits?** → LimitRange injects defaults. Always set explicit resources.
5. **Need more?** → Set explicit `resources {}` or add quota/limitrange opt-out labels.
## Authentik (Identity Provider)
- **URL**: `https://authentik.viktorbarzin.me` | **API** : `/api/v3/` | **Token** : `authentik_api_token` in tfvars
- 3 server + 3 worker + 3 PgBouncer + embedded outpost
- Forward auth: `protected = true` in ingress_factory
- OIDC for K8s: issuer `.../application/o/kubernetes/` , client `kubernetes` (public)
- See archived skills for management tasks and OIDC gotchas
## Archived Troubleshooting Runbooks
28 skills in `.claude/skills/archived/` — load when the specific issue arises.
Topics: authentik, bluestacks, clickhouse-nfs, coturn, crowdsec, fastapi-svelte-gpu,
grafana-datasource, helm-stuck, ingress-migration, image-caching, gpu-devices, hpa-storm,
nfs-mount, kubelet-manifest, llm-gpu, loki-helm, librespot, nextcloud-calendar, nfsv4-idmapd,
openclaw-deploy, pfsense-dnsmasq, pfsense-nat, proxmox-disk, python-sanitize, terraform-state,
traefik-helm, traefik-rewrite-body.