diff --git a/.claude/CLAUDE.md b/.claude/CLAUDE.md new file mode 100755 index 00000000..93f4ba4c --- /dev/null +++ b/.claude/CLAUDE.md @@ -0,0 +1,147 @@ +# Infrastructure Repository Knowledge + +## Instructions for Claude +- **When the user says "remember" something**: Always update this file (`.claude/CLAUDE.md`) with the information so it persists across sessions +- **When discovering new patterns or versions**: Add them to the appropriate section below +- **Use `/update-knowledge` command**: Or edit this file directly to add learnings + +## Execution Environment (CRITICAL) +- **File operations** (Read, Edit, Write, Glob, Grep): Run locally at `/Volumes/wizard/code/infra` +- **Git commands**: Run locally (git status, git log, git diff, etc.) +- **ALL other commands**: Use the remote executor relay (kubectl, terraform, helm, python, etc.) + +### Remote Command Execution (ALWAYS USE THIS) +For any command that is not file editing or git, use the file-based relay: + +**To execute a remote command:** +```bash +# 1. Write command +echo "your-command-here" > /Volumes/wizard/code/infra/.claude/cmd_input.txt +# 2. Wait and check status +sleep 1 && cat /Volumes/wizard/code/infra/.claude/cmd_status.txt +# 3. Read output (when status is "done:*") +cat /Volumes/wizard/code/infra/.claude/cmd_output.txt +``` + +**Status values:** `ready` | `running` | `done:N` (N = exit code) + +**Requires user to start executor in another terminal:** +```bash +.claude/remote-executor.sh wizard@10.0.10.10 /home/wizard/code/infra +``` + +--- + +## Overview +Terraform-based infrastructure repository managing a home Kubernetes cluster on Proxmox VMs. Uses git-crypt for secrets encryption. + +## Directory Structure +- `main.tf` - Main Terraform entry point, imports all modules +- `modules/kubernetes/` - Kubernetes service deployments (one folder per service) +- `modules/create-vm/` - Proxmox VM creation module +- `secrets/` - Encrypted secrets (TLS certs, keys) via git-crypt +- `cli/` - Go CLI tool for infrastructure management +- `scripts/` - Helper scripts (cluster management, node updates) +- `playbooks/` - Ansible playbooks for node configuration +- `diagram/` - Infrastructure diagrams (Python-based) + +## Key Patterns +- Each service in `modules/kubernetes//main.tf` defines its own namespace, deployments, services, and ingress +- NFS storage from `10.0.10.15` for persistent data +- TLS secrets managed via `setup_tls_secret` module +- Ingress uses nginx-ingress with annotations for customization +- GPU workloads use `node_selector = { "gpu": "true" }` +- Services expose to `*.viktorbarzin.me` domains + +### NFS Volume Pattern +**Prefer inline NFS volumes** over separate PV/PVC resources. Use the `nfs {}` block directly in pod/deployment/cronjob specs: +```hcl +volume { + name = "data" + nfs { + server = "10.0.10.15" + path = "/mnt/main/" + } +} +``` +Only use PV/PVC when the Helm chart requires `existingClaim` (like the Nextcloud Helm chart). + +### Factory Pattern (for multi-user services) +Used when a service needs one instance per user. Structure: +``` +modules/kubernetes// +├── main.tf # Namespace, TLS secret, user module calls +└── factory/ + └── main.tf # Deployment, service, ingress templates with ${var.name} +``` +Examples: `actualbudget`, `freedify` + +To add a new user: +1. Export NFS share at `/mnt/main//` in TrueNAS +2. Add Cloudflare route in tfvars +3. Add module block in main.tf calling factory + +## Common Variables +- `tls_secret_name` - TLS certificate secret name +- `tier` - Deployment tier label +- Service-specific passwords passed as variables + +## Service Versions (as of 2025-01) +- Immich: v2.4.1 +- Freedify: latest (music streaming, factory pattern) + +## Useful Commands +```bash +# ALWAYS use -target for terraform apply (speeds up execution) +terraform apply -target=module.kubernetes_cluster.module. +terraform plan -target=module.kubernetes_cluster.module. +terraform fmt -recursive +kubectl get pods -A +``` + +**Terraform target examples:** +- `terraform apply -target=module.kubernetes_cluster.module.monitoring` - Apply monitoring +- `terraform apply -target=module.kubernetes_cluster.module.immich` - Apply immich +- `terraform apply -target=module.docker-registry-vm` - Apply docker registry VM +- Only skip `-target` when explicitly told to apply everything + +## Module Structure +Top-level modules in `main.tf`: +- `module.k8s-node-template` - K8s node VM template +- `module.non-k8s-node-template` - Non-k8s VM template +- `module.docker-registry-template` - Docker registry template +- `module.docker-registry-vm` - Docker registry VM +- `module.kubernetes_cluster` - Main K8s cluster (contains all services) + +### Kubernetes Services (under module.kubernetes_cluster.module.*) +Core (tier 0-1): +- `metallb`, `dbaas`, `technitium`, `nginx-ingress`, `crowdsec`, `cloudflared` +- `redis`, `metrics-server`, `authentik`, `nvidia`, `vaultwarden`, `reverse-proxy` +- `wireguard`, `headscale`, `xray`, `monitoring` + +GPU (tier 2): +- `immich`, `frigate`, `ollama`, `ebook2audiobook` + +Edge/Aux (tier 3-4): +- `blog`, `drone`, `hackmd`, `mailserver`, `privatebin`, `shadowsocks` +- `city-guesser`, `echo`, `url`, `webhook_handler`, `excalidraw`, `travel_blog` +- `dashy`, `send`, `ytdlp`, `uptime-kuma`, `calibre`, `audiobookshelf` +- `paperless-ngx`, `jsoncrack`, `servarr`, `ntfy`, `cyberchef`, `diun` +- `meshcentral`, `nextcloud`, `homepage`, `matrix`, `linkwarden`, `actualbudget` +- `owntracks`, `dawarich`, `changedetection`, `tandoor`, `n8n`, `real-estate-crawler` +- `tor-proxy`, `onlyoffice`, `forgejo`, `freshrss`, `navidrome`, `networking-toolbox` +- `tuya-bridge`, `stirling-pdf`, `isponsorblocktv`, `rybbit`, `wealthfolio` +- `kyverno`, `speedtest`, `freedify`, `netbox`, `f1-stream`, `kms`, `k8s-dashboard` +- `descheduler`, `reloader`, `infra-maintenance` + +## CI/CD +- Drone CI (`.drone.yml`) for automated deployments +- Auto-updates TLS certificates +- **ALWAYS add `[ci skip]` to commit messages** when you've already run `terraform apply` to avoid triggering CI redundantly +- **After committing, run `git push origin master`** to sync changes + +## Infrastructure +- Proxmox hypervisor for VMs +- Kubernetes cluster with GPU node +- NFS server at 10.0.10.15 for storage +- Redis shared service at `redis.redis.svc.cluster.local` diff --git a/.claude/remote-executor.sh b/.claude/remote-executor.sh new file mode 100755 index 00000000..b0f50526 --- /dev/null +++ b/.claude/remote-executor.sh @@ -0,0 +1,58 @@ +#!/bin/bash +# Remote Command Executor +# Run this in a terminal with SSH access to the remote machine +# +# Usage: ./remote-executor.sh [user@host] [remote_workdir] +# Example: ./remote-executor.sh wizard@10.0.10.10 /home/wizard/code/infra + +REMOTE_HOST="${1:-wizard@10.0.10.10}" +REMOTE_WORKDIR="${2:-/home/wizard/code/infra}" + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +CMD_FILE="$SCRIPT_DIR/cmd_input.txt" +OUTPUT_FILE="$SCRIPT_DIR/cmd_output.txt" +STATUS_FILE="$SCRIPT_DIR/cmd_status.txt" + +# Initialize files +echo "ready" > "$STATUS_FILE" +> "$CMD_FILE" +> "$OUTPUT_FILE" + +echo "╔════════════════════════════════════════════════════════════╗" +echo "║ Remote Command Executor Started ║" +echo "╠════════════════════════════════════════════════════════════╣" +echo "║ Remote: $REMOTE_HOST" +echo "║ Workdir: $REMOTE_WORKDIR" +echo "║ Watching: $CMD_FILE" +echo "╚════════════════════════════════════════════════════════════╝" +echo "" +echo "Waiting for commands..." + +# Watch for new commands +while true; do + # Check if there's a command to execute + if [ -s "$CMD_FILE" ]; then + CMD=$(cat "$CMD_FILE") + + # Clear the command file immediately + > "$CMD_FILE" + + # Update status + echo "running" > "$STATUS_FILE" + echo "[$(date '+%H:%M:%S')] Executing: $CMD" + + # Execute on remote and capture output + ssh "$REMOTE_HOST" "cd $REMOTE_WORKDIR && $CMD" > "$OUTPUT_FILE" 2>&1 + EXIT_CODE=$? + + # Append exit code to output + echo "" >> "$OUTPUT_FILE" + echo "---EXIT_CODE:$EXIT_CODE---" >> "$OUTPUT_FILE" + + # Update status + echo "done:$EXIT_CODE" > "$STATUS_FILE" + echo "[$(date '+%H:%M:%S')] Done (exit: $EXIT_CODE)" + fi + + sleep 0.2 +done diff --git a/modules/kubernetes/monitoring/prometheus_chart_values.tpl b/modules/kubernetes/monitoring/prometheus_chart_values.tpl old mode 100644 new mode 100755 index 0a85d08d..9f9cd042 --- a/modules/kubernetes/monitoring/prometheus_chart_values.tpl +++ b/modules/kubernetes/monitoring/prometheus_chart_values.tpl @@ -318,13 +318,39 @@ serverFiles: # severity: page # annotations: # summary: Pod stuck not ready. - #- alert: ReadyPodsInDeploymentLessThanSpec - # expr: kube_deployment_status_replicas_available - on(namespace, deployment) kube_deployment_spec_replicas < 0 - # for: 10m - # labels: - # severity: page - # annotations: - # summary: Number of ready pods in {{ $labels.deployment }} is less than what is defined in spec. + - alert: DeploymentReplicasMismatch + expr: | + ( + kube_deployment_spec_replicas + - on(namespace, deployment) kube_deployment_status_replicas_available + ) > 0 + for: 15m + labels: + severity: page + annotations: + summary: "Deployment {{ $labels.namespace }}/{{ $labels.deployment }} has {{ $value }} unavailable replicas" + - alert: StatefulSetReplicasMismatch + expr: | + ( + kube_statefulset_replicas + - on(namespace, statefulset) kube_statefulset_status_replicas_ready + ) > 0 + for: 15m + labels: + severity: page + annotations: + summary: "StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} has {{ $value }} unavailable replicas" + - alert: DaemonSetMissingPods + expr: | + ( + kube_daemonset_status_desired_number_scheduled + - on(namespace, daemonset) kube_daemonset_status_number_ready + ) > 0 + for: 15m + labels: + severity: page + annotations: + summary: "DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} has {{ $value }} missing pods" - alert: NoNodeLoadData expr: (node_load1 OR on() vector(0)) == 0 for: 10m