add prometheus alerts for deployment/statefulset/daemonset replica mismatches [ci skip]
- Add DeploymentReplicasMismatch alert - Add StatefulSetReplicasMismatch alert - Add DaemonSetMissingPods alert - Add .claude/ directory with remote executor and knowledge base
This commit is contained in:
parent
70952c2448
commit
d751a5924c
3 changed files with 238 additions and 7 deletions
147
.claude/CLAUDE.md
Executable file
147
.claude/CLAUDE.md
Executable file
|
|
@ -0,0 +1,147 @@
|
|||
# Infrastructure Repository Knowledge
|
||||
|
||||
## Instructions for Claude
|
||||
- **When the user says "remember" something**: Always update this file (`.claude/CLAUDE.md`) with the information so it persists across sessions
|
||||
- **When discovering new patterns or versions**: Add them to the appropriate section below
|
||||
- **Use `/update-knowledge` command**: Or edit this file directly to add learnings
|
||||
|
||||
## Execution Environment (CRITICAL)
|
||||
- **File operations** (Read, Edit, Write, Glob, Grep): Run locally at `/Volumes/wizard/code/infra`
|
||||
- **Git commands**: Run locally (git status, git log, git diff, etc.)
|
||||
- **ALL other commands**: Use the remote executor relay (kubectl, terraform, helm, python, etc.)
|
||||
|
||||
### Remote Command Execution (ALWAYS USE THIS)
|
||||
For any command that is not file editing or git, use the file-based relay:
|
||||
|
||||
**To execute a remote command:**
|
||||
```bash
|
||||
# 1. Write command
|
||||
echo "your-command-here" > /Volumes/wizard/code/infra/.claude/cmd_input.txt
|
||||
# 2. Wait and check status
|
||||
sleep 1 && cat /Volumes/wizard/code/infra/.claude/cmd_status.txt
|
||||
# 3. Read output (when status is "done:*")
|
||||
cat /Volumes/wizard/code/infra/.claude/cmd_output.txt
|
||||
```
|
||||
|
||||
**Status values:** `ready` | `running` | `done:N` (N = exit code)
|
||||
|
||||
**Requires user to start executor in another terminal:**
|
||||
```bash
|
||||
.claude/remote-executor.sh wizard@10.0.10.10 /home/wizard/code/infra
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
Terraform-based infrastructure repository managing a home Kubernetes cluster on Proxmox VMs. Uses git-crypt for secrets encryption.
|
||||
|
||||
## Directory Structure
|
||||
- `main.tf` - Main Terraform entry point, imports all modules
|
||||
- `modules/kubernetes/` - Kubernetes service deployments (one folder per service)
|
||||
- `modules/create-vm/` - Proxmox VM creation module
|
||||
- `secrets/` - Encrypted secrets (TLS certs, keys) via git-crypt
|
||||
- `cli/` - Go CLI tool for infrastructure management
|
||||
- `scripts/` - Helper scripts (cluster management, node updates)
|
||||
- `playbooks/` - Ansible playbooks for node configuration
|
||||
- `diagram/` - Infrastructure diagrams (Python-based)
|
||||
|
||||
## Key Patterns
|
||||
- Each service in `modules/kubernetes/<service>/main.tf` defines its own namespace, deployments, services, and ingress
|
||||
- NFS storage from `10.0.10.15` for persistent data
|
||||
- TLS secrets managed via `setup_tls_secret` module
|
||||
- Ingress uses nginx-ingress with annotations for customization
|
||||
- GPU workloads use `node_selector = { "gpu": "true" }`
|
||||
- Services expose to `*.viktorbarzin.me` domains
|
||||
|
||||
### NFS Volume Pattern
|
||||
**Prefer inline NFS volumes** over separate PV/PVC resources. Use the `nfs {}` block directly in pod/deployment/cronjob specs:
|
||||
```hcl
|
||||
volume {
|
||||
name = "data"
|
||||
nfs {
|
||||
server = "10.0.10.15"
|
||||
path = "/mnt/main/<service>"
|
||||
}
|
||||
}
|
||||
```
|
||||
Only use PV/PVC when the Helm chart requires `existingClaim` (like the Nextcloud Helm chart).
|
||||
|
||||
### Factory Pattern (for multi-user services)
|
||||
Used when a service needs one instance per user. Structure:
|
||||
```
|
||||
modules/kubernetes/<service>/
|
||||
├── main.tf # Namespace, TLS secret, user module calls
|
||||
└── factory/
|
||||
└── main.tf # Deployment, service, ingress templates with ${var.name}
|
||||
```
|
||||
Examples: `actualbudget`, `freedify`
|
||||
|
||||
To add a new user:
|
||||
1. Export NFS share at `/mnt/main/<service>/<username>` in TrueNAS
|
||||
2. Add Cloudflare route in tfvars
|
||||
3. Add module block in main.tf calling factory
|
||||
|
||||
## Common Variables
|
||||
- `tls_secret_name` - TLS certificate secret name
|
||||
- `tier` - Deployment tier label
|
||||
- Service-specific passwords passed as variables
|
||||
|
||||
## Service Versions (as of 2025-01)
|
||||
- Immich: v2.4.1
|
||||
- Freedify: latest (music streaming, factory pattern)
|
||||
|
||||
## Useful Commands
|
||||
```bash
|
||||
# ALWAYS use -target for terraform apply (speeds up execution)
|
||||
terraform apply -target=module.kubernetes_cluster.module.<service_name>
|
||||
terraform plan -target=module.kubernetes_cluster.module.<service_name>
|
||||
terraform fmt -recursive
|
||||
kubectl get pods -A
|
||||
```
|
||||
|
||||
**Terraform target examples:**
|
||||
- `terraform apply -target=module.kubernetes_cluster.module.monitoring` - Apply monitoring
|
||||
- `terraform apply -target=module.kubernetes_cluster.module.immich` - Apply immich
|
||||
- `terraform apply -target=module.docker-registry-vm` - Apply docker registry VM
|
||||
- Only skip `-target` when explicitly told to apply everything
|
||||
|
||||
## Module Structure
|
||||
Top-level modules in `main.tf`:
|
||||
- `module.k8s-node-template` - K8s node VM template
|
||||
- `module.non-k8s-node-template` - Non-k8s VM template
|
||||
- `module.docker-registry-template` - Docker registry template
|
||||
- `module.docker-registry-vm` - Docker registry VM
|
||||
- `module.kubernetes_cluster` - Main K8s cluster (contains all services)
|
||||
|
||||
### Kubernetes Services (under module.kubernetes_cluster.module.*)
|
||||
Core (tier 0-1):
|
||||
- `metallb`, `dbaas`, `technitium`, `nginx-ingress`, `crowdsec`, `cloudflared`
|
||||
- `redis`, `metrics-server`, `authentik`, `nvidia`, `vaultwarden`, `reverse-proxy`
|
||||
- `wireguard`, `headscale`, `xray`, `monitoring`
|
||||
|
||||
GPU (tier 2):
|
||||
- `immich`, `frigate`, `ollama`, `ebook2audiobook`
|
||||
|
||||
Edge/Aux (tier 3-4):
|
||||
- `blog`, `drone`, `hackmd`, `mailserver`, `privatebin`, `shadowsocks`
|
||||
- `city-guesser`, `echo`, `url`, `webhook_handler`, `excalidraw`, `travel_blog`
|
||||
- `dashy`, `send`, `ytdlp`, `uptime-kuma`, `calibre`, `audiobookshelf`
|
||||
- `paperless-ngx`, `jsoncrack`, `servarr`, `ntfy`, `cyberchef`, `diun`
|
||||
- `meshcentral`, `nextcloud`, `homepage`, `matrix`, `linkwarden`, `actualbudget`
|
||||
- `owntracks`, `dawarich`, `changedetection`, `tandoor`, `n8n`, `real-estate-crawler`
|
||||
- `tor-proxy`, `onlyoffice`, `forgejo`, `freshrss`, `navidrome`, `networking-toolbox`
|
||||
- `tuya-bridge`, `stirling-pdf`, `isponsorblocktv`, `rybbit`, `wealthfolio`
|
||||
- `kyverno`, `speedtest`, `freedify`, `netbox`, `f1-stream`, `kms`, `k8s-dashboard`
|
||||
- `descheduler`, `reloader`, `infra-maintenance`
|
||||
|
||||
## CI/CD
|
||||
- Drone CI (`.drone.yml`) for automated deployments
|
||||
- Auto-updates TLS certificates
|
||||
- **ALWAYS add `[ci skip]` to commit messages** when you've already run `terraform apply` to avoid triggering CI redundantly
|
||||
- **After committing, run `git push origin master`** to sync changes
|
||||
|
||||
## Infrastructure
|
||||
- Proxmox hypervisor for VMs
|
||||
- Kubernetes cluster with GPU node
|
||||
- NFS server at 10.0.10.15 for storage
|
||||
- Redis shared service at `redis.redis.svc.cluster.local`
|
||||
58
.claude/remote-executor.sh
Executable file
58
.claude/remote-executor.sh
Executable file
|
|
@ -0,0 +1,58 @@
|
|||
#!/bin/bash
|
||||
# Remote Command Executor
|
||||
# Run this in a terminal with SSH access to the remote machine
|
||||
#
|
||||
# Usage: ./remote-executor.sh [user@host] [remote_workdir]
|
||||
# Example: ./remote-executor.sh wizard@10.0.10.10 /home/wizard/code/infra
|
||||
|
||||
REMOTE_HOST="${1:-wizard@10.0.10.10}"
|
||||
REMOTE_WORKDIR="${2:-/home/wizard/code/infra}"
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
CMD_FILE="$SCRIPT_DIR/cmd_input.txt"
|
||||
OUTPUT_FILE="$SCRIPT_DIR/cmd_output.txt"
|
||||
STATUS_FILE="$SCRIPT_DIR/cmd_status.txt"
|
||||
|
||||
# Initialize files
|
||||
echo "ready" > "$STATUS_FILE"
|
||||
> "$CMD_FILE"
|
||||
> "$OUTPUT_FILE"
|
||||
|
||||
echo "╔════════════════════════════════════════════════════════════╗"
|
||||
echo "║ Remote Command Executor Started ║"
|
||||
echo "╠════════════════════════════════════════════════════════════╣"
|
||||
echo "║ Remote: $REMOTE_HOST"
|
||||
echo "║ Workdir: $REMOTE_WORKDIR"
|
||||
echo "║ Watching: $CMD_FILE"
|
||||
echo "╚════════════════════════════════════════════════════════════╝"
|
||||
echo ""
|
||||
echo "Waiting for commands..."
|
||||
|
||||
# Watch for new commands
|
||||
while true; do
|
||||
# Check if there's a command to execute
|
||||
if [ -s "$CMD_FILE" ]; then
|
||||
CMD=$(cat "$CMD_FILE")
|
||||
|
||||
# Clear the command file immediately
|
||||
> "$CMD_FILE"
|
||||
|
||||
# Update status
|
||||
echo "running" > "$STATUS_FILE"
|
||||
echo "[$(date '+%H:%M:%S')] Executing: $CMD"
|
||||
|
||||
# Execute on remote and capture output
|
||||
ssh "$REMOTE_HOST" "cd $REMOTE_WORKDIR && $CMD" > "$OUTPUT_FILE" 2>&1
|
||||
EXIT_CODE=$?
|
||||
|
||||
# Append exit code to output
|
||||
echo "" >> "$OUTPUT_FILE"
|
||||
echo "---EXIT_CODE:$EXIT_CODE---" >> "$OUTPUT_FILE"
|
||||
|
||||
# Update status
|
||||
echo "done:$EXIT_CODE" > "$STATUS_FILE"
|
||||
echo "[$(date '+%H:%M:%S')] Done (exit: $EXIT_CODE)"
|
||||
fi
|
||||
|
||||
sleep 0.2
|
||||
done
|
||||
40
modules/kubernetes/monitoring/prometheus_chart_values.tpl
Normal file → Executable file
40
modules/kubernetes/monitoring/prometheus_chart_values.tpl
Normal file → Executable file
|
|
@ -318,13 +318,39 @@ serverFiles:
|
|||
# severity: page
|
||||
# annotations:
|
||||
# summary: Pod stuck not ready.
|
||||
#- alert: ReadyPodsInDeploymentLessThanSpec
|
||||
# expr: kube_deployment_status_replicas_available - on(namespace, deployment) kube_deployment_spec_replicas < 0
|
||||
# for: 10m
|
||||
# labels:
|
||||
# severity: page
|
||||
# annotations:
|
||||
# summary: Number of ready pods in {{ $labels.deployment }} is less than what is defined in spec.
|
||||
- alert: DeploymentReplicasMismatch
|
||||
expr: |
|
||||
(
|
||||
kube_deployment_spec_replicas
|
||||
- on(namespace, deployment) kube_deployment_status_replicas_available
|
||||
) > 0
|
||||
for: 15m
|
||||
labels:
|
||||
severity: page
|
||||
annotations:
|
||||
summary: "Deployment {{ $labels.namespace }}/{{ $labels.deployment }} has {{ $value }} unavailable replicas"
|
||||
- alert: StatefulSetReplicasMismatch
|
||||
expr: |
|
||||
(
|
||||
kube_statefulset_replicas
|
||||
- on(namespace, statefulset) kube_statefulset_status_replicas_ready
|
||||
) > 0
|
||||
for: 15m
|
||||
labels:
|
||||
severity: page
|
||||
annotations:
|
||||
summary: "StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} has {{ $value }} unavailable replicas"
|
||||
- alert: DaemonSetMissingPods
|
||||
expr: |
|
||||
(
|
||||
kube_daemonset_status_desired_number_scheduled
|
||||
- on(namespace, daemonset) kube_daemonset_status_number_ready
|
||||
) > 0
|
||||
for: 15m
|
||||
labels:
|
||||
severity: page
|
||||
annotations:
|
||||
summary: "DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} has {{ $value }} missing pods"
|
||||
- alert: NoNodeLoadData
|
||||
expr: (node_load1 OR on() vector(0)) == 0
|
||||
for: 10m
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue