diff --git a/.claude/CLAUDE.md b/.claude/CLAUDE.md index ad4be989..bdcc8e10 100755 --- a/.claude/CLAUDE.md +++ b/.claude/CLAUDE.md @@ -7,13 +7,14 @@ - **Agents**: `.claude/agents/cluster-health-checker` (haiku, autonomous health checks) - **Reference**: `.claude/reference/` — patterns.md, service-catalog.md, proxmox-inventory.md, github-api.md, authentik-state.md - **GitHub API**: `curl` with tokens from tfvars (`gh` CLI blocked by sandbox) -- **CI/CD**: Woodpecker CI — pushes to master auto-apply platform stack ## Instructions -- **"remember X"**: Update this file + `AGENTS.md` (if it's shared knowledge), commit with `[ci skip]` +- **"remember X"**: Update this file + `AGENTS.md` (if shared knowledge), commit with `[ci skip]` +- **Apply with SOPS**: Use `scripts/tg` wrapper instead of raw `terragrunt` — auto-decrypts secrets - **New services need CI/CD** (Woodpecker) and **monitoring** (Prometheus/Uptime Kuma) - **New service**: Use `setup-project` skill for full workflow - **Ingress**: `ingress_factory` module. Auth: `protected = true`. Anti-AI: on by default. +- **Docker images**: Always build for `linux/amd64` (`docker buildx build --platform linux/amd64`). Pull-through cache serves stale :latest — use versioned tags. ## User Preferences - **Calendar**: Nextcloud at `nextcloud.viktorbarzin.me` diff --git a/.woodpecker/k8s-portal.yml b/.woodpecker/k8s-portal.yml new file mode 100644 index 00000000..a540cf61 --- /dev/null +++ b/.woodpecker/k8s-portal.yml @@ -0,0 +1,49 @@ +when: + event: push + branch: master + path: + include: + - "stacks/platform/modules/k8s-portal/files/**" + +clone: + git: + image: woodpeckerci/plugin-git + settings: + attempts: 5 + backoff: 10s + +steps: + - name: build-and-push + image: woodpeckerci/plugin-docker-buildx + settings: + username: "viktorbarzin" + password: + from_secret: dockerhub-pat + repo: viktorbarzin/k8s-portal + dockerfile: stacks/platform/modules/k8s-portal/files/Dockerfile + context: stacks/platform/modules/k8s-portal/files + platforms: + - linux/amd64 + auto_tag: true + cache_from: "viktorbarzin/k8s-portal:latest" + cache_to: "type=inline" + + - name: deploy + image: bitnami/kubectl:latest + commands: + - "kubectl rollout restart deployment/k8s-portal -n k8s-portal" + - "kubectl rollout status deployment/k8s-portal -n k8s-portal --timeout=120s" + - "echo 'k8s-portal deployed successfully'" + + - name: slack + image: curlimages/curl + commands: + - | + curl -s -X POST -H 'Content-type: application/json' \ + --data "{\"text\":\"K8s Portal: build + deploy ${CI_PIPELINE_STATUS}\"}" \ + "$SLACK_WEBHOOK" || true + environment: + SLACK_WEBHOOK: + from_secret: slack_webhook + when: + status: [success, failure] diff --git a/AGENTS.md b/AGENTS.md index 2a0f5300..54a5ff1d 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -2,31 +2,44 @@ ## Critical Rules (MUST FOLLOW) - **ALL changes through Terraform/Terragrunt** — NEVER `kubectl apply/edit/patch/delete` for persistent changes. Read-only kubectl is fine. -- **NEVER put secrets in committed files** — use `terraform.tfvars` or `secrets/` (git-crypt encrypted) +- **NEVER put secrets in plaintext** — use `secrets.sops.json` (SOPS-encrypted) or `terraform.tfvars` (git-crypt, legacy) - **NEVER restart NFS on TrueNAS** — causes cluster-wide mount failures across all pods - **NEVER commit secrets** — triple-check before every commit - **`[ci skip]` in commit messages** when changes were already applied locally - **Ask before `git push`** — always confirm with the user first ## Execution -- **Apply a service**: `cd stacks/ && terragrunt apply --non-interactive` +- **Apply a service**: `scripts/tg apply --non-interactive` (auto-decrypts SOPS secrets) +- **Legacy apply**: `cd stacks/ && terragrunt apply --non-interactive` (uses terraform.tfvars) - **kubectl**: `kubectl --kubeconfig $(pwd)/config` - **Health check**: `bash scripts/cluster_healthcheck.sh --quiet` - **Plan all**: `cd stacks && terragrunt run --all --non-interactive -- plan` +## Secrets Management (SOPS) +- **`config.tfvars`** — plaintext config (hostnames, IPs, DNS records, public keys) +- **`secrets.sops.json`** — SOPS-encrypted secrets (passwords, tokens, SSH keys, API keys) +- **`.sops.yaml`** — defines who can decrypt (age public keys: Viktor + CI) +- **`scripts/tg`** — wrapper that auto-decrypts SOPS before running terragrunt +- **Edit secrets**: `sops secrets.sops.json` (opens $EDITOR, re-encrypts on save) +- **Add a secret**: `sops set secrets.sops.json '["new_key"]' '"value"'` +- **Operators** push PRs → Viktor reviews → CI decrypts and applies. No encryption keys needed for operators. + ## Architecture Terragrunt-based homelab managing a Kubernetes cluster (5 nodes, v1.34.2) on Proxmox VMs. - **70+ services**, each in `stacks//` with its own Terraform state - **Core platform**: `stacks/platform/modules/` (~22 modules: Traefik, Kyverno, monitoring, dbaas, etc.) - **Public domain**: `viktorbarzin.me` (Cloudflare) | **Internal**: `viktorbarzin.lan` (Technitium DNS) -- **Secrets**: `terraform.tfvars` (git-crypt encrypted) +- **Onboarding portal**: `https://k8s-portal.viktorbarzin.me` — self-service kubectl setup + docs +- **CI/CD**: Woodpecker CI — PRs run plan, merges to master auto-apply platform stack ## Key Paths - `stacks//main.tf` — service definition - `stacks/platform/modules//` — core infra modules - `modules/kubernetes/ingress_factory/` — standardized ingress with auth, rate limiting, anti-AI - `modules/kubernetes/nfs_volume/` — NFS volume module (CSI-backed, soft mount) -- `terraform.tfvars` — all secrets, DNS config, shared variables +- `config.tfvars` — non-secret configuration (plaintext) +- `secrets.sops.json` — all secrets (SOPS-encrypted JSON) +- `terraform.tfvars` — legacy secrets file (git-crypt, kept for reference) - `scripts/cluster_healthcheck.sh` — 25-check cluster health script ## Storage @@ -47,16 +60,23 @@ Terragrunt-based homelab managing a Kubernetes cluster (5 nodes, v1.34.2) on Pro - **Proxmox**: 192.168.1.127 (Dell R730, 22c/44t, 142GB RAM) - **Nodes**: k8s-master (10.0.20.100), node1 (GPU, Tesla T4), node2-4 - **GPU**: `node_selector = { "gpu": "true" }` + toleration `nvidia.com/gpu` -- **Pull-through cache**: 10.0.20.10 — docker.io (:5000), ghcr.io (:5010) only +- **Pull-through cache**: 10.0.20.10 — docker.io (:5000), ghcr.io (:5010) only. Caches stale manifests for :latest tags — use versioned tags or pre-pull with `ctr --hosts-dir ''` to bypass. - **pfSense**: 10.0.20.1 (gateway, firewall, DNS forwarding) - **MySQL InnoDB Cluster**: 3 instances on iSCSI, anti-affinity excludes node2 (SIGBUS bug) - **SMTP**: `var.mail_host` port 587 STARTTLS (not internal svc address — cert mismatch) +## Contributor Onboarding +1. Get Authentik account + Headscale VPN access (ask Viktor) +2. Clone repo — `AGENTS.md` is auto-loaded by Codex +3. Create branch → edit → push → open PR +4. Viktor reviews → CI applies → Slack notification +5. Portal: `https://k8s-portal.viktorbarzin.me/onboarding` for full guide + ## Common Operations - **Deploy new service**: Use `stacks//` as template. Create stack, add DNS in tfvars, apply platform then service. - **Fix crashed pods**: Run healthcheck first. Safe to delete evicted/failed pods and CrashLoopBackOff pods with >10 restarts. - **OOMKilled**: Check `kubectl describe limitrange tier-defaults -n `. Increase `resources.limits.memory` in the stack's main.tf. -- **Helm stuck**: If Helm release is in `pending-upgrade`/`failed`, check `reference/patterns.md` for recovery. +- **Add a secret**: `sops set secrets.sops.json '["key"]' '"value"'` then commit. - **NFS exports**: Create dir on TrueNAS first, add to `secrets/nfs_directories.txt`, run `secrets/nfs_exports.sh`. ## Detailed Reference